3

Regex is absolutely my weak point and this one has me completely stumped. I am building a fairly basic search functionality and I need to be able to alter my user input based on the following pattern:

Subject:

%22first set%22 %22second set%22-drupal -wordpress

Desired output:

+"first set" +"second set" -drupal -wordpress

I wish I could be more help as I normally like to at least post the solution I have so far, but on this one I'm at a loss.

Any help is appreciated. Thank you.

2
  • 1
    It seems your data is URL encoded. If you apply urldecode, you will get "first set" "second set"-drupal -wordpress. Do you have actually a space before -drupal or should this be inserted too? Commented Jan 10, 2011 at 3:29
  • I can manage the space. The only issue using urldecode is that this going in an sql query and I only want to urldecode double quotes and only if they're in this pattern. Commented Jan 10, 2011 at 3:35

3 Answers 3

2

Seems your data is URL encoded. If you apply urldecode, you will get

"first set" "second set" -drupal -wordpress

(I assume you have a space before -drupal).

Now you have to add +. Again, I assume you have to add those before all words that don't have a - and that are not inside quotes:

$str = '"first set" "second set" -drupal -wordpress foo';
echo preg_replace('#( |^)(?!(?:\w+"|-| ))#','\1+', $str));
// prints +"first set" +"second set" -drupal -wordpress +foo

Update: If you cannot use urldecode, you could just use str_replace to replace %22 with ".

Sign up to request clarification or add additional context in comments.

Comments

1
preg_replace('/%22((?:[^%]|%[^2]|%2[^2])*)%22/', '+"$1"', $str);

Explanation: The $1 is a backreference, which references the first ()-section in the regular expression, in this case, ((?:[^%]|%[^2]|%2[^2])*). And the [^%] and the alternations (...|...|...) after it prevents %22 in between from being matched due to greediness. See http://en.wikipedia.org/wiki/Regular_expression#Lazy_quantification.

I found that technique in a JavaCC example of matching block comments (/* */), and I can't find any other webpages explaining it, so here is a cleaner example: To match a block of text between 12345 12345........12345 with no 12345 in between: /12345([^1]|1[^2]|12[^3]|123[^4]|1234[^5])*12345/

3 Comments

You rock. Thank you, very much. Any chance you could offer an explanation on the solution?
The $1 is a backreference, which references the first ()-section in the regular expression, in this case, ((?:[^%]|%[^2]|%2[^2])*). And the [^%] thing prevents %22 in between from being matched: prevents greedy matching, greediness is explained in en.wikipedia.org/wiki/Regular_expression#Lazy_quantification , while the [^%] method is explained in shinkirou.org/blog/2010/12/tricky-regular-expression-problems (first seen in a JavaCC example)
@SHiNKiROU Explanation to a code given in answer, should be put to answer itsef, not to the comments, where many people may miss it. I wonder, why didn't you edit your own answer, when asked for a clarification, and used tiny comment instead?
1

Is this what you're looking for?

<?php
  $input = "%22first set%22 %22second set%22-drupal -wordpress";
  $res = preg_replace( "/\%22(.+?)\%22/","+\"(\\1)\" ", $input);
  print $res;
?>

1 Comment

Explanation: the \%22 match "%22". The key here is the (.+?) part, which finds the shortest (i.e., "ungreedy") match between the %22s. In the second part, \1 represents the matched value in (.+?).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.