0

I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:

Regex.Replace(query, @"""[^""~]+""([^~]|$)", 
    m => string.Format(field + "_exact:{0}", m.Value))

What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.

1
  • Does the reference help? Commented Aug 5, 2013 at 6:54

2 Answers 2

1

As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:

    "[^"~]+"([^~]|$)

You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/

1.) a single character

"

The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.

2.) a character class

[^"~]

The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:

^"~

The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.

In this case, every literal character is possible, except the two excluded ones: " or ~.

3.) a special character

+

The next expression, a plus, tells the engine to attempt to match the preceding token once or more. So the defined character class should one or multiple times repeated to match the given expression.

4.) a single character

"

To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.

5.) a lookaround

([^~]|$)

The first structure here to examine is the ()-bracket. This is called a "Lookaround". It is is a special kind of group. Lookaround matches a position. It does not expand the regex match. So this means this part does not try to find any certain characters inside of an expression rather then to localize them.

The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: | So the next character of the matched expression could either be [^~] one single character out of the class everything excluding the character ~ or $ the end of the line (or word, if multiline-mode is not used in regex engine)

I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)

Update: to "detect" a Asterisk/star in front/end of the line, you have to do following:

First it's a special character, so you have to escape it with an backslash: *

To define the position, you can use:

  • ^ to look at the beginning of the line,
  • $ end of the line

The overall expression would be:

^* in front of the expression to search for an * at the beginning of the line $* at the end of the regex to demand an * at the end.

.... in your case you can add the * in the last character class to detect an * in the end:

([^~]|$|$*)

and to force an * in the end, delete the other conditions:

($*)

PS: (somehow my regex is swallowed up by formating engine, so my update is wrong...)

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, @EpicEmil. You went over each piece of the regex and explained it so well. Much appreciated :)
1

The @ makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use @ in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.

Let's see what the regex really is:

Console.WriteLine(@"""[^""~]+""([^~]|$)");

is

"[^"~]+"([^~]|$)

So now we can look at the "real" regex.

It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)

For example in

car"hello"help

it would match "hello"h

4 Comments

Thanks, xanatos. What was confusing was that the " weren't escaped with a backslash. A follow-up: How do I detect a * at the beginning or the end of the string?
@Alex You have to escape it as \*... But what you mean with "detect"? An optional \* is \*?, so you could write @"*?""[^""~]+""([^~]|$)" (if you want the * outside of the ") and then check with standard string methods if there is an *. The final [^~] will already eat the *
Thanks, @xanatos. Detect is the wrong word -- to know that the pattern exists and replace it. So there's no regex pattern that can detect the last *? Would a lastIndexOf() be the recommended way to do that and somehow merge the two into one check?
@Alex Yes there is, but "detect" is not a word with a single meaning. You have to give concrete examples of what you want to match and what you don't want to match and what you want to do after the matching. It's even important to remember that regexes can have various uses. A regex to search a text that you want to replace is different from a regex that you want to use to get some pieces of a text and both are different from a regex that only look if a text has a certain "shape".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.