3

I'm wondering about the behavior of using the matcher in java.

I have a pattern which I compiled and when running through the results of the matcher i don't understand why a specific value is missing.

My code:

String str = "star wars";
Pattern p = Pattern.compile("star war|Star War|Starwars|star wars|star wars|pirates of the caribbean|long strage trip|drone|snatched (2017)");
Matcher matcher = p.matcher(str);
while (matcher.find()) {
        System.out.println("\nRegex : " matcher.group());
    }

I get hit with "star war" which is right as it is in my pattern.

But I don't get "star wars" as a hit and I don't understand why as it is part of my pattern.

6
  • 3
    The first alternative in the alternation group that matches "wins", and the rest is not checked. Once the star war is matched, the text is consumed, there won't be any more passes. It is expected. What behavior do you need? Commented May 26, 2017 at 18:38
  • Is there a way to return all hits? Commented May 26, 2017 at 18:39
  • 3
    You would have to check each pattern against the string separately instead of as a long chain of alternations. Commented May 26, 2017 at 18:42
  • Understood.. thanks guys Commented May 26, 2017 at 18:44
  • 1
    See ideone.com/gGcALb. BTW, you have 2 "star wars" :) Commented May 26, 2017 at 19:02

3 Answers 3

2

The behavior is expected because alternation in NFA regex is "eager", i.e. the first match wins, and the rest of the alternatives are not even tested against. Also, note that once a regex engine finds a match in a consuming pattern (and yours is a consuming pattern, it is not a zero-width assertion like a lookahead/lookbehind/word boundary/anchor) the index is advanced to the end of the match and the next match is searched for from that position.

So, once your first star war alternative branch matches, there is no way to match star wars as the regex index is before the last s.

Just check if the string contains the strings you check against, the simplest approach is with a loop:

String str = "star wars";
String[] arr = {"star war","Star War","Starwars","star wars","pirates of the caribbean","long strage trip","drone","snatched (2017)"};
for(String s: arr){
    if(str.contains(s))
        System.out.println(s);
}

See the Java demo

By the way, your regex contains snatched (2017), and it does not match ( and ), it only matches snatched 2017. To match literal parentheses, the ( and ) must be escaped. I also removed a dupe entry for star wars.

Sign up to request clarification or add additional context in comments.

3 Comments

This approach is better, but then we should also split the string on | and match str entirely in order to avoid problem with movies like AI or so.
@steffen: I split with \| just to quickly build an array. I think the best way is to define it as usual, with String[] arr = {"term1", "term2", "etc."};. Note I did not even remove dupes, I guess those are served at design time.
I decided to edit the answer to show how the array of search terms should be defined. Splitting with "\\|" is hacky.
1

A better way to build your regex would be like this:

String pattern = "[Ss]tar[\\s]{0,1}[Ww]ar[s]{0,1}";

Breaking down:

  • [Ss]: it will match either S or s in the first position
  • \s: representation of space
  • {0,1}: the previous character (or set) will be matched from 0 to 1 times

An alternative is:

String pattern = "[Ss]tar[\\s]?[Ww]ar[s]?";
  • ?: the previous character (or set) will be matched once or not at all

For more information, see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Edit 1: fixed typo (\s -> \\s). Thanks, @eugene.

4 Comments

[\s]{0,1} should really be \\s+ there could be many spaces probably
Eugene: * would cause to match 0 or more times. By using {0,1} it says to match 0 or 1 times only.
@Eugene - '\\s*' would allow things like 'star wars' to match, as well.
@marklark, yeah, I know that. I just wanted to point out a better way to build Regex'es than the one on the question. In fact, that is the first line of my answer.
0

You want to match the whole input sequence, so you should use Matcher.matches() or add ^ and $:

Pattern p = Pattern.compile("^(star war|Star War|Starwars|star wars|"
        + "star wars|pirates of the caribbean)$");

will print

Regex : star wars

But I agree with @NAMS: Don't build your regex like this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.