0

How do I match an URL string like this:

img src = "https://stackoverflow.com/a/b/c/d/someimage.jpg"

where only the domain name and the file extension (jpg) is fixed while others are variables?

The following code does not seem working:

Pattern p = Pattern.compile("<img src=\"http://stachoverflow.com/.*jpg");
    // Create a matcher with an input string
    Matcher m = p.matcher(url);
    while (m.find()) {
     String s = m.toString();
    }
3
  • 1
    Maybe because it's actually stackoverflow with a k instead of h? ;) Further, the right approach would be using a HTML parser: stackoverflow.com/search?q=parse+html+with+regex Commented Apr 4, 2010 at 4:17
  • This is a problem that is well within the bounds of regex, a alot of HTML parsing is not but suited for regex but I don't see a problem with extracting an image path in this way. Commented Apr 4, 2010 at 5:09
  • 2
    @Michael: <img src="http://stachoverflow.com/" /><script ...>...</script><img src="some.jpg" /> could become a problem, depending on what the regex is used for... Commented Apr 4, 2010 at 8:01

2 Answers 2

2

There were a couple of issues with the regex matching the sample string you gave. You were close, though. Here's your code fixed to make it work:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TCPChat {

  static public void main(String[] args) {
    String url = "<img src=\"http://stackoverflow.com/a/b/c/d/someimage.jpg\">";
    Pattern p = Pattern.compile("<img src=\"http://stackoverflow.com/.*jpg\">");
    // Create a matcher with an input string
    Matcher m = p.matcher(url);
    while (m.find()) {
      String s = m.toString();
      System.out.println(s);
    }
  }
}
Sign up to request clarification or add additional context in comments.

2 Comments

I would end the regex with .*\.jpg. It's a little thing, but it prevents matching stuff like "myimage.notreallyajpg"
You need to double-escape the backslash. Your example would not have compiled (if that was the actual problem, you should have said that instead of just "it's not working").
1

First, I would use the group() method to retrieve the matched text, not toString(). But it's probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.

Second, I wouldn't assume src was the first attribute in the <img> tag. On SO, for example, it's usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can't match beyond the end of the tag. [^<>]+ will probably suffice.

Third, I would use something more restrictive than .* to match the unknown part to the path. There's always a chance that you'll find two URLs on one line, like this:

<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">

In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.

There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?

...and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it's essential that you understand their limitations.

Here's my revised version of your regex (as a Java string literal):

"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

2 Comments

@Alan: <img src="http://stackoverflow.com/" onclick="javascript:evil()" class="some.jpg"/> would still be matched when using [^<>]. It's just one more example, why this is so dangerous...
@chris: Yeah, I was just covering some of the most common issues found in valid, well-intentioned HTML. A complete list of potential gotchas would fill a thick book.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.