2

I have an HTML string like so:

<img src="http://foo"><img src="http://bar">

What would be the regex pattern to split this into two separate img tags?

5
  • 5
    They are already 2 separate tags Commented Oct 28, 2010 at 16:21
  • 2
    It already is two separate img tags. Commented Oct 28, 2010 at 16:21
  • 1
    Please search for the similar questions. There are tons of them. Never use RegEx for HTML unless you have very small, specific and pattern-ized input. Commented Oct 28, 2010 at 16:30
  • 1
    Not every computing problem is best solved with a regex. Commented Oct 28, 2010 at 16:57
  • 2
    The literal answer to your question is split /(?<=>)(?=<)/, but if that is really the answer you're looking for, I can virtually guarantee that you're doing something very wrong. Commented Oct 28, 2010 at 19:20

5 Answers 5

8

How sure are you that your string is exactly that? What about input like this:

<img alt=">"          src="http://foo"  >
<img src='http://bar' alt='<'           >

What programming language is this? Is there some reason you're not using a standard HTML-parsing class to handle this? Regexes are only a good approach when you have an extremely well-known set of inputs. They don't work for real HTML, only for rigged demos.

Even if you must use a regex, you should use a proper grammatical one. This is quite easy. I've tested the following programacita on a zillion web pages. It takes care of the cases I outline above — and one or two others, too.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

$/ = undef;
$_ = <>;   # read all input

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ <script> .*?  </script> }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

my $count = 0;

while (/$img_rx/g) {
    printf "Match %d at %d: %s\n", 
            ++$count, pos(), $+{TAG};
} 

There you go. Nothing to it!

Gee, why would you ever want to use an HTML-parsing class, given how easily HTML can be dealt with in a regex. ☺

Sign up to request clarification or add additional context in comments.

Comments

5

Don't do it with regex. Use an HTML/XML parser. You can even run it through Tidy first to clean it up. Most languages have a Tidy library. What language are you using?

Comments

2

This will do it:

<img\s+src=\"[^\"]*?\">

Or you can do this to account for any additional attributes

<img\s+[^>]*?\bsrc=\"[^\"]*?\"[^>]*>

2 Comments

That doesn't account for "additional attributes" that you say it does. Look at my solution for how to do this properly. Well, as properly as possible for if not using an HTML-parsing class.
I was actually looking for a quick and dirty solution to get all src attribute values of img tags in a string and came across this answer, which was very helpful and for my case I only had to add two brackets: <img\s+[^>]*?\bsrc=\"([^\"]*?)\"[^>]*>
0
<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">

PHP example:

$prom = '<img src="http://foo"><img src="http://bar">';

preg_match_all('|<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">|',$prom, $matches);

print_r($matches[0]);

Comments

0

One slightly insane/brilliant/weird way to do it would be to split on >< and then add the two characters back respectively to the string after the split.

$string = '<img src="http://foo"><img src="http://bar">';
$KimKardashian = split("><",$string);
$First = $KimKardashian[0] . '>';
$Second = '<' . $KimKardashian[1];

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.