Parse HTML DOM using PHP

Question

I have this html code:

<marquee  align="left" id="LatestNewsM" SCROLLAMOUNT="4" loop="infinite" direction="right">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test test test</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test sample text sample</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">text text 222 another text</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">
            ...........
            .....
</marquee>

and this PHP code:

$homepage = file_get_contents('http://www.site.com');

How I can search in the content and get only the text inside Font tag <font>

Your question is completely unrelated to file_get_contents(); I've edited it accordingly. Besides that.. a DOM parser is the way to go - please don't even think about using regular expressions for it. — ThiefMaster
– ThiefMaster, Commented Apr 15, 2011 at 23:24
<marquee> and <font>?! Please tell me you're just scraping data from a decades-old site in order to update it to modern standards... — jrn.ak
– jrn.ak, Commented Apr 15, 2011 at 23:36
@jnpcl: No, I am trying to get news from a news website and display it on my webpage, that is all what I want to do — Saleh
– Saleh, Commented Apr 15, 2011 at 23:41
You should send a letter with white powder to whoever is responsible for the HTML code of that site. ;x — ThiefMaster
– ThiefMaster, Commented Apr 16, 2011 at 0:08

ludesign · Accepted Answer · 2011-04-16 00:07:23Z

You have few options, one mentioned by ThiefMaster as not to use "regex", doing strpos and substr or using DOM/XML parser.

If you go with regex, you might end up with something like this:

/<font[^>]*>.*<\/font>/i

When run on data like this:

> Hello, this is my brutal <font>font
> <font>tag</font> right</font> it is

You will end up with (if greedy)

<font>font <font>tag</font> right</font>

or if ungreedy

<font>font <font>tag</font>

You can use negative look ahead and do a better job but its still not a good solution (this example is to show you why, regex is kept as simple as possible)

If you go with strpos and substr, you'll have to look through all characters one by one and parse the document yourself (matching opening and closing tags, skipping attributes) or you can try

$opening = strpos($dataset, '<font', $closing) // closing is at offset zero
$closing = strpos($dataset, '</font', $opening) // start at opening tag

and so on until you parse it all.

If you go with DOM/XML parser, you might want to consider this, using file_get_contents or file() loads whole file into memory as most DOM/XML parsers does, I would go with XMLReader (Streaming instead of loading whole file in memory, parse it, build the tree), its more efficient.

p.s. Its quite late here (3:00AM), excuse me for any misspelled words. Thank you. :)

score 0 · Accepted Answer · 2011-04-15 23:36:56Z

0

Will be useful:
http://php.net/manual/en/function.strip-tags.php - to delete all tags from text
http://php.net/manual/en/book.simplexml.php - to parse XML

If HTML will be valid (currently not - 'img' tags not closed), something like this can be used:

$xml = new SimpleXMLElement($data);
$fonts = $xml->xpath('/marquee/font');
foreach ($fonts as $font) print $font[0].PHP_EOL;

edited Apr 15, 2011 at 23:36

answered Apr 15, 2011 at 23:24

user680786

1 Comment

Saleh Over a year ago

thanks for helping me but I am getting an error because the first line is 'DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd' and string could not be parsed, What should I do?

Collectives™ on Stack Overflow

Parse HTML DOM using PHP

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related