1

I have this html code:

<marquee  align="left" id="LatestNewsM" SCROLLAMOUNT="4" loop="infinite" direction="right">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test test test</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test sample text sample</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">text text 222 another text</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">
            ...........
            .....
</marquee>

and this PHP code:

$homepage = file_get_contents('http://www.site.com');

How I can search in the content and get only the text inside Font tag <font>

4
  • Your question is completely unrelated to file_get_contents(); I've edited it accordingly. Besides that.. a DOM parser is the way to go - please don't even think about using regular expressions for it. Commented Apr 15, 2011 at 23:24
  • <marquee> and <font>?! Please tell me you're just scraping data from a decades-old site in order to update it to modern standards... Commented Apr 15, 2011 at 23:36
  • @jnpcl: No, I am trying to get news from a news website and display it on my webpage, that is all what I want to do Commented Apr 15, 2011 at 23:41
  • You should send a letter with white powder to whoever is responsible for the HTML code of that site. ;x Commented Apr 16, 2011 at 0:08

2 Answers 2

1

You have few options, one mentioned by ThiefMaster as not to use "regex", doing strpos and substr or using DOM/XML parser.

If you go with regex, you might end up with something like this:

/<font[^>]*>.*<\/font>/i

When run on data like this:

> Hello, this is my brutal <font>font
> <font>tag</font> right</font> it is

You will end up with (if greedy)

<font>font <font>tag</font> right</font>

or if ungreedy

<font>font <font>tag</font>

You can use negative look ahead and do a better job but its still not a good solution (this example is to show you why, regex is kept as simple as possible)

If you go with strpos and substr, you'll have to look through all characters one by one and parse the document yourself (matching opening and closing tags, skipping attributes) or you can try

$opening = strpos($dataset, '<font', $closing) // closing is at offset zero
$closing = strpos($dataset, '</font', $opening) // start at opening tag

and so on until you parse it all.

If you go with DOM/XML parser, you might want to consider this, using file_get_contents or file() loads whole file into memory as most DOM/XML parsers does, I would go with XMLReader (Streaming instead of loading whole file in memory, parse it, build the tree), its more efficient.

p.s. Its quite late here (3:00AM), excuse me for any misspelled words. Thank you. :)

Sign up to request clarification or add additional context in comments.

Comments

0

Will be useful:
http://php.net/manual/en/function.strip-tags.php - to delete all tags from text
http://php.net/manual/en/book.simplexml.php - to parse XML

If HTML will be valid (currently not - 'img' tags not closed), something like this can be used:

$xml = new SimpleXMLElement($data);
$fonts = $xml->xpath('/marquee/font');
foreach ($fonts as $font) print $font[0].PHP_EOL;

1 Comment

thanks for helping me but I am getting an error because the first line is 'DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd' and string could not be parsed, What should I do?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.