Python HTML parsing specific information within tags

Question

I am trying to parse out specific information within tags.

For example, on the website:

http://www.epicurious.com/articlesguides/bestof/toprecipes/bestchickenrecipes/recipes/food/views/My-Favorite-Simple-Roast-Chicken-231348

I am trying to parse out very specific information like the ingredients. If you go to the pagesource, you can see that the information present is within tags called

<h2>Ingredients</h2> and <ul class="ingredientsList"> has all the actual ingredients.

I found a python program online that conveniently parses out the hyperlinks in a website. But I want to modify it to parse out these ingredients. I am not very well versed in python but how exactly would I go about modifying my code to fit my parsing needs?

Please do elaborate on how I should go about doing this or providing examples etc would be greatly appreciated since I am not very informed at this.

The code:

import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []
        self.descriptions = []
        self.inside_a_element = 0
        self.starting_description = 0

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1
                self.starting_description = 1

    def end_a(self):
        "Record the end of a hyperlink."

        self.inside_a_element = 0

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            if self.starting_description:
                self.descriptions.append(data)
                self.starting_description = 0
            else:
                self.descriptions[-1] += data

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks

    def get_descriptions(self):
        "Return a list of descriptions."

        return self.descriptions

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.epicurious.com/Roast-Chicken-231348")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()
print myparser.get_descriptions()

Achim · Accepted Answer · 2011-04-10 22:29:39Z

3

Have a look at http://www.crummy.com/software/BeautifulSoup/ Your approach works for simple cases, but will cause you headache as soon as the html and/or your requirements get a bit more complicated.

answered Apr 10, 2011 at 22:29

Achim

15.7k15 gold badges92 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:04:13Z

1

I will get a ticking-off from all the people that say HTML texts can't be analysed by regexes.

Ok, ok, but I got the result in fifty minutes:

First, I used this code to obtain a convenient display of the code source of the web page:

import urllib

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')


sock = urllib.urlopen(url)
ch = sock.read()
sock.close()


gen = (str(i)+' '+repr(line) for i,line in enumerate(ch.splitlines(1)))

print '\n'.join(gen)

Then, it's a child's play to catch the ingredients:

import urllib
import re

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('ul class="ingredientsList">')

patingr = re.compile('<li class="ingredient">(.+?)</li>\n')

print patingr.findall(ch,x)

.

EDIT

Achim,

Concerning the presence of '\n', the fault is mine, not of the regex tool: I wrote the code too rapidly.

You are right concerning uppercase: BS still finds the right strings, while the regex fails. But, I have never seen a source code in which the elements tags were written in upper case. Can you give me a link to one like that ?

Concerning ' or " , it's the same, I never see , but you are right , it may happen.

However, when writing a RE, if there are upper cased letters or ' instead of " at some places, the RE will be written in order to match them: where is the problem ?

Do you mean: if the source code change ? It is even less probable to see one day a site whose source code will change from lower case to upper case, or " changed in ' . It isn't very realistic.

So, it's easy to correct my RE

import urllib
import re

url = ('http://www.epicurious.com/articlesguides/bestof/'
       'toprecipes/bestchickenrecipes/recipes/food/views/'
       'My-Favorite-Simple-Roast-Chicken-231348')

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

#----------------------------------------------------------
patingr = re.compile('<li class="ingredient">(.+?)</li>\n')
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))


ch = ch.replace('<li class="ingredient">One 2- to 3-pound farm-raised chicken</li>',
                "<LI class='ingredient'>One 2- to 3-pound farm-raised \nchicken</li>")
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))


patingr = re.compile('<li class=["\']ingredient["\']>(.+?)</li>\n',re.DOTALL|re.IGNORECASE)
print
print '\n'.join(repr(mat.group()) for mat in patingr.finditer(ch))

result

'<li class="ingredient">One 2- to 3-pound farm-raised chicken</li>\n'
'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

"<LI class='ingredient'>One 2- to 3-pound farm-raised \nchicken</li>\n"
'<li class="ingredient">Kosher salt and freshly ground black pepper</li>\n'
'<li class="ingredient">2 teaspoons minced thyme (optional)</li>\n'
'<li class="ingredient">Unsalted butter</li>\n'
'<li class="ingredient">Dijon mustard</li>\n'

Then , from now on, I will always add the flag re.IGNORECASE and ["'] in tags

Are there other "problems" that can happen ? I would be interested to be aware of them.

I don't pretend that regexes must be used in all the cases and parsers never, I just think that if the conditions to use regexes in a controled and delimited manner are verified, they are very intersting and that it would be a pity to neglect them.

By the way, you say nothing about the fact that regexes are enormously faster than BeautifulSoup . See time comparison between regex an BeautifulSoup

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Apr 10, 2011 at 23:45

eyquem

27.8k7 gold badges43 silver badges46 bronze badges

4 Comments

eyquem Over a year ago

@Ryan Matthew I see you accepted my answer. Fine. Do you know you can also upvote it ? - Be aware that the variations of source code from one page to another in a given site may provoke failures; results must be watched during several runs, and it's a good practice to add verification snippets when it is possible. If you have other questions, I will be pleased to try to answer.

Eric Over a year ago

Hi eyquem, that is what I figured out also. So I am going to try to use Beautiful Soup also since that might be easier when switching to a different page source code

eyquem Over a year ago

@Ryan Matthew that might be easier when switching to a different page source code I have the same impression. But it isn't absolutely evident. By the way, see the edit in my other answer: regexes are faster with 3 magnitude levels (I mean 1000 times faster) at least on the exemple I tested

Achim Over a year ago

Your regex breaks if the <li> contain line breaks. It also breaks for the smallest variation in the html code. For example using LI or single quotes does not change the semantics of the document at all, but will break your code. This code might be work for a single document, but I would never use something like that in real live. Let alone production code.

Collectives™ on Stack Overflow

Python HTML parsing specific information within tags

2 Answers 2

Comments

EDIT

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

EDIT

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related