Python, parsing html

Question

Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.

For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)

Movie Title Quality Torrent Link

There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:

item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]

And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:

<div class="browse-info">
    <span class="info">
        <h3><a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006">James Bond: Casino Royale (2006)</a></h3>
        <p><b>Size:</b> 1018.26 MB</p>
        <p><b>Quality:</b> 720p</p>
        <p><b>Genre:</b> Action | Crime</p>
        <p><b>IMDB Rating:</b> 7.9/10</p>
            <span>
                <p class="peers"><b>Peers:</b> 698</p>
                <p class="peers"><b>Seeds:</b> 356</p>
            </span>
    </span>
    <span class="links">
        <a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006" class="std-btn-small mright">View Info<span></span></a>
        <a href="http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent" class="std-btn-small mleft torrentDwl" data-movieID="2620" data-torrentID="2812">Download<span></span></a>
    </span> 
</div>

Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.

Whenever you start thinking about code with variables named x_1, x_2, etc., this is usually an indication that you should really use a Python list, in this case one named x. A list will make your script much more robust, as you can append new elements and the list will grow as necessary. If you design your program for one page with 4 elements, and then parse a different page with 6 elements, your x_1 scheme will require that you change your input loop, and probably your print loop too. But if you have coded to use a list, then it will adapt to other pages without any changes. — PaulMcG
– PaulMcG, Commented Mar 22, 2015 at 15:13

root · Accepted Answer · 2012-12-07 08:50:32Z

2

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)


In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.text
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]
    ...:     
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

or to get exactly the output you wanted:

In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.find(text=True, recursive=False).strip()
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]

edited Dec 7, 2012 at 8:50

answered Dec 7, 2012 at 8:39

root

81.1k25 gold badges111 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jamus Over a year ago

Thank you @root! This code is exactly what I'm after, and your styling skills make it very easy to follow and interpret. The thing that surprises me, however, is that I'm not getting any output what-so-ever. The soup variable contains the required mark-up, but the for loop doesn't spit anything out. I'm using bs3 (python2.5, so no choice) Could this make a difference? Thanks again! :)

root Over a year ago

this will most likely not work with bs3. i will try to take a look some time later as i don't have bs3 in my hands at the moment. Also 2.5 is really old, if possible you should upgrade to 2.7 or at least 2.6.

Jamus Over a year ago

Thank you for your help! Don't worry about checking it out, I have another computer that has 2.7 installed - I've just been working through ssh to my WD TV all day and completely forgot that I have a MODERN LINUX OPERATING SYSTEM INSTALLED! GUI and all. I'll make the switch now. I can see that by the time it hits 'soup', there's practically no code left. Thank you so much for your help! :)

Denis · Accepted Answer · 2012-12-07 08:25:09Z

As you request I paste simple example of parser. As you can see it's use lxml. With lxml you have two ways to work with DOM tree one of these is xpath and the second is css selectors I prefered xpath.

import lxml.html
import decimal
import urllib

def parse():
    url = 'https://sometotosite.com'
    doc = lxml.html.fromstring(urllib.urlopen(url).read())
    main_div = doc.xpath("//div[@id='line']")[0]
    main = {}
    tr = []
    for el in main_div.getchildren():
    if el.xpath("descendant::a[contains(@name,'tn')]/text()"):
        category = el.xpath("descendant::a[contains(@name,'tn')]/text()")[0]
        main[category] = ''
        tr = []
    else:
        for element in el.getchildren():
            if '&#8212' in lxml.html.tostring(element):
                tr.append(element)
                print category, tr
parse()

LXML official site

Collectives™ on Stack Overflow

Python, parsing html

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related