Thanks to the kind users of this site, I have some idea of how to use re as an alternative to a non-standard python module so that my script will work with minimum over-hang. Today, I've been experimenting with parsing modules. I've come across beautifulsoup.. this is all great, but I don't understand it.
For educational purposes, I'd like to strip the following information from http://yify-torrents.com/browse-movie (please don't tell me to use a web-crawler, I'm not trying to crawl the whole site - just extract the information from this page to learn how parsing modules work!)
Movie Title Quality Torrent Link
There is 22 of these items, I wish for them to be stored in lists in order, ie. item_1, item_2. And these lists need to contain these three items. For instance:
item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]
And then, to make matters simple, I just want to print every item to the console. To make things more difficult, however, these items don't have identifiers on the page, so the info. needs to be strictly ordered. This is all good, but all I'm getting is either the entire source being contained by each list item, or empty items! An example item divider is as follows:
<div class="browse-info">
<span class="info">
<h3><a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006">James Bond: Casino Royale (2006)</a></h3>
<p><b>Size:</b> 1018.26 MB</p>
<p><b>Quality:</b> 720p</p>
<p><b>Genre:</b> Action | Crime</p>
<p><b>IMDB Rating:</b> 7.9/10</p>
<span>
<p class="peers"><b>Peers:</b> 698</p>
<p class="peers"><b>Seeds:</b> 356</p>
</span>
</span>
<span class="links">
<a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006" class="std-btn-small mright">View Info<span></span></a>
<a href="http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent" class="std-btn-small mleft torrentDwl" data-movieID="2620" data-torrentID="2812">Download<span></span></a>
</span>
</div>
Any ideas? Would someone please do me the honours of giving me an example of how to do this? I'm not sure beautiful soup accommodates all of my requirements! PS. Sorry for the poor English, it's not my first language.
x_1,x_2, etc., this is usually an indication that you should really use a Python list, in this case one namedx. A list will make your script much more robust, as you can append new elements and the list will grow as necessary. If you design your program for one page with 4 elements, and then parse a different page with 6 elements, yourx_1scheme will require that you change your input loop, and probably your print loop too. But if you have coded to use a list, then it will adapt to other pages without any changes.