Parsing HTML with Python 2.7

Question

Evening folks (or morning depending on where you are :) ).

I'm looking to parse a webpage which contains multiple segments similar to the below:-

> <p><a name="Abercrombie"></a></p> <h3>Abercrombie Council</h3> <p>Mr
> Billy Smith<br />The Managing Director<br />123 Jones Street,
> London<br />T:02081234567<br /><a
> href="mailto:[email protected]">Email</a></p>

What I'm wishing to do is to capture the source code from the webpage and then parse through it extracting the unique info above and place this into rows in a tab delimited document with a new line at the end - splitting up the title, name of office, name of individual, job role, address, telephone number, email address.

I've been looking at using BeautifulSoup but I'm just wondering if there's any other tools that are more suitable?

Kartik · Accepted Answer · 2013-01-24 21:15:41Z

1

I'd say BeautifulSoup would be your best and easiest option and parse pages or chunks of HTML. You can also try scrapy or even scraperwiki

Sample Usage for BS

import BeautifulSoup
import urllib2

get = urllib2.urlopen('http://site.com').read()
dom = BeautifulSoup.BeautifulSoup(get)
data = dom.findAll('p', {'class' : 'address'}) # <p class='address'>....</p>

for i in data:
    print data

More examples: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

answered Jan 24, 2013 at 21:15

Kartik

9,9319 gold badges50 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

thikonom · Accepted Answer · 2013-01-24 21:10:16Z

0

BeautifulSoup is a decent and popular library but you could also have a look at lxml

answered Jan 24, 2013 at 21:10

thikonom

4,2673 gold badges30 silver badges30 bronze badges

Comments

Victor Olex · Accepted Answer · 2013-01-24 22:27:17Z

0

Web scraping framework Scrapy is a good choice for this kind of task http://scrapy.org/ because not only can it parse and extract data but also run automatic scraping jobs.

answered Jan 24, 2013 at 22:27

Victor Olex

1,5081 gold badge13 silver badges28 bronze badges

Collectives™ on Stack Overflow

Parsing HTML with Python 2.7

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related