0

Evening folks (or morning depending on where you are :) ).

I'm looking to parse a webpage which contains multiple segments similar to the below:-

> <p><a name="Abercrombie"></a></p> <h3>Abercrombie Council</h3> <p>Mr
> Billy Smith<br />The Managing Director<br />123 Jones Street,
> London<br />T:02081234567<br /><a
> href="mailto:[email protected]">Email</a></p>

What I'm wishing to do is to capture the source code from the webpage and then parse through it extracting the unique info above and place this into rows in a tab delimited document with a new line at the end - splitting up the title, name of office, name of individual, job role, address, telephone number, email address.

I've been looking at using BeautifulSoup but I'm just wondering if there's any other tools that are more suitable?

3 Answers 3

1

I'd say BeautifulSoup would be your best and easiest option and parse pages or chunks of HTML. You can also try scrapy or even scraperwiki

Sample Usage for BS

import BeautifulSoup
import urllib2

get = urllib2.urlopen('http://site.com').read()
dom = BeautifulSoup.BeautifulSoup(get)
data = dom.findAll('p', {'class' : 'address'}) # <p class='address'>....</p>

for i in data:
    print data

More examples: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

Sign up to request clarification or add additional context in comments.

Comments

0

BeautifulSoup is a decent and popular library but you could also have a look at lxml

Comments

0

Web scraping framework Scrapy is a good choice for this kind of task http://scrapy.org/ because not only can it parse and extract data but also run automatic scraping jobs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.