-6

I need to be able to extract the HTML content within the tags provided I have the URL's of the pages. Is there any way i can do this using Python?

3

2 Answers 2

1

There is an incredible scraping library for Python called BeautifulSoup which will make your life much easier: http://www.crummy.com/software/BeautifulSoup/

BeautifulSoup allows you to select by html tags and/or html attributes such via a css class name. It also handles bad html docs really well but you need to read the docs on how it works. It's pretty amazing what you can scrape with so few lines of code using this library.

Have fun!

Sign up to request clarification or add additional context in comments.

Comments

0

Use BeautifuSoup

it is very easy to do this just use urllib to get the data from the web then use BeautifulSoup to parse out the information you need

here is an example:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen('example.com')

soup = BeautifulSoup(url)

you can then use BeautifulSoup to extract the infromation given a certain tag like this

soup.find_all('tag_name')

also there are alot of other ways to extract data this site will help Web-Scraping with bs4

4 Comments

from bs4 import * should be from bs4 import BeautifulSoup. Also, you don't need to read the file handle before passing it into BeautifulSoup.
well if you download BeautifulSoup 4 you can import it like that
Sorry, I was talking about the asterisk. You shouldn't need to do that.
ohhhh yes youre right ill fix it

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.