I need to be able to extract the HTML content within the tags provided I have the URL's of the pages. Is there any way i can do this using Python?
-
3Google python web scraping.Blender– Blender2013-07-26 05:01:52 +00:00Commented Jul 26, 2013 at 5:01
-
possible duplicate of Options for HTML scraping?Anthon– Anthon2013-07-26 05:06:51 +00:00Commented Jul 26, 2013 at 5:06
-
Duplicate. stackoverflow.com/questions/1391657/… stackoverflow.com/questions/2081586/… stackoverflow.com/questions/6969567/…Logan– Logan2013-07-26 05:07:17 +00:00Commented Jul 26, 2013 at 5:07
2 Answers
There is an incredible scraping library for Python called BeautifulSoup which will make your life much easier: http://www.crummy.com/software/BeautifulSoup/
BeautifulSoup allows you to select by html tags and/or html attributes such via a css class name. It also handles bad html docs really well but you need to read the docs on how it works. It's pretty amazing what you can scrape with so few lines of code using this library.
Have fun!
Comments
Use BeautifuSoup
it is very easy to do this just use urllib to get the data from the web then use BeautifulSoup to parse out the information you need
here is an example:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen('example.com')
soup = BeautifulSoup(url)
you can then use BeautifulSoup to extract the infromation given a certain tag like this
soup.find_all('tag_name')
also there are alot of other ways to extract data this site will help Web-Scraping with bs4
4 Comments
from bs4 import * should be from bs4 import BeautifulSoup. Also, you don't need to read the file handle before passing it into BeautifulSoup.