Extracting parts of HTML from website using python

Question

I'm currently working on a project that involves a program to inspect a web page's HTML using Python. My program has to monitor a web page, and when a change is made to the HTML, it will complete a set of actions. My question is how do you extract just part of a web page, and how do you monitor a web page's HTML and report almost instantly when a change is made. Thanks.

Massimo · Accepted Answer · 2011-11-26 21:54:36Z

In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like

document.write('<SCR' + 'IPT

And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.

I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.

Its intro :

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.

dm03514 · Accepted Answer · 2011-11-26 21:52:36Z

1

Scrapy might be a good place to start. http://doc.scrapy.org/en/latest/intro/overview.html

Getting sections of websites is easy, it is just xml, you can use scrapy or beautifulsoup.

answered Nov 26, 2011 at 21:52

dm03514

56.2k18 gold badges117 silver badges147 bronze badges

Collectives™ on Stack Overflow

Extracting parts of HTML from website using python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related