0

I'm currently working on a project that involves a program to inspect a web page's HTML using Python. My program has to monitor a web page, and when a change is made to the HTML, it will complete a set of actions. My question is how do you extract just part of a web page, and how do you monitor a web page's HTML and report almost instantly when a change is made. Thanks.

2 Answers 2

2

In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like

document.write('<SCR' + 'IPT

And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.

I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.

Its intro :

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.

Sign up to request clarification or add additional context in comments.

Comments

1

Scrapy might be a good place to start. http://doc.scrapy.org/en/latest/intro/overview.html

Getting sections of websites is easy, it is just xml, you can use scrapy or beautifulsoup.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.