Html Parsing vs. Regex

Question

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. The problem conditions below

Data is clear and well structured
Data is small
Performance matters, application must be able to get as much as data that is possibble
Application will write data to MongoDB database
Implementation programming language will be Scala or Python

Any opinion is valuable so what should I do?

Community · Accepted Answer · 2017-05-23 12:31:29Z

7

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format.

If performance matters here, there is a blazingly fast lxml package. For the HTML, use lxml.html.

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood. Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document.

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML:

RegEx match open tags except XHTML self-contained tags

edited May 23, 2017 at 12:31

CommunityBot

11 silver badge

answered Oct 11, 2014 at 20:15

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hüseyin Zengin Over a year ago

I know that what i shouldn't use regex to html parsing, i know what is regex and what it turns when implemented,yes i got automata lesson too, most of reasons are about html unstable structres and big amount of data, which is not true for our case, we have a well structred and small data to process. so, i appreciate your answer but this is not we are lookng for, i think.

alecxe Over a year ago

@HüseyinZengin thanks. It's difficult to say without seeing what kind of data you have, how much of it and what data you need to parse from it. I guess your best bet would be to measure the performance yourself. For example, implement it using lxml and regex-only approach and benchmark it.

Collectives™ on Stack Overflow

Html Parsing vs. Regex

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related