3

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. The problem conditions below

  • Data is clear and well structured
  • Data is small
  • Performance matters, application must be able to get as much as data that is possibble
  • Application will write data to MongoDB database
  • Implementation programming language will be Scala or Python

Any opinion is valuable so what should I do?

1 Answer 1

7

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format.

If performance matters here, there is a blazingly fast lxml package. For the HTML, use lxml.html.

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood. Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document.

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML:

Sign up to request clarification or add additional context in comments.

2 Comments

I know that what i shouldn't use regex to html parsing, i know what is regex and what it turns when implemented,yes i got automata lesson too, most of reasons are about html unstable structres and big amount of data, which is not true for our case, we have a well structred and small data to process. so, i appreciate your answer but this is not we are lookng for, i think.
@HüseyinZengin thanks. It's difficult to say without seeing what kind of data you have, how much of it and what data you need to parse from it. I guess your best bet would be to measure the performance yourself. For example, implement it using lxml and regex-only approach and benchmark it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.