Parse HTML with HTMLParser in Python3

Question

I've got a piece of code in Python 3 that successfully parses HTML with HTMLParser in Windows, the problem is that I want to run the script also in Linux and it doesn't seem to be working.

I retrieve the HTML code with the following:

html = urllib.request.urlopen(url).read()
html_str = str(html)
parse = MyHTMLParser()
parse.feed(html_str)

The original output of html is the following:

b'\n \n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n
    <html xmlns="http://www.w3.org/1999/xhtml">\n
        <head>\n

html is in binary, so I convert it to string so parse.feed doesn't complain. The problem is that the html I get when converting to string is something like this:

'b\'\\n \\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\\n
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\\n
<html xmlns="http://www.w3.org/1999/xhtml">\\n
    <head>\\n

As you can see, I've got several \\n that Windows doesn't give a damn about them, but for Linux they're escape sequences, thus failing to parse the html due to them. I don't remember the exact error right now, but it was something like can't parse \\

I've tried using re to remove the excess of \ with re.sub("\\","",html_str) but in Windows doesn't seem to do anything and in Linux I get also an error.

This is the error I get when trying to re.sub the html in Linux:

>>> re.sub("\\","",html_str)
Traceback (most recent call last):
  File "/usr/lib/python3.1/sre_parse.py", line 194, in __next
    c = self.string[self.index + 1]
IndexError: string index out of range

Any idea how can I remove the excess of \ in html_str so I can parse it in Linux?

\\n are not escape sequences on Linux. \\n is two characters, a backslash (escaped to \\ to make the output a valid python bytes literal) and a n character. These characters have the same meaning on Windows and Linux. Could you please look up the exact error and traceback? — Martijn Pieters
– Martijn Pieters, Commented Apr 24, 2013 at 7:38

mata · Accepted Answer · 2013-04-24 14:58:37Z

2

In python3 you can't convert bytes to str like you're doing:

html_str = str(html)

This worked in python2 because bytes and str were the same, but now you'll get a representation of the original string. To decode the string, you either need to supply the encoding argument, or use:

hmtl_str = html.decode(encoding)

If you can't get the charset from the http headers, you could either try to guess, or use chardet to determine the right encoding.

edited Apr 24, 2013 at 14:58

answered Apr 24, 2013 at 14:52

mata

69.4k10 gold badges168 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Martijn Pieters Over a year ago

Note that str(html, 'ascii') is the same thing as html.decode('ascii').

Collectives™ on Stack Overflow

Parse HTML with HTMLParser in Python3

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest