0

I've got a piece of code in Python 3 that successfully parses HTML with HTMLParser in Windows, the problem is that I want to run the script also in Linux and it doesn't seem to be working.

I retrieve the HTML code with the following:

html = urllib.request.urlopen(url).read()
html_str = str(html)
parse = MyHTMLParser()
parse.feed(html_str)

The original output of html is the following:

b'\n \n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n
    <html xmlns="http://www.w3.org/1999/xhtml">\n
        <head>\n

html is in binary, so I convert it to string so parse.feed doesn't complain. The problem is that the html I get when converting to string is something like this:

'b\'\\n \\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\\n
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\\n
<html xmlns="http://www.w3.org/1999/xhtml">\\n
    <head>\\n

As you can see, I've got several \\n that Windows doesn't give a damn about them, but for Linux they're escape sequences, thus failing to parse the html due to them. I don't remember the exact error right now, but it was something like can't parse \\

I've tried using re to remove the excess of \ with re.sub("\\","",html_str) but in Windows doesn't seem to do anything and in Linux I get also an error.

This is the error I get when trying to re.sub the html in Linux:

>>> re.sub("\\","",html_str)
Traceback (most recent call last):
  File "/usr/lib/python3.1/sre_parse.py", line 194, in __next
    c = self.string[self.index + 1]
IndexError: string index out of range

Any idea how can I remove the excess of \ in html_str so I can parse it in Linux?

1
  • \\n are not escape sequences on Linux. \\n is two characters, a backslash (escaped to \\ to make the output a valid python bytes literal) and a n character. These characters have the same meaning on Windows and Linux. Could you please look up the exact error and traceback? Commented Apr 24, 2013 at 7:38

1 Answer 1

2

In python3 you can't convert bytes to str like you're doing:

html_str = str(html)

This worked in python2 because bytes and str were the same, but now you'll get a representation of the original string. To decode the string, you either need to supply the encoding argument, or use:

hmtl_str = html.decode(encoding)

If you can't get the charset from the http headers, you could either try to guess, or use chardet to determine the right encoding.

Sign up to request clarification or add additional context in comments.

1 Comment

Note that str(html, 'ascii') is the same thing as html.decode('ascii').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.