I've got a piece of code in Python 3 that successfully parses HTML with HTMLParser in Windows, the problem is that I want to run the script also in Linux and it doesn't seem to be working.
I retrieve the HTML code with the following:
html = urllib.request.urlopen(url).read()
html_str = str(html)
parse = MyHTMLParser()
parse.feed(html_str)
The original output of html is the following:
b'\n \n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n
<html xmlns="http://www.w3.org/1999/xhtml">\n
<head>\n
html is in binary, so I convert it to string so parse.feed doesn't complain. The problem is that the html I get when converting to string is something like this:
'b\'\\n \\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\\n
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\\n
<html xmlns="http://www.w3.org/1999/xhtml">\\n
<head>\\n
As you can see, I've got several \\n that Windows doesn't give a damn about them, but for Linux they're escape sequences, thus failing to parse the html due to them. I don't remember the exact error right now, but it was something like can't parse \\
I've tried using re to remove the excess of \ with re.sub("\\","",html_str) but in Windows doesn't seem to do anything and in Linux I get also an error.
This is the error I get when trying to re.sub the html in Linux:
>>> re.sub("\\","",html_str)
Traceback (most recent call last):
File "/usr/lib/python3.1/sre_parse.py", line 194, in __next
c = self.string[self.index + 1]
IndexError: string index out of range
Any idea how can I remove the excess of \ in html_str so I can parse it in Linux?
\\nare not escape sequences on Linux.\\nis two characters, a backslash (escaped to\\to make the output a valid python bytes literal) and ancharacter. These characters have the same meaning on Windows and Linux. Could you please look up the exact error and traceback?