Encode Decode of strings python

Question

I have a list of html pages which may contain certain encoded characters. Some examples are as below -

<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm unsure of the current terminology) these strings to -

 <a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.

Edit -

The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a

UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

error in some cases.

mechanical_meat · Accepted Answer · 2012-03-25 01:57:21Z

8

You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.

import HTMLParser, urllib2

markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>'''

result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"): 
    print(line)

Result:

<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:

import codecs 
with codecs.open(filename, encoding="cp1252") as fin:
    decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
    fou.write(result)

Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:

with open(filename) as fin:
    decoded = fin.read().decode('ascii','ignore')
...

edited Mar 25, 2012 at 1:57

answered Mar 25, 2012 at 0:58

mechanical_meat

170k25 gold badges238 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Dexter Over a year ago

While this solution looks good is does throw a UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128) error at some places.

Niall Byrne Over a year ago

Try using .encode('ascii') on the markup string before feeding it in.

mechanical_meat Over a year ago

@mcenley: if you post more detail about how you're getting your data we can provide encoding assistance.

Dexter Over a year ago

@bernie I have a list of html pages downloaded. How should I send them to you?

mechanical_meat Over a year ago

No no, I believe you. What we'd need is the encoding used for those pages, and how you're reading them. The principle is decode (to Unicode) on input, and encode on output.

|

Collectives™ on Stack Overflow

Encode Decode of strings python

1 Answer 1

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related