encoding/decoding unicode and utf-8 : Python [duplicate]

Question

I have a html text : If I'm reading lots of articles

I am trying to replace ' and other such special characters into unicode '. I did

rawtxt.encode('utf-8').encode('ascii','ignore')

, but it fails

Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2

It looks like this is not really the code that produces the error because the error comes from trying to decode the string as ascii. Where does rawtxt come from? — Sarien
– Sarien, Commented May 16, 2013 at 12:01
@Sarien: it is the code that produces the error. You can get a decode error in a call to encode. See: chat.stackoverflow.com/rooms/10/conversation/… — R. Martinho Fernandes
– R. Martinho Fernandes, Commented May 16, 2013 at 13:04

likeitlikeit · Accepted Answer · 2013-05-16 11:54:56Z

3

You're having problems with HTML entities, not unicode or UTF-8. Try this:

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape('If I&#039;m reading lots of articles')
print s

This prints If I'm reading lots of articles.

answered May 16, 2013 at 11:54

likeitlikeit

5,6885 gold badges45 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for saving loads of time