1

I have a html text : If I'm reading lots of articles

I am trying to replace ' and other such special characters into unicode '. I did

rawtxt.encode('utf-8').encode('ascii','ignore') 

, but it fails

Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2

2
  • It looks like this is not really the code that produces the error because the error comes from trying to decode the string as ascii. Where does rawtxt come from? Commented May 16, 2013 at 12:01
  • @Sarien: it is the code that produces the error. You can get a decode error in a call to encode. See: chat.stackoverflow.com/rooms/10/conversation/… Commented May 16, 2013 at 13:04

1 Answer 1

3

You're having problems with HTML entities, not unicode or UTF-8. Try this:

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape('If I'm reading lots of articles')
print s

This prints If I'm reading lots of articles.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for saving loads of time

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.