0

Possible Duplicate:
Decode HTML entities in Python string?

I have a malformed string in Python:

Muhammad Ali's fight with Larry Holmes

where ' is a apostrophe.

Firstly what representation is this: '? Secondly, how can I parse the string in python so that it replaces ' with '

2
  • 3
    This looks like a HTML entity of a character with code 39 (which would make it easy to parse and reassemble using chr(). However there are is also a big number of symbolic HTML entities like & (&) which you'd probably want to also consider. Commented Nov 13, 2011 at 20:17
  • @All: I did not know how to search for an answer because I did not know what to search. Commented Nov 13, 2011 at 20:20

2 Answers 2

5

The Python Standard Library's HTMLParser is able to decode HTML entities in strings.

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'

A range of solutions are described here: http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

Sign up to request clarification or add additional context in comments.

Comments

1

The &#CHAR-CODE; is a sytax for special chars in html (maybe elsewhere, but I'm not sure). There may be a more complete way to do this, but you could replace it simply with:

mystring = "Muhammad Ali's fight with Larry Holmes"
print mystring.replace("'", "'")

Yields:

Muhammad Ali's fight with Larry Holmes

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.