How to decode string representative of utf-8 with python?

Question

I have a unicode like this:

\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7

And I know it is the string representative of bytes which is encoded with utf-8

Note that the string \xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7 itself is <type 'unicode'>

How to decode it to the real string 山东日照 ?

Martijn Pieters · Accepted Answer · 2016-08-19 10:14:24Z

11

If you printed the repr() output of your unicode string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.

First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:

unicode_string.encode('latin1').decode('utf8')

This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.

The best way to repair such mistakes is using the ftfy library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.

For your small sample, Latin-1 appears to work just fine:

>>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照

If you have the literal character \, x, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape codec:

>>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> unicode_string
u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照

'string_escape' is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.

edited Aug 19, 2016 at 10:14

answered Aug 19, 2016 at 9:48

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

armnotstrong Over a year ago

Thanks, Martijn, and what if I print the dict which contains the String it shows

{u'qualifier': u'name', u'timestamp': u'1462275769186', u'value': u'\\xE5\\x8E\\x9F\\xE6\\x9D\\xA5\\xE6\\x98\\xAFolivia\\xE5\\x95\\x8A', u'columnFamily': u'interActive', u'type': u'Put', u'row': u'1771897264'}

and print m.get('value').encode('latin1').decode('utf8') still print \xE5\x8E\x9F...

Martijn Pieters Over a year ago

@armnotstrong: You don't have bytes. You have literal backslashes, x characters and hex digits. You have a different problem here. What even produces this?

Martijn Pieters Over a year ago

@armnotstrong: updated; from your question it was not clear that you had literal text. In future, please show the repr() output of such a string (which is what the dict() representation you showed in your comment uses for each key and value)

armnotstrong Over a year ago

.original data was retrieved from hbase, and it only could store ASCII String as value. and org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter gives me this

Martijn Pieters Over a year ago

I'm not familiar with spark, sorry. No idea if this is a spark issue or an issue with how the data was stored in the first place.

|

Collectives™ on Stack Overflow

How to decode string representative of utf-8 with python?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related