3

I have a unicode like this:

\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7

And I know it is the string representative of bytes which is encoded with utf-8

Note that the string \xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7 itself is <type 'unicode'>

How to decode it to the real string 山东 日照 ?

1 Answer 1

11

If you printed the repr() output of your unicode string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.

First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:

unicode_string.encode('latin1').decode('utf8')

This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.

The best way to repair such mistakes is using the ftfy library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.

For your small sample, Latin-1 appears to work just fine:

>>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照

If you have the literal character \, x, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape codec:

>>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> unicode_string
u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照

'string_escape' is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, Martijn, and what if I print the dict which contains the String it shows {u'qualifier': u'name', u'timestamp': u'1462275769186', u'value': u'\\xE5\\x8E\\x9F\\xE6\\x9D\\xA5\\xE6\\x98\\xAFolivia\\xE5\\x95\\x8A', u'columnFamily': u'interActive', u'type': u'Put', u'row': u'1771897264'} and print m.get('value').encode('latin1').decode('utf8') still print \xE5\x8E\x9F...
@armnotstrong: You don't have bytes. You have literal backslashes, x characters and hex digits. You have a different problem here. What even produces this?
@armnotstrong: updated; from your question it was not clear that you had literal text. In future, please show the repr() output of such a string (which is what the dict() representation you showed in your comment uses for each key and value)
.original data was retrieved from hbase, and it only could store ASCII String as value. and org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter gives me this
I'm not familiar with spark, sorry. No idea if this is a spark issue or an issue with how the data was stored in the first place.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.