2

So I am converting some code from python2 to python3. I don't understand the python2 encode/decode functionality enough to even determine what I should be doing in python3

In python2, I can do the following things:

>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'

What did I just do there? Doesn't the 'u' prefix mean unicode? Shouldn't the utf8 be '\xe5\xb8\x90\xe6\x88\xb7' since that is what I input in the first place?

1 Answer 1

2

Your variable c was not declared as a unicode (with prefix 'u'). If you decode it using the 'latin1' encoding you will get the same result:

>>> c.decode('latin1')
u'\xe5\xb8\x90\xe6\x88\xb7'

Note that the result of decode is a unicode string:

>>> type(c)
<type 'str'>
>>> type(c.decode('latin1'))
<type 'unicode'>

If you declare c as a unicode and keep the same input, you will not print the same characters:

>>> c=u'\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
叿·

If you use the input '\u5e10\u6237', you will print the initial characters:

>>> c=u'\u5e10\u6237'
>>> print c
帐户

Encoding and decoding is just a matter of using a table of correspondence value<->character. The thing is that the same value does not render the same character according to the encoding (ie table) used.

The main difficulty is when you don't know the encoding of an input string that you have to handle. Some tools can try to guess it, but it is not always successful (see https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding).

Sign up to request clarification or add additional context in comments.

4 Comments

Encoding and decoding is just a matter of using a table of correspondence value<->character - I would prefer Encoding and decoding is just a matter of using a table of correspondence (single byte)character(s)<->unicode character
@SergeBallesta yes you are right. I meant 'value' as the byte value.
So what format is u'\u5e10\u6237' in? Is that actually utf8? Then what format is '\xe5\xb8\x90\xe6\x88\xb7' in, latin1? This is confusing because turning those chinese characters to bytes in python3 gives me '\xe5\xb8\x90\xe6\x88\xb7' which I had assumed is utf8.
u'\u5e10\u6237' is the Unicode string which prints as the chineese ideograms when using the print statement. \xe5\xb8\x90\xe6\x88\xb7 is corresponding latin1 encoded Byte string. I know this is quite confusing (even for me), please refer to the documentation: docs.python.org/2/howto/unicode.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.