string.decode() function in python2

Question

So I am converting some code from python2 to python3. I don't understand the python2 encode/decode functionality enough to even determine what I should be doing in python3

In python2, I can do the following things:

>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'

What did I just do there? Doesn't the 'u' prefix mean unicode? Shouldn't the utf8 be '\xe5\xb8\x90\xe6\x88\xb7' since that is what I input in the first place?

Community · Accepted Answer · 2017-03-20 10:18:12Z

2

Your variable c was not declared as a unicode (with prefix 'u'). If you decode it using the 'latin1' encoding you will get the same result:

>>> c.decode('latin1')
u'\xe5\xb8\x90\xe6\x88\xb7'

Note that the result of decode is a unicode string:

>>> type(c)
<type 'str'>
>>> type(c.decode('latin1'))
<type 'unicode'>

If you declare c as a unicode and keep the same input, you will not print the same characters:

>>> c=u'\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
å¸æ·

If you use the input '\u5e10\u6237', you will print the initial characters:

>>> c=u'\u5e10\u6237'
>>> print c
帐户

Encoding and decoding is just a matter of using a table of correspondence value<->character. The thing is that the same value does not render the same character according to the encoding (ie table) used.

The main difficulty is when you don't know the encoding of an input string that you have to handle. Some tools can try to guess it, but it is not always successful (see https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding).

edited Mar 20, 2017 at 10:18

CommunityBot

11 silver badge

answered Jul 12, 2016 at 13:44

Frodon

3,7951 gold badge18 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Serge Ballesta Over a year ago

Encoding and decoding is just a matter of using a table of correspondence value<->character - I would prefer Encoding and decoding is just a matter of using a table of correspondence (single byte)character(s)<->unicode character

Frodon Over a year ago

@SergeBallesta yes you are right. I meant 'value' as the byte value.

kingledion Over a year ago

So what format is u'\u5e10\u6237' in? Is that actually utf8? Then what format is '\xe5\xb8\x90\xe6\x88\xb7' in, latin1? This is confusing because turning those chinese characters to bytes in python3 gives me '\xe5\xb8\x90\xe6\x88\xb7' which I had assumed is utf8.

Frodon Over a year ago

u'\u5e10\u6237' is the Unicode string which prints as the chineese ideograms when using the print statement. \xe5\xb8\x90\xe6\x88\xb7 is corresponding latin1 encoded Byte string. I know this is quite confusing (even for me), please refer to the documentation: docs.python.org/2/howto/unicode.html

Collectives™ on Stack Overflow

string.decode() function in python2

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related