Unicode and `decode()` in Python

Question

>>> a = "我"  # chinese  
>>> b = unicode(a,"gb2312")  
>>> a.__class__   
<type 'str'>   
>>> b.__class__   
<type 'unicode'>  # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211' 

>>> c = u"我"
>>> c.__class__
<type 'unicode'>  # c is unicode
>>> c
u'\xce\xd2'

b and c are all unicode, but >>> b outputs u'\u6211', and >>> c outputs u'\xce\xd2', why?

What terminal are you using? I can't reproduce the results on my Unicode gnome-terminal (c === u'\u6211') — Chris Morgan
– Chris Morgan, Commented Apr 23, 2012 at 8:53

Fred Foo · Accepted Answer · 2012-04-23 09:05:06Z

When you enter "我", the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the "". On my UTF-8 system, that's '\xe6\x88\x91'. On yours, it's '\xce\xd2' because you use GB2312. That explains the value of your variable a.

When you enter u"我", the Python interpreter doesn't know which encoding the 我 character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong result u'\xce\xd2' (or, on my box, u'\xe6\x88\x91').

This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u"liberté")
print("liberté")

pepr · Accepted Answer · 2012-04-23 10:19:36Z

The interactive Python show representation of an object when you just type-in its name. On the other hand, the print command tries to render the character. Your variable named a is of a string type. Actually, strings in Python 2.x are series of bytes. So, it depends on your working environment. You say to the unicode() function that you now use the gb2312 encoding. If it is true, then b contains the correct representation of the character in the given encoding.

Try to

>>> print b

in your case. It is likely you will see the wanted result. Try also:

>>> print repr(a)
...
>>> print repr(b)

The representation is (if possible) a text string that when copy-pasted to a source code would create the object with the same value.

Have a look at Mark Pilgrim's "Dive Into Python 3", Chapter 4. Strings (http://getpython3.com/diveintopython3/strings.html) for a nice, readable explanations.

Collectives™ on Stack Overflow

Unicode and `decode()` in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related