8
>>> a = "我"  # chinese  
>>> b = unicode(a,"gb2312")  
>>> a.__class__   
<type 'str'>   
>>> b.__class__   
<type 'unicode'>  # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211' 

>>> c = u"我"
>>> c.__class__
<type 'unicode'>  # c is unicode
>>> c
u'\xce\xd2'

b and c are all unicode, but >>> b outputs u'\u6211', and >>> c outputs u'\xce\xd2', why?

3
  • What terminal are you using? I can't reproduce the results on my Unicode gnome-terminal (c === u'\u6211') Commented Apr 23, 2012 at 8:53
  • @ChrisMorgan I test these codes in IDLE. Commented Apr 23, 2012 at 8:54
  • can also repro this with IDLE Commented Apr 23, 2012 at 9:00

2 Answers 2

12

When you enter "我", the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the "". On my UTF-8 system, that's '\xe6\x88\x91'. On yours, it's '\xce\xd2' because you use GB2312. That explains the value of your variable a.

When you enter u"我", the Python interpreter doesn't know which encoding the character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong result u'\xce\xd2' (or, on my box, u'\xe6\x88\x91').

This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u"liberté")
print("liberté")
Sign up to request clarification or add additional context in comments.

Comments

0

The interactive Python show representation of an object when you just type-in its name. On the other hand, the print command tries to render the character. Your variable named a is of a string type. Actually, strings in Python 2.x are series of bytes. So, it depends on your working environment. You say to the unicode() function that you now use the gb2312 encoding. If it is true, then b contains the correct representation of the character in the given encoding.

Try to

>>> print b

in your case. It is likely you will see the wanted result. Try also:

>>> print repr(a)
...
>>> print repr(b)

The representation is (if possible) a text string that when copy-pasted to a source code would create the object with the same value.

Have a look at Mark Pilgrim's "Dive Into Python 3", Chapter 4. Strings (http://getpython3.com/diveintopython3/strings.html) for a nice, readable explanations.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.