Python Unicode Encode Decode Issue

Question

Lets take a simple variable -

var =  u' \u2013 2'

Lets try decoding it -

var.decode('utf-8')

I get the following error -

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 7: ordinal not in range(128)

Lets try encoding it -

var.encode('utf-8')

I get the following error -

'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

One solution is to do -

sys.setdefaultencoding('utf-8')

Let me know, what others are doing?

You just don't understand the difference between unicode and bytes. Python 27 did not manage to get it right however: unicode objects have a .decode method, and bytestrings have a .encode which is a non sense. — bgusach
– bgusach, Commented May 19, 2015 at 11:17

bobince · Accepted Answer · 2015-05-19 10:59:24Z

Lets try decoding [a Unicode string]

You decode bytes to Unicode. You encode Unicode to bytes.

You cannot decode a unicode string.

If you try, Python tries to help you out by automatically converting the Unicode string to something it can decode: a byte string. As this is implicit, it uses the default encoding for your platform, which is ASCII. ASCII can't encode U+2013 so you have an error.

(With hindsight, this attempt at “do what I mean” behaviour was a mistake. Python 3 no longer allows it.)

I get 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

You're doing something else there you haven't shown us, then, because encoding works fine:

>>> u' \u2013 2'.encode('utf-8')
' \xe2\x80\x93 2'

One solution is to do sys.setdefaultencoding('utf-8')

This was never a proper solution to anything, which is why Python takes some steps to prevent you doing it.

holdenweb · Accepted Answer · 2015-05-19 11:06:26Z

The statement

>>> var =  u' \u2013 2'

creates a Unicode string object inside your program. The mistake you appear to be making is assuming that Unicode objects are encoded: they aren't, they are in a form suitable for direct use by Python code.

When you want to transmit the Unicode string, you have to do so as a sequence of bytes, which means your string must be encoded.

>>> var.encode("utf-8")

gives the result

' \xe2\x80\x93 2'

which is indeed your original string encoded in UTF-8. You can verify this with

>>> var.encode('utf-8').decode('utf-8')

which gives you back the original Unicode string:

u' \u2013 2'

Remember - decode on the way in (to convert an external representation into a Unicode object), encode on the way out (so your Unicode objects can be represented as byte strings).

Collectives™ on Stack Overflow

Python Unicode Encode Decode Issue

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related