0

Lets take a simple variable -

var =  u' \u2013 2'

Lets try decoding it -

var.decode('utf-8')

I get the following error -

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 7: ordinal not in range(128)

Lets try encoding it -

var.encode('utf-8')

I get the following error -

'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

One solution is to do -

sys.setdefaultencoding('utf-8')

Let me know, what others are doing?

2
  • where are you running this? Commented May 19, 2015 at 10:34
  • You just don't understand the difference between unicode and bytes. Python 27 did not manage to get it right however: unicode objects have a .decode method, and bytestrings have a .encode which is a non sense. Commented May 19, 2015 at 11:17

2 Answers 2

2

Lets try decoding [a Unicode string]

You decode bytes to Unicode. You encode Unicode to bytes.

You cannot decode a unicode string.

If you try, Python tries to help you out by automatically converting the Unicode string to something it can decode: a byte string. As this is implicit, it uses the default encoding for your platform, which is ASCII. ASCII can't encode U+2013 so you have an error.

(With hindsight, this attempt at “do what I mean” behaviour was a mistake. Python 3 no longer allows it.)

I get 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

You're doing something else there you haven't shown us, then, because encoding works fine:

>>> u' \u2013 2'.encode('utf-8')
' \xe2\x80\x93 2'

One solution is to do sys.setdefaultencoding('utf-8')

This was never a proper solution to anything, which is why Python takes some steps to prevent you doing it.

Sign up to request clarification or add additional context in comments.

Comments

0

The statement

>>> var =  u' \u2013 2'

creates a Unicode string object inside your program. The mistake you appear to be making is assuming that Unicode objects are encoded: they aren't, they are in a form suitable for direct use by Python code.

When you want to transmit the Unicode string, you have to do so as a sequence of bytes, which means your string must be encoded.

>>> var.encode("utf-8")

gives the result

' \xe2\x80\x93 2'

which is indeed your original string encoded in UTF-8. You can verify this with

>>> var.encode('utf-8').decode('utf-8')

which gives you back the original Unicode string:

u' \u2013 2'

Remember - decode on the way in (to convert an external representation into a Unicode object), encode on the way out (so your Unicode objects can be represented as byte strings).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.