Python string encoding issue

Question

I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:

'S�o Paulo'

so I tried to encode it like so:

encodeme = 'S�o Paulo'
encodeme.encode('utf-8)

but got the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)

The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of

São Paulo

Any help is appreciated.

Community · Accepted Answer · 2017-05-23 10:32:32Z

It looks like you are having some kind of encoding problem.

First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?

Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.

So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?

I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.

Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.

One thing to get very clear (as Skirmantas also explains):

Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.

The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)

Try this example code:

>>> reportbody = 'S\xc3\xa3o Paulo'   # UTF-8 encoded byte string
>>> reportbody.decode('utf-8')        # returns a Unicode string, u'...'
u'S\xe3o Paulo'

You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.

thanks, i really appreciate it. I will try to get the xml response from amazon

Ski · Accepted Answer · 2012-01-30 08:39:12Z

I think you have to decode it using a correct encoding rather than encode it to utf-8. Try

s = s.decode('utf-8')

However you need to know which encoding to use. Input can come in other encodings that utf-8.

The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.

I'll try to explain the difference of unicode string and utf-8 bytestring in python.

unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.

bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.

utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.

CR0SS0V3R · Accepted Answer · 2012-01-30 07:25:55Z

1

Official Python unicode documentation

You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)

answered Jan 30, 2012 at 7:25

CR0SS0V3R

3385 silver badges11 bronze badges

2 Comments

CR0SS0V3R Over a year ago

Should have looked at where I was posting, D'oh! I'm new if you haven't noticed already :)

Yuji 'Tomita' Tomita Over a year ago

don't worry about it! Questions with answers get less attention, so it's just good etiquette if it's definitely not an answer.

Collectives™ on Stack Overflow

Python string encoding issue

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related