python encoding conversion

Question

Here's my problem, I have a variable wrongly encoded that I want to fix. Long story short, I end up with:

myVar=u'\xc3\xa9'

which is wrong because it's the character 'é' or \u00e9 UTF-8 encoded, not unicode.

None of the combinations of encode/decode I tried seem to solve the problem. I looked towards the bytearray object, but you must provide an encoding, and obviously none of them fits.

Basically I need to reinterpret the byte array into the correct encoding. Any ideas on how to do that? Thanks.

@X-Istence : nope unicode is a number standing for a character, while UTF-8 is an encoding for that number (such as UTF-16, UTF-32 ...) — gregseth
– gregseth, Commented Jun 28, 2011 at 7:04

S.Lott · Accepted Answer · 2011-06-27 20:43:42Z

5

What you should have done.

>>> b='\xc3\xa9'
>>> b
'\xc3\xa9'
>>> b.decode("UTF-8")
u'\xe9'

Since you didn't show the broken code that caused the problem, all we can do is make a complex problem more complex.

This appears to be what you're seeing.

>>> c
u'\xc3\xa9'
>>> c.decode("UTF-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Here's a workaround.

>>> [ chr(ord(x)) for x in c ]
['\xc3', '\xa9']
>>> ''.join(_)
'\xc3\xa9'
>>> _.decode("UTF-8")
u'\xe9'

Fix the code that produced the wrong stuff to begin with.

answered Jun 27, 2011 at 20:43

S.Lott

393k83 gold badges521 silver badges791 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

gregseth Over a year ago

Yeah I know what should have been done, and that fixing the source of the problem is the best solution. But I'm in a situation where I can't, so I'll take the workaroud, which is precisely what I wanted. Thanks.

gregseth Over a year ago

It appears that c.encode('iso-8859-15').decode('utf-8').encode('utf-8') works too. Am I in a special case?

S.Lott Over a year ago

@gregseth: No. Many encodings overlap. The point of UTF-8 is to look vaguely like ASCII for most of the standard ASCII characters. I have no idea what you mean by "works" in that comment, since there's no point in doing a decode to create Unicode followed by and encode to recreate the bytes again. Python code works in Unicode. Period. External files are encoded (on output) and decoded (on input). There's no other use for encoding and decoding except file I/O.

gregseth Over a year ago

Ok, my bad, I got confused. Thanks for your time.

Fred Foo · Accepted Answer · 2011-06-28 13:40:41Z

1

The hacky solution: pull out the codepoints with ord, then build characters (length-one strings) out of these with chr, then paste the lot back together and decode.

>>> u = u'\xc3\xa9'
>>> s = ''.join(chr(ord(c)) for c in u)
>>> unicode(s, encoding='utf-8')
u'\xe9'

edited Jun 28, 2011 at 13:40

answered Jun 27, 2011 at 20:43

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Collectives™ on Stack Overflow

python encoding conversion

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related