5

Here's my problem, I have a variable wrongly encoded that I want to fix. Long story short, I end up with:

myVar=u'\xc3\xa9'

which is wrong because it's the character 'é' or \u00e9 UTF-8 encoded, not unicode.

None of the combinations of encode/decode I tried seem to solve the problem. I looked towards the bytearray object, but you must provide an encoding, and obviously none of them fits.

Basically I need to reinterpret the byte array into the correct encoding. Any ideas on how to do that? Thanks.

2
  • What would you like to end up with, unicode or str? Commented Jun 27, 2011 at 20:29
  • 1
    @X-Istence : nope unicode is a number standing for a character, while UTF-8 is an encoding for that number (such as UTF-16, UTF-32 ...) Commented Jun 28, 2011 at 7:04

2 Answers 2

5

What you should have done.

>>> b='\xc3\xa9'
>>> b
'\xc3\xa9'
>>> b.decode("UTF-8")
u'\xe9'

Since you didn't show the broken code that caused the problem, all we can do is make a complex problem more complex.

This appears to be what you're seeing.

>>> c
u'\xc3\xa9'
>>> c.decode("UTF-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Here's a workaround.

>>> [ chr(ord(x)) for x in c ]
['\xc3', '\xa9']
>>> ''.join(_)
'\xc3\xa9'
>>> _.decode("UTF-8")
u'\xe9'

Fix the code that produced the wrong stuff to begin with.

Sign up to request clarification or add additional context in comments.

4 Comments

Yeah I know what should have been done, and that fixing the source of the problem is the best solution. But I'm in a situation where I can't, so I'll take the workaroud, which is precisely what I wanted. Thanks.
It appears that c.encode('iso-8859-15').decode('utf-8').encode('utf-8') works too. Am I in a special case?
@gregseth: No. Many encodings overlap. The point of UTF-8 is to look vaguely like ASCII for most of the standard ASCII characters. I have no idea what you mean by "works" in that comment, since there's no point in doing a decode to create Unicode followed by and encode to recreate the bytes again. Python code works in Unicode. Period. External files are encoded (on output) and decoded (on input). There's no other use for encoding and decoding except file I/O.
Ok, my bad, I got confused. Thanks for your time.
1

The hacky solution: pull out the codepoints with ord, then build characters (length-one strings) out of these with chr, then paste the lot back together and decode.

>>> u = u'\xc3\xa9'
>>> s = ''.join(chr(ord(c)) for c in u)
>>> unicode(s, encoding='utf-8')
u'\xe9'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.