convert utf-8 byte to string python

Question

I already know if I wanna encode and decode a string in 'utf-8' , i can do ...

string = "Kröger"

print(string.encode('utf-8'))
>> b'Kr\xc3\xb6ger'

print(b'Kr\xc3\xb6ger'.decode('utf-8')
>> Kröger

If I have an string 'Kr\xc3\xb6ger' without specifying it was of <class bytes> (missing prefix 'b'), how will I decode this ?

Edit:

I have a tokenized list if it helps : ['K', 'r', '\\xc3\\xb6', 'g', 'e', 'r']

Do you have 'Kr\xc3\xb6ger' or 'Kr\\xc3\\xb6ger'? What's the length of the string? — lenz
– lenz, Commented Jul 5, 2020 at 20:05
This is mojibake, ie. a string decoded from bytes with the wrong codec. Specifically, Latin-1 was used to decode instead of UTF-8. You can undo the damage by en/decoding in reverse: 'Kr\xc3\xb6ger'.encode('latin1').decode('utf8') — lenz
– lenz, Commented Jul 5, 2020 at 20:09
@lenz, thank you. I took this word from a german word corpus. — Ansh David
– Ansh David, Commented Jul 5, 2020 at 20:13
In that case you may want to check that you correctly decode when loading the data (eg. explicitly specify encoding='utf8' when opening a file for reading). — lenz
– lenz, Commented Jul 5, 2020 at 20:43

alani · Accepted Answer · 2020-07-05 20:05:23Z

2

string = "Kr\xc3\xb6ger"
print(bytes(string, "raw_unicode_escape").decode("utf-8"))

gives

Kröger

answered Jul 5, 2020 at 20:05

alani

13.2k3 gold badges18 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

thank you @alaniwi, i missed the "raw_unicode_escape".

thebjorn · Accepted Answer · 2020-07-05 20:07:39Z

2

First you have to encode it to bytes, then decode it from utf-8:

>>> s = 'Kr\xc3\xb6ger'
>>> s.encode("raw-unicode-escape")
b'Kr\xc3\xb6ger'
>>> s.encode("raw-unicode-escape").decode('u8')
'Kröger'
>>>

answered Jul 5, 2020 at 20:07

thebjorn

27.6k12 gold badges107 silver badges152 bronze badges