1

I already know if I wanna encode and decode a string in 'utf-8' , i can do ...

string = "Kröger"

print(string.encode('utf-8'))
>> b'Kr\xc3\xb6ger'

print(b'Kr\xc3\xb6ger'.decode('utf-8')
>> Kröger

If I have an string 'Kr\xc3\xb6ger' without specifying it was of <class bytes> (missing prefix 'b'), how will I decode this ?


Edit:

I have a tokenized list if it helps : ['K', 'r', '\\xc3\\xb6', 'g', 'e', 'r']

6
  • Do you have 'Kr\xc3\xb6ger' or 'Kr\\xc3\\xb6ger'? What's the length of the string? Commented Jul 5, 2020 at 20:05
  • 'Kr\xc3\xb6ger', len(striing) = 7 Commented Jul 5, 2020 at 20:08
  • This is mojibake, ie. a string decoded from bytes with the wrong codec. Specifically, Latin-1 was used to decode instead of UTF-8. You can undo the damage by en/decoding in reverse: 'Kr\xc3\xb6ger'.encode('latin1').decode('utf8') Commented Jul 5, 2020 at 20:09
  • @lenz, thank you. I took this word from a german word corpus. Commented Jul 5, 2020 at 20:13
  • 1
    In that case you may want to check that you correctly decode when loading the data (eg. explicitly specify encoding='utf8' when opening a file for reading). Commented Jul 5, 2020 at 20:43

2 Answers 2

2
string = "Kr\xc3\xb6ger"
print(bytes(string, "raw_unicode_escape").decode("utf-8"))

gives

Kröger
Sign up to request clarification or add additional context in comments.

1 Comment

thank you @alaniwi, i missed the "raw_unicode_escape".
2

First you have to encode it to bytes, then decode it from utf-8:

>>> s = 'Kr\xc3\xb6ger'
>>> s.encode("raw-unicode-escape")
b'Kr\xc3\xb6ger'
>>> s.encode("raw-unicode-escape").decode('u8')
'Kröger'
>>>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.