One more fight between Unicode, Python 3 and programmer. Decoding string

Question

The problem is to convert the bytes to unicode, when that bytes already saved in string. Here is an example:

s1 = '\xd0\xb1\xd0\xb0'
s2 = b'\xd0\xb1\xd0\xb1'

print(s1)  #  Here is the problem: prints a trash (Ð°Ð±)
print(s2.decode('utf-8'))  #  Everything is OK, printing 'ба' (two cyrillic symbols)

But how can i decode the data from s1 now? I can't add b'' modifier before the s1 declaration cause s1 may come from internet, so i can't just declare s1 like i declared s2. I found that b'' modifier works like a bytes() function, but when i tried to call it:

s3 = bytes(s1, 'utf-8')

There was a trash again:

print(s3.decode('utf-8'))  #  Ð°Ð±

So the question is: what should i do with s1 that it becomes the 'ба' in terminal output?

I googled a lot but all that i found was not that i need.

That is what i need:

s4 = SOME_WONDERFUL_MAGIC(s1)
print(s4)  #  Prints 'ба'

Very thanks for everybody who can help and sorry me please for bad english.

UPDATE: Oops, the problem returned. I hoped that 1st answer will help me, but i found that:

s1 == '\xd0\xb1\xd0\xb0'  #  BUT
s1 != '\xd0\xb1\xd0\xb0'

What do i mean: I used the 'requests' package to make a POST request to Flask server. It responses me:

req = requests.post(hostName)
print(req.text)  #  b'testText'
#  BUT!
print(req.text[2:-1]  #  testText

It means that bytes representation of testText represented as string like that:

s5 = "b'tumba'"

So the real question is: how to extract tumba from "b'tumba'" (if tumba may contain cyrillic symbols)?

How can a unicode object "come from the internet"? If it's from the internet it's bytes. It's being decoded to unicode somewhere, the question is where? — John La Rooy
– John La Rooy, Commented Nov 26, 2013 at 1:21
@gnibbler: It's possible that "come from the internet" may mean "come from one of the internet-related modules in the stdlib or elsewhere, which decoded it behind my back". In that case, of course, the OP has to tell us which module he used, and we can tell him how to set an encoding explicitly instead of defaulting to something incorrect, which will solve the problem without needing any wonderful magic. — abarnert
– abarnert, Commented Nov 26, 2013 at 1:39
@gnibbler: Yes, and your point is crucial; I just wasn't sure a novice would understand it. Novices are usually not directly processing data off sock.read(), and may not realize that requests or ElementTree or whatever is doing some magic with a default value or guess. — abarnert
– abarnert, Commented Nov 26, 2013 at 1:43
For everybody who asked for source of unicode object: it is coming as a response from Flask server. 'requests' package is using for making requests. But there is no more problem: gnibbler answered a question, very gratz. — user3034492
– user3034492, Commented Nov 26, 2013 at 2:30

John La Rooy · Accepted Answer · 2013-11-26 01:34:53Z

4

s1 is probably being incorrectly decoded as ISO-8859-1(latin1) somewhere.

You can try reencoding

>>> s4 = s1.encode('ISO-8859-1')
>>> s4.decode('UTF-8')
'ба'

You real bug is finding where the decoding is happening though.

Stop treating unicode and bytes interchangeably and the fighting will stop :)

edited Nov 26, 2013 at 1:34

answered Nov 26, 2013 at 1:24

John La Rooy

306k54 gold badges378 silver badges514 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3034492 Over a year ago

I worked for test string, but real life surprised me with new problem (question is updated).

Tupteq · Accepted Answer · 2013-11-26 01:23:59Z

1

Quick and dirty solution that worked for me:

s1 = '\xd0\xb1\xd0\xb0'
s4 = bytes(s1, encoding='latin1').decode('utf-8')
print(s4)

answered Nov 26, 2013 at 1:23

Tupteq

3,1141 gold badge23 silver badges35 bronze badges

3 Comments

abarnert Over a year ago

Sure, but if you've got data that "comes from the internet" and is being incorrectly decoded as Latin-1 this time, so you write code that re-encodes it just to decode it properly, it's all going to fail with the next piece of data that gets incorrectly decoded as something else. Without finding out how it got incorrectly decoded, you can't fix the problem.

Tupteq Over a year ago

If you don't know what you have (encoding), then it's quite hard to guess encoding and even if you do this, there's always a possibility of failure. So, I just assumed it's a UTF-8.

abarnert Over a year ago

You're missing the point. We can assume the actual bytes are UTF-8 because the OP said so. What we can't assume is that the bytes are getting improperly decoded with Latin-1, as opposed to, say, with sys.getdefaultencoding() (which would make your code break on other machines), or Unicode, Dammit (which would make it break with different data), etc.

Collectives™ on Stack Overflow

One more fight between Unicode, Python 3 and programmer. Decoding string

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related