2

The problem is to convert the bytes to unicode, when that bytes already saved in string. Here is an example:

s1 = '\xd0\xb1\xd0\xb0'
s2 = b'\xd0\xb1\xd0\xb1'

print(s1)  #  Here is the problem: prints a trash (аб)
print(s2.decode('utf-8'))  #  Everything is OK, printing 'ба' (two cyrillic symbols)

But how can i decode the data from s1 now? I can't add b'' modifier before the s1 declaration cause s1 may come from internet, so i can't just declare s1 like i declared s2. I found that b'' modifier works like a bytes() function, but when i tried to call it:

s3 = bytes(s1, 'utf-8')

There was a trash again:

print(s3.decode('utf-8'))  #  аб

So the question is: what should i do with s1 that it becomes the 'ба' in terminal output?

I googled a lot but all that i found was not that i need.

That is what i need:

s4 = SOME_WONDERFUL_MAGIC(s1)
print(s4)  #  Prints 'ба'

Very thanks for everybody who can help and sorry me please for bad english.

UPDATE: Oops, the problem returned. I hoped that 1st answer will help me, but i found that:

s1 == '\xd0\xb1\xd0\xb0'  #  BUT
s1 != '\xd0\xb1\xd0\xb0'

What do i mean: I used the 'requests' package to make a POST request to Flask server. It responses me:

req = requests.post(hostName)
print(req.text)  #  b'testText'
#  BUT!
print(req.text[2:-1]  #  testText

It means that bytes representation of testText represented as string like that:

s5 = "b'tumba'"

So the real question is: how to extract tumba from "b'tumba'" (if tumba may contain cyrillic symbols)?

5
  • 5
    How can a unicode object "come from the internet"? If it's from the internet it's bytes. It's being decoded to unicode somewhere, the question is where? Commented Nov 26, 2013 at 1:21
  • 1
    @gnibbler: It's possible that "come from the internet" may mean "come from one of the internet-related modules in the stdlib or elsewhere, which decoded it behind my back". In that case, of course, the OP has to tell us which module he used, and we can tell him how to set an encoding explicitly instead of defaulting to something incorrect, which will solve the problem without needing any wonderful magic. Commented Nov 26, 2013 at 1:39
  • @abarnert, that's exactly what I was trying say Commented Nov 26, 2013 at 1:40
  • @gnibbler: Yes, and your point is crucial; I just wasn't sure a novice would understand it. Novices are usually not directly processing data off sock.read(), and may not realize that requests or ElementTree or whatever is doing some magic with a default value or guess. Commented Nov 26, 2013 at 1:43
  • For everybody who asked for source of unicode object: it is coming as a response from Flask server. 'requests' package is using for making requests. But there is no more problem: gnibbler answered a question, very gratz. Commented Nov 26, 2013 at 2:30

2 Answers 2

4

s1 is probably being incorrectly decoded as ISO-8859-1(latin1) somewhere.

You can try reencoding

>>> s4 = s1.encode('ISO-8859-1')
>>> s4.decode('UTF-8')
'ба'

You real bug is finding where the decoding is happening though.

Stop treating unicode and bytes interchangeably and the fighting will stop :)

Sign up to request clarification or add additional context in comments.

1 Comment

I worked for test string, but real life surprised me with new problem (question is updated).
1

Quick and dirty solution that worked for me:

s1 = '\xd0\xb1\xd0\xb0'
s4 = bytes(s1, encoding='latin1').decode('utf-8')
print(s4)

3 Comments

Sure, but if you've got data that "comes from the internet" and is being incorrectly decoded as Latin-1 this time, so you write code that re-encodes it just to decode it properly, it's all going to fail with the next piece of data that gets incorrectly decoded as something else. Without finding out how it got incorrectly decoded, you can't fix the problem.
If you don't know what you have (encoding), then it's quite hard to guess encoding and even if you do this, there's always a possibility of failure. So, I just assumed it's a UTF-8.
You're missing the point. We can assume the actual bytes are UTF-8 because the OP said so. What we can't assume is that the bytes are getting improperly decoded with Latin-1, as opposed to, say, with sys.getdefaultencoding() (which would make your code break on other machines), or Unicode, Dammit (which would make it break with different data), etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.