2

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.

t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())

In particular, I cannot understand why the console prints this when I run the code above:

C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before:  b'\xacBLETCHINGLEY'
Using bytes(str):  b'\xc2\xacBLETCHINGLEY'
Using str.encode:  b'\xc2\xacBLETCHINGLEY'

What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?

2
  • 1
    b is not a character. It is part of a bytes-literal. Commented Mar 21, 2020 at 22:46
  • 1
    Possible duplicate of this question. Welcome to StackOverflow. Some unicode codepoints are two bytes long and if encoded with utf-8, prints two bytes. Commented Mar 22, 2020 at 2:48

2 Answers 2

1

Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.

In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:

>>> c = chr(172)
>>> print(c)
¬

And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:

>>> c.encode()
b'\xc2\xac'

In the latin-1 encoding, it is a 1 byte:

>>> c.encode('latin')
b'\xac'

If you want raw bytes, the most precise/easy way then is to use a bytes-literal.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much for the quick answer, just one more thing: so as far as python is concerned, there should be no difference between these two variables? t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59" and t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59".encode("latin") correct?
@MarcoBorinato there isn't, but if you mean in general, I can't say 100%, but I believe that, essentially, the first 256 code points of Unicode are equivalent to latin by design. Why do you ask?
because they behave in different ways, or at least not in the way I was expecting. I've posted another question since the topic was slightly different, if you can help with that one as well I'd really appreciate. Link below: stackoverflow.com/questions/60811088/…
1

In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.

>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.