Bytes operations in Python

Question

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.

t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())

In particular, I cannot understand why the console prints this when I run the code above:

C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before:  b'\xacBLETCHINGLEY'
Using bytes(str):  b'\xc2\xacBLETCHINGLEY'
Using str.encode:  b'\xc2\xacBLETCHINGLEY'

What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?

Possible duplicate of this question. Welcome to StackOverflow. Some unicode codepoints are two bytes long and if encoded with utf-8, prints two bytes. — Rahat Zaman
– Rahat Zaman, Commented Mar 22, 2020 at 2:48

juanpa.arrivillaga · Accepted Answer · 2020-03-21 23:03:58Z

1

Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.

In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:

>>> c = chr(172)
>>> print(c)
¬

And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:

>>> c.encode()
b'\xc2\xac'

In the latin-1 encoding, it is a 1 byte:

>>> c.encode('latin')
b'\xac'

If you want raw bytes, the most precise/easy way then is to use a bytes-literal.

answered Mar 21, 2020 at 23:03

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Marco Borinato Over a year ago

Thank you very much for the quick answer, just one more thing: so as far as python is concerned, there should be no difference between these two variables? t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59" and t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59".encode("latin") correct?

juanpa.arrivillaga Over a year ago

@MarcoBorinato there isn't, but if you mean in general, I can't say 100%, but I believe that, essentially, the first 256 code points of Unicode are equivalent to latin by design. Why do you ask?

Marco Borinato Over a year ago

because they behave in different ways, or at least not in the way I was expecting. I've posted another question since the topic was slightly different, if you can help with that one as well I'd really appreciate. Link below: stackoverflow.com/questions/60811088/…

tdelaney · Accepted Answer · 2020-03-21 23:36:15Z

1

In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.

>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

edited Mar 21, 2020 at 23:36

answered Mar 21, 2020 at 23:30

tdelaney

78k6 gold badges91 silver badges129 bronze badges

Collectives™ on Stack Overflow

Bytes operations in Python

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related