Python - Reading a UTF-8 encoded string byte-by-byte

Question

I have a device that returns a UTF-8 encoded string. I can only read from it byte-by-byte and the read is terminated by a byte of value 0x00.

I'm making a Python 2.7 function for others to access my device and return string.

In a previous design when the device just returned ASCII, I used this in a loop:

x = read_next_byte()
if x == 0:
    break
my_string += chr(x)

Where x is the latest byte value read from the device.

Now the device can return a UTF-8 encoded string, but I'm not sure how to convert the bytes that I get back into a UTF-8 encoded string/unicode.

chr(x) understandably causes an error when the x>127, so I thought that using unichr(x) may work, but that assumes the value passed is a full unicode character value, but I only have a part 0-255.

So how can I convert the bytes that I get back from the device into a string that can be used in Python and still handle the full UTF-8 string?

Likewise, if I was given a UTF-8 string in Python, how would I break that down into individual bytes to send to my device and still maintain UTF-8?

ShadowRanger · Accepted Answer · 2016-09-26 20:06:11Z

4

The correct solution would be to read until you hit the terminating byte, then convert to UTF-8 at that time (so you have all characters):

mybytes = bytearray()
while True:
    x = read_next_byte()
    if x == 0:
        break
    mybytes.append(x)
my_string = mybytes.decode('utf-8')

The above is the most direct translation of your original code. Interestingly, this is one of those cases where two arg iter can be used to dramatically simplify the code by making your C-style stateful byte reader function into a Python iterator that lets you one-line the work:

# If this were Python 3 code, you'd use the bytes constructor instead of bytearray
my_string = bytearray(iter(read_next_byte, 0)).decode('utf-8')

edited Sep 26, 2016 at 20:06

answered Sep 26, 2016 at 19:59

ShadowRanger

158k12 gold badges221 silver badges316 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Will Over a year ago

Fantastic. That seems to work great. So to do the opposite and encode a bytearray I could use this right? my_bytes = bytearray(my_string, 'utf-8') and just loop over my_bytes to send the individual bytes.

ShadowRanger Over a year ago

@Will: Yup. In Py3, it's somewhat more intuitive to do my_string.encode('utf-8') (which gets you bytes, which behave like immutable bytearrays in Py3); in Py2 though, encode gets you str, which iterates by len 1 str of its characters, instead of by ints from 0-255. Either way, you can iterate the result and call a write function: for b in bytearray(my_string, 'utf-8'): write_one_byte(b)

Collectives™ on Stack Overflow

Python - Reading a UTF-8 encoded string byte-by-byte

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related