5

I have a device that returns a UTF-8 encoded string. I can only read from it byte-by-byte and the read is terminated by a byte of value 0x00.

I'm making a Python 2.7 function for others to access my device and return string.

In a previous design when the device just returned ASCII, I used this in a loop:

x = read_next_byte()
if x == 0:
    break
my_string += chr(x)

Where x is the latest byte value read from the device.

Now the device can return a UTF-8 encoded string, but I'm not sure how to convert the bytes that I get back into a UTF-8 encoded string/unicode.

chr(x) understandably causes an error when the x>127, so I thought that using unichr(x) may work, but that assumes the value passed is a full unicode character value, but I only have a part 0-255.

So how can I convert the bytes that I get back from the device into a string that can be used in Python and still handle the full UTF-8 string?

Likewise, if I was given a UTF-8 string in Python, how would I break that down into individual bytes to send to my device and still maintain UTF-8?

1 Answer 1

4

The correct solution would be to read until you hit the terminating byte, then convert to UTF-8 at that time (so you have all characters):

mybytes = bytearray()
while True:
    x = read_next_byte()
    if x == 0:
        break
    mybytes.append(x)
my_string = mybytes.decode('utf-8')

The above is the most direct translation of your original code. Interestingly, this is one of those cases where two arg iter can be used to dramatically simplify the code by making your C-style stateful byte reader function into a Python iterator that lets you one-line the work:

# If this were Python 3 code, you'd use the bytes constructor instead of bytearray
my_string = bytearray(iter(read_next_byte, 0)).decode('utf-8')
Sign up to request clarification or add additional context in comments.

2 Comments

Fantastic. That seems to work great. So to do the opposite and encode a bytearray I could use this right? my_bytes = bytearray(my_string, 'utf-8') and just loop over my_bytes to send the individual bytes.
@Will: Yup. In Py3, it's somewhat more intuitive to do my_string.encode('utf-8') (which gets you bytes, which behave like immutable bytearrays in Py3); in Py2 though, encode gets you str, which iterates by len 1 str of its characters, instead of by ints from 0-255. Either way, you can iterate the result and call a write function: for b in bytearray(my_string, 'utf-8'): write_one_byte(b)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.