Python: Bytes to string with accented characters

Question

I have git reading the file name "ùàèòùèòùùè.txt" as a simple string of bytes, so when I ask git for a list of commited files, I'm given the following string:

r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"

How can I use Python 2 to have it back to "ùàèòùèòùùè.txt"?

Please specify whether you are using Python 2 or 3, this is one of the main differences between them. — mmdanziger
– mmdanziger, Commented Jun 17, 2015 at 11:24
How did you manage to end up with ùàèòùèòùùè.txt in your file system? — John Dvorak
– John Dvorak, Commented Jun 17, 2015 at 11:37

Martijn Pieters · Accepted Answer · 2015-06-17 11:31:20Z

4

If the git format contains literal \ddd sequences (so up to 4 characters per filename byte) you can use the string_escape (Python 2) or unicode_escape (Python 3) codecs to have Python interpret the escape sequences.

You'll get UTF-8 data; my terminal is set to interpret UTF-8 directly:

>>> git_data = r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('string_escape')
'\xc3\xb9\xc3\xa0\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xb9\xc3\xa8.txt'
>>> print git_data.decode('string_escape')
ùàèòùèòùùè.txt

You'd want to decode that as UTF-8 to get text:

>>> git_data.decode('string_escape').decode('utf8')
u'\xf9\xe0\xe8\xf2\xf9\xe8\xf2\xf9\xf9\xe8.txt'
>>> print git_data.decode('string_escape').decode('utf8')
ùàèòùèòùùè.txt

In Python 3, the unicode_escape codec gives you (Unicode) text so an extra encode to Latin-1 is required to make it bytes again:

>>> git_data = rb"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('unicode_escape').encode('latin1').decode('utf8')
'ùàèòùèòùùè.txt'

Note that git_data is a bytes object before decoding.

edited Jun 17, 2015 at 11:31

answered Jun 17, 2015 at 11:28

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Padraic Cunningham Over a year ago

Or just print the string

pistacchio Over a year ago

Hi, thanks! This seems to work, but the problem is that I don't need to print it, I have to set it in a string variable

Martijn Pieters Over a year ago

@PadraicCunningham: that only works when copying the string into Python. This is not a literal, this is read from a file.

Martijn Pieters Over a year ago

@pistacchio: the prints are there to demonstrate that the data has been decoded correctly.

Martijn Pieters Over a year ago

@pistacchio: in other words: text = git_data.decode('string_escape').decode('utf8') gives you the Unicode text value for the given data.

Collectives™ on Stack Overflow

Python: Bytes to string with accented characters

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related