1

I have git reading the file name "ùàèòùèòùùè.txt" as a simple string of bytes, so when I ask git for a list of commited files, I'm given the following string:

r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"

How can I use Python 2 to have it back to "ùàèòùèòùùè.txt"?

2
  • Please specify whether you are using Python 2 or 3, this is one of the main differences between them. Commented Jun 17, 2015 at 11:24
  • 1
    How did you manage to end up with ùàèòùèòùùè.txt in your file system? Commented Jun 17, 2015 at 11:37

1 Answer 1

4

If the git format contains literal \ddd sequences (so up to 4 characters per filename byte) you can use the string_escape (Python 2) or unicode_escape (Python 3) codecs to have Python interpret the escape sequences.

You'll get UTF-8 data; my terminal is set to interpret UTF-8 directly:

>>> git_data = r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('string_escape')
'\xc3\xb9\xc3\xa0\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xb9\xc3\xa8.txt'
>>> print git_data.decode('string_escape')
ùàèòùèòùùè.txt

You'd want to decode that as UTF-8 to get text:

>>> git_data.decode('string_escape').decode('utf8')
u'\xf9\xe0\xe8\xf2\xf9\xe8\xf2\xf9\xf9\xe8.txt'
>>> print git_data.decode('string_escape').decode('utf8')
ùàèòùèòùùè.txt

In Python 3, the unicode_escape codec gives you (Unicode) text so an extra encode to Latin-1 is required to make it bytes again:

>>> git_data = rb"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('unicode_escape').encode('latin1').decode('utf8')
'ùàèòùèòùùè.txt'

Note that git_data is a bytes object before decoding.

Sign up to request clarification or add additional context in comments.

5 Comments

Or just print the string
Hi, thanks! This seems to work, but the problem is that I don't need to print it, I have to set it in a string variable
@PadraicCunningham: that only works when copying the string into Python. This is not a literal, this is read from a file.
@pistacchio: the prints are there to demonstrate that the data has been decoded correctly.
@pistacchio: in other words: text = git_data.decode('string_escape').decode('utf8') gives you the Unicode text value for the given data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.