2

I have some problems regarding escape characters.

Problem I:

I have a string in the form of:

String = "%C3%85"

String is the representation of two bytes in UTF-8 encoding this char: "Å". Except: "\x" is replaced with "%".

So I want to alter String to look like this:

String = "\xC3\x85"

Problem II:

I have a String in the form:

*String* = "\\x33"

Now I want to convert it into the UTF-8 byte representation of that which should look like:

String = b"\x33"

How do I do that?

Approaches I tried:

I tried using the replace method:

string.replace("%","\")  -- wont work since \ escapes "
string.replace("%","\\") -- wont work since this produces problem II
string.replace("%","\x00").replace("00","") -- wont work since "\x00" is a char by its own.

bytes(string.replace("%","\\") ) -- wont work since this basically comes down to problem II

One approach that works but is way more work than seems to be needed is to create a dictionary with all characters in the form of:

"%00" = "\x00"
...
...

But well....this should be automatable since its basically just replacing % with x\

I am out of luck and couldnt find any help anywhere on the internet.

lmgtfy wont help me either;)

Thanks for any help!

2
  • 1
    What approaches did u try, so we don't give you the same approach. Commented Nov 27, 2017 at 18:55
  • I tried using the replace method: string.replace("%","\") -- wont work since \ escapes " string.replace("%","\\") -- wont work since this produces problem II string.replace("%","\x00").replace("00","") -- wont work since "\x00" is a char by its own. bytes(string.replace("%","\\") ) -- wont work since this basically comes down to problem II One approach that works but is way more work than seems to be needed is to create a dictionary with all characters in the form of: "%00" = "\x00" Commented Nov 27, 2017 at 19:04

2 Answers 2

1

Both problems can probably be solved with the standard library.

Problem I looks like URL-Encoding, ie. the kind of "garbling" you see in query strings in the browser's address bar. In Python 3, the urllib module can handle this:

>>> import urllib.parse
>>> urllib.parse.unquote('%C3%85')
'Å'

For Problem II, you seem to have escape sequences as they are used in Python's string literals. As you might know, you can type 'å' or '\xe5' in the source code to get exactly the same string, just as you can type 0.1, .1 or 1e-1 to get the same float value. Since the Python interpreter sees the four characters \, x, e and 5 in your source code, it must have a way to convert this sequence into the character å. And (part of) this algorithm is made available to Python programmers through the "unicode_escape" codec, which you can use like "normal" codecs such as "utf-8":

>>> '\\x33'.encode('ascii').decode('unicode_escape')
'3'

Since Python 3's str type has no decode() method, you have to encode it to bytes first. If your input contains ASCII characters only, the above line works; also "latin-1" is possible for a mixture of Latin-1 characters and \xNN escapes.

Sign up to request clarification or add additional context in comments.

5 Comments

I like this answer a lot, however I cant seem to find a useable explanation of the usage of the unicode_escape codec. From the description: "Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default." I wouldnt have been able to tell this codec will solve this issues. Can you give me a short guide to what unicode_escape is capable of, or give me some guidelines or links as to how to teach it to me myself?
@Nord.Kind You are absolutly right, I wrote my answer in a haste and didn't explain anything. I edited the answer to include a bit of background.
Thanks a lot for clarifying. However may I ask you for a little guide into the "unicode_escape" encoding? I cant seem to find any explanation as to what expect from it in different cases. What for example does this sentence tell me? "Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped" A normal String in python 3 is a unicode string afaik, so the encoding makes the decoded-bytestring useable for this(thats what I would expect of any encoding). However python source code should be utf8 encoded afaik.
So I am not sure what to make off this. Also "except that quotes are not escaped". How can quotes not be escaped? I dont quite understand what they mean
It's cumbersome to explain this in a comment. Can you post a new question about it? Send me a link here once you did so, and I'll try to answer it (or somebody else who sees it before me).
0

The problem is you have string representation of a hex encoded character byte array. You need to convert it from a string to hex, then let Python interpret it as the UTF-8 character encoding. Try this:

import re 

String = "%C3%85"
out = bytearray(int(c, 16) for c in re.findall(r'%(\w\w)', String)).decode('utf8')
out
# returns:
'Å'

For you second part, the binary representation of '\x33' is b'3'. To get from the string '\\x33' to b'3', you again need to strip out the string formatting, convert the string characters to hex, and convert to bytes.

String = '\\x33'
out = bytes(int(c, 16) for c in re.findall(r'\\x(\w\w)', String))
out
# returns:
b'3'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.