0

i get a reply from python server basically what i am doing is sending an article and the python code is sending me important tags in the article. the reply i get is like this

"keywords": "[u'Smartphone', u'Abmessung', u'Geh\xe4userand']"

so i want to utf8 decode the Geh\xe4userand string. i read in some post that i have to put it in "" and do the decoding but its not working. my code is

$tags = str_replace("'",'"',$tags);
$tags = preg_replace('/\[*\s*u(".*?")\]*/', "$1", $tags);
$tags = explode(',', $tags);
    foreach ($tags as $tag) {
        pr(utf8_encode($tag));
    }
    die;

the output i am getting is

<pre>"Smartphone"</pre><pre>"Abmessung"</pre><pre>"Geh\xe4userand"</pre>

i don't have access to the python code.

3
  • 4
    Fix the Python code instead; it is sending you a Python list literal with a Unicode escape, not UTF8. It should send you JSON instead, most likely. The \xe4 character sequence encodes the codepoint U+00E4, but it is 4 literal ASCII characters. Commented Oct 31, 2014 at 11:48
  • 1
    If you cannot fix the Python code, you'll have to translate all \xhh 2-hex codes to map them to Latin-1 codepoints instead. Any \uhhhh 4-hex codes are Unicode code points, \Uhhhhhhhh 8-hex codepoints for Unicode codepoints outside the BMP, and then there are the \n, \r and \t escape codes for newline, carriage return and tab. Commented Oct 31, 2014 at 11:50
  • replaced the hex with appropriate character since changing python code can't happen soon. thanx @Martijn Pieters Commented Oct 31, 2014 at 12:31

1 Answer 1

1

If at all feasible, fix the Python code instead; it is sending you a Python list literal with a Unicode escape, not UTF8. Ideally it should send you JSON instead.

The \xe4 character sequence encodes the codepoint U+00E4, but it is using 4 literal ASCII characters (\, x, e, 4).

Other Python literal rules:

  • It'll use either single quotes or double quotes, depending on the contents, with a preference for single quotes. As a result you may have to handle escaped \' single quotes.
  • Newlines, carriage returns and tabs are escaped to \n, \r and \t respectively.
  • All other non-printable Latin-1 characters are escaped to \xhh, a two-digit hexadecimal encoding of the codepoint.
  • If the literal starts with u it is a Unicode string, not a byte string, and any codepoint outside the Latin-1 subset but part of the Basic Multilingual Plane is escaped to \uhhhh, a four-digit hexadecimal encoding of the codepoint in the range U+0100 through to U+FFFF
  • In a Unicode string you'll also find \Uhhhhhhhh, a eight-digit hexadecimal encoding non-BMP unicode codepoints in the range U+00010000 through to U+0001FFFF.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.