PHP UTF8 decode not working for out returned from python

Question

i get a reply from python server basically what i am doing is sending an article and the python code is sending me important tags in the article. the reply i get is like this

"keywords": "[u'Smartphone', u'Abmessung', u'Geh\xe4userand']"

so i want to utf8 decode the Geh\xe4userand string. i read in some post that i have to put it in "" and do the decoding but its not working. my code is

$tags = str_replace("'",'"',$tags);
$tags = preg_replace('/\[*\s*u(".*?")\]*/', "$1", $tags);
$tags = explode(',', $tags);
    foreach ($tags as $tag) {
        pr(utf8_encode($tag));
    }
    die;

the output i am getting is

<pre>"Smartphone"</pre><pre>"Abmessung"</pre><pre>"Geh\xe4userand"</pre>

i don't have access to the python code.

Fix the Python code instead; it is sending you a Python list literal with a Unicode escape, not UTF8. It should send you JSON instead, most likely. The \xe4 character sequence encodes the codepoint U+00E4, but it is 4 literal ASCII characters. — Martijn Pieters
– Martijn Pieters, Commented Oct 31, 2014 at 11:48
If you cannot fix the Python code, you'll have to translate all \xhh 2-hex codes to map them to Latin-1 codepoints instead. Any \uhhhh 4-hex codes are Unicode code points, \Uhhhhhhhh 8-hex codepoints for Unicode codepoints outside the BMP, and then there are the \n, \r and \t escape codes for newline, carriage return and tab. — Martijn Pieters
– Martijn Pieters, Commented Oct 31, 2014 at 11:50
replaced the hex with appropriate character since changing python code can't happen soon. thanx @Martijn Pieters — Rohan
– Rohan, Commented Oct 31, 2014 at 12:31

Martijn Pieters · Accepted Answer · 2014-10-31 12:41:58Z

If at all feasible, fix the Python code instead; it is sending you a Python list literal with a Unicode escape, not UTF8. Ideally it should send you JSON instead.

The \xe4 character sequence encodes the codepoint U+00E4, but it is using 4 literal ASCII characters (\, x, e, 4).

Other Python literal rules:

It'll use either single quotes or double quotes, depending on the contents, with a preference for single quotes. As a result you may have to handle escaped \' single quotes.
Newlines, carriage returns and tabs are escaped to \n, \r and \t respectively.
All other non-printable Latin-1 characters are escaped to \xhh, a two-digit hexadecimal encoding of the codepoint.
If the literal starts with u it is a Unicode string, not a byte string, and any codepoint outside the Latin-1 subset but part of the Basic Multilingual Plane is escaped to \uhhhh, a four-digit hexadecimal encoding of the codepoint in the range U+0100 through to U+FFFF
In a Unicode string you'll also find \Uhhhhhhhh, a eight-digit hexadecimal encoding non-BMP unicode codepoints in the range U+00010000 through to U+0001FFFF.

Collectives™ on Stack Overflow

PHP UTF8 decode not working for out returned from python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related