0

I'm json_encoding some strings. Sometimes they contain binary data. This causes the encoding to fail with error code JSON_ERROR_UTF8. Running the strings through utf8_encode gets around this error. However, (a unicode checkmark) gets encoded as \u00e2\u009c\u0093 which when interpreted by JavaScript and rendered in your browser actually looks like â.

How can I fix this? Is there another encoding I can use?


echo json_encode(utf8_encode('✓')); // "\u00e2\u009c\u0093"

Now press F12 and paste that into your JavaScript console (quotes included). It should output â.


Please note that

echo json_encode('✓'); // "\u2713"

Works as intended. The issue is that sometimes the string will contain binary data which json_encode can't handle, so I need to sanitize every string without breaking the strings it can handle.


More examples:

json_encode(chr(200));              // false (bad)
json_encode(utf8_encode(chr(200)))  // "\u00c8" (good)
json_encode('✓');                   // "\u2713" (good)
json_encode(utf8_encode(chr(200)))  // "\u00e2\u009c\u0093" (bad)

So you see, encoding it works well for some strings and breaks others.

This is strictly for logging. I don't care if the binary data comes out weird, I just don't want it to mess with valid strings.

6
  • Can you show your example PHP and JS code? Commented Aug 21, 2014 at 19:15
  • Maybe the problem relies in the document charset. Did you tried to add <meta charset="UTF-8">in the head of the HTML document? Commented Aug 21, 2014 at 19:20
  • @hek2mgl I pretty much gave it to you, but nevertheless, I updated the question. Commented Aug 21, 2014 at 19:20
  • 1
    This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop 0xe29c93 from being interpreted as ✓ when it shows up in your binary data? Commented Aug 21, 2014 at 19:27
  • 2
    chr(200) isn't a valid unicode char Commented Aug 21, 2014 at 19:31

2 Answers 2

1

Running strings through this function

function _utf8($str) {
    if(!mb_check_encoding($str, 'UTF-8')) {
        return utf8_encode($str);
    }
    return $str;
}

(taken and modified from here)

Seems to give the results I'm after.

Checkmarks are left alone, but chr(200) and other weirdness is encoded:

json_encode(utf8_encode(chr(200))) // "\u00c8"
Sign up to request clarification or add additional context in comments.

Comments

0

EDIT: This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop the byte sequence 0xe29c93 from being interpreted as when it shows up in your binary data?

According to the json_encode PHP reference page, you can use the following syntax to encode Unicode characters:

json_encode($data, JSON_UNESCAPED_UNICODE);

It should make it pass unicode characters through unescaped.

4 Comments

Tried it already. Doesn't work: json_encode(chr(200),JSON_UNESCAPED_UNICODE) yields false.
re: "What's to stop..." I don't actually care if that shows up in my binary data. I just need it not break (return false) for data it can't handle.
@Mark Then transfer it into an encoding it will always be able to handle. For example, base64 encode it.
That would make valid strings illegible. It's for logging. I want to be able to read the valid strings. I'll visually ignore any binary data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.