15

To express, for example, the character U+10400 in JavaScript, I use "\uD801\uDC00" or String.fromCharCode(0xD801) + String.fromCharCode(0xDC00). How do I figure that out for a given unicode character? I want the following:

var char = getUnicodeCharacter(0x10400);

How do I find 0xD801 and 0xDC00 from 0x10400?

6
  • See the wikipedia article on UTF-16. Commented Aug 19, 2011 at 19:24
  • I can't believe that this many years later Javascript is still in the Stone Age regarding Unicode. Having only BMP characters was something that should have gone out the door with Unicode 1.1 something like 15 years ago. Why is Javascript still so broken? Commented Aug 20, 2011 at 3:48
  • 6
    @tchrist: because you can't change a language's basic string model without widespread application breakage. Java, .NET and Windows in general are in the same boat: most of the world is afflicted by the UTF-16 curse. Browser JavaScript has a further hurdle in that the DOM standard also requires strings to be indexed by UTF-16 code units. Commented Aug 20, 2011 at 10:06
  • @bobince: I agree that the UTF-16 Curse sucks, but it may not be insurmountable. There are still measures that can be taken. You can provide alternate libraries available by explicit declaration that have a code point interface sitting on top the original code unit one. On the other hand, the UCS-2 that afflicts Javascript and many aspects of narrow builds of Python is a scourge, and some of the JVM languages can't make use of the code point interfaces that Java is able to provide if you ask nicely enough. Commented Aug 20, 2011 at 10:29
  • 3
    String.fromCharCode(0xD801) + String.fromCharCode(0xDC00) can be written as String.fromCharCode(0xD801, 0xDC00). Commented Feb 2, 2012 at 13:08

2 Answers 2

17

Based on the wikipedia article given by Henning Makholm, the following function will return the correct character for a code point:

function getUnicodeCharacter(cp) {

    if (cp >= 0 && cp <= 0xD7FF || cp >= 0xE000 && cp <= 0xFFFF) {
        return String.fromCharCode(cp);
    } else if (cp >= 0x10000 && cp <= 0x10FFFF) {

        // we substract 0x10000 from cp to get a 20-bits number
        // in the range 0..0xFFFF
        cp -= 0x10000;

        // we add 0xD800 to the number formed by the first 10 bits
        // to give the first byte
        var first = ((0xffc00 & cp) >> 10) + 0xD800

        // we add 0xDC00 to the number formed by the low 10 bits
        // to give the second byte
        var second = (0x3ff & cp) + 0xDC00;

        return String.fromCharCode(first) + String.fromCharCode(second);
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

You can't concatenate "\u" with a hex code to get a unicode character. That is the literal syntax. To get a string from a code you must use String.fromCharCode(). This will return false: "\u0001" == "\u"+"0001" so will this: "\u0001" == "\\u"+"0001".
Well, I know :) The function purposefully returned the javascript literal for those code points (so, "\uD801\uDC00" for 0x10400). I modified the function to return the character instead.
5

How do I find 0xD801 and 0xDC00 from 0x10400?

JavaScript uses UCS-2 internally. That’s why String#charCodeAt() doesn’t work the way you’d want it to.

If you want to get the code point of every Unicode character (including non-BMP characters) in a string, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:

// String#charCodeAt() replacement that only considers full Unicode characters
punycode.ucs2.decode('𝌆'); // [119558]
punycode.ucs2.decode('abc'); // [97, 98, 99]

If you don’t need to do it programmatically though, and you’ve already got the character, just use mothereff.in/js-escapes. It will tell you how to escape any character in JavaScript.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.