2

I am comparing two presumably differently encoded file names, in Javascript, with the hope to find matches:

Analysis

When comparing the log output in the javascript console, these file names look exactly identical:

15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3
15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3

Note the german umlauts.

Now, when I just copy and paste these strings into Notepad++ and enable the hex editor, it looks like this:

The above text in hex

  • In the first case the A-Umlaut is encoded with 3 (three) bytes
  • In the second case the A-Umlaut is encoded with only 2 (two) bytes.

Question

How can I safely compare those two strings. Is there a general "unencode" method in Javascript that can handle these instances? Or should I / must I guess each encoding and then compare explicitly?

Note

1 Answer 1

4

What's happening here?

If you have a String in JavaScript, it's a sequence of Unicode codepoints. Some component has already decoded the bytes representing those strings from the ZIP or the plist into a sequence of codepoints.

That is, this question is not quite about encodings, but about Unicode decomposition and normalization forms.

It's possible to encode an ä in (at least) two different ways in Unicode (examples below in Python due to the useful outputs).

>>> "ä".encode("UTF-8")
b'\xc3\xa4'  # two bytes
>>> [ord(c) for c in "ä"]
[228]
>>> [unicodedata.name(c) for c in "ä"]
['LATIN SMALL LETTER A WITH DIAERESIS']

or in the NFKD normalization form, taking two codepoints and three bytes in UTF-8.

>>> unicodedata.normalize("NFKD", "ä").encode("UTF-8")
b'a\xcc\x88'  # three bytes
>>> [ord(c) for c in unicodedata.normalize("NFKD", "ä")]
[97, 776]  # two codepoints
>>> [unicodedata.name(c) for c in unicodedata.normalize("NFKD", "ä")]
['LATIN SMALL LETTER A', 'COMBINING DIAERESIS']

Answer

Long story short, in JavaScript, you'll need to call String#normalize() to make sure the strings are in the same normalization form before attempting regular comparison.

$ node
Welcome to Node.js v16.6.1.
Type ".help" for more information.
> var a = '15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3';
undefined
> var b = '15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3';
undefined
> a.length
50
> b.length
48
> a === b
false
> a.normalize() === b.normalize()
true
>
Sign up to request clarification or add additional context in comments.

7 Comments

"I am specifically asking for a solution in javascript" - OP.
@Rojo Yep. That's what the link in the end is. Examples are in Python because it has useful visualizations for the different normalization forms.
The weird thing is that JavaScript strings are UTF-16, and lower-case "a" with diaresis in JavaScript is 0x00E4, one UTF-16 codepoint.
@Pointy UTF-16 is the internal memory representation JavaScript uses, that doesn't factor in here (since we're not talking about codepoints >= 0xFFFF). 0xE4 is 228 in decimal, the same codepoint in the Python illustration.
Because the binary data had been decomposed to begin with is my guess. Since OP is talking about Apple plists, I'd wager the ZIP has been created on a Mac too, and Mac file systems store filenames in NFD. You can emulate this with String.fromCodePoint(97) + String.fromCodePoint(776).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.