Javascript: compare two strings with actually different encoding

Question

I am comparing two presumably differently encoded file names, in Javascript, with the hope to find matches:

One file name is an actual file name, from within a unarchived zip (using https://stuk.github.io/jszip/)
One file name is a name extracted from a bplist (iOS archive format unarchived with https://github.com/joeferner/node-bplist-parser)

Analysis

When comparing the log output in the javascript console, these file names look exactly identical:

15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3
15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3

Note the german umlauts.

Now, when I just copy and paste these strings into Notepad++ and enable the hex editor, it looks like this:

In the first case the A-Umlaut is encoded with 3 (three) bytes
In the second case the A-Umlaut is encoded with only 2 (two) bytes.

Question

How can I safely compare those two strings. Is there a general "unencode" method in Javascript that can handle these instances? Or should I / must I guess each encoding and then compare explicitly?

Note

I am specifically asking for a solution in javascript
This question, Compare strings with different encodings althoug similar is not actually about encoding

AKX · Accepted Answer · 2021-09-14 12:50:36Z

4

What's happening here?

If you have a String in JavaScript, it's a sequence of Unicode codepoints. Some component has already decoded the bytes representing those strings from the ZIP or the plist into a sequence of codepoints.

That is, this question is not quite about encodings, but about Unicode decomposition and normalization forms.

It's possible to encode an ä in (at least) two different ways in Unicode (examples below in Python due to the useful outputs).

>>> "ä".encode("UTF-8")
b'\xc3\xa4'  # two bytes
>>> [ord(c) for c in "ä"]
[228]
>>> [unicodedata.name(c) for c in "ä"]
['LATIN SMALL LETTER A WITH DIAERESIS']

or in the NFKD normalization form, taking two codepoints and three bytes in UTF-8.

>>> unicodedata.normalize("NFKD", "ä").encode("UTF-8")
b'a\xcc\x88'  # three bytes
>>> [ord(c) for c in unicodedata.normalize("NFKD", "ä")]
[97, 776]  # two codepoints
>>> [unicodedata.name(c) for c in unicodedata.normalize("NFKD", "ä")]
['LATIN SMALL LETTER A', 'COMBINING DIAERESIS']

Answer

Long story short, in JavaScript, you'll need to call String#normalize() to make sure the strings are in the same normalization form before attempting regular comparison.

$ node
Welcome to Node.js v16.6.1.
Type ".help" for more information.
> var a = '15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3';
undefined
> var b = '15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3';
undefined
> a.length
50
> b.length
48
> a === b
false
> a.normalize() === b.normalize()
true
>

edited Sep 14, 2021 at 12:50

answered Sep 14, 2021 at 12:33

AKX

171k17 gold badges148 silver badges230 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Rojo Over a year ago

"I am specifically asking for a solution in javascript" - OP.

AKX Over a year ago

@Rojo Yep. That's what the link in the end is. Examples are in Python because it has useful visualizations for the different normalization forms.

Pointy Over a year ago

The weird thing is that JavaScript strings are UTF-16, and lower-case "a" with diaresis in JavaScript is 0x00E4, one UTF-16 codepoint.

AKX Over a year ago

@Pointy UTF-16 is the internal memory representation JavaScript uses, that doesn't factor in here (since we're not talking about codepoints >= 0xFFFF). 0xE4 is 228 in decimal, the same codepoint in the Python illustration.

AKX Over a year ago

Because the binary data had been decomposed to begin with is my guess. Since OP is talking about Apple plists, I'd wager the ZIP has been created on a Mac too, and Mac file systems store filenames in NFD. You can emulate this with String.fromCodePoint(97) + String.fromCodePoint(776).

|

Collectives™ on Stack Overflow

Javascript: compare two strings with actually different encoding

Analysis

Question

Note

1 Answer 1

What's happening here?

Answer

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Analysis

Question

Note

1 Answer 1

What's happening here?

Answer

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related