logo

ACME Updates

10mar2019 Unicode in JavaScript

I was doing some Unicode stuff in JavaScript today. I needed to extract the code points from a string. You might think that the way to get the code point at a given position in a string is:


cp = str.codePointAt( i );

Hah hah, no. That only works for points in the 16-bit Basic Multilingual Plane. If you have characters outside there, such as many emoji, you get garbage.

To handle points past 16 bits you instead do like so:


chars = Array.from( str );
cp = chars[i].codePointAt( 0 );

Array.from() knows how to correctly split a string into individual code points. Why does something as generically named as Array.from() have intimate knowledge of Unicode? ¯\_(ツ)_/¯

And why does codePointAt() correctly handle high-plane code points here but not before? Maybe the single-character strings produced by Array.from() have a different invisible encoding flag? Again, ¯\_(ツ)_/¯

Anyway, that's how you extract Unicode code points in JavaScript. Thank you for coming to my TED talk.


Back to ACME Updates.
address