Full Unicode Search at 50× ICU Speed with AVX‑512
This article is about the ugliest, but potentially most useful piece of open-source software I’ve written this year. It’s messy, because UTF-8 is messy. The world’s most widely used text encoding standard was introduced in 1989. It now covers more than 1 million characters across the majority of used writing systems, so it’s not exactly trivial to work with. The example above contains multiple confusable characters: German Eszett variants 'ß'U+00DF0x C3 9F and 'ẞ'U+1E9E0x E1 BA 9E , the Kelvin sign 'K'U+212A0x E2 84 AA and ASCII 'k'U+006B0x 6B , and Greek mu 'μ'U+03BC0x CE BC vs the micro sign 'µ'U+00B50x C2 B5 . Try guessing which is which and how they are encoded in UTF-8! ...