To my great surprise, one of the biggest current users of my StringZilla library in the Python ecosystem is one of the world’s most widely used Image Augmentation libraries - Albumentations with over 5 million monthly downloads on PyPI. Last year, Albumentations swapped parts of OpenCV - the world’s most widely used image processing library with 32 million monthly downloads in Python - for my strings library š¤Æ
The reason for that surprising move? The quality of Look-Up Tables (LUTs) implementation in StringZilla.
What are LUTs?
A LUT is simply a 256-byte array where each byte value (0-255) maps to another byte value. In Python, you’d implement it naively like this:
LUTs are everywhere in image processing: gamma correction, histogram equalization, color channel swapping, threshold operations.
I’ve implemented those in StringZilla for byte-level mappings, like a generic foundation for to_lower
and to_upper
string operations and translating biological alphabets, like ACGT
to TGCA
for DNA complements or to ${0,1,2,3}$ for efficient storage and processing.
Performance Comparison
The newer your hardware - the bigger the performance gap compared to OpenCV, the industry standard for image processing. Running on server-grade Intel Sapphire Rapids and consumer-grade Apple M2 Pro CPUs, one may expect the following results:
Library | Intel Sapphire Rapids | Apple M2 Pro | ||
---|---|---|---|---|
Short Words | Long Lines | Short Words | Long Lines | |
Rust š¦ | ||||
to[i] = lut[from[i]] | 0.61 GiB/s | 1.49 GiB/s | 0.16 GiB/s | 3.57 GiB/s |
stringzilla::lookup_inplace | 0.54 GiB/s | 9.90 GiB/s | 0.19 GiB/s | 9.38 GiB/s |
Python š | ||||
bytes.translate | 0.05 GiB/s | 1.92 GiB/s | 0.08 GiB/s | 2.52 GiB/s |
numpy.take | 0.01 GiB/s | 0.85 GiB/s | 0.01 GiB/s | 0.47 GiB/s |
opencv.LUT | 0.01 GiB/s | 1.95 GiB/s | 0.01 GiB/s | 1.60 GiB/s |
opencv.LUT inplace | 0.01 GiB/s | 2.16 GiB/s | 0.02 GiB/s | 1.97 GiB/s |
stringzilla.translate | 0.07 GiB/s | 7.92 GiB/s | 0.07 GiB/s | 7.49 GiB/s |
stringzilla.translate inplace | 0.06 GiB/s | 8.14 GiB/s | 0.03 GiB/s | 4.87 GiB/s |
Words are around 8.5 bytes long on average. Lines are typically around 4 KB long. The benchmarks show throughput in Gigabytes per second - higher is better.
The source of this gap? As with many performance wins I write about, it’s the clever use of SIMD instructions. SIMD (Single Instruction, Multiple Data) lets modern CPUs process multiple bytes in parallelāAVX-512 handles 64 bytes at once on Intel, while ARM NEON processes 16 bytes.
LUTs on Intel Ice Lake and Newer
For full code, refer to
sz_lookup_ice
in StringZilla’s source.
On Intel’s Ice Lake CPUs and newer, AVX512_VBMI
(Vector Byte Manipulation Instructions) introduces several game-changing instructions for LUTs.
Most importantly, the VPERMB
instruction, which can do 64 parallel byte lookups from a 64-byte table.
For a 256-entry alphabet, we need 4 such lookups combined with several other instructions:
- 4x
_mm512_permutexvar_epi8
maps toVPERMB (ZMM, ZMM, ZMM)
:- On Intel Ice Lake: 3 cycles latency, on port 5
- On AMD Genoa: 6 cycles latency, on ports 1 and 2
- 3x
_mm512_mask_blend_epi8
maps toVPBLENDMB_Z (ZMM, K, ZMM, ZMM)
:- On Intel Ice Lake: 3 cycles latency, on ports 0 and 5
- On AMD Genoa: 1 cycle latency, on ports 0, 1, 2, and 3
- 2x
_mm512_test_epi8_mask
maps toVPTESTMB (K, ZMM, ZMM)
:- On Intel Ice Lake: 3 cycles latency, on port 5
- On AMD Genoa: 4 cycles latency, on ports 0 and 1
Here’s how the algorithm works:
- LUT Partitioning: Split the 256-byte LUT into four 64-byte segments
- Bit Testing: Use the top 2 bits of each byte to determine which segment to use
- Parallel Lookups: Perform lookups in all four segments simultaneously
- Selective Blending: Use mask operations to mix parts of different segments
Which roughly translates into this C code:
|
|
Optimizing for Small and Large Inputs
For small inputs, the idea is simple - avoid the overhead of setting up AVX-512 state, and fall back to serial code. This also helps prevent CPU energy state transitions that can add latency.
For larger inputs, we want to ensure our main loop uses aligned stores to maximize throughput. Emphasis on “stores”-unaligned reads are rarely a bottleneck on modern CPUs:
This approach processes:
- Head: Unaligned portion at the beginning using masked operations
- Body: 64-byte aligned chunks using slightly faster aligned stores
- Tail: Remaining unaligned bytes at the end using masked operations
Those masked operations are becoming the new norm, present both in x86 starting with Skylake, on Arm, starting with SVE, and even RISC-V with V extension.
LUTs in Arm NEON
For full code, refer to
sz_lookup_neon
in StringZilla’s source.
ARM’s NEON implementation is simpler than x86.
NEON is ARM’s SIMD extension, available on virtually all modern ARM processors including Apple Silicon, Android phones, and AWS Graviton servers.
We don’t have 512-bit registers on most Arm systems (except for the SVE-capable Fujitsu A64FX), but that’s not our target platform.
Still, there is a logical abstraction for 4x 128-bit vectors, called uint8x16x4_t
, which is perfect for our 256-byte LUT.
A potential challenge with NEON is register pressure. AArch64 provides 32 vector registers of 128 bits each (only 512 bytes in total), while our algorithm uses:
- 4x
uint8x16x4_t
or 16xuint8x16_t
for LUT - 4x source values, 3x masks, 3x intermediaries
- 3x blend registers
So we fit within the 32-register limit anyway, but even if we didn’t-it wouldn’t be a big deal. The physical register file can be 10x larger than the architectural one, but that’s a topic for another post. Without overcomplicating things, the main body translation loop (excluding misaligned head and tail) is:
|
|
Daniel Lemire has discussed similar NEON LUT implementations, but the variations suggested in the comments don’t yield significant improvements over this approach.
Instead of Conclusion
Sometimes, when optimizing code, solutions come from unexpected places. A string library beating OpenCV at image processing is one such example.
These LUT kernels were integrated into Albumentations alongside other SIMD kernels from SimSIMD. Since then, StringZilla v4 has grown to include CUDA kernels for GPUs and SVE/SVE2 implementations for newer ARM processors. The same byte manipulation techniques now accelerate bioinformatics pipelines processing DNA sequences.
4x speedups mean 4x less energy wasted and 4x more we can do with the same hardware. And if you find that useful, spread the word š¤