Over the years, Intel’s 512-bit Advanced Vector eXtensions (AVX-512) stirred extensive discussions. While introduced in 2014, it wasn’t until recently that CPUs began providing comprehensive support. Similarly, Arm Scalable Vector Extensions (SVE), primarily designed for Arm servers, have also started making waves only lately. The computing landscape now looks quite different with powerhouses like Intel’s Sapphire Rapids CPUs, AWS Graviton 3, and Ampere Altra entering the fray. Their arrival brings compelling advantages over the traditional AVX2 and NEON extensions:
- Masked Loads: Enhanced data processing through selective loading.
- Half-Precision Floating Point Math: Speedier computations with less memory usage.
These enhancements shine in the latest SimSIMD release.
Do you use #NumPy or #SciPy to compute vector similarity? For @OpenAI Ada embeddings ā SimSIMD does it 3-200x faster š¤Æ
— Ash Vardanian (@ashvardanian) October 5, 2023
float16 works great on every arch - Arm NEON & SVE and x86 AVX2 & AVX-512. Intel AMX also looks excellent šhttps://t.co/ouFdvMJd7g#opensource #Python
This library now boasts:
- Speeds up to 200x faster than SciPy and NumPy,
- Compatibility with NEON, SVE, AVX2, and AVX-512 extensions,
- Support for Inner Product, Euclidean, Angular, Hamming, Jaccard distances, and
- Acceptance of
f32
,f16
,i8
, and binary vectors.
If this sparks your interest, go ahead and pip install simsimd
. The library is available for both public and commercial projects under the Apache 2.0 license. For those keen on delving deeper into the mechanics behind these advancements, understanding how AVX-512 and SVE make SIMD code cleaner, and how the accelerate half-precision AI and Machine Learning workloads, continue reading!
Appendix contains performance benchmarks for:
āTails of the pastā: The Significance of Masked Loads
Think of NEON as a container that’s always 128 bits wide, while AVX is a wider container at 256 bits. But real-world data can come in all sorts of sizes, and this sometimes causes complications, especially when we’re not just working with standard math libraries.
- For tasks like measuring the distance between dense data points, manually guiding the computer can be 2-5 times faster than letting it figure things out on its own.
- For trickier tasks, like processing strings of text or handling varying bit rates, manually guiding can be even 20-50 times faster.
Now, with this in mind, you might expect to see a piece of code from StringZilla. But this isn’t just about speedāit’s also about writing neat and concise code. One thing that bothers me is splitting a piece of code into three parts because the data doesn’t fit perfectly.
However, if you’re not chasing the absolute best speed, you can simplify it to two parts. Just skip the beginning and use a more flexible approach in the main body of the code. With tools like NEON and AVX, making this kind of distance calculation might look like this:
|
|
To avoid doing the tail, you can use the _mm256_maskload_ps
for masking, but it’s only available for 32 and 64-bit entries. Now, here’s where AVX-512 makes things better. It can handle uneven data sizes more gracefully, so we can get rid of the “tail” for f32
, f16
, i8
, and other data types.
|
|
For 1535-dimensional OpenAI Ada embeddings on Intel Xeon Platinum 8480+ the performance get’s 84% better:
- AVX2: 3.3 Million ops/s, same as auto-vectorized code.
- AVX-512: 6.1 Million ops/s.
Another neat idea is to swap out some _mm512_maskz_loadu_ps
instructions in the code for better-aligned ones, like _mm512_maskz_load_ps
. But this doesn’t always work, especially with tools like SVE, which handle things differently. You will often see intrinsics like svwhilelt
and svcnt
:
|
|
For 1535-dimensional OpenAI Ada embeddings on AWS Graviton 3 the performance get’s 43% better:
- NEON: 3.5 Million ops/s.
- SVE: 5 Million ops/s.
To wrap it up, this might be a glimpse into the future of computing, where the lines between different types of computer brainsālike CPUs and GPUsāstart to blur. It’s more evident here than with other technologies we’ve seen in the past, like Intel Knights Landing lineup.
The Challenge of f16
Half-precision math, represented as _Float16
, has been part of the C standard since 2011. But even today, it’s tricky to get optimal performance. While compilers like GCC and Clang can work with _Float16
or its variant __fp16
, they often don’t vectorize it effectively. As a result, when you compile with options like -O3
and -ffast-math
, half-precision code tends to run about 10 times slower than its single-precision counterpart. Here’s a breakdown of some benchmarks:
|
|
In a table format for clarity:
auto-vectorization | AVX2 + F16C + FMA | AVX-512 FP16 | |
---|---|---|---|
f32 | 3.0 M | 3.3 M | 6.0 M |
f16 | 148.0 K | 3.0 M | 9.0 M |
Given AI’s increasing reliance on half-precision floats for most workloads, the lag in performance is concerning. Some vendors initially prioritized bf16
support, offering a different balance between exponent and mantissa than the IEEE 754 16-bit float
. However, 4th generation Intel Xeon now supports bf16
and introduces mixed-precision fused-multiply-add instructions, streamlining most of the vectorized dot-product tasks.
|
|
To maintain clarity in the demonstration, I’ve used a sequential _mm512_reduce_add_ph
at the conclusion; yet, this choice doesn’t impede performance. These compelling results are set to be integrated into the forthcoming USearch v3 release, bolstering its reputation as the fastest Vector Search engine in the domain.
We are excited to share a single-header C++11 vector-search engine with SDKs for #Python, #JavaScript, #Java, #Rust, and soon #GoLang & #Wolfram! It's tiny and easy to use, but still brings half-precision & integer quantization and scales beyond 4B points! https://t.co/q7yqpL3CHL
— Unum (@unum_cloud) May 2, 2023
Bonus Section: Bypassing sqrt
and LibC Dependencies!
One of the joys of library development is achieving independence from external dependencies. This means total independence, including from staples like LibC and the Standard Template Library of C++. In SimSIMD, I previously depended on just one header from LibC, <math.h>
. Not anymore.
Originally, it was used for the angular distance computation given by 1 - ab / sqrtf(a2 * b2)
. A simple optimization for this is substituting with the renowned rsqrt
bithack from Quake 3:
This hack is already a remarkable trick. However, not everyone is aware of alternative approaches, some of which offer improved computational steps or employ different magic numbers to enhance numerical stability. I opted for methods recommended by Jan Kadlec. To bolster numerical stability, consider:
Going further, modern CPUs furnish specialized instructions, and I’ve integrated several of these into SimSIMD. Some prominent examples are:
_mm_rsqrt_ss
for SSE._mm_mask_rsqrt14_pd
for AVX-512.vrsqrte_f32
for NEON.
For those keen on eliminating the <math.h>
dependency, these instructions are certainly worth exploring!
Benchmark Results
Methodology:
- For Cosine, squared Euclidean, Hamming, and Jaccard distances, I employed specialized functions from the
scipy.spatial.distance
module of SciPy. - Inner Product distance calculations were performed directly using NumPy as
1 - np.inner(A, B)
. cdist
Inner Product distance was computed with1 - np.dot(A, B.T)
, translating directly into a BLASsgemm
call.- For batch processes where direct support from SciPy is absent, each row was looped using SciPy and NumPy.
- For Cosine, squared Euclidean, and Inner Product 3 datatypes were benchmarked:
f32
,f16
,i8
. - For binary metrics the only datatype that makes sense is an 8-bit unsigned integer -
u8
.
Notably, there are two specific scenarios where SimSIMD doesn’t outshine:
- Inner Product distance between pairwise rows in two matrices (expressed as
1 - np.dot(A, B.T)
): When matrices increase in size, effective caching is required to retain parts of the matrix on the CPU, circumventing frequent RAM fetches. While BLAS optimizes this, SimSIMD doesn’t and has no plans to. - Minimal operations on contemporary Intel CPUs involving only two vectors: Despite bypassing heavy abstractions like
PyArg_ParseTuple
in the CPython binding, the performance is impacted due to a combination of substantial type-checking and ancillary tasks between the API and its execution. Potential optimization could arise from backend pre-selection upon module loading, saving it in a global variable, rather than the current approach of verifying thecpuid
for every interaction.
In upcoming endeavors, I’ll delve into Intel’s Advanced Matrix eXtensions (AMX) and Arm’s Scalable Matrix Extensions (SME). These have already enhanced our Unumās UForm Transformer models, trimming inference time to a mere 1.3 milliseconds on a Sapphire Rapids CPU. While these extensions are yet to be integrated into SimSIMD, I openly invite contributions. Excited to see your inputs! š¤
Appendix 1: Performance on Apple M2 Pro
Between 2 Vectors, Batch Size: 1
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 67,352 | 1,559,963 | 23.16 x |
f16 | scipy.cosine | 26,723 | 1,553,700 | 58.14 x |
i8 | scipy.cosine | 78,937 | 3,259,101 | 41.29 x |
f32 | scipy.sqeuclidean | 383,307 | 1,587,302 | 4.14 x |
f16 | scipy.sqeuclidean | 89,709 | 1,567,298 | 17.47 x |
i8 | scipy.sqeuclidean | 213,618 | 1,794,259 | 8.40 x |
f32 | numpy.inner | 1,346,952 | 1,611,604 | 1.20 x |
f16 | numpy.inner | 266,904 | 1,545,994 | 5.79 x |
i8 | numpy.inner | 862,502 | 3,302,596 | 3.83 x |
u8 | scipy.hamming | 1,632,432 | 27,874,556 | 17.08 x |
u8 | scipy.jaccard | 1,180,928 | 24,515,213 | 20.76 x |
Between 2 Vectors, Batch Size: 1,000
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 68,414 | 2,471,680 | 36.13 x |
f16 | scipy.cosine | 27,964 | 2,470,154 | 88.33 x |
i8 | scipy.cosine | 81,064 | 15,364,768 | 189.54 x |
f32 | scipy.sqeuclidean | 415,326 | 2,462,800 | 5.93 x |
f16 | scipy.sqeuclidean | 90,795 | 2,439,524 | 26.87 x |
i8 | scipy.sqeuclidean | 203,094 | 4,144,373 | 20.41 x |
f32 | numpy.inner | 1,590,667 | 2,535,124 | 1.59 x |
f16 | numpy.inner | 268,583 | 2,505,481 | 9.33 x |
i8 | numpy.inner | 914,879 | 16,985,118 | 18.57 x |
u8 | scipy.hamming | 1,645,641 | 134,064,810 | 81.47 x |
u8 | scipy.jaccard | 1,237,944 | 103,444,581 | 83.56 x |
Between All Pairs of Vectors (cdist
), Batch Size: 1,000
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 777,894 | 2,533,840 | 3.26 x |
f16 | scipy.cosine | 776,171 | 2,480,454 | 3.20 x |
i8 | scipy.cosine | 778,570 | 17,927,198 | 23.03 x |
f32 | scipy.sqeuclidean | 2,349,645 | 2,569,746 | 1.09 x |
f16 | scipy.sqeuclidean | 2,158,488 | 2,377,701 | 1.10 x |
i8 | scipy.sqeuclidean | 2,091,610 | 4,118,384 | 1.97 x |
f32 | numpy.inner | 21,854,341 | 2,092,590 | 0.10 x |
f16 | numpy.inner | 304,646 | 2,423,802 | 7.96 x |
i8 | numpy.inner | 1,819,572 | 17,014,688 | 9.35 x |
u8 | scipy.hamming | 46,265,239 | 1,502,534,579 | 32.48 x |
u8 | scipy.jaccard | 129,574,247 | 971,895,663 | 7.50 x |
Appendix 2: Performance on 4th Gen Intel Xeon Platinum (8480+)
Between 2 Vectors, Batch Size: 1
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 57,346 | 224,148 | 3.91 x |
f16 | scipy.cosine | 17,068 | 226,829 | 13.29 x |
i8 | scipy.cosine | 71,983 | 234,372 | 3.26 x |
f32 | scipy.sqeuclidean | 341,301 | 218,177 | 0.64 x |
f16 | scipy.sqeuclidean | 37,076 | 229,565 | 6.19 x |
i8 | scipy.sqeuclidean | 236,241 | 237,825 | 1.01 x |
f32 | numpy.inner | 912,270 | 225,867 | 0.25 x |
f16 | numpy.inner | 92,400 | 230,110 | 2.49 x |
i8 | numpy.inner | 700,822 | 231,987 | 0.33 x |
u8 | scipy.hamming | 1,586,149 | 1,875,391 | 1.18 x |
u8 | scipy.jaccard | 1,083,886 | 1,871,898 | 1.73 x |
Between 2 Vectors, Batch Size: 1,000
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 56,848 | 2,835,761 | 49.88 x |
f16 | scipy.cosine | 17,125 | 4,144,457 | 242.01 x |
i8 | scipy.cosine | 71,990 | 7,627,362 | 105.95 x |
f32 | scipy.sqeuclidean | 370,555 | 2,817,710 | 7.60 x |
f16 | scipy.sqeuclidean | 37,116 | 4,451,408 | 119.93 x |
i8 | scipy.sqeuclidean | 224,866 | 10,432,198 | 46.39 x |
f32 | numpy.inner | 910,884 | 2,814,501 | 3.09 x |
f16 | numpy.inner | 91,347 | 4,728,760 | 51.77 x |
i8 | numpy.inner | 700,576 | 7,347,589 | 10.49 x |
u8 | scipy.hamming | 1,611,206 | 79,796,509 | 49.53 x |
u8 | scipy.jaccard | 1,096,522 | 87,557,689 | 79.85 x |
Between All Pairs of Vectors (cdist
), Batch Size: 1,000
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 2,430,617 | 3,106,912 | 1.28 x |
f16 | scipy.cosine | 2,272,228 | 4,304,768 | 1.89 x |
i8 | scipy.cosine | 2,370,290 | 8,384,383 | 3.54 x |
f32 | scipy.sqeuclidean | 2,226,342 | 5,505,910 | 2.47 x |
f16 | scipy.sqeuclidean | 2,237,966 | 4,732,212 | 2.11 x |
i8 | scipy.sqeuclidean | 2,234,063 | 13,029,101 | 5.83 x |
f32 | numpy.inner | 175,492,443 | 4,984,406 | 0.03 x |
f16 | numpy.inner | 106,677 | 5,048,223 | 47.32 x |
i8 | numpy.inner | 1,807,566 | 8,257,849 | 4.57 x |
u8 | scipy.hamming | 159,574,713 | 2,325,251,446 | 14.57 x |
u8 | scipy.jaccard | 19,478,797 | 1,705,905,017 | 87.58 x |
Appendix 3: Performance on AWS Graviton 3
Between 2 Vectors, Batch Size: 1
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 30,830 | 1,007,218 | 32.67 x |
f16 | scipy.cosine | 14,683 | 1,055,420 | 71.88 x |
i8 | scipy.cosine | 38,843 | 1,441,367 | 37.11 x |
f32 | scipy.sqeuclidean | 173,800 | 1,076,601 | 6.19 x |
f16 | scipy.sqeuclidean | 48,080 | 1,136,170 | 23.63 x |
i8 | scipy.sqeuclidean | 120,281 | 1,371,827 | 11.41 x |
f32 | numpy.inner | 642,658 | 1,103,082 | 1.72 x |
f16 | numpy.inner | 171,904 | 1,078,110 | 6.27 x |
i8 | numpy.inner | 459,605 | 1,438,284 | 3.13 x |
u8 | scipy.hamming | 796,739 | 10,041,370 | 12.60 x |
u8 | scipy.jaccard | 534,529 | 9,181,640 | 17.18 x |
Between 2 Vectors, Batch Size: 1,000
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 31,162 | 2,692,131 | 86.39 x |
f16 | scipy.cosine | 14,721 | 3,034,736 | 206.15 x |
i8 | scipy.cosine | 38,837 | 10,442,992 | 268.89 x |
f32 | scipy.sqeuclidean | 177,025 | 2,705,020 | 15.28 x |
f16 | scipy.sqeuclidean | 50,164 | 3,151,582 | 62.83 x |
i8 | scipy.sqeuclidean | 121,269 | 6,081,171 | 50.15 x |
f32 | numpy.inner | 657,257 | 3,078,979 | 4.68 x |
f16 | numpy.inner | 171,875 | 3,153,530 | 18.35 x |
i8 | numpy.inner | 459,975 | 9,719,212 | 21.13 x |
u8 | scipy.hamming | 845,629 | 46,755,187 | 55.29 x |
u8 | scipy.jaccard | 546,286 | 30,730,463 | 56.25 x |
Between All Pairs of Vectors (cdist
), Batch Size: 1,000
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f32 | scipy.cosine | 1,570,730 | 3,066,406 | 1.95 x |
f16 | scipy.cosine | 1,605,807 | 3,057,974 | 1.90 x |
i8 | scipy.cosine | 1,590,897 | 14,750,039 | 9.27 x |
f32 | scipy.sqeuclidean | 1,622,545 | 3,272,164 | 2.02 x |
f16 | scipy.sqeuclidean | 1,578,548 | 3,178,287 | 2.01 x |
i8 | scipy.sqeuclidean | 1,592,369 | 6,436,115 | 4.04 x |
f32 | numpy.inner | 58,199,003 | 3,445,812 | 0.06 x |
f16 | numpy.inner | 218,842 | 3,274,774 | 14.96 x |
i8 | numpy.inner | 1,356,633 | 14,812,931 | 10.92 x |
u8 | scipy.hamming | 78,713,290 | 579,018,344 | 7.36 x |
u8 | scipy.jaccard | 34,455,520 | 318,356,871 | 9.24 x |