Over the years, Intel’s 512-bit Advanced Vector eXtensions (AVX-512) stirred extensive discussions. While introduced in 2014, it wasn’t until recently that CPUs began providing comprehensive support. Similarly, Arm Scalable Vector Extensions (SVE), primarily designed for Arm servers, have also started making waves only lately. The computing landscape now looks quite different with powerhouses like Intel’s Sapphire Rapids CPUs, AWS Graviton 3, and Ampere Altra entering the fray. Their arrival brings compelling advantages over the traditional AVX2 and NEON extensions:

These enhancements shine in the latest SimSIMD release.

This library now boasts:

  • Speeds up to 200x faster than SciPy and NumPy,
  • Compatibility with NEON, SVE, AVX2, and AVX-512 extensions,
  • Support for Inner Product, Euclidean, Angular, Hamming, Jaccard distances, and
  • Acceptance of f32, f16, i8, and binary vectors.

If this sparks your interest, go ahead and pip install simsimd. The library is available for both public and commercial projects under the Apache 2.0 license. For those keen on delving deeper into the mechanics behind these advancements, understanding how AVX-512 and SVE make SIMD code cleaner, and how the accelerate half-precision AI and Machine Learning workloads, continue reading!


Appendix contains performance benchmarks for:

  1. Apple M2 Pro
  2. 4th Gen Intel Xeon Platinum (8480+)
  3. AWS Graviton 3

“Tails of the past”: The Significance of Masked Loads

Think of NEON as a container that’s always 128 bits wide, while AVX is a wider container at 256 bits. But real-world data can come in all sorts of sizes, and this sometimes causes complications, especially when we’re not just working with standard math libraries.

  • For tasks like measuring the distance between dense data points, manually guiding the computer can be 2-5 times faster than letting it figure things out on its own.
  • For trickier tasks, like processing strings of text or handling varying bit rates, manually guiding can be even 20-50 times faster.

Now, with this in mind, you might expect to see a piece of code from StringZilla. But this isn’t just about speed—it’s also about writing neat and concise code. One thing that bothers me is splitting a piece of code into three parts because the data doesn’t fit perfectly.

However, if you’re not chasing the absolute best speed, you can simplify it to two parts. Just skip the beginning and use a more flexible approach in the main body of the code. With tools like NEON and AVX, making this kind of distance calculation might look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__attribute__((target("avx2,fma")))
simsimd_f32_t simsimd_avx2_f32_l2sq(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n) {
    __m256 d2_vec = _mm256_set1_ps(0);
    simsimd_size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 a_vec = _mm256_loadu_ps(a + i);
        __m256 b_vec = _mm256_loadu_ps(b + i);
        __m256 d_vec = _mm256_sub_ps(a_vec, b_vec);
        d2_vec = _mm256_fmadd_ps(d_vec, d_vec, d2_vec);
    }

    d2_vec = _mm256_add_ps(_mm256_permute2f128_ps(d2_vec, d2_vec, 1), d2_vec);
    d2_vec = _mm256_hadd_ps(d2_vec, d2_vec);
    d2_vec = _mm256_hadd_ps(d2_vec, d2_vec);

    simsimd_f32_t result[1];
    _mm_store_ss(result, _mm256_castps256_ps128(d2_vec));

    // Accumulate the tail:
    for (; i < n; ++i)
        result[0] += (a[i] - b[i]) * (a[i] - b[i]);
    return result[0];
}

To avoid doing the tail, you can use the _mm256_maskload_ps for masking, but it’s only available for 32 and 64-bit entries. Now, here’s where AVX-512 makes things better. It can handle uneven data sizes more gracefully, so we can get rid of the “tail” for f32, f16, i8, and other data types.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__attribute__((target("avx512f,avx512vl")))
simsimd_f32_t simsimd_avx512_f32_l2sq(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n) {
    __m512 d2_vec = _mm512_set1_ps(0);
    for (simsimd_size_t i = 0; i < n; i += 16) {
        __mmask16 mask = n - i >= 16 ? 0xFFFF : ((1u << (n - i)) - 1u);
        __m512 a_vec = _mm512_maskz_loadu_ps(mask, a + i);
        __m512 b_vec = _mm512_maskz_loadu_ps(mask, b + i);
        __m512 d_vec = _mm512_sub_ps(a_vec, b_vec);
        d2_vec = _mm512_fmadd_ps(d_vec, d_vec, d2_vec);
    }
    return _mm512_reduce_add_ps(d2_vec);
}

For 1535-dimensional OpenAI Ada embeddings on Intel Xeon Platinum 8480+ the performance get’s 84% better:

  • AVX2: 3.3 Million ops/s, same as auto-vectorized code.
  • AVX-512: 6.1 Million ops/s.

Another neat idea is to swap out some _mm512_maskz_loadu_ps instructions in the code for better-aligned ones, like _mm512_maskz_load_ps. But this doesn’t always work, especially with tools like SVE, which handle things differently. You will often see intrinsics like svwhilelt and svcnt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__attribute__((target("+sve")))
simsimd_f32_t simsimd_sve_f32_l2sq(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n) {
    simsimd_size_t i = 0;
    svfloat32_t d2_vec = svdupq_n_f32(0.f, 0.f, 0.f, 0.f);
    do {
        svbool_t pg_vec = svwhilelt_b32((unsigned int)i, (unsigned int)n);
        svfloat32_t a_vec = svld1_f32(pg_vec, a + i);
        svfloat32_t b_vec = svld1_f32(pg_vec, b + i);
        svfloat32_t d_vec = svsub_f32_x(pg_vec, a_vec, b_vec);
        d2_vec = svmla_f32_x(pg_vec, d2_vec, d_vec, d_vec);
        i += svcntw();
    } while (i < d);
    simsimd_f32_t d2 = svaddv_f32(svptrue_b32(), d2_vec);
    return d2;
}

For 1535-dimensional OpenAI Ada embeddings on AWS Graviton 3 the performance get’s 43% better:

  • NEON: 3.5 Million ops/s.
  • SVE: 5 Million ops/s.

To wrap it up, this might be a glimpse into the future of computing, where the lines between different types of computer brains—like CPUs and GPUs—start to blur. It’s more evident here than with other technologies we’ve seen in the past, like Intel Knights Landing lineup.

The Challenge of f16

Half-precision math, represented as _Float16, has been part of the C standard since 2011. But even today, it’s tricky to get optimal performance. While compilers like GCC and Clang can work with _Float16 or its variant __fp16, they often don’t vectorize it effectively. As a result, when you compile with options like -O3 and -ffast-math, half-precision code tends to run about 10 times slower than its single-precision counterpart. Here’s a breakdown of some benchmarks:

1
2
3
4
5
6
7
8
9
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
avx2_f32_l2sq_1536d/min_time:10.000/threads:224         1.37 ns          306 ns     46795392 pairs=3.2643M/s
avx2_f16_l2sq_1536d/min_time:10.000/threads:224         1.48 ns          329 ns     44446528 pairs=3.03545M/s
avx512_f32_l2sq_1536d/min_time:10.000/threads:224      0.741 ns          166 ns     85654240 pairs=6.03806M/s
avx512_f16_l2sq_1536d/min_time:10.000/threads:224      0.495 ns          111 ns    122070592 pairs=9.01289M/s
auto_f32_l2sq_1536d/min_time:10.000/threads:224         1.45 ns          325 ns     50284192 pairs=3.08087M/s
auto_f16_l2sq_1536d/min_time:10.000/threads:224         30.2 ns         6752 ns      1931328 pairs=148.094k/s

In a table format for clarity:

auto-vectorizationAVX2 + F16C + FMAAVX-512 FP16
f323.0 M3.3 M6.0 M
f16148.0 K3.0 M9.0 M

Given AI’s increasing reliance on half-precision floats for most workloads, the lag in performance is concerning. Some vendors initially prioritized bf16 support, offering a different balance between exponent and mantissa than the IEEE 754 16-bit float. However, 4th generation Intel Xeon now supports bf16 and introduces mixed-precision fused-multiply-add instructions, streamlining most of the vectorized dot-product tasks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__attribute__((target("avx512fp16,avx512vl,avx512f")))
simsimd_f32_t simsimd_avx512_f16_l2sq(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t d) {
    __m512h d2_vec = _mm512_set1_ph(0);
    for (simsimd_size_t i = 0; i < d; i += 32) {
        __mmask32 mask = d - i >= 32 ? 0xFFFFFFFF : ((1u << (d - i)) - 1u);
        __m512i a_vec = _mm512_maskz_loadu_epi16(mask, a + i);
        __m512i b_vec = _mm512_maskz_loadu_epi16(mask, b + i);
        __m512h d_vec = _mm512_sub_ph(_mm512_castsi512_ph(a_vec), _mm512_castsi512_ph(b_vec));
        d2_vec = _mm512_fmadd_ph(d_vec, d_vec, d2_vec);
    }
    return _mm512_reduce_add_ph(d2_vec);
}

To maintain clarity in the demonstration, I’ve used a sequential _mm512_reduce_add_ph at the conclusion; yet, this choice doesn’t impede performance. These compelling results are set to be integrated into the forthcoming USearch v3 release, bolstering its reputation as the fastest Vector Search engine in the domain.

Bonus Section: Bypassing sqrt and LibC Dependencies!

One of the joys of library development is achieving independence from external dependencies. This means total independence, including from staples like LibC and the Standard Template Library of C++. In SimSIMD, I previously depended on just one header from LibC, <math.h>. Not anymore.

Originally, it was used for the angular distance computation given by 1 - ab / sqrtf(a2 * b2). A simple optimization for this is substituting with the renowned rsqrt bithack from Quake 3:

1
2
3
4
5
6
7
simsimd_f32_t simsimd_approximate_inverse_square_root(simsimd_f32_t number) {
    simsimd_f32i32_t conv;
    conv.f = number;
    conv.i = 0x5F1FFFF9 - (conv.i >> 1);
    conv.f *= 0.703952253f * (2.38924456f - number * conv.f * conv.f);
    return conv.f;
}

This hack is already a remarkable trick. However, not everyone is aware of alternative approaches, some of which offer improved computational steps or employ different magic numbers to enhance numerical stability. I opted for methods recommended by Jan Kadlec. To bolster numerical stability, consider:

1
2
3
1 - ab / sqrt(a2 * b2);
1 - ab * rsqrt(a2 * b2);
1 - ab * rsqrt(a2) * rsqrt(b2);

Going further, modern CPUs furnish specialized instructions, and I’ve integrated several of these into SimSIMD. Some prominent examples are:

  • _mm_rsqrt_ss for SSE.
  • _mm_mask_rsqrt14_pd for AVX-512.
  • vrsqrte_f32 for NEON.

For those keen on eliminating the <math.h> dependency, these instructions are certainly worth exploring!

Benchmark Results

Methodology:

  • For Cosine, squared Euclidean, Hamming, and Jaccard distances, I employed specialized functions from the scipy.spatial.distance module of SciPy.
  • Inner Product distance calculations were performed directly using NumPy as 1 - np.inner(A, B).
  • cdist Inner Product distance was computed with 1 - np.dot(A, B.T), translating directly into a BLAS sgemm call.
  • For batch processes where direct support from SciPy is absent, each row was looped using SciPy and NumPy.
  • For Cosine, squared Euclidean, and Inner Product 3 datatypes were benchmarked: f32, f16, i8.
  • For binary metrics the only datatype that makes sense is an 8-bit unsigned integer - u8.

Notably, there are two specific scenarios where SimSIMD doesn’t outshine:

  1. Inner Product distance between pairwise rows in two matrices (expressed as 1 - np.dot(A, B.T)): When matrices increase in size, effective caching is required to retain parts of the matrix on the CPU, circumventing frequent RAM fetches. While BLAS optimizes this, SimSIMD doesn’t and has no plans to.
  2. Minimal operations on contemporary Intel CPUs involving only two vectors: Despite bypassing heavy abstractions like PyArg_ParseTuple in the CPython binding, the performance is impacted due to a combination of substantial type-checking and ancillary tasks between the API and its execution. Potential optimization could arise from backend pre-selection upon module loading, saving it in a global variable, rather than the current approach of verifying the cpuid for every interaction.

In upcoming endeavors, I’ll delve into Intel’s Advanced Matrix eXtensions (AMX) and Arm’s Scalable Matrix Extensions (SME). These have already enhanced our Unum’s UForm Transformer models, trimming inference time to a mere 1.3 milliseconds on a Sapphire Rapids CPU. While these extensions are yet to be integrated into SimSIMD, I openly invite contributions. Excited to see your inputs! 🤗

Appendix 1: Performance on Apple M2 Pro

Between 2 Vectors, Batch Size: 1

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine67,3521,559,96323.16 x
f16scipy.cosine26,7231,553,70058.14 x
i8scipy.cosine78,9373,259,10141.29 x
f32scipy.sqeuclidean383,3071,587,3024.14 x
f16scipy.sqeuclidean89,7091,567,29817.47 x
i8scipy.sqeuclidean213,6181,794,2598.40 x
f32numpy.inner1,346,9521,611,6041.20 x
f16numpy.inner266,9041,545,9945.79 x
i8numpy.inner862,5023,302,5963.83 x
u8scipy.hamming1,632,43227,874,55617.08 x
u8scipy.jaccard1,180,92824,515,21320.76 x

Between 2 Vectors, Batch Size: 1,000

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine68,4142,471,68036.13 x
f16scipy.cosine27,9642,470,15488.33 x
i8scipy.cosine81,06415,364,768189.54 x
f32scipy.sqeuclidean415,3262,462,8005.93 x
f16scipy.sqeuclidean90,7952,439,52426.87 x
i8scipy.sqeuclidean203,0944,144,37320.41 x
f32numpy.inner1,590,6672,535,1241.59 x
f16numpy.inner268,5832,505,4819.33 x
i8numpy.inner914,87916,985,11818.57 x
u8scipy.hamming1,645,641134,064,81081.47 x
u8scipy.jaccard1,237,944103,444,58183.56 x

Between All Pairs of Vectors (cdist), Batch Size: 1,000

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine777,8942,533,8403.26 x
f16scipy.cosine776,1712,480,4543.20 x
i8scipy.cosine778,57017,927,19823.03 x
f32scipy.sqeuclidean2,349,6452,569,7461.09 x
f16scipy.sqeuclidean2,158,4882,377,7011.10 x
i8scipy.sqeuclidean2,091,6104,118,3841.97 x
f32numpy.inner21,854,3412,092,5900.10 x
f16numpy.inner304,6462,423,8027.96 x
i8numpy.inner1,819,57217,014,6889.35 x
u8scipy.hamming46,265,2391,502,534,57932.48 x
u8scipy.jaccard129,574,247971,895,6637.50 x

Appendix 2: Performance on 4th Gen Intel Xeon Platinum (8480+)

Between 2 Vectors, Batch Size: 1

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine57,346224,1483.91 x
f16scipy.cosine17,068226,82913.29 x
i8scipy.cosine71,983234,3723.26 x
f32scipy.sqeuclidean341,301218,1770.64 x
f16scipy.sqeuclidean37,076229,5656.19 x
i8scipy.sqeuclidean236,241237,8251.01 x
f32numpy.inner912,270225,8670.25 x
f16numpy.inner92,400230,1102.49 x
i8numpy.inner700,822231,9870.33 x
u8scipy.hamming1,586,1491,875,3911.18 x
u8scipy.jaccard1,083,8861,871,8981.73 x

Between 2 Vectors, Batch Size: 1,000

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine56,8482,835,76149.88 x
f16scipy.cosine17,1254,144,457242.01 x
i8scipy.cosine71,9907,627,362105.95 x
f32scipy.sqeuclidean370,5552,817,7107.60 x
f16scipy.sqeuclidean37,1164,451,408119.93 x
i8scipy.sqeuclidean224,86610,432,19846.39 x
f32numpy.inner910,8842,814,5013.09 x
f16numpy.inner91,3474,728,76051.77 x
i8numpy.inner700,5767,347,58910.49 x
u8scipy.hamming1,611,20679,796,50949.53 x
u8scipy.jaccard1,096,52287,557,68979.85 x

Between All Pairs of Vectors (cdist), Batch Size: 1,000

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine2,430,6173,106,9121.28 x
f16scipy.cosine2,272,2284,304,7681.89 x
i8scipy.cosine2,370,2908,384,3833.54 x
f32scipy.sqeuclidean2,226,3425,505,9102.47 x
f16scipy.sqeuclidean2,237,9664,732,2122.11 x
i8scipy.sqeuclidean2,234,06313,029,1015.83 x
f32numpy.inner175,492,4434,984,4060.03 x
f16numpy.inner106,6775,048,22347.32 x
i8numpy.inner1,807,5668,257,8494.57 x
u8scipy.hamming159,574,7132,325,251,44614.57 x
u8scipy.jaccard19,478,7971,705,905,01787.58 x

Appendix 3: Performance on AWS Graviton 3

Between 2 Vectors, Batch Size: 1

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine30,8301,007,21832.67 x
f16scipy.cosine14,6831,055,42071.88 x
i8scipy.cosine38,8431,441,36737.11 x
f32scipy.sqeuclidean173,8001,076,6016.19 x
f16scipy.sqeuclidean48,0801,136,17023.63 x
i8scipy.sqeuclidean120,2811,371,82711.41 x
f32numpy.inner642,6581,103,0821.72 x
f16numpy.inner171,9041,078,1106.27 x
i8numpy.inner459,6051,438,2843.13 x
u8scipy.hamming796,73910,041,37012.60 x
u8scipy.jaccard534,5299,181,64017.18 x

Between 2 Vectors, Batch Size: 1,000

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine31,1622,692,13186.39 x
f16scipy.cosine14,7213,034,736206.15 x
i8scipy.cosine38,83710,442,992268.89 x
f32scipy.sqeuclidean177,0252,705,02015.28 x
f16scipy.sqeuclidean50,1643,151,58262.83 x
i8scipy.sqeuclidean121,2696,081,17150.15 x
f32numpy.inner657,2573,078,9794.68 x
f16numpy.inner171,8753,153,53018.35 x
i8numpy.inner459,9759,719,21221.13 x
u8scipy.hamming845,62946,755,18755.29 x
u8scipy.jaccard546,28630,730,46356.25 x

Between All Pairs of Vectors (cdist), Batch Size: 1,000

DatatypeMethodOps/sSimSIMD Ops/sSimSIMD Improvement
f32scipy.cosine1,570,7303,066,4061.95 x
f16scipy.cosine1,605,8073,057,9741.90 x
i8scipy.cosine1,590,89714,750,0399.27 x
f32scipy.sqeuclidean1,622,5453,272,1642.02 x
f16scipy.sqeuclidean1,578,5483,178,2872.01 x
i8scipy.sqeuclidean1,592,3696,436,1154.04 x
f32numpy.inner58,199,0033,445,8120.06 x
f16numpy.inner218,8423,274,77414.96 x
i8numpy.inner1,356,63314,812,93110.92 x
u8scipy.hamming78,713,290579,018,3447.36 x
u8scipy.jaccard34,455,520318,356,8719.24 x