5x Faster Set Intersections: SVE2, AVX-512, & NEON 🤐

Set intersections are one of the standard operations in databases and search engines. They are used in:

TF-IDF ranking in search engines,
Table Joins in OLAP databases,
Graph Algorithms.

Chances are, you rely on them every day, but you may not realize that they are some of the most complex operations to accelerate with SIMD instructions. SIMD instructions make up the majority of modern Assembly instruction sets on x86 and Arm. They can yield 10x speedups, but due to their complexity, they are almost never used in production codebases.

Meet SimSIMD, which achieves up to 5x faster set intersections for sorted arrays of unique u16 and u32 values. With SimSIMD v5.3, we’ve broken new ground: it was one of the first libraries to support Arm’s Scalable Vector Extensions (SVE) and now leads the charge on SVE2.

Over-engineering 5x Faster Set Intersections in SVE2, AVX-512, & NEON by Ash Vardanian

This latest release uses instructions like HISTCNT and MATCH and compares their performance to Intel’s VP2INTERSECT in AVX-512, along with simpler methods like Galloping. These instructions are already live on AWS Graviton 4 CPUs and will soon power Nvidia’s Grace Hopper, Microsoft Cobalt, and Google Axios. So, let’s dive into the details and see how more software can get those 5x speedups.

Baseline Algorithm

Let’s say you have two sets of unique integers, like identifiers of different documents or tokens. You’d often store them in the form of a sorted array and intersected with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def intersect(a, b):
    i, j = 0, 0
    result = []
    while i < len(a) and j < len(b):
        if a[i] == b[j]:
            result.append(a[i])
            i += 1
            j += 1
        elif a[i] < b[j]:
            i += 1
        else:
            j += 1
    return result

Rewriting this in C++, gives us the simplified analog of std::set_intersection from the Standard Template Library:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
template <typename T>
std::vector<T> intersect(std::vector<T> const & a, std::vector<T> const & b) {
    std::vector<T> result;
    auto i = a.begin(), j = b.begin();
    while (i != a.end() && j != b.end()) {
        if (*i == *j) {
            result.push_back(*i);
            ++i;
            ++j;
        } else if (*i < *j) {
            ++i;
        } else {
            ++j;
        }
    }
    return result;
}

In the SimSIMD v5.1 library release, I’ve added Set Intersections as a similarity metric for sparse vector representations. There, I don’t need to export the matches but rather just count them, making the serial algorithm even more straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
template <typename T>
size_t intersection_size(T const *a, T const *b, size_t n, size_t m) noexcept {
    size_t i = 0, j = 0, count = 0;
    while (i < n && j < m) {
        if (a[i] < b[j]) i++;
        else if (a[i] > b[j]) j++;
        else i++, j++, count++;
    }
    return count;
}

There are several ways to optimize this algorithm, but the most common one is using the Galloping algorithm, a binary search with a twist.

Galloping

The Galloping algorithm is a straightforward application of the binary search algorithm. The idea is to find the first element of the first array in the second array. If the element is found, we can add it to the result and continue the search from the next element. Thus, we reduce each iteration’s searchable scope within the second array.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <algorithm> // for `std::lower_bound` binary search procedure

template <typename T>
size_t galloping_intersection_size(T const *a, T const *b, size_t n, size_t m) noexcept {
    size_t i = 0, j = 0, count = 0;    
    while (i < n && j < m) {
        // If elements are equal, increment both indices and count
        if (a[i] == b[j]) count++, i++, j++;

        // If a[i] is smaller, gallop through b to find its match
        else if (a[i] < b[j]) j = std::lower_bound(b + j, b + m, a[i]) - b;

        // If a[i] is larger, gallop through a to find its match
        else i = std::lower_bound(a + i, a + n, b[j]) - a;
    }
    return count;
}

As it can be guessed, if one array is orders of magnitude smaller than the other, the Galloping algorithm can be much faster. It generally becomes noticeable when the ratio of the sizes is more than 100:1. When you are dealing with uint16_t 16-bit integers, the computer will still fetch entire cache lines from DRAM to SRAM, so the read amplification of every binary search iteration is 32x, and the latency of the DRAM access is easily 100x higher than a single CPU cycle needed to compare two integers. Anything else we can do?

Baseline Benchmarks

The practical runtime of algorithms depends on the inputs’ size and intersection. The SimSIMD benchmarking suite includes a benchmark for those implemented with Google Benchmark.

1
2
3
cmake -D CMAKE_BUILD_TYPE=Release -D SIMSIMD_BUILD_BENCHMARKS=1 -B build_release
cmake --build build_release --config Release
build_release/simsimd_bench --benchmark_filter=intersect

In SimSIMD applications, we generally deal with vectors containing roughly 128, 1'024, 8'192, or up to 65'536 elements. The set intersection is emulated to be 1%, 5%, 50%, and 95% of the smaller array. This is what the logic for defining benchmarks looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Register different lengths, intersection sizes, and distributions
// 2 first lengths * 3 second length multipliers * 4 intersection grades = 24 benchmarks for each metric.
for (std::size_t first_len : {128, 1024}) {                         //< 2 lengths
    for (std::size_t second_len_multiplier : {1, 8, 64}) {          //< 3 length multipliers
        for (double intersection_share : {0.01, 0.05, 0.5, 0.95}) { //< 4 intersection grades
            std::size_t intersection_size = static_cast<std::size_t>(first_len * intersection_share);
            std::size_t second_len = first_len * second_len_multiplier;
            std::string bench_name = name + 
                "<#A=" + std::to_string(first_len) +
                ",#B=" + std::to_string(second_len) +
                ",#A∩B=" + std::to_string(intersection_size) + ">";
            
// Results in:
// - intersect_u16_ice<#A=128,#B=128,#A∩B=1>
// - intersect_u16_ice<#A=1024,#B=8192,#A∩B=512>

Running it we get a baseline for our serial implementation:

1
$ build_release/simsimd_bench --benchmark_filter=intersect_u16_serial

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-----------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
intersect_u16_serial<#A=128,#B=128,#A∩B=1>             567 ns          567 ns     24785678 pairs=1.76263M/s
intersect_u16_serial<#A=128,#B=128,#A∩B=6>             567 ns          567 ns     24598141 pairs=1.76286M/s
intersect_u16_serial<#A=128,#B=128,#A∩B=64>            569 ns          569 ns     24741572 pairs=1.75684M/s
intersect_u16_serial<#A=128,#B=128,#A∩B=121>           568 ns          568 ns     24871638 pairs=1.76073M/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=1>           2508 ns         2508 ns      5591748 pairs=398.803k/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=6>           2509 ns         2509 ns      5589871 pairs=398.535k/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=64>          2530 ns         2530 ns      5564535 pairs=395.33k/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=121>         2522 ns         2522 ns      5532306 pairs=396.447k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=1>           4791 ns         4791 ns      2920833 pairs=208.737k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=6>           4800 ns         4800 ns      2923139 pairs=208.346k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=64>          4821 ns         4820 ns      2906942 pairs=207.448k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=121>         4843 ns         4843 ns      2897334 pairs=206.504k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=10>         4484 ns         4484 ns      3122873 pairs=223.023k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=51>         4479 ns         4479 ns      3124662 pairs=223.261k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=512>        4484 ns         4484 ns      3125584 pairs=223.034k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=972>        4500 ns         4500 ns      3104588 pairs=222.229k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=10>        20118 ns        20117 ns       696244 pairs=49.7084k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=51>        20134 ns        20134 ns       696160 pairs=49.6682k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=512>       20125 ns        20124 ns       695799 pairs=49.6911k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=972>       20102 ns        20102 ns       695762 pairs=49.7464k/s

Sadly, the std::unordered_set used to produce the input dataset is too slow to scale to 65'536 elements 🤦‍♂️

SIMD

I am not even remotely the first to think about accelerating set intersections with SIMD instructions. The most representative single-purpose repositories are:

There are also production systems using those ideas. Let’s start with SSE usage in NMSLIB.

String Matching Instructions in `SSE`

The Non-Metric Space Library (NMSLIB) is one of the best original Nearest-Neighbors Search libraries. Here is a snippet from their SparseScalarProductFastIntern function:

1
2
3
4
5
6
__m128i id1 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(&pBlockIds1[i1]));
__m128i id2 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(&pBlockIds2[i2]));
while (true) {
    __m128i cmpRes = _mm_cmpistrm(id2, id1,
        _SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
    int r = _mm_extract_epi32(cmpRes, 0);

The _mm_cmpistrm is intrinsic for the pcmpistrm instruction in the SSE4.2 instruction set, and it is used to compare two strings. It’s a heavy function, with a latency of 10 cycles on Intel Skylake CPUs and tons of flags. It’s so bad, that calling it the fcntl of SIMD instructions won’t be an exaggeration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
_SIDD_UBYTE_OPS                // unsigned 8-bit characters
_SIDD_UWORD_OPS                // unsigned 16-bit characters
_SIDD_SBYTE_OPS                // signed 8-bit characters
_SIDD_SWORD_OPS                // signed 16-bit characters
_SIDD_CMP_EQUAL_ANY            // compare equal any
_SIDD_CMP_RANGES               // compare ranges
_SIDD_CMP_EQUAL_EACH           // compare equal each
_SIDD_CMP_EQUAL_ORDERED        // compare equal ordered
_SIDD_NEGATIVE_POLARITY        // negate results
_SIDD_MASKED_NEGATIVE_POLARITY // negate results only before end of string
_SIDD_LEAST_SIGNIFICANT        // index only: return last significant bit
_SIDD_MOST_SIGNIFICANT         // index only: return most significant bit
_SIDD_BIT_MASK                 // mask only: return bit mask
_SIDD_UNIT_MASK                // mask only: return byte/word mask

Announced back in 2006, it was designed for 128-bit XMM registers. It’s not a stretch to imagine that we can do better in 2024 with 512-bit ZMM registers and much more powerful AVX-512 instructions.

Integers Comparisons with `AVX-512BW`

On Intel Ice Lake and newer CPUs, we can use the AVX-512BW instructions for simple integer comparisons for a baseline implementation:

Load one element from the first array and a vector of elements from the second array.
Compare the first element with all elements in the vector.
Skip:
1. potentially many elements in the second array and
2. at most, one element in the first array.
Repeat.

One nifty optimization is to keep the shorter array in the first argument and the longer array in the second argument, swapping them if necessary. Putting this into code for uint16_t arguments, targeting Intel Ice Lake CPUs and newer, we get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
void simsimd_intersect_u16_ice(
    simsimd_u16_t const* shorter, simsimd_u16_t const* longer,
    simsimd_size_t shorter_length, simsimd_size_t longer_length,
    simsimd_distance_t* results) {

    simsimd_size_t intersection_count = 0;
    simsimd_size_t shorter_idx = 0, longer_idx = 0;
    simsimd_size_t longer_load_size;
    __mmask32 longer_mask;

    while (shorter_idx < shorter_length && longer_idx < longer_length) {
        // Load `shorter_member` and broadcast it to shorter vector, load `longer_members_vec` from memory.
        simsimd_size_t longer_remaining = longer_length - longer_idx;
        simsimd_u16_t shorter_member = shorter[shorter_idx];
        __m512i shorter_member_vec = _mm512_set1_epi16(*(short*)&shorter_member);
        __m512i longer_members_vec;
        if (longer_remaining < 32) {
            longer_load_size = longer_remaining;
            longer_mask = (__mmask32)_bzhi_u32(0xFFFFFFFF, longer_remaining);
        } else {
            longer_load_size = 32;
            longer_mask = 0xFFFFFFFF;
        }
        longer_members_vec = _mm512_maskz_loadu_epi16(longer_mask, (__m512i const*)(longer + longer_idx));

        // Compare `shorter_member` with each element in `longer_members_vec`,
        // and jump to the position of the match. There can be only one match at most!
        __mmask32 equal_mask = _mm512_mask_cmpeq_epu16_mask(longer_mask, shorter_member_vec, longer_members_vec);
        simsimd_size_t equal_count = equal_mask != 0;
        intersection_count += equal_count;

        // When comparing a scalar against a sorted array, we can find three types of elements:
        // - entries that scalar is greater than,
        // - entries that scalar is equal to,
        // - entries that scalar is less than,
        // ... in that order! Any of them can be an empty set.
        __mmask32 greater_mask = _mm512_mask_cmplt_epu16_mask(longer_mask, longer_members_vec, shorter_member_vec);
        simsimd_size_t greater_count = _mm_popcnt_u32(greater_mask);
        simsimd_size_t smaller_exists = longer_load_size > greater_count - equal_count;

        // Advance the first array:
        // - to the next element, if a match was found,
        // - to the next element, if the current element is smaller than any elements in the second array.
        shorter_idx += equal_count | smaller_exists;
        // Advance the second array:
        // - to the next element after match, if a match was found,
        // - to the first element that is greater than the current element in the first array, if no match was found.
        longer_idx += greater_count + equal_count;

        // At any given cycle, take one entry from shorter array and compare it with multiple from the longer array.
        // For that, we need to swap the arrays if necessary.
        if ((shorter_length - shorter_idx) > (longer_length - longer_idx)) {
            simsimd_u16_t const* temp_array = shorter;
            shorter = longer, longer = temp_array;
            simsimd_size_t temp_length = shorter_length;
            shorter_length = longer_length, longer_length = temp_length;
            simsimd_size_t temp_idx = shorter_idx;
            shorter_idx = longer_idx, longer_idx = temp_idx;
        }
    }
    *results = intersection_count;
}

Every time we load data, we need to perform 2 comparisons and analyze the resulting masks. Running the benchmark, we get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
---------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
intersect_u16_ice<#A=128d,#B=128,#A∩B=1>             875 ns          875 ns     16248886 pairs=1.14342M/s
intersect_u16_ice<#A=128d,#B=128,#A∩B=6>             873 ns          873 ns     16081249 pairs=1.14555M/s
intersect_u16_ice<#A=128d,#B=128,#A∩B=64>            882 ns          882 ns     15851609 pairs=1.13354M/s
intersect_u16_ice<#A=128d,#B=128,#A∩B=121>           916 ns          916 ns     15282595 pairs=1091.32k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=1>            955 ns          955 ns     14660187 pairs=1047.53k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=6>            955 ns          955 ns     14663375 pairs=1047.57k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=64>           952 ns          952 ns     14702462 pairs=1050.17k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=121>          949 ns          949 ns     14743103 pairs=1053.59k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=1>           2718 ns         2718 ns      5168053 pairs=367.871k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=6>           2698 ns         2698 ns      5155819 pairs=370.664k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=64>          2705 ns         2705 ns      5203675 pairs=369.686k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=121>         2693 ns         2693 ns      5187007 pairs=371.377k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=10>         7310 ns         7310 ns      1910292 pairs=136.8k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=51>         7312 ns         7312 ns      1913190 pairs=136.759k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=512>        7365 ns         7365 ns      1900946 pairs=135.781k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=972>        7439 ns         7439 ns      1882319 pairs=134.43k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=10>         7682 ns         7681 ns      1821784 pairs=130.183k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=51>         7695 ns         7695 ns      1821861 pairs=129.955k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=512>        7643 ns         7643 ns      1829955 pairs=130.842k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=972>        7617 ns         7617 ns      1838612 pairs=131.279k/s

Highlighting the results:

Benchmark	Serial	🆕 AVX-512BW
Small: `#A=128,#B=128,#A∩B=1`	1'762 K/s	1'143 K/s
Medium: `#A=128,#B=1024,#A∩B=6`	398 K/s	1'047 K/s
Large: `#A=1024,#B=1024,#A∩B=512`	223 K/s	135 K/s
Largest: `#A=1024,#B=8192,#A∩B=972`	49 K/s	131 K/s

We can do better.

Emulating `AVX-512VP2INTERSECT`

AVX-512 added many instructions, one of the most interesting ones being the VP2INTERSECT instructions. Here is how Intel documents them:

Compute intersection of packed 32-bit integer vectors a and b, and store indication of match in the corresponding bit of two mask registers specified by k1 and k2. A match in corresponding elements of a and b is indicated by a set bit in the corresponding bit of the mask registers.

MEM[k1+15:k1] := 0
MEM[k2+15:k2] := 0
FOR i := 0 TO 15
	FOR j := 0 TO 15
		match := (a.dword[i] == b.dword[j] ? 1 : 0)
		MEM[k1+15:k1].bit[i] |= match
		MEM[k2+15:k2].bit[j] |= match
	ENDFOR
ENDFOR

Even from pseudo-code, it clearly does at least 2x more work than a set intersection would — comparing all pairs of elements instead of just the upper or lower triangle of the matrix. Moreover, it applies only to 32-bit and 64-bit integers and has extremely high latency. It was an easy pass, assuming only a few Intel Tiger Lake mobile CPUs support the VP2INTERSECT subset of AVX-512.

Still, a 2022 paper by Guille Diez-Canas, explores emulating this instruction using VPERM, VPSHUF, and VALIGN:

Since computing set intersections (or sizes of set intersections) of sorted vectors only makes use of the first output result of a vp2intersectd instruction (as seen in listing 1), we implement equivalents to all instructions in the VP2INTERSECT subset that return only the first output mask. Note, therefore, that our implementations (except for that of section 3.3) are not drop-in replacements for the VP2INTERSECT subset. In fact, as shown in section 3.3, the strict emulation is slower than the native instruction.

The outlined procedures, however, can be valuable for our use case. I have reimplemented them in SimSIMD, integrating with minor modifications under the name _mm512_2intersect_epi16_mask, and voila:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
void simsimd_intersect_u16_ice(
    simsimd_u16_t const* a, simsimd_u16_t const* b, 
    simsimd_size_t a_length, simsimd_size_t b_length, 
    simsimd_distance_t* results) {

    simsimd_u16_t const* const a_end = a + a_length;
    simsimd_u16_t const* const b_end = b + b_length;
    simsimd_size_t c = 0;
    union vec_t {
        __m512i zmm;
        simsimd_u16_t u16[32];
        simsimd_u8_t u8[64];
    } a_vec, b_vec;

    while (a + 32 < a_end && b + 32 < b_end) {
        a_vec.zmm = _mm512_loadu_si512((__m512i const*)a);
        b_vec.zmm = _mm512_loadu_si512((__m512i const*)b);

        // Intersecting registers with `_mm512_2intersect_epi16_mask` involves a lot of shuffling
        // and comparisons, so we want to avoid it if the slices don't overlap at all
        simsimd_u16_t a_min;
        simsimd_u16_t a_max = a_vec.u16[31];
        simsimd_u16_t b_min = b_vec.u16[0];
        simsimd_u16_t b_max = b_vec.u16[31];

        // If the slices don't overlap, advance the appropriate pointer
        while (a_max < b_min && a + 64 < a_end) {
            a += 32;
            a_vec.zmm = _mm512_loadu_si512((__m512i const*)a);
            a_max = a_vec.u16[31];
        }
        a_min = a_vec.u16[0];
        while (b_max < a_min && b + 64 < b_end) {
            b += 32;
            b_vec.zmm = _mm512_loadu_si512((__m512i const*)b);
            b_max = b_vec.u16[31];
        }
        b_min = b_vec.u16[0];

        // Now we are likely to have some overlap, so we can intersect the registers
        __mmask32 a_matches = _mm512_2intersect_epi16_mask(a_vec.zmm, b_vec.zmm);
        c += _mm_popcnt_u32(a_matches); // The `_popcnt32` symbol isn't recognized by MSVC

        // Determine the number of entries to skip in each array, by comparing
        // every element in the vector with the last (largest) element in the other array
        __m512i a_last_broadcasted = _mm512_set1_epi16(*(short const*)&a_max);
        __m512i b_last_broadcasted = _mm512_set1_epi16(*(short const*)&b_max);
        __mmask32 a_step_mask = _mm512_cmple_epu16_mask(a_vec.zmm, b_last_broadcasted);
        __mmask32 b_step_mask = _mm512_cmple_epu16_mask(b_vec.zmm, a_last_broadcasted);
        a += 32 - _lzcnt_u32((simsimd_u32_t)a_step_mask);
        b += 32 - _lzcnt_u32((simsimd_u32_t)b_step_mask);
    }

    // Handle the tail:
    simsimd_intersect_u16_serial(a, b, a_end - a, b_end - b, results);
    *results += c; // And merge it with the main body result
}

There are a lot of neat aspects that make the borrowed _mm512_2intersect_epi16_mask subroutine great. Most similar snippets may rely on the following intrinsics:

Intrinsic	Requirements	Latency
`_mm512_permutex_epi64`	`AVX-512F`	3 cycles
`_mm512_permutexvar_epi16`	`AVX-512BW`	4-6 cycles
`_mm512_permutexvar_epi8`	`AVX-512VBMI`	3 cycles
`_mm512_shuffle_epi8`	`AVX-512BW`	1 cycle

Lesson? The permutex variants are slow and not always available. Shuffles are great! Feel free to quote me on that 😅:

If in C++ everything is a rotate, in Assembly everything is a shuffle!

Luckily, our snippet uses only shuffles and alignr, reaching massive 7 Million pairs per second on 128-element arrays:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
--------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
intersect_u16_ice<#A=128,#B=128,#A∩B=1>             129 ns          129 ns    101989513 pairs=7.72559M/s
intersect_u16_ice<#A=128,#B=128,#A∩B=6>             134 ns          134 ns    107140278 pairs=7.46949M/s
intersect_u16_ice<#A=128,#B=128,#A∩B=64>            122 ns          122 ns    113134485 pairs=8.18634M/s
intersect_u16_ice<#A=128,#B=128,#A∩B=121>           114 ns          114 ns    122765163 pairs=8.75268M/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=1>           1042 ns         1042 ns     13412933 pairs=959.711k/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=6>           1035 ns         1035 ns     13423867 pairs=966.278k/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=64>          1038 ns         1038 ns     13401265 pairs=963.267k/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=121>         1055 ns         1055 ns     13170438 pairs=948.193k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=1>           4315 ns         4315 ns      3024069 pairs=231.776k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=6>           3999 ns         3999 ns      3371134 pairs=250.088k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=64>          4486 ns         4486 ns      3278143 pairs=222.9k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=121>         4525 ns         4525 ns      3170802 pairs=220.991k/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=10>          817 ns          817 ns     17102654 pairs=1.22419M/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=51>          820 ns          820 ns     17168886 pairs=1.22003M/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=512>         793 ns          793 ns     17756237 pairs=1.26107M/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=972>         747 ns          747 ns     18261381 pairs=1.33794M/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=10>         5142 ns         5142 ns      2728465 pairs=194.496k/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=51>         5114 ns         5114 ns      2727670 pairs=195.56k/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=512>        5142 ns         5142 ns      2716714 pairs=194.491k/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=972>        5151 ns         5151 ns      2721708 pairs=194.148k/s

Highlighting the results:

Benchmark	Serial	AVX-512BW	🆕 AVX-512VP2INTERSECT
Small: `#A=128,#B=128,#A∩B=1`	1'762 K/s	1'143 K/s	7'725 K/s
Medium: `#A=128,#B=1024,#A∩B=6`	398 K/s	1'047 K/s	966 K/s
Large: `#A=1024,#B=1024,#A∩B=512`	223 K/s	135 K/s	1'261 K/s
Largest: `#A=1024,#B=8192,#A∩B=972`	49 K/s	131 K/s	194 K/s

Masks & Bitsets in `NEON`

Before jumping to SVE2, there is a well-known SIMD extension for ARM CPUs - NEON. It’s a 128-bit SIMD extension, available on most ARM CPUs, and should have been straightforward to implement, but it wasn’t. I’ve noticed a repeated pattern when I need a movemask x86 instruction to extract one bit from every byte or lane in the vector. There is a snippet I’ve previously borrowed from Danila Kutenin from his blogpost on the Arm page and previously used in StringZilla for faster-than-LibC memmem and strstr:

1
2
3
simsimd_u64_t _simsimd_u8_to_u4_neon(uint8x16_t vec) {
    return vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(vreinterpretq_u16_u8(vec), 4)), 0);
}

It is a powerful tool, use it wisely 😊

Next up in my over-engineered SIMD version, I’ve faced the issue of implementing decent bitset operations on Arm. Arm does not have a good way of computing the population count of a bitset, similar to _mm_popcnt_u32 and _lzcnt_u32 on x86. One approach may be compiler intrinsics, like __builtin_popcountll and __builtin_clzll, but those are specific to GCC and Clang. Combining CPU-specific intrinsics with compiler-specific intrinsics is in bad taste, but beyond that, it’s not portable with respect to MSVC and less popular compilers. One approach would be to implement it in inlined-assembly:

1
2
3
4
5
int _simsimd_clz_u64(simsimd_u64_t value) {
    simsimd_u64_t result;
    __asm__("clz %x0, %x1" : "=r"(result) : "r"(value));
    return (int)result;
}

However, MSVC also doesn’t support inline assembly. In StringZilla, I needed a general-purpose solution and implemented vanilla serial variants if compiler support was missing. It’s ugly code with some common bit-twiddling hacks and mess #if preprocessor macros. A better solution for Arm was to aggregate the counts within the loop body and horizontally sum them outside the end, so we only do this on the hot path:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Now we are likely to have some overlap, so we can intersect the registers.
// We can do it by performing a population count at every cycle, but it's not the cheapest in terms of cycles.
//
//      simsimd_u64_t a_matches = __builtin_popcountll(
//          _simsimd_u8_to_u4_neon(vreinterpretq_u8_u16(
//              _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8))));
//      c += a_matches / 8;
//
// Alternatively, we can we can transform match-masks into "ones", accumulate them between the cycles,
// and merge all together in the end.
uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));

Population counts - solved! Counting leading zeros is harder. We can use the vclz instruction in NEON. Looking through Arm documentation, we can find all the supported variants of the intrinsic:

SIMD ISA	Return Type	Name	Arguments
Neon	`int8x8_t`	`vclz_s8`	`(int8x8_t a)`
Neon	`int8x16_t`	`vclzq_s8`	`(int8x16_t a)`
Neon	`int16x4_t`	`vclz_s16`	`(int16x4_t a)`
Neon	`int16x8_t`	`vclzq_s16`	`(int16x8_t a)`
Neon	`int32x2_t`	`vclz_s32`	`(int32x2_t a)`
Neon	`int32x4_t`	`vclzq_s32`	`(int32x4_t a)`
Neon	`uint8x8_t`	`vclz_u8`	`(uint8x8_t a)`
Neon	`uint8x16_t`	`vclzq_u8`	`(uint8x16_t a)`
Neon	`uint16x4_t`	`vclz_u16`	`(uint16x4_t a)`
Neon	`uint16x8_t`	`vclzq_u16`	`(uint16x8_t a)`
Neon	`uint32x2_t`	`vclz_u32`	`(uint32x2_t a)`
Neon	`uint32x4_t`	`vclzq_u32`	`(uint32x4_t a)`

Notice that we need a variant for 64-bit integers, but it’s not defined. We can circumvent that in 2 ways:

Further shrinking masks from 64-bits to 32-bit integers before vclz, or
Treat the 64-bit masks as a pair of concatenated 32-bit masks and mix them afterward.

For now, I couldn’t find a way to apply those ideas, sometimes losing as much as 50% of the performance compared to the following solution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
void simsimd_intersect_u16_neon(
    simsimd_u16_t const* a, simsimd_u16_t const* b,
    simsimd_size_t a_length, simsimd_size_t b_length,
    simsimd_distance_t* results) {

    simsimd_u16_t const* const a_end = a + a_length;
    simsimd_u16_t const* const b_end = b + b_length;
    union vec_t {
        uint16x8_t u16x8;
        simsimd_u16_t u16[8];
        simsimd_u8_t u8[16];
    } a_vec, b_vec, c_counts_vec;
    c_counts_vec.u16x8 = vdupq_n_u16(0);

    while (a + 8 < a_end && b + 8 < b_end) {
        a_vec.u16x8 = vld1q_u16(a);
        b_vec.u16x8 = vld1q_u16(b);

        // Intersecting registers with `_simsimd_intersect_u16x8_neon` involves a lot of shuffling
        // and comparisons, so we want to avoid it if the slices don't overlap at all..
        simsimd_u16_t a_min;
        simsimd_u16_t a_max = a_vec.u16[7];
        simsimd_u16_t b_min = b_vec.u16[0];
        simsimd_u16_t b_max = b_vec.u16[7];

        // If the slices don't overlap, advance the appropriate pointer
        while (a_max < b_min && a + 16 < a_end) {
            a += 8;
            a_vec.u16x8 = vld1q_u16(a);
            a_max = a_vec.u16[7];
        }
        a_min = a_vec.u16[0];
        while (b_max < a_min && b + 16 < b_end) {
            b += 8;
            b_vec.u16x8 = vld1q_u16(b);
            b_max = b_vec.u16[7];
        }
        b_min = b_vec.u16[0];

        // Now we are likely to have some overlap, so we can intersect the registers.
        uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
        c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));

        // Counting leading zeros is tricky. 
        uint16x8_t a_last_broadcasted = vdupq_n_u16(a_max);
        uint16x8_t b_last_broadcasted = vdupq_n_u16(b_max);
        simsimd_u64_t a_step = _simsimd_clz_u64(_simsimd_u8_to_u4_neon( //
            vreinterpretq_u8_u16(vcleq_u16(a_vec.u16x8, b_last_broadcasted))));
        simsimd_u64_t b_step = _simsimd_clz_u64(_simsimd_u8_to_u4_neon( //
            vreinterpretq_u8_u16(vcleq_u16(b_vec.u16x8, a_last_broadcasted))));
        a += (64 - a_step) / 8;
        b += (64 - b_step) / 8;
    }

    simsimd_intersect_u16_serial(a, b, a_end - a, b_end - b, results);
    *results += vaddvq_u16(c_counts_vec.u16x8);
}

The _simsimd_clz_u64 currently falls back to a serial __builtin_clzll. Even with that and only 128-bit registers, the performance is still decent:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
--------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
intersect_u16_neon<#A=128,#B=128,#A∩B=1>            195 ns          195 ns     72473251 pairs=5.12346M/s
intersect_u16_neon<#A=128,#B=128,#A∩B=6>            193 ns          193 ns     71826322 pairs=5.17983M/s
intersect_u16_neon<#A=128,#B=128,#A∩B=64>           181 ns          181 ns     76859132 pairs=5.51211M/s
intersect_u16_neon<#A=128,#B=128,#A∩B=121>          161 ns          161 ns     86301671 pairs=6.22906M/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=1>          1199 ns         1027 ns     13866808 pairs=973.295k/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=6>          1171 ns         1034 ns     13729254 pairs=966.886k/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=64>         1120 ns         1038 ns     13671085 pairs=963.804k/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=121>        1150 ns         1051 ns     13070692 pairs=951.238k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=1>          2587 ns         2446 ns      5685615 pairs=408.885k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=6>          2595 ns         2490 ns      5538880 pairs=401.615k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=64>         2482 ns         2460 ns      5704185 pairs=406.459k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=121>        2512 ns         2512 ns      5592948 pairs=398.064k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=10>        1599 ns         1573 ns      8893290 pairs=635.781k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=51>        1570 ns         1570 ns      8950291 pairs=637.098k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=512>       1488 ns         1488 ns      9449103 pairs=672.121k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=972>       1332 ns         1332 ns     10582682 pairs=751.007k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=10>        8997 ns         8997 ns      1556944 pairs=111.144k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=51>        8999 ns         8999 ns      1554324 pairs=111.128k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=512>       9126 ns         9070 ns      1543769 pairs=110.257k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=972>       9089 ns         9089 ns      1536462 pairs=110.029k/s

Highlights:

Benchmark	Serial	🆕 NEON
Small: `#A=128,#B=128,#A∩B=1`	1'628 K/s	5'123 K/s
Medium: `#A=128,#B=1024,#A∩B=6`	393 K/s	963 K/s
Large: `#A=1024,#B=1024,#A∩B=512`	218 K/s	672 K/s
Largest: `#A=1024,#B=8192,#A∩B=972`	49 K/s	110 K/s

Matching & Histograms in `SVE2`

Dessert, at last! Arm SVE2 is a new SIMD extension available on the latest Arm CPUs. It is a way to stretch the idea of runtime-defined vector length to non-trivial, often integer operations, applied at a 128-bit single-lane granularity or the whole vector.

SVE2 has a MATCH instruction that can be used to compute the intersection of two vectors. This one operates at a lane granularity and can be used for 16-bit inputs, so we need to apply it several times to cover the whole vector.

1
2
3
4
5
6
7
8
simsimd_size_t const register_size = svcnth(); // Number of 16-bit elements in a vector
simsimd_size_t const lanes_count = register_size / 8; // 8 elements in a 128-bit vector
svbool_t equal_mask = svmatch_u16(a_progress, a_vec, b_vec);
for (simsimd_size_t i = 1; i < lanes_count; i++) {
 b_vec = svext_u16(b_vec, b_vec, 8);
 equal_mask = svorr_z(svptrue_b16(), equal_mask, svmatch_u16(a_progress, a_vec, b_vec));
}
simsimd_size_t equal_count = svcntp_b16(svptrue_b16(), equal_mask);

This is a great start, but the MATCH instruction is not available for 32-bit integers. We can emulate similar behavior by running a lot more iterations of svcmpeq_u32:

1
2
3
4
5
6
svbool_t equal_mask = svpfalse_b();
for (simsimd_size_t i = 0; i < register_size; i++) {
 equal_mask = svorr_z(svptrue_b32(), equal_mask, svcmpeq_u32(a_progress, a_vec, b_vec));
 b_vec = svext_u32(b_vec, b_vec, 1);
}
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);

Alternatively, the new histogram instructions, such as HISTCN and the svhistcnt_u32_z intrinsic, can be used. For any element in the vector, it counts the number of times it appears in the other vectors prefix. It’s practically an inclusive prefix sum of equality matches and operates over the whole register. If we think of those intersections as a matrix of matches, it is equivalent to the lower triangle of the row-major intersection matrix. To compute the upper triangle, we can reverse (with svrev_b32) the order of elements and repeat the operation, accumulating the results for the top and bottom:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
//      ⊐ α = {A, B, C, D}, β = {X, Y, Z, W}:
//
//      hist(α, β):           hist(α_rev, β_rev):
//
//        X Y Z W               W Z Y X
//      A 1 0 0 0             D 1 0 0 0
//      B 1 1 0 0             C 1 1 0 0
//      C 1 1 1 0             B 1 1 1 0
//      D 1 1 1 1             A 1 1 1 1
//
svuint32_t hist_lower = svhistcnt_u32_z(a_progress, a_vec, b_vec);
svuint32_t a_rev_vec = svrev_u32(a_vec);
svuint32_t b_rev_vec = svrev_u32(b_vec);
svuint32_t hist_upper = svrev_u32(svhistcnt_u32_z(svptrue_b32(), a_rev_vec, b_rev_vec));
svuint32_t hist = svorr_u32_x(a_progress, hist_lower, hist_upper);
svbool_t equal_mask = svcmpne_n_u32(a_progress, hist, 0);
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);

This is the most elegant of today’s solutions. And the performance looks mostly good:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
--------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
intersect_u16_sve2<#A=128,#B=128,#A∩B=1>            179 ns          178 ns     77997900 pairs=5.60245M/s
intersect_u16_sve2<#A=128,#B=128,#A∩B=6>            179 ns          179 ns     77959137 pairs=5.59776M/s
intersect_u16_sve2<#A=128,#B=128,#A∩B=64>           170 ns          170 ns     82829421 pairs=5.88598M/s
intersect_u16_sve2<#A=128,#B=128,#A∩B=121>          143 ns          143 ns     97771708 pairs=6.9995M/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=1>           900 ns          900 ns     15430306 pairs=1.11111M/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=6>           909 ns          909 ns     15374525 pairs=1099.58k/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=64>          922 ns          922 ns     15025863 pairs=1085.12k/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=121>         932 ns          932 ns     15083373 pairs=1072.6k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=1>          2135 ns         2135 ns      6460842 pairs=468.333k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=6>          2118 ns         2118 ns      6509484 pairs=472.238k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=64>         2138 ns         2138 ns      6468742 pairs=467.706k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=121>        2136 ns         2136 ns      6419653 pairs=468.097k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=10>        1502 ns         1502 ns      9329372 pairs=665.698k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=51>        1492 ns         1492 ns      9375601 pairs=670.246k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=512>       1416 ns         1416 ns      9859829 pairs=706.16k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=972>       1274 ns         1274 ns     11052636 pairs=785.05k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=10>        9148 ns         9148 ns      1528714 pairs=109.319k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=51>        9150 ns         9150 ns      1529679 pairs=109.287k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=512>       9148 ns         9147 ns      1527762 pairs=109.32k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=972>       9135 ns         9135 ns      1529316 pairs=109.473k/s

Highlights look comparable:

Benchmark	Serial	NEON	🆕 SVE2
Small: `#A=128,#B=128,#A∩B=1`	1'628 K/s	5'123 K/s	5'602 K/s
Medium: `#A=128,#B=1024,#A∩B=6`	393 K/s	963 K/s	1'099 K/s
Large: `#A=1024,#B=1024,#A∩B=512`	218 K/s	672 K/s	706 K/s
Largest: `#A=1024,#B=8192,#A∩B=972`	49 K/s	110 K/s	109 K/s

But there are some artifacts this picture can’t show. The u16 SVE2 variant is always at least as fast as NEON. The u32 is a different story. In some cases, like <#A=128,#B=128,#A∩B=121> it’s a bit faster; in some, like <#A=128,#B=8192,#A∩B=1> it’s 50% slower. To reproduce, take a modern Arm CPU, like AWS Graviton 4, and try it yourself:

1
2
3
cmake -D CMAKE_BUILD_TYPE=Release -D SIMSIMD_BUILD_BENCHMARKS=1 -B build_release
cmake --build build_release --config Release
build_release/simsimd_bench --benchmark_filter=intersect

Still, SVE promises that once the CPUs get beefier, the registers will get wider, and the performance will skyrocket. Amen!

In the meantime, check out some of my GitHub projects for some more over-engineered tech and “good first issues” for potential contributors. Happy coding 🤗

I've just released a collection of super-over-engineered Set Intersection kernels for database and search engines. These kernels are up to 5x faster than serial algorithms!

SimSIMD was one of the only few libraries using Arm SVE. Now, it also uses SVE2!https://t.co/1kAq3Itw37
— Ash Vardanian (@ashvardanian) September 16, 2024

Baseline Algorithm#

Galloping#

Baseline Benchmarks#

SIMD#

String Matching Instructions in SSE#

Integers Comparisons with AVX-512BW#

Emulating AVX-512VP2INTERSECT#

Masks & Bitsets in NEON#

Matching & Histograms in SVE2#