Set intersections are one of the standard operations in databases and search engines.
They are used in:
Chances are, you rely on them every day, but you may not realize that they are some of the most complex operations to accelerate with SIMD instructions.
SIMD instructions make up the majority of modern Assembly instruction sets on x86 and Arm. They can yield 10x speedups, but due to their complexity, they are almost never used in production codebases.
Meet SimSIMD, which achieves up to 5x faster set intersections for sorted arrays of unique u16
and u32
values. With SimSIMD v5.3, we’ve broken new ground: it was one of the first libraries to support Arm’s Scalable Vector Extensions (SVE) and now leads the charge on SVE2.
This latest release uses instructions like HISTCNT
and MATCH
and compares their performance to Intel’s VP2INTERSECT
in AVX-512, along with simpler methods like Galloping.
These instructions are already live on AWS Graviton 4 CPUs and will soon power Nvidia’s Grace Hopper, Microsoft Cobalt, and Google Axios.
So, let’s dive into the details and see how more software can get those 5x speedups.
Baseline Algorithm#
Let’s say you have two sets of unique integers, like identifiers of different documents or tokens.
You’d often store them in the form of a sorted array and intersected with:
1
2
3
4
5
6
7
8
9
10
11
12
13
| def intersect(a, b):
i, j = 0, 0
result = []
while i < len(a) and j < len(b):
if a[i] == b[j]:
result.append(a[i])
i += 1
j += 1
elif a[i] < b[j]:
i += 1
else:
j += 1
return result
|
Rewriting this in C++, gives us the simplified analog of std::set_intersection
from the Standard Template Library:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| template <typename T>
std::vector<T> intersect(std::vector<T> const & a, std::vector<T> const & b) {
std::vector<T> result;
auto i = a.begin(), j = b.begin();
while (i != a.end() && j != b.end()) {
if (*i == *j) {
result.push_back(*i);
++i;
++j;
} else if (*i < *j) {
++i;
} else {
++j;
}
}
return result;
}
|
In the SimSIMD v5.1 library release, I’ve added Set Intersections as a similarity metric for sparse vector representations.
There, I don’t need to export the matches but rather just count them, making the serial algorithm even more straightforward:
1
2
3
4
5
6
7
8
9
10
| template <typename T>
size_t intersection_size(T const *a, T const *b, size_t n, size_t m) noexcept {
size_t i = 0, j = 0, count = 0;
while (i < n && j < m) {
if (a[i] < b[j]) i++;
else if (a[i] > b[j]) j++;
else i++, j++, count++;
}
return count;
}
|
There are several ways to optimize this algorithm, but the most common one is using the Galloping algorithm, a binary search with a twist.
Galloping#
The Galloping algorithm is a straightforward application of the binary search algorithm.
The idea is to find the first element of the first array in the second array.
If the element is found, we can add it to the result and continue the search from the next element.
Thus, we reduce each iteration’s searchable scope within the second array.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| #include <algorithm> // for `std::lower_bound` binary search procedure
template <typename T>
size_t galloping_intersection_size(T const *a, T const *b, size_t n, size_t m) noexcept {
size_t i = 0, j = 0, count = 0;
while (i < n && j < m) {
// If elements are equal, increment both indices and count
if (a[i] == b[j]) count++, i++, j++;
// If a[i] is smaller, gallop through b to find its match
else if (a[i] < b[j]) j = std::lower_bound(b + j, b + m, a[i]) - b;
// If a[i] is larger, gallop through a to find its match
else i = std::lower_bound(a + i, a + n, b[j]) - a;
}
return count;
}
|
As it can be guessed, if one array is orders of magnitude smaller than the other, the Galloping algorithm can be much faster.
It generally becomes noticeable when the ratio of the sizes is more than 100:1.
When you are dealing with uint16_t
16-bit integers, the computer will still fetch entire cache lines from DRAM to SRAM, so the read amplification of every binary search iteration is 32x, and the latency of the DRAM access is easily 100x higher than a single CPU cycle needed to compare two integers.
Anything else we can do?
Baseline Benchmarks#
The practical runtime of algorithms depends on the inputs’ size and intersection.
The SimSIMD benchmarking suite includes a benchmark for those implemented with Google Benchmark.
1
2
3
| cmake -D CMAKE_BUILD_TYPE=Release -D SIMSIMD_BUILD_BENCHMARKS=1 -B build_release
cmake --build build_release --config Release
build_release/simsimd_bench --benchmark_filter=intersect
|
In SimSIMD applications, we generally deal with vectors containing roughly 128, 1'024, 8'192, or up to 65'536 elements.
The set intersection is emulated to be 1%, 5%, 50%, and 95% of the smaller array.
This is what the logic for defining benchmarks looks like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| // Register different lengths, intersection sizes, and distributions
// 2 first lengths * 3 second length multipliers * 4 intersection grades = 24 benchmarks for each metric.
for (std::size_t first_len : {128, 1024}) { //< 2 lengths
for (std::size_t second_len_multiplier : {1, 8, 64}) { //< 3 length multipliers
for (double intersection_share : {0.01, 0.05, 0.5, 0.95}) { //< 4 intersection grades
std::size_t intersection_size = static_cast<std::size_t>(first_len * intersection_share);
std::size_t second_len = first_len * second_len_multiplier;
std::string bench_name = name +
"<#A=" + std::to_string(first_len) +
",#B=" + std::to_string(second_len) +
",#A∩B=" + std::to_string(intersection_size) + ">";
// Results in:
// - intersect_u16_ice<#A=128,#B=128,#A∩B=1>
// - intersect_u16_ice<#A=1024,#B=8192,#A∩B=512>
|
Running it we get a baseline for our serial implementation:
1
| $ build_release/simsimd_bench --benchmark_filter=intersect_u16_serial
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| -----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
intersect_u16_serial<#A=128,#B=128,#A∩B=1> 567 ns 567 ns 24785678 pairs=1.76263M/s
intersect_u16_serial<#A=128,#B=128,#A∩B=6> 567 ns 567 ns 24598141 pairs=1.76286M/s
intersect_u16_serial<#A=128,#B=128,#A∩B=64> 569 ns 569 ns 24741572 pairs=1.75684M/s
intersect_u16_serial<#A=128,#B=128,#A∩B=121> 568 ns 568 ns 24871638 pairs=1.76073M/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=1> 2508 ns 2508 ns 5591748 pairs=398.803k/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=6> 2509 ns 2509 ns 5589871 pairs=398.535k/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=64> 2530 ns 2530 ns 5564535 pairs=395.33k/s
intersect_u16_serial<#A=128,#B=1024,#A∩B=121> 2522 ns 2522 ns 5532306 pairs=396.447k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=1> 4791 ns 4791 ns 2920833 pairs=208.737k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=6> 4800 ns 4800 ns 2923139 pairs=208.346k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=64> 4821 ns 4820 ns 2906942 pairs=207.448k/s
intersect_u16_serial<#A=128,#B=8192,#A∩B=121> 4843 ns 4843 ns 2897334 pairs=206.504k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=10> 4484 ns 4484 ns 3122873 pairs=223.023k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=51> 4479 ns 4479 ns 3124662 pairs=223.261k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=512> 4484 ns 4484 ns 3125584 pairs=223.034k/s
intersect_u16_serial<#A=1024,#B=1024,#A∩B=972> 4500 ns 4500 ns 3104588 pairs=222.229k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=10> 20118 ns 20117 ns 696244 pairs=49.7084k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=51> 20134 ns 20134 ns 696160 pairs=49.6682k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=512> 20125 ns 20124 ns 695799 pairs=49.6911k/s
intersect_u16_serial<#A=1024,#B=8192,#A∩B=972> 20102 ns 20102 ns 695762 pairs=49.7464k/s
|
Sadly, the std::unordered_set
used to produce the input dataset is too slow to scale to 65'536 elements 🤦♂️
SIMD#
I am not even remotely the first to think about accelerating set intersections with SIMD instructions.
The most representative single-purpose repositories are:
There are also production systems using those ideas.
Let’s start with SSE usage in NMSLIB.
String Matching Instructions in SSE
#
The Non-Metric Space Library (NMSLIB) is one of the best original Nearest-Neighbors Search libraries.
Here is a snippet from their SparseScalarProductFastIntern
function:
1
2
3
4
5
6
| __m128i id1 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(&pBlockIds1[i1]));
__m128i id2 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(&pBlockIds2[i2]));
while (true) {
__m128i cmpRes = _mm_cmpistrm(id2, id1,
_SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
int r = _mm_extract_epi32(cmpRes, 0);
|
The _mm_cmpistrm
is intrinsic for the pcmpistrm
instruction in the SSE4.2
instruction set, and it is used to compare two strings.
It’s a heavy function, with a latency of 10 cycles on Intel Skylake CPUs and tons of flags.
It’s so bad, that calling it the fcntl
of SIMD instructions won’t be an exaggeration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| _SIDD_UBYTE_OPS // unsigned 8-bit characters
_SIDD_UWORD_OPS // unsigned 16-bit characters
_SIDD_SBYTE_OPS // signed 8-bit characters
_SIDD_SWORD_OPS // signed 16-bit characters
_SIDD_CMP_EQUAL_ANY // compare equal any
_SIDD_CMP_RANGES // compare ranges
_SIDD_CMP_EQUAL_EACH // compare equal each
_SIDD_CMP_EQUAL_ORDERED // compare equal ordered
_SIDD_NEGATIVE_POLARITY // negate results
_SIDD_MASKED_NEGATIVE_POLARITY // negate results only before end of string
_SIDD_LEAST_SIGNIFICANT // index only: return last significant bit
_SIDD_MOST_SIGNIFICANT // index only: return most significant bit
_SIDD_BIT_MASK // mask only: return bit mask
_SIDD_UNIT_MASK // mask only: return byte/word mask
|
Announced back in 2006, it was designed for 128-bit XMM
registers. It’s not a stretch to imagine that we can do better in 2024 with 512-bit ZMM
registers and much more powerful AVX-512
instructions.
Integers Comparisons with AVX-512BW
#
On Intel Ice Lake and newer CPUs, we can use the AVX-512BW
instructions for simple integer comparisons for a baseline implementation:
- Load one element from the first array and a vector of elements from the second array.
- Compare the first element with all elements in the vector.
- Skip:
- potentially many elements in the second array and
- at most, one element in the first array.
- Repeat.
One nifty optimization is to keep the shorter array in the first argument and the longer array in the second argument, swapping them if necessary.
Putting this into code for uint16_t
arguments, targeting Intel Ice Lake CPUs and newer, we get:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
| void simsimd_intersect_u16_ice(
simsimd_u16_t const* shorter, simsimd_u16_t const* longer,
simsimd_size_t shorter_length, simsimd_size_t longer_length,
simsimd_distance_t* results) {
simsimd_size_t intersection_count = 0;
simsimd_size_t shorter_idx = 0, longer_idx = 0;
simsimd_size_t longer_load_size;
__mmask32 longer_mask;
while (shorter_idx < shorter_length && longer_idx < longer_length) {
// Load `shorter_member` and broadcast it to shorter vector, load `longer_members_vec` from memory.
simsimd_size_t longer_remaining = longer_length - longer_idx;
simsimd_u16_t shorter_member = shorter[shorter_idx];
__m512i shorter_member_vec = _mm512_set1_epi16(*(short*)&shorter_member);
__m512i longer_members_vec;
if (longer_remaining < 32) {
longer_load_size = longer_remaining;
longer_mask = (__mmask32)_bzhi_u32(0xFFFFFFFF, longer_remaining);
} else {
longer_load_size = 32;
longer_mask = 0xFFFFFFFF;
}
longer_members_vec = _mm512_maskz_loadu_epi16(longer_mask, (__m512i const*)(longer + longer_idx));
// Compare `shorter_member` with each element in `longer_members_vec`,
// and jump to the position of the match. There can be only one match at most!
__mmask32 equal_mask = _mm512_mask_cmpeq_epu16_mask(longer_mask, shorter_member_vec, longer_members_vec);
simsimd_size_t equal_count = equal_mask != 0;
intersection_count += equal_count;
// When comparing a scalar against a sorted array, we can find three types of elements:
// - entries that scalar is greater than,
// - entries that scalar is equal to,
// - entries that scalar is less than,
// ... in that order! Any of them can be an empty set.
__mmask32 greater_mask = _mm512_mask_cmplt_epu16_mask(longer_mask, longer_members_vec, shorter_member_vec);
simsimd_size_t greater_count = _mm_popcnt_u32(greater_mask);
simsimd_size_t smaller_exists = longer_load_size > greater_count - equal_count;
// Advance the first array:
// - to the next element, if a match was found,
// - to the next element, if the current element is smaller than any elements in the second array.
shorter_idx += equal_count | smaller_exists;
// Advance the second array:
// - to the next element after match, if a match was found,
// - to the first element that is greater than the current element in the first array, if no match was found.
longer_idx += greater_count + equal_count;
// At any given cycle, take one entry from shorter array and compare it with multiple from the longer array.
// For that, we need to swap the arrays if necessary.
if ((shorter_length - shorter_idx) > (longer_length - longer_idx)) {
simsimd_u16_t const* temp_array = shorter;
shorter = longer, longer = temp_array;
simsimd_size_t temp_length = shorter_length;
shorter_length = longer_length, longer_length = temp_length;
simsimd_size_t temp_idx = shorter_idx;
shorter_idx = longer_idx, longer_idx = temp_idx;
}
}
*results = intersection_count;
}
|
Every time we load data, we need to perform 2 comparisons and analyze the resulting masks.
Running the benchmark, we get:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| ---------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
intersect_u16_ice<#A=128d,#B=128,#A∩B=1> 875 ns 875 ns 16248886 pairs=1.14342M/s
intersect_u16_ice<#A=128d,#B=128,#A∩B=6> 873 ns 873 ns 16081249 pairs=1.14555M/s
intersect_u16_ice<#A=128d,#B=128,#A∩B=64> 882 ns 882 ns 15851609 pairs=1.13354M/s
intersect_u16_ice<#A=128d,#B=128,#A∩B=121> 916 ns 916 ns 15282595 pairs=1091.32k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=1> 955 ns 955 ns 14660187 pairs=1047.53k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=6> 955 ns 955 ns 14663375 pairs=1047.57k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=64> 952 ns 952 ns 14702462 pairs=1050.17k/s
intersect_u16_ice<#A=128d,#B=1024,#A∩B=121> 949 ns 949 ns 14743103 pairs=1053.59k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=1> 2718 ns 2718 ns 5168053 pairs=367.871k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=6> 2698 ns 2698 ns 5155819 pairs=370.664k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=64> 2705 ns 2705 ns 5203675 pairs=369.686k/s
intersect_u16_ice<#A=128d,#B=8192,#A∩B=121> 2693 ns 2693 ns 5187007 pairs=371.377k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=10> 7310 ns 7310 ns 1910292 pairs=136.8k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=51> 7312 ns 7312 ns 1913190 pairs=136.759k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=512> 7365 ns 7365 ns 1900946 pairs=135.781k/s
intersect_u16_ice<#A=1024d,#B=1024,#A∩B=972> 7439 ns 7439 ns 1882319 pairs=134.43k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=10> 7682 ns 7681 ns 1821784 pairs=130.183k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=51> 7695 ns 7695 ns 1821861 pairs=129.955k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=512> 7643 ns 7643 ns 1829955 pairs=130.842k/s
intersect_u16_ice<#A=1024d,#B=8192,#A∩B=972> 7617 ns 7617 ns 1838612 pairs=131.279k/s
|
Highlighting the results:
Benchmark | Serial | 🆕 AVX-512BW |
---|
Small: #A=128,#B=128,#A∩B=1 | 1'762 K/s | 1'143 K/s |
Medium: #A=128,#B=1024,#A∩B=6 | 398 K/s | 1'047 K/s |
Large: #A=1024,#B=1024,#A∩B=512 | 223 K/s | 135 K/s |
Largest: #A=1024,#B=8192,#A∩B=972 | 49 K/s | 131 K/s |
We can do better.
Emulating AVX-512VP2INTERSECT
#
AVX-512 added many instructions, one of the most interesting ones being the VP2INTERSECT
instructions.
Here is how Intel documents them:
Compute intersection of packed 32-bit integer vectors a and b, and store indication of match in the corresponding bit of two mask registers specified by k1 and k2.
A match in corresponding elements of a and b is indicated by a set bit in the corresponding bit of the mask registers.
MEM[k1+15:k1] := 0
MEM[k2+15:k2] := 0
FOR i := 0 TO 15
FOR j := 0 TO 15
match := (a.dword[i] == b.dword[j] ? 1 : 0)
MEM[k1+15:k1].bit[i] |= match
MEM[k2+15:k2].bit[j] |= match
ENDFOR
ENDFOR
Even from pseudo-code, it clearly does at least 2x more work than a set intersection would — comparing all pairs of elements instead of just the upper or lower triangle of the matrix.
Moreover, it applies only to 32-bit and 64-bit integers and has extremely high latency.
It was an easy pass, assuming only a few Intel Tiger Lake mobile CPUs support the VP2INTERSECT
subset of AVX-512.
Still, a 2022 paper by Guille Diez-Canas, explores emulating this instruction using VPERM
, VPSHUF
, and VALIGN
:
Since computing set intersections (or sizes of set intersections) of sorted vectors only makes use of the first output result of a vp2intersectd
instruction (as seen in listing 1), we implement equivalents to all instructions in the VP2INTERSECT
subset that return only the first output mask.
Note, therefore, that our implementations (except for that of section 3.3) are not drop-in replacements for the VP2INTERSECT
subset.
In fact, as shown in section 3.3, the strict emulation is slower than the native instruction.
The outlined procedures, however, can be valuable for our use case.
I have reimplemented them in SimSIMD, integrating with minor modifications under the name _mm512_2intersect_epi16_mask
, and voila:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| void simsimd_intersect_u16_ice(
simsimd_u16_t const* a, simsimd_u16_t const* b,
simsimd_size_t a_length, simsimd_size_t b_length,
simsimd_distance_t* results) {
simsimd_u16_t const* const a_end = a + a_length;
simsimd_u16_t const* const b_end = b + b_length;
simsimd_size_t c = 0;
union vec_t {
__m512i zmm;
simsimd_u16_t u16[32];
simsimd_u8_t u8[64];
} a_vec, b_vec;
while (a + 32 < a_end && b + 32 < b_end) {
a_vec.zmm = _mm512_loadu_si512((__m512i const*)a);
b_vec.zmm = _mm512_loadu_si512((__m512i const*)b);
// Intersecting registers with `_mm512_2intersect_epi16_mask` involves a lot of shuffling
// and comparisons, so we want to avoid it if the slices don't overlap at all
simsimd_u16_t a_min;
simsimd_u16_t a_max = a_vec.u16[31];
simsimd_u16_t b_min = b_vec.u16[0];
simsimd_u16_t b_max = b_vec.u16[31];
// If the slices don't overlap, advance the appropriate pointer
while (a_max < b_min && a + 64 < a_end) {
a += 32;
a_vec.zmm = _mm512_loadu_si512((__m512i const*)a);
a_max = a_vec.u16[31];
}
a_min = a_vec.u16[0];
while (b_max < a_min && b + 64 < b_end) {
b += 32;
b_vec.zmm = _mm512_loadu_si512((__m512i const*)b);
b_max = b_vec.u16[31];
}
b_min = b_vec.u16[0];
// Now we are likely to have some overlap, so we can intersect the registers
__mmask32 a_matches = _mm512_2intersect_epi16_mask(a_vec.zmm, b_vec.zmm);
c += _mm_popcnt_u32(a_matches); // The `_popcnt32` symbol isn't recognized by MSVC
// Determine the number of entries to skip in each array, by comparing
// every element in the vector with the last (largest) element in the other array
__m512i a_last_broadcasted = _mm512_set1_epi16(*(short const*)&a_max);
__m512i b_last_broadcasted = _mm512_set1_epi16(*(short const*)&b_max);
__mmask32 a_step_mask = _mm512_cmple_epu16_mask(a_vec.zmm, b_last_broadcasted);
__mmask32 b_step_mask = _mm512_cmple_epu16_mask(b_vec.zmm, a_last_broadcasted);
a += 32 - _lzcnt_u32((simsimd_u32_t)a_step_mask);
b += 32 - _lzcnt_u32((simsimd_u32_t)b_step_mask);
}
// Handle the tail:
simsimd_intersect_u16_serial(a, b, a_end - a, b_end - b, results);
*results += c; // And merge it with the main body result
}
|
There are a lot of neat aspects that make the borrowed _mm512_2intersect_epi16_mask
subroutine great.
Most similar snippets may rely on the following intrinsics:
Intrinsic | Requirements | Latency |
---|
_mm512_permutex_epi64 | AVX-512F | 3 cycles |
_mm512_permutexvar_epi16 | AVX-512BW | 4-6 cycles |
_mm512_permutexvar_epi8 | AVX-512VBMI | 3 cycles |
_mm512_shuffle_epi8 | AVX-512BW | 1 cycle |
Lesson?
The permutex
variants are slow and not always available.
Shuffles are great!
Feel free to quote me on that 😅:
If in C++ everything is a rotate, in Assembly everything is a shuffle!
Luckily, our snippet uses only shuffles and alignr
, reaching massive 7 Million pairs per second on 128-element arrays:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| --------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
intersect_u16_ice<#A=128,#B=128,#A∩B=1> 129 ns 129 ns 101989513 pairs=7.72559M/s
intersect_u16_ice<#A=128,#B=128,#A∩B=6> 134 ns 134 ns 107140278 pairs=7.46949M/s
intersect_u16_ice<#A=128,#B=128,#A∩B=64> 122 ns 122 ns 113134485 pairs=8.18634M/s
intersect_u16_ice<#A=128,#B=128,#A∩B=121> 114 ns 114 ns 122765163 pairs=8.75268M/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=1> 1042 ns 1042 ns 13412933 pairs=959.711k/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=6> 1035 ns 1035 ns 13423867 pairs=966.278k/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=64> 1038 ns 1038 ns 13401265 pairs=963.267k/s
intersect_u16_ice<#A=128,#B=1024,#A∩B=121> 1055 ns 1055 ns 13170438 pairs=948.193k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=1> 4315 ns 4315 ns 3024069 pairs=231.776k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=6> 3999 ns 3999 ns 3371134 pairs=250.088k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=64> 4486 ns 4486 ns 3278143 pairs=222.9k/s
intersect_u16_ice<#A=128,#B=8192,#A∩B=121> 4525 ns 4525 ns 3170802 pairs=220.991k/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=10> 817 ns 817 ns 17102654 pairs=1.22419M/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=51> 820 ns 820 ns 17168886 pairs=1.22003M/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=512> 793 ns 793 ns 17756237 pairs=1.26107M/s
intersect_u16_ice<#A=1024,#B=1024,#A∩B=972> 747 ns 747 ns 18261381 pairs=1.33794M/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=10> 5142 ns 5142 ns 2728465 pairs=194.496k/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=51> 5114 ns 5114 ns 2727670 pairs=195.56k/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=512> 5142 ns 5142 ns 2716714 pairs=194.491k/s
intersect_u16_ice<#A=1024,#B=8192,#A∩B=972> 5151 ns 5151 ns 2721708 pairs=194.148k/s
|
Highlighting the results:
Benchmark | Serial | AVX-512BW | 🆕 AVX-512VP2INTERSECT |
---|
Small: #A=128,#B=128,#A∩B=1 | 1'762 K/s | 1'143 K/s | 7'725 K/s |
Medium: #A=128,#B=1024,#A∩B=6 | 398 K/s | 1'047 K/s | 966 K/s |
Large: #A=1024,#B=1024,#A∩B=512 | 223 K/s | 135 K/s | 1'261 K/s |
Largest: #A=1024,#B=8192,#A∩B=972 | 49 K/s | 131 K/s | 194 K/s |
Masks & Bitsets in NEON
#
Before jumping to SVE2, there is a well-known SIMD extension for ARM CPUs - NEON.
It’s a 128-bit SIMD extension, available on most ARM CPUs, and should have been straightforward to implement, but it wasn’t.
I’ve noticed a repeated pattern when I need a movemask
x86 instruction to extract one bit from every byte or lane in the vector. There is a snippet I’ve previously borrowed from Danila Kutenin from his blogpost on the Arm page and previously used in StringZilla for faster-than-LibC memmem
and strstr
:
1
2
3
| simsimd_u64_t _simsimd_u8_to_u4_neon(uint8x16_t vec) {
return vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(vreinterpretq_u16_u8(vec), 4)), 0);
}
|
It is a powerful tool, use it wisely 😊
Next up in my over-engineered SIMD version, I’ve faced the issue of implementing decent bitset operations on Arm.
Arm does not have a good way of computing the population count of a bitset, similar to _mm_popcnt_u32
and _lzcnt_u32
on x86.
One approach may be compiler intrinsics, like __builtin_popcountll
and __builtin_clzll
, but those are specific to GCC and Clang.
Combining CPU-specific intrinsics with compiler-specific intrinsics is in bad taste, but beyond that, it’s not portable with respect to MSVC and less popular compilers.
One approach would be to implement it in inlined-assembly:
1
2
3
4
5
| int _simsimd_clz_u64(simsimd_u64_t value) {
simsimd_u64_t result;
__asm__("clz %x0, %x1" : "=r"(result) : "r"(value));
return (int)result;
}
|
However, MSVC also doesn’t support inline assembly.
In StringZilla, I needed a general-purpose solution and implemented vanilla serial variants if compiler support was missing.
It’s ugly code with some common bit-twiddling hacks and mess #if
preprocessor macros.
A better solution for Arm was to aggregate the counts within the loop body and horizontally sum them outside the end, so we only do this on the hot path:
1
2
3
4
5
6
7
8
9
10
11
12
| // Now we are likely to have some overlap, so we can intersect the registers.
// We can do it by performing a population count at every cycle, but it's not the cheapest in terms of cycles.
//
// simsimd_u64_t a_matches = __builtin_popcountll(
// _simsimd_u8_to_u4_neon(vreinterpretq_u8_u16(
// _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8))));
// c += a_matches / 8;
//
// Alternatively, we can we can transform match-masks into "ones", accumulate them between the cycles,
// and merge all together in the end.
uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));
|
Population counts - solved!
Counting leading zeros is harder.
We can use the vclz
instruction in NEON.
Looking through Arm documentation, we can find all the supported variants of the intrinsic:
SIMD ISA | Return Type | Name | Arguments |
---|
Neon | int8x8_t | vclz_s8 | (int8x8_t a) |
Neon | int8x16_t | vclzq_s8 | (int8x16_t a) |
Neon | int16x4_t | vclz_s16 | (int16x4_t a) |
Neon | int16x8_t | vclzq_s16 | (int16x8_t a) |
Neon | int32x2_t | vclz_s32 | (int32x2_t a) |
Neon | int32x4_t | vclzq_s32 | (int32x4_t a) |
Neon | uint8x8_t | vclz_u8 | (uint8x8_t a) |
Neon | uint8x16_t | vclzq_u8 | (uint8x16_t a) |
Neon | uint16x4_t | vclz_u16 | (uint16x4_t a) |
Neon | uint16x8_t | vclzq_u16 | (uint16x8_t a) |
Neon | uint32x2_t | vclz_u32 | (uint32x2_t a) |
Neon | uint32x4_t | vclzq_u32 | (uint32x4_t a) |
Notice that we need a variant for 64-bit integers, but it’s not defined.
We can circumvent that in 2 ways:
- Further shrinking masks from 64-bits to 32-bit integers before
vclz
, or - Treat the 64-bit masks as a pair of concatenated 32-bit masks and mix them afterward.
For now, I couldn’t find a way to apply those ideas, sometimes losing as much as 50% of the performance compared to the following solution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| void simsimd_intersect_u16_neon(
simsimd_u16_t const* a, simsimd_u16_t const* b,
simsimd_size_t a_length, simsimd_size_t b_length,
simsimd_distance_t* results) {
simsimd_u16_t const* const a_end = a + a_length;
simsimd_u16_t const* const b_end = b + b_length;
union vec_t {
uint16x8_t u16x8;
simsimd_u16_t u16[8];
simsimd_u8_t u8[16];
} a_vec, b_vec, c_counts_vec;
c_counts_vec.u16x8 = vdupq_n_u16(0);
while (a + 8 < a_end && b + 8 < b_end) {
a_vec.u16x8 = vld1q_u16(a);
b_vec.u16x8 = vld1q_u16(b);
// Intersecting registers with `_simsimd_intersect_u16x8_neon` involves a lot of shuffling
// and comparisons, so we want to avoid it if the slices don't overlap at all..
simsimd_u16_t a_min;
simsimd_u16_t a_max = a_vec.u16[7];
simsimd_u16_t b_min = b_vec.u16[0];
simsimd_u16_t b_max = b_vec.u16[7];
// If the slices don't overlap, advance the appropriate pointer
while (a_max < b_min && a + 16 < a_end) {
a += 8;
a_vec.u16x8 = vld1q_u16(a);
a_max = a_vec.u16[7];
}
a_min = a_vec.u16[0];
while (b_max < a_min && b + 16 < b_end) {
b += 8;
b_vec.u16x8 = vld1q_u16(b);
b_max = b_vec.u16[7];
}
b_min = b_vec.u16[0];
// Now we are likely to have some overlap, so we can intersect the registers.
uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));
// Counting leading zeros is tricky.
uint16x8_t a_last_broadcasted = vdupq_n_u16(a_max);
uint16x8_t b_last_broadcasted = vdupq_n_u16(b_max);
simsimd_u64_t a_step = _simsimd_clz_u64(_simsimd_u8_to_u4_neon( //
vreinterpretq_u8_u16(vcleq_u16(a_vec.u16x8, b_last_broadcasted))));
simsimd_u64_t b_step = _simsimd_clz_u64(_simsimd_u8_to_u4_neon( //
vreinterpretq_u8_u16(vcleq_u16(b_vec.u16x8, a_last_broadcasted))));
a += (64 - a_step) / 8;
b += (64 - b_step) / 8;
}
simsimd_intersect_u16_serial(a, b, a_end - a, b_end - b, results);
*results += vaddvq_u16(c_counts_vec.u16x8);
}
|
The _simsimd_clz_u64
currently falls back to a serial __builtin_clzll
.
Even with that and only 128-bit registers, the performance is still decent:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| --------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
intersect_u16_neon<#A=128,#B=128,#A∩B=1> 195 ns 195 ns 72473251 pairs=5.12346M/s
intersect_u16_neon<#A=128,#B=128,#A∩B=6> 193 ns 193 ns 71826322 pairs=5.17983M/s
intersect_u16_neon<#A=128,#B=128,#A∩B=64> 181 ns 181 ns 76859132 pairs=5.51211M/s
intersect_u16_neon<#A=128,#B=128,#A∩B=121> 161 ns 161 ns 86301671 pairs=6.22906M/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=1> 1199 ns 1027 ns 13866808 pairs=973.295k/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=6> 1171 ns 1034 ns 13729254 pairs=966.886k/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=64> 1120 ns 1038 ns 13671085 pairs=963.804k/s
intersect_u16_neon<#A=128,#B=1024,#A∩B=121> 1150 ns 1051 ns 13070692 pairs=951.238k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=1> 2587 ns 2446 ns 5685615 pairs=408.885k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=6> 2595 ns 2490 ns 5538880 pairs=401.615k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=64> 2482 ns 2460 ns 5704185 pairs=406.459k/s
intersect_u16_neon<#A=128,#B=8192,#A∩B=121> 2512 ns 2512 ns 5592948 pairs=398.064k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=10> 1599 ns 1573 ns 8893290 pairs=635.781k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=51> 1570 ns 1570 ns 8950291 pairs=637.098k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=512> 1488 ns 1488 ns 9449103 pairs=672.121k/s
intersect_u16_neon<#A=1024,#B=1024,#A∩B=972> 1332 ns 1332 ns 10582682 pairs=751.007k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=10> 8997 ns 8997 ns 1556944 pairs=111.144k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=51> 8999 ns 8999 ns 1554324 pairs=111.128k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=512> 9126 ns 9070 ns 1543769 pairs=110.257k/s
intersect_u16_neon<#A=1024,#B=8192,#A∩B=972> 9089 ns 9089 ns 1536462 pairs=110.029k/s
|
Highlights:
Benchmark | Serial | 🆕 NEON |
---|
Small: #A=128,#B=128,#A∩B=1 | 1'628 K/s | 5'123 K/s |
Medium: #A=128,#B=1024,#A∩B=6 | 393 K/s | 963 K/s |
Large: #A=1024,#B=1024,#A∩B=512 | 218 K/s | 672 K/s |
Largest: #A=1024,#B=8192,#A∩B=972 | 49 K/s | 110 K/s |
Matching & Histograms in SVE2
#
Dessert, at last!
Arm SVE2 is a new SIMD extension available on the latest Arm CPUs.
It is a way to stretch the idea of runtime-defined vector length to non-trivial, often integer operations, applied at a 128-bit single-lane granularity or the whole vector.
SVE2 has a MATCH
instruction that can be used to compute the intersection of two vectors.
This one operates at a lane granularity and can be used for 16-bit inputs, so we need to apply it several times to cover the whole vector.
1
2
3
4
5
6
7
8
| simsimd_size_t const register_size = svcnth(); // Number of 16-bit elements in a vector
simsimd_size_t const lanes_count = register_size / 8; // 8 elements in a 128-bit vector
svbool_t equal_mask = svmatch_u16(a_progress, a_vec, b_vec);
for (simsimd_size_t i = 1; i < lanes_count; i++) {
b_vec = svext_u16(b_vec, b_vec, 8);
equal_mask = svorr_z(svptrue_b16(), equal_mask, svmatch_u16(a_progress, a_vec, b_vec));
}
simsimd_size_t equal_count = svcntp_b16(svptrue_b16(), equal_mask);
|
This is a great start, but the MATCH
instruction is not available for 32-bit integers.
We can emulate similar behavior by running a lot more iterations of svcmpeq_u32
:
1
2
3
4
5
6
| svbool_t equal_mask = svpfalse_b();
for (simsimd_size_t i = 0; i < register_size; i++) {
equal_mask = svorr_z(svptrue_b32(), equal_mask, svcmpeq_u32(a_progress, a_vec, b_vec));
b_vec = svext_u32(b_vec, b_vec, 1);
}
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);
|
Alternatively, the new histogram instructions, such as HISTCN
and the svhistcnt_u32_z
intrinsic, can be used.
For any element in the vector, it counts the number of times it appears in the other vectors prefix.
It’s practically an inclusive prefix sum of equality matches and operates over the whole register.
If we think of those intersections as a matrix of matches, it is equivalent to the lower triangle of the row-major intersection matrix.
To compute the upper triangle, we can reverse (with svrev_b32
) the order of elements and repeat the operation, accumulating the results for the top and bottom:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| // ⊐ α = {A, B, C, D}, β = {X, Y, Z, W}:
//
// hist(α, β): hist(α_rev, β_rev):
//
// X Y Z W W Z Y X
// A 1 0 0 0 D 1 0 0 0
// B 1 1 0 0 C 1 1 0 0
// C 1 1 1 0 B 1 1 1 0
// D 1 1 1 1 A 1 1 1 1
//
svuint32_t hist_lower = svhistcnt_u32_z(a_progress, a_vec, b_vec);
svuint32_t a_rev_vec = svrev_u32(a_vec);
svuint32_t b_rev_vec = svrev_u32(b_vec);
svuint32_t hist_upper = svrev_u32(svhistcnt_u32_z(svptrue_b32(), a_rev_vec, b_rev_vec));
svuint32_t hist = svorr_u32_x(a_progress, hist_lower, hist_upper);
svbool_t equal_mask = svcmpne_n_u32(a_progress, hist, 0);
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);
|
This is the most elegant of today’s solutions.
And the performance looks mostly good:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| --------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
intersect_u16_sve2<#A=128,#B=128,#A∩B=1> 179 ns 178 ns 77997900 pairs=5.60245M/s
intersect_u16_sve2<#A=128,#B=128,#A∩B=6> 179 ns 179 ns 77959137 pairs=5.59776M/s
intersect_u16_sve2<#A=128,#B=128,#A∩B=64> 170 ns 170 ns 82829421 pairs=5.88598M/s
intersect_u16_sve2<#A=128,#B=128,#A∩B=121> 143 ns 143 ns 97771708 pairs=6.9995M/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=1> 900 ns 900 ns 15430306 pairs=1.11111M/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=6> 909 ns 909 ns 15374525 pairs=1099.58k/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=64> 922 ns 922 ns 15025863 pairs=1085.12k/s
intersect_u16_sve2<#A=128,#B=1024,#A∩B=121> 932 ns 932 ns 15083373 pairs=1072.6k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=1> 2135 ns 2135 ns 6460842 pairs=468.333k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=6> 2118 ns 2118 ns 6509484 pairs=472.238k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=64> 2138 ns 2138 ns 6468742 pairs=467.706k/s
intersect_u16_sve2<#A=128,#B=8192,#A∩B=121> 2136 ns 2136 ns 6419653 pairs=468.097k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=10> 1502 ns 1502 ns 9329372 pairs=665.698k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=51> 1492 ns 1492 ns 9375601 pairs=670.246k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=512> 1416 ns 1416 ns 9859829 pairs=706.16k/s
intersect_u16_sve2<#A=1024,#B=1024,#A∩B=972> 1274 ns 1274 ns 11052636 pairs=785.05k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=10> 9148 ns 9148 ns 1528714 pairs=109.319k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=51> 9150 ns 9150 ns 1529679 pairs=109.287k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=512> 9148 ns 9147 ns 1527762 pairs=109.32k/s
intersect_u16_sve2<#A=1024,#B=8192,#A∩B=972> 9135 ns 9135 ns 1529316 pairs=109.473k/s
|
Highlights look comparable:
Benchmark | Serial | NEON | 🆕 SVE2 |
---|
Small: #A=128,#B=128,#A∩B=1 | 1'628 K/s | 5'123 K/s | 5'602 K/s |
Medium: #A=128,#B=1024,#A∩B=6 | 393 K/s | 963 K/s | 1'099 K/s |
Large: #A=1024,#B=1024,#A∩B=512 | 218 K/s | 672 K/s | 706 K/s |
Largest: #A=1024,#B=8192,#A∩B=972 | 49 K/s | 110 K/s | 109 K/s |
But there are some artifacts this picture can’t show.
The u16
SVE2 variant is always at least as fast as NEON.
The u32
is a different story.
In some cases, like <#A=128,#B=128,#A∩B=121>
it’s a bit faster; in some, like <#A=128,#B=8192,#A∩B=1>
it’s 50% slower.
To reproduce, take a modern Arm CPU, like AWS Graviton 4, and try it yourself:
1
2
3
| cmake -D CMAKE_BUILD_TYPE=Release -D SIMSIMD_BUILD_BENCHMARKS=1 -B build_release
cmake --build build_release --config Release
build_release/simsimd_bench --benchmark_filter=intersect
|
Still, SVE promises that once the CPUs get beefier, the registers will get wider, and the performance will skyrocket.
Amen!
In the meantime, check out some of my GitHub projects for some more over-engineered tech and “good first issues” for potential contributors. Happy coding 🤗