A few years back, I found a simple trick in tandem with SIMD intrinsics to truly unleash the power of contemporary CPUs. I put the strstr of LibC and the std::search of the C++ Standard Templates Library to the test and hit a throughput of around 1.5 GB/s for substring search on a solitary core. Not too shabby, right? But imagine, that the memory bandwidth could theoretically reach a striking 10-15 GB/s per core.


Going on this wild hypothesis that if the first 4 characters of a string are a match, it’s highly probable that the rest will match too, I brought this idea to life for x86 SSE, AVX, AVX-512, and Arm Neon. The result? I achieved memory bandwidth that outdid the standard by a whopping 5-10x factor. But alas, it remained a demo, with no practical applications… until recently.

Meet StringZilla

I found myself knee-deep in a demo project for USearch, UStore, and UForm, which required parsing of a multi-terabyte collection of newline-delimited files. Being a frequent Python user, I, naively, attempted open(...).readlines()...

StringZilla

A billion lines were simply too overwhelming for Python to handle. So, I went back to my demo code, breathed new life into it, and reshaped it into a Python library. And thus, StringZilla was born!

1
2
3
4
5
from stringzilla import Str, File

text: str = 'some-string'
text: Str = Str('some-string')
text: File = File('some-file.txt')

Str and File mimic the interface of a str, offering functions like contains, find, count, splitlines, and split, in addition to other operator overloads.

  • Native: 1.5 GB/s.
  • Heuristic: 4 GB/s.
  • Heuristic with Arm NEON: 16 GB/s.
  • Heuristic with x86 AVX: 13 GB/s.

Even with its minimalistic implementation that doesn’t even bother to dodge misaligned loads, StringZilla holds its ground. The NEON version runs through 16 possible offsets simultaneously, employing SIMD to verify if a match is present and switching to scalar code if has_match == true for a particular block. Here’s a sneak peek at the hot path:

1
2
3
4
5
6
7
8
uint32x4_t matches0 = vceqq_u32(vld1q_u32((uint32_t *)(text + 0)), prefix);
uint32x4_t matches1 = vceqq_u32(vld1q_u32((uint32_t *)(text + 1)), prefix);
uint32x4_t matches2 = vceqq_u32(vld1q_u32((uint32_t *)(text + 2)), prefix);
uint32x4_t matches3 = vceqq_u32(vld1q_u32((uint32_t *)(text + 3)), prefix);

uint32x4_t matches32x4 = vorrq_u32(vorrq_u32(matches0, matches1), vorrq_u32(matches2, matches3));
uint64x2_t matches64x2 = vreinterpretq_u64_u32(matches32x4);
bool has_match = vgetq_lane_u64(matches64x2, 0) | vgetq_lane_u64(matches64x2, 1);

Interestingly, the same concept can be effortlessly extended to cater to needles under 4 characters long… in more ways than one. You can create standalone methods:

  1. For one-character needles, you would only require matches0.
  2. For two-character needles, matches0 and matches1 would suffice.
  3. For three-character needles, you’d need matches0, matches1, and matches2.

Despite a more straightforward but slower solution, I opted to keep the code brief. You might consider developing special functions for counting the number of times a single character appears in the string…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
inline static size_t strzl_neon_count_char(strzl_haystack_t h, char n) {
    char const *const end = h.ptr + h.len;

    // The plan is simple: skim through the misaligned part of the string.
    char const *aligned_start = (char const *)(strzl_divide_round_up((size_t)h.ptr, 16) * 16);
    size_t misaligned_len = (size_t)(aligned_start - h.ptr) < h.len ? (size_t)(aligned_start - h.ptr) : h.len;
    size_t result = strzl_naive_count_char({h.ptr, misaligned_len}, n);
    if (h.ptr + misaligned_len >= end)
        return result;

    // Count matches in the aligned part.
    char const *text = aligned_start;
    uint8x16_t n_vector = vld1q_dup_u8((uint8_t const *)&n);
    for (; (text + 16) <= end; text += 16) {
        uint8x16_t masks = vceqq_u8(vld1q_u8((uint8_t const *)text), n_vector);
        uint64x2_t masks64x2 = vreinterpretq_u64_u8(masks);
        result += __builtin_popcountll(vgetq_lane_u64(masks64x2, 0)) / 8;
        result += __builtin_popcountll(vgetq_lane_u64(masks64x2, 1)) / 8;
    }

    // Count matches in the misaligned tail.
    size_t tail_len = end - text;
    result += strzl_naive_count_char({text, tail_len}, n);
    return result;
}

While an impressive exercise, the effectiveness of implementing it may be questionable if we’re already operating with a fast scalar variant.

Pseudo-SIMD

SIMD is an exceptional asset when the goal is optimal performance, assuming your hardware is known. The complexity arises when the software is distributed across millions of different devices globally. Balancing performance and adaptability, we may compile the same function multiple times with varying configurations, bundling them in a single binary and executing dynamic dispatch at the initiation stage.


However, dynamic dispatch is far from simple. GCC provides the __builtin_cpu_supports function, but it’s limited to x86 flags. On Linux, getauxval allows checking of hardware abilities, but it’s exclusively available for Arm. This identical issue occurs in USearch and in all of my projects, from time to time compelling me to resort to JIT

What if there was a more straightforward way? The Arm NEON registers are only 128 bits long. How much worse would the performance get if we use 64-bit general-purpose registers and emulate SIMD instructions with simple bitshifts and lookups? The answer - not much!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
uint64_t nnnnnnnn = n;
nnnnnnnn |= nnnnnnnn << 8;  // broadcast `n` into `nnnnnnnn`
nnnnnnnn |= nnnnnnnn << 16; // broadcast `n` into `nnnnnnnn`
nnnnnnnn |= nnnnnnnn << 32; // broadcast `n` into `nnnnnnnn`
for (; haystack + 8 <= end; haystack += 8) {
    uint64_t haystack_part = *(uint64_t const *)haystack;
    uint64_t match_indicators = ~(haystack_part ^ nnnnnnnn);
    match_indicators &= match_indicators >> 1;
    match_indicators &= match_indicators >> 2;
    match_indicators &= match_indicators >> 4;
    match_indicators &= 0x0101010101010101;

    if (match_indicators != 0)
        return haystack - begin + ctz64(match_indicators) / 8;
}

Let’s start with the simple case when the needle is only a single character long. The n is the “needle”, and ctz is short for “count trailing zeros”. This would work on almost every modern CPU, and it’s easy to extend for 2, 3, and 4-character needles. Adding boilerplate code to choose the suitable backend and match the misaligned head and tail, you can expect the following performance on the Apple M2 Pro CPU:

Needle LengthSTLStringZillaBackend
13.4 GB/s12.25 GB/sstrzl_naive_find_char
23.4 GB/s6.4 GB/sstrzl_naive_find_2chars
33.4 GB/s4.1 GB/sstrzl_naive_find_3chars
43.4 GB/s3.4 GB/sstrzl_naive_find_4chars
51.7 GB/s3.1 GB/sstrzl_naive_find_substr

This may not be that impressive after the previous chapter, but this doesn’t need any specialized hardware and can easily be integrated into WebAssembly and other runtimes.

Sorting

Aside from “search”, other common operations on strings include sorting, grouping, and partitioning. For those, StringZilla brings Strs to replace the list[str].

1
2
3
4
5
6
7
8
def stringzilla_sort(strings, bit=0):
    pivot = partition(strings, bit)
    if bit < 32:
        stringzilla_sort(strings[:pivot], bit + 1)
        stringzilla_sort(strings[pivot:], bit + 1)
    else:
        sort(strings[:pivot])
        sort(strings[pivot:])

The core idea is the same. Let’s Radix Sort based on the first 4 characters and finalize the rest with Quick Sort or any other standard algorithm. In C, the code is more mouthful.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
inline static void _strzl_sort_recursion( //
    strzl_array_t *array,
    size_t bit_idx,
    size_t bit_max,
    int (*comparator)(void const *, void const *, void *)) {

    if (!array->count)
        return;

    // Partition a range of integers according to a specific bit value
    size_t split = 0;
    {
        size_t mask = (1ul << 63) >> bit_idx;
        while (split != array->count && (array->order[split] & mask))
            ++split;

        for (size_t i = split + 1; i < array->count; ++i)
            if (array->order[i] & mask)
                strzl_swap(array->order + i, array->order + split), ++split;
    }

    // Go down recursively
    if (bit_idx < bit_max) {
        strzl_array_t a = *array;
        a.count = split;
        _strzl_sort_recursion(&a, bit_idx + 1, bit_max, comparator);

        strzl_array_t b = *array;
        b.order += split;
        b.count -= split;
        _strzl_sort_recursion(&b, bit_idx + 1, bit_max, comparator);
    }

    // Reached the end of recursion
    else {
        // Discard the prefixes
        for (size_t i = 0; i != array->count; ++i)
            memset(&array->order[i], 0, 4ul);

        qsort_r(array->order, split, sizeof(size_t), comparator, (void *)array);
        qsort_r(array->order + split, array->count - split, sizeof(size_t), comparator, (void *)array);
    }
}

inline static int _strzl_sort_array_strncmp(void const *a_raw, void const *b_raw, void *array_raw) {
    strzl_array_t *array = (strzl_array_t *)array_raw;
    size_t a = *(size_t *)a_raw;
    size_t b = *(size_t *)b_raw;
    size_t a_len = array->get_length(array->handle, a);
    size_t b_len = array->get_length(array->handle, b);
    return strncmp( //
        array->get_begin(array->handle, a),
        array->get_begin(array->handle, b),
        a_len > b_len ? b_len : a_len);
}

inline static void strzl_sort(strzl_array_t *array) {

    // Export up to 4 bytes into the `array` bits themselves
    for (size_t i = 0; i != array->count; ++i) {
        char const *begin = array->get_begin(array->handle, array->order[i]);
        size_t length = array->get_length(array->handle, array->order[i]);
        char *prefix = (char *)&array->order[i];
        memcpy(prefix, begin, length > 4ul ? 4ul : length);
    }

    _strzl_sort_recursion(array, 0, 32, _strzl_sort_array_strncmp);
}

Sadly, we need more boilerplate to make it run on Windows and MacOS, as qsort_r isn’t broadly supported. And when supported, the comparator callback signature is different:

1
2
3
4
5
#if defined(WIN32) || defined(_WIN32) || defined(__WIN32__) || defined(__NT__) || __APPLE__
    void *array, void const *a, void const *b
#else
    void const *a, void const *b, void *array
#endif

Accounting for that nonsense, we can finally benchmark. Taking the leipzig1M.txt and sorting 1 Million of its first whitespace-delimited tokens, we get:

  • std::sort: 139.82 milliseconds.
  • strzl_sort: 26.43 milliseconds. 5x improvement.

For 10 Million tokens:

  • std::sort: 1'959.33 milliseconds.
  • strzl_sort: 214.57 milliseconds. 9x improvement.

I have also implemented Merge Sort, Quick Sort, Insertion Sort, and a few other algorithm variations as part of an experiment. None of them performed well and were deprecated. It’s also worth noting, that this exact implementation only works for arrays with less than 4 Billion entries, but is easy to generalize further.

Applying on Large Datasets

This trick has already saved me a few thousand dollars in the last couple of weeks. The most apparent scenario would be when I read a huge file and split its parts between processes (pages[p::n]).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from stringzilla import File
from multiprocessing import Process, cpu_count

def process_pages(pages: Iterable[str]):
    pass

warc = File('CC-xxxx.warc')
pages = warc.split('WARC/1.1\nWARC-Type: request')

n = cpu_count()
pool = [Process(process_pages, args=(pages[p::n],) for p in range(n))
for p in pool:
    p.start()
for p in pool:
    p.join()

You probably have a few similar scenarios as well, so check out the repo and consider contributing Possible next steps would be:

  • rfind, rsplit, and other reverse-order interfaces.
  • composing Strs from Python strings.
  • stable in-place partitions and sorts (not so trivial without dynamic memory).

The last one is not so trivial, if you want it fast, in-place, and without using any dynamic memory. It probably deserves a separate article, but the next one is again about abusing USearch