Less Slow

Fork Union: Beyond OpenMP in C++ and Rust?

TL;DR: Most C++ and Rust thread‑pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork‑join workloads and micro-benchmarks. So I’ve drafted a minimal ~300‑line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies. OpenMP has been the industry workhorse for coarse‑grain parallelism in C and C++ for decades. I lean on it heavily in projects like USearch, yet I avoid it in larger systems because: ...

Calling CUDA in 3000 Words

You’ve probably seen a CUDA tutorial like this one — a classic “Hello World” blending CPU and GPU code in a single “heterogeneous” CUDA C++ source file, with the kernel launched using NVCC’s now-iconic triple-bracket <<<>>> syntax: 1 2 3 4 5 6 7 8 9 10 11 #include <cuda_runtime.h> #include <stdio.h> __global__ void kernel() { printf("Hello World from block %d, thread %d\n", blockIdx.x, threadIdx.x); } int main() { kernel<<<1, 1>>>(); // Returns `void`?! 🤬 return cudaDeviceSynchronize() == cudaSuccess ? 0 : -1; } I still see this exact pattern in production code — and I’ll admit, it shows up in some of my own toy projects too - one, two, and three. But relying on triple-bracket kernel launches in production isn’t ideal. They don’t return error codes, and they encourage a false sense of simplicity. So in the next ~25 KBytes of text, we’ll explore the less wrong ways to launch kernels. ...

The Longest Nvidia PTX Instruction

The race for AI dominance isn’t just about who has the most computing - it’s increasingly about who can use it most efficiently. With the recent emergence of DeepSeek and other competitors in the AI space, even well-funded companies are discovering that raw computational power isn’t enough. The ability to squeeze maximum performance out of hardware through low-level optimization is becoming a crucial differentiator. One powerful tool in this optimization arsenal is the ability to work directly with PTX, NVIDIA’s low-level Instruction Set Architecture (ISA). However, PTX instructions are quite different from those for traditional CPU assembly. PTX Intermediate Representations (IR) live between high-level languages like CUDA and the actual hardware-specific Streaming Assembler (SASS) instructions. PTX is more akin to Java bytecode than x86 Assembly. And as we’re about to discover, they can reach lengths that would make even the most verbose x86 “opcodes” blush! ...

Hiding x86 Port Latency for 330 GB/s/core Reductions 🫣

For High-Performance Computing engineers, here’s the gist: On Intel CPUs, the vaddps instruction (vectorized float addition) executes on ports 0 and 5. The vfmadd132ps instruction (vectorized fused float multiply-add, or FMA) also executes on ports 0 and 5. On AMD CPUs, however, the vaddps instruction takes ports 2 and 3, and the vfmadd132ps instruction takes ports 0 and 1. Since FMA is equivalent to simple addition when one of the arguments is 1, we can drastically increase the throughput of addition-heavy numerical kernels. ...

Parsing JSON in C & C++: Singleton Tax

I’d argue that almost every open-source developer gets an extra spark of joy when someone reads the documentation and uses their tool in a way that goes beyond the classic 101 examples. It’s a rare treat even for popular projects like JSON parsers, but if you are building high-throughput software, such as analyzing millions of network packets per second, you’ll have to dig deeper. The first thing I generally check in such libraries is the memory usage pattern and whether I can override the default memory allocator. It is the most common singleton of them all! ...

10x Faster C++ String Split, 16 Years Later 👴🏻

It’s 2025. Sixteen years ago, someone asked on StackOverflow how to split a string in C++. With 3000 upvotes, you might think this question has been definitively answered. However, the provided solutions can be greatly improved in terms of both flexibility and performance, yielding up to a 10x speedup. In this post, we’ll explore three better ways to split strings in C++, including a solution I briefly mentioned in 2024 as part of a longer review of the Painful Pitfalls of C++ STL Strings. ...

The Next 31 Years of Developing Unum

When I was 20, I committed the next 40 years of my life to AI, approaching it from the infrastructure perspective. In 2015, before ChatGPT and the AI surge, such a long-term commitment seemed naive to many — almost like proposing marriage on a second date — the move I am most proud of. Yesterday, Unum celebrated its 9th anniversary, and my open-source contributions have surpassed 9,000 stars on GitHub. To mark the occasion, as I’ve done before, I’m releasing something new, free, and practical for the scientific community: efficient Bilinear Forms for real and complex numbers. These are useful in fields like statistics, computational physics, biology, or chemistry. These kernels may offer up to 5x improvements in mixed-precision throughput compared to BLAS and other libraries that power tools like NumPy, especially for simulating the time evolution of small systems of non-entangled quantum states. If you’re curious about Bilinear Forms, you can check the release notes of the SimSIMD project. ...

Understanding SIMD: Infinite Complexity of Trivial Problems 🔥

This blogpost is a mirror of the original post on Modular.com. Modern CPUs have an incredible superpower: super-scalar operations, made available through single instruction, multiple data (SIMD) parallel processing. Instead of doing one operation at a time, a single core can do up to 4, 8, 16, or even 32 operations in parallel. In a way, a modern CPU is like a mini GPU, able to perform a lot of simultaneous calculations. Yet, because it’s so tricky to write parallel operations, almost all that potential remains untapped, resulting in code that only does one operation at a time. ...

5x Faster Set Intersections: SVE2, AVX-512, & NEON 🤐

Set intersections are one of the standard operations in databases and search engines. They are used in: TF-IDF ranking in search engines, Table Joins in OLAP databases, Graph Algorithms. Chances are, you rely on them every day, but you may not realize that they are some of the most complex operations to accelerate with SIMD instructions. SIMD instructions make up the majority of modern Assembly instruction sets on x86 and Arm. They can yield 10x speedups, but due to their complexity, they are almost never used in production codebases. ...

35% Discount on Keyword Arguments in Python 🐍

Python has a straightforward syntax for positional and keyword arguments. Positional arguments are arguments passed to a function in a specific order, while keyword arguments are passed to a function by name. Surprising to most Python developers, the choice of positional vs keyword arguments can have huge implications on readability and performance. Let’s take the cdist interface as an example. It’s a function implemented in SimSIMD, mimicking SciPy, that computes all pairwise distances between two sets of points, each represented by a matrix. It accepts up to 6 arguments: ...

NumPy vs BLAS: Losing 90% of Throughput

Downloaded over 5 Billion times, NumPy is the most popular library for numerical computing in Python. It wraps low-level HPC libraries like BLAS and LAPACK, providing a high-level interface for matrix operations. BLAS is mainly implemented in C, Fortran, or Assembly and is available for most modern chips, not just CPUs. BLAS is fast, but bindings aren’t generally free. So, how much of the BLAS performance is NumPy leaving on the table? ...

The Painful Pitfalls of C++ STL Strings 🧵

Criticizing software is easy, yet the C++ and C standard libraries have withstood the test of time admirably. Nevertheless, they are not perfect. Especially the <string>, <string_view>, and <string.h> headers. The first two alone bring in over 20,000 lines of code, slowing the compilation of every translation unit by over 100 milliseconds. Most of that code seems dated, much slower than LibC, and equally error-prone, with interfaces that are very hard to distinguish. ...

Python, C, Assembly - 2'500x Faster Cosine Similarity 📐

In this fourth article of the “Less Slow” series, I’m accelerating Unum’s open-source Vector Search primitives used by some great database and cloud providers to replace Meta’s FAISS and scale-up search in their products. This time, our focus is on the most frequent operation for these tasks - computing the the Cosine Similarity/Distance between two vectors. It’s so common, even doubling it’s performance can have a noticeable impact on applications economics. But compared to a pure Python baseline our single-threaded performance grew by a factor of 2'500x. Let’s go through optimizations one by one: ...

GCC Compiler vs Human - 119x Faster Assembly 💻🆚🧑‍💻

When our Python code is too slow, like most others we switch to C and often get 100x speed boosts, just like when we replaced SciPy distance computations with SimSIMD. But imagine going 100x faster than C code! It sounds crazy, especially for number-crunching tasks that are “data-parallel” and easy for compilers to optimize. In such spots the compiler will typically “unroll” the loop, vectorize the code, and use SIMD instructions to process multiple data elements in parallel. But I’ve found a simple case, where the GNU C Compiler (GCC 12) failed to do so, and lost to hand-written SIMD code by a factor of 119. Enough to make the friendly bull compiler cry. ...

Accelerating JavaScript arrays by 10x for Vector Search 🏹

You’ve probably heard about AI a lot this year. Lately, there’s been talk about something called Retrieval Augmented Generation (RAG). Unlike a regular chat with ChatGPT, RAG lets ChatGPT search through a database for helpful information. This makes the conversation better and the answers more on point. Usually, a Vector Search engine is used as the database. It’s good at finding similar data points in a big pile of data. These data points are often at least 256-dimensional, meaning they have many Number-s. If you use JavaScript, you might wonder whether to use the built-in Array type or the more specialized TypedArray for this job. ...

Our CPython bindings got 5x faster without PyBind11 🐍

Python’s not the fastest language out there. Developers often use tools like Boost.Python and SWIG to wrap faster native C/C++ code for Python. PyBind11 is the most popular tool for the job not the quickest. NanoBind offers improvements, but when speed really matters, we turn to pure CPython C API bindings. With StringZilla, I started with PyBind11 but switched to CPython to reduce latency. The switch did demand more coding effort, moving from modern C++17 to more basic C99, but the result is a 5x lower call latency! So if you want to make your bindings lighter and better understand how CPython works under the hood - continue reading 🤗 ...

SciPy distances... up to 200x faster with AVX-512 & SVE 📏

Over the years, Intel’s 512-bit Advanced Vector eXtensions (AVX-512) stirred extensive discussions. While introduced in 2014, it wasn’t until recently that CPUs began providing comprehensive support. Similarly, Arm Scalable Vector Extensions (SVE), primarily designed for Arm servers, have also started making waves only lately. The computing landscape now looks quite different with powerhouses like Intel’s Sapphire Rapids CPUs, AWS Graviton 3, and Ampere Altra entering the fray. Their arrival brings compelling advantages over the traditional AVX2 and NEON extensions: ...

Combinatorial Stable Marriages for DBMS Semantic Joins 💍

How can the 2012 Nobel Prize in Economics, Vector Search, and the world of dating come together? What are the implications for the future of databases? And why do Multi-Modal AI model evaluation datasets often fall short? Synopsis: Stable Marriages are generally computed from preference lists. Those consume too much memory. Instead, one can dynamically recalculate candidate lists using a scalable Vector Search engine. However, achieving this depends on having high-quality representations in a shared Vector Space. While this already works well for text-based features using modern BERT-like architectures, the quality decreases significantly for Multi-Modal data. This shortcoming, reflected in OpenAI’s CLIP and our own Unum’s UForm, signifies the need for improving modern space-alignment techniques. Their advancement could not only catalyze the integration of AI into databases but also enhance the performance of upcoming Generative Vision-Language models. ...

StringZilla: 5x faster strings with SIMD & SWAR 🦖

A few years back, I found a simple trick in tandem with SIMD intrinsics to truly unleash the power of contemporary CPUs. I put the strstr of LibC and the std::search of the C++ Standard Templates Library to the test and hit a throughput of around 1.5 GB/s for substring search on a solitary core. Not too shabby, right? But imagine, that the memory bandwidth could theoretically reach a striking 10-15 GB/s per core. ...

Counting Strings in C++: 30x Throughput Difference 💬

Some of the most common questions in programming interviews are about strings - reversing them, splitting, joining, counting, etc. These days, having to interview more and more developers across the whole spectrum, we see how vastly the solutions, even to the most straightforward problems, differ depending on experience. Let’s imagine a test with the following constraints: You must find the first occurrence of every unique string in a non-empty array. You are only allowed to use the standard library, no other dependencies. You have 20 minutes. In Python, the solution may look like this: ...

Mastering C++ with Google Benchmark ⏱️

There are only two kinds of languages: the ones people complain about and the ones nobody uses. – Bjarne Stroustrup, creator of C++. Very few consider C++ attractive, and only some people think it’s easy. Choosing it for a project generally means you care about the performance of your code. And rightly so! Today, machines can process hundreds of Gigabytes per second, and we, as developers, should all learn to saturate those capabilities. So, let’s look into a few simple code snippets and familiarize ourselves with Google Benchmark (GB) - the most famous library in the space using incrementally complex examples. ...

Failing to Reach DDR4 Bandwidth 🚌

A bit of history. Not so long ago, we tried to use GPU acceleration from Python. We benchmarked NumPy vs CuPy in the most common number-crunching tasks. We took the highest-end desktop CPU and the highest-end desktop GPU and put them to the test. The GPU, expectedly, won, but not just in Matrix Multiplications. Sorting arrays, finding medians, and even simple accumulation was vastly faster. So we implemented multiple algorithms for parallel reductions in C++ and CUDA, just to compare efficiency. CUDA was obviously harder, than using std::accumulate, but there is a shortcut: thrust::reduce. ...

Crushing CPUs with 879 GB/s Reductions in CUDA

GPU acceleration can be trivial for Python users. Follow CUDA installation steps carefully, replace import numpy as np with import cupy as np, and you will often get the 100x performance boosts without breaking a sweat. Every time you write magical one-liners, remember a systems engineer is making your dreams come true. A couple of years ago, when I was giving a talk on the breadth of GPGPU technologies, I published a repo. A repo with various cool GPGPU staff in Vulkan, Halide, and most importantly, 8x implementations of std::accumulate in OpenCL. I know what you are thinking: ...

Apple to Apple Comparison: M1 Max vs Intel 🍏

This will be a story about many things: about computers, about their (memory) speed limits, about very specific workloads that can push computers to those limits and the subtle differences in Hash-Tables (HT) designs. But before we get in, here is a glimpse of what we are about to see. A friendly warning, the following article contains many technical terms and is intended for somewhat technical and hopefully curious readers. ...

Only 1% of Software Benefits from SIMD Instructions

David Patterson had recently mentioned that (rephrasing): The programmers may benefit from using complex instruction sets directly, but it is increasingly challenging for compilers to automatically generate them in the right spots. In the last 3-4 years I gave a bunch of talks on the intricacies of SIMD programming, highlighting the divergence in hardware and software design in the past ten years. Chips are becoming bigger and more complicated to add more functionality, but the general-purpose compilers like GCC, LLVM, MSVC and ICC cannot keep up with the pace. Hardly any developer codes in Assembly today, hoping that the compiler will do the heavy lifting. ...