TL;DR: Most C++ and Rust thread‑pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork‑join workloads and micro-benchmarks. So I’ve drafted a minimal ~300‑line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies.
OpenMP has been the industry workhorse for coarse‑grain parallelism in C and C++ for decades. I lean on it heavily in projects like USearch, yet I avoid it in larger systems because:
- Fine‑grain parallelism with independent subsystems doesn’t map cleanly to OpenMP’s global runtime.
- Portability of the C++ STL and the Rust standard library is better than OpenMP.
- Meta‑programming with OpenMP is a pain - mixing
#pragma omp
with templates quickly becomes unmaintainable.
So I went looking for ready‑made thread pools in C++ and Rust — only to realize most of them implement asynchronous task queues, a much heavier abstraction than OpenMP’s fork‑join model. Those extra layers introduce what I call the four horsemen of low performance:
- Locks & mutexes with syscalls in the hot path.
- Heap allocations in queues, tasks, futures, and promises.
- Compare‑and‑swap (CAS) stalls in the pessimistic path.
- False sharing unaligned counters thrashing cache lines.
With today’s dual‑socket AWS machines pushing 192 physical cores, I needed something leaner than Taskflow, Rayon, or Tokio. Enter Fork Union.
Benchmarks
Hardware: AWS Graviton 4 metal (single NUMA node, 96× Arm v9 cores, 1 thread/core).
Workload: “ParallelReductionsBenchmark” - summing single-precision floats in parallel.
In this case, just one cache line (float[16]
) per core—small enough to stress synchronization cost of the thread pool rather than arithmetic throughput of the CPU.
In other words, we are benchmarking kernels similar to:
Google Benchmark numbers for the C++ version of Fork Union, compared to OpenMP, Taskflow, and allocating 96× std::thread
objects on-demand, are as follows:
|
|
I’ve cleaned up the output, focusing only on the relevant rows and the reduction throughput.
Criterion.rs numbers for the Rust version of Fork Union, compared to Rayon, Tokio, and Smol’s Async Executors, are as follows:
The timing methods used in those two executables are different, but the relative observations should hold.
- Spawning new threads is obviously too expensive.
- Most reusable thread pools are still 10x slower to sync than OpenMP.
- OpenMP isn’t easy to compete with and still outperforms Fork Union by 20%.
This clearly shows, how important it is to chose the right tool for the job. Don’t pick an asynchronous task pool for a fork-join blocking workload!
Four Horsemen of Performance
This article won’t be a deep dive into those topics. Each deserves its own article and a proper benchmark, with some good ones already available and linked.
Locks and Mutexes
Unlike the std::atomic
, the std::mutex
update may result in a system call, and it can be expensive to acquire and release.
Its implementations generally have 2 executable paths:
- the fast path, where the mutex is not contended, where it first tries to grab the mutex via a compare-and-swap operation, and if it succeeds, it returns immediately.
- the slow path, where the mutex is contended, and it has to go through the kernel to block the thread until the mutex is available.
On Linux, the latter translates to a “futex” syscall and an expensive context switch.
In Rust, the same applies to std::async::atomic
and std::sync::Mutex
.
Prefer the former when possible.
Memory Allocations
Most thread-pools use classes like std::future
, std::packaged_task
, std::function
, std::queue
, std::conditional_variable
.
In Rust land, there will often be a
std::Box
,std::Arc
,std::collections::VecDeque
,std::sync::mpsc
or evenstd::sync::mpmc
.
Most of those, I believe, aren’t unusable in Big-Data applications, where you always operate in memory-constrained environments:
- Raising a
std::bad_alloc
exception when there is no memory left and just hoping that someone up the call stack will catch it is not a great design idea for Systems Engineering. - The threat of having to synchronize ~200 physical CPU cores across 2-8 sockets and potentially dozens of NUMA nodes around a shared global memory allocator practically means you can’t have predictable performance.
As we focus on a simpler concurrency parallelism model, we can avoid the complexity of allocating shared states, wrapping callbacks into some heap-allocated “tasks”, and a lot of other boilerplates.
Less work = more performance.
Atomics and CAS
Once you get to the lowest-level primitives on concurrency, you end up with the std::atomic
and a small set of hardware-supported atomic instructions.
Hardware implements it differently:
- x86 is built around the “Total Store Order” (TSO) memory consistency model and provides
LOCK
variants of theADD
andCMPXCHG
. These variants act as full-blown “fences” — no loads or stores can be reordered across them. This makes atomic operations on x86 straightforward but heavyweight. - Arm, on the other hand, has a “weak” memory model and provides a set of atomic instructions that are not fenced and match the C++ concurrency model. It offers
acquire
,release
, andacq_rel
variants of each atomic instruction — such asLDADD
,STADD
, andCAS
— which allow precise control over visibility and order, especially with the introduction of “Large System Extension” (LSE) instructions in Armv8.1-A.
A locked atomic on x86 requires the cache line in the Exclusive state in the requester’s L1 cache. This would incur a coherence transaction (Read-for-Ownership) if another core had the line. Both Intel and AMD handle this similarly.
It makes Arm and Power much more suitable for lock-free programming and concurrent data structures, but some observations hold for both platforms. Most importantly, “Compare and Swap” (CAS) is costly and should be avoided at all costs.
On x86, for example, the LOCK ADD
can easily take 50 CPU cycles.
It is 50x slower than a regular ADD
instruction but still easily 5-10x faster than a LOCK CMPXCHG
instruction.
Once the contention rises, the gap naturally widens, further amplified by the increased “failure” rate of the CAS operation when the value being compared has already changed.
That’s why, for the “dynamic” mode, we resort to using an additional atomic variable rather than more typical CAS-based implementations.
Alignment
Assuming a thread pool is a heavy object anyway, nobody will care if it’s a bit larger than expected.
That allows us to over-align the internal counters to std::hardware_destructive_interference_size
or std::max_align_t
to avoid false sharing.
In that case, even on x86, where the entire cache will be exclusively owned by a single thread, in eager mode, we end up effectively “pipelining” the execution, where one thread may be incrementing the “in-flight” counter while the other is decrementing the “remaining” counter.
Others are executing the loop body in between.
Comparing APIs
Fork Union
Fork Union has a straightforward goal, so its API is equally clear. There are only 4 core interfaces:
for_each_thread
- to dispatch a callback per thread, similar to#pragma omp parallel
.for_each_static
- for individual evenly-sized tasks, similar to#pragma omp for schedule(static)
.for_each_slice
- for slices of evenly-sized tasks, similar to nested#pragma omp for schedule(static)
.for_each_dynamic
- for individual unevenly-sized tasks, similar to#pragma omp for schedule(dynamic, 1)
.
They all receive a C++ lambda or a Rust closure and a range of tasks to execute.
The construction of the thread pool itself is a bit trickier than typically in standard libraries, as “exceptions” and “panics” are not allowed.
So, the constructor can’t perform any real work.
In C++, the try_spawn
method can be called to allocate all the threads:
|
|
As you may have noticed, the lambdas are forced to be
noexcept
and can’t return anything. This is a design choice that vastly simplifies the implementation.
In Rust, similarly, the try_spawn
method can be used:
|
|
Assuming Rust has no function overloading, there are a few alternatives:
try_spawn
- to spawn a thread pool with the main allocator.try_spawn_in
- to spawn a thread pool with a custom allocator.try_named_spawn
- to spawn a thread pool with the main allocator and a name.try_named_spawn_in
- to spawn a thread pool with a custom allocator and a name.
Rayon
Rayon is the go-to Rust library for data parallelism. It suffers from the same core design issues as every other thread pool I’ve looked at on GitHub, but it’s fair to say that at the high level, it provides outstanding coverage for various parallel iterators! As such, there is an open call to explore similar “Map-Reduce” and “Map-Fork-Reduce” patterns in Fork Union to see if they can be implemented efficiently.
The default .par_iter()
API of Rayon, at the start of the README.md, is not how I’ve used it in “Parallel Reductions Benchmark”.
To ensure that we are benchmarking the actual synchronization cost of the thread pool, I’ve gone directly to the underlying rayon::ThreadPool
API:
|
|
Taskflow
Taskflow is one of the most popular C++ libraries for parallelism. It has many features, including async execution graphs on CPUs and GPUs. The most common example looks like this:
|
|
Despite being just an example, it clearly shows how different Taskflow’s core objectives are from OpenMP and Fork Union.
It is still probably mainly used for simple static parallelism, similar to our case without complex dependencies and the taskflow
can be reused.
Here is how “Parallel Reductions Benchmark” wraps Taskflow:
|
|
Only the operator()
method is timed, leaving the construction costs out of the equation.
Conclusions & Observations
Fork Union shows that a lean, 300-line fork-join pool can sit within ~20% of OpenMP, while more functional pools trail by an order of magnitude. That margin will shift as more workloads, CPUs, and compilers are tested, so treat today’s numbers as directional, not gospel. There may still be subtle memory‑ordering bugs lurking in Fork Union, but the core observations should hold: dodge mutexes, dynamic queues, likely-pessimistic CAS paths, and false sharing — regardless of language or framework.
Rust is still new territory for me.
The biggest surprise is the missing allocator support in std::collections
on the stable toolchain.
Nightly’s Vec::try_reserve_in
helps, but until stable lands, ergonomic custom allocation remains tricky.
The machinery exists in C++, yet most projects ignore it — so the culture needs to catch up.
PS: Spot dubious memory‑ordering? Open an issue. Want to close the remaining 20% gap? Happy forking 🤗
I'm waiting for those #Rust features to use it more in HPC/BigData environments:
— Ash Vardanian (@ashvardanian) May 19, 2025
- Allocators API for containers
- AVX-512 intrinsics in the toolchain
- Provenance for pointer math
Any other important RFCs to keep an eye on?