Very few consider C++ attractive, and hardly anyone thinks it’s easy. Choosing it for a project generally means you care about the performance of your code. And rightly so! Today machines can process hundreds of Gigabytes per second, and we, as developers, should all learn to saturate those capabilities. So let’s look into a few simple code snippets and familiarize ourselves with Google Benchmark (GB) - the most famous library in the space. Here is the plan:
For the impatient ones, here is the source code on GitHub.
That repo also has the results for AMD EPYC 7302 and Threadripper PRO 3995WX CPUs with -O1
and -O3
builds.
When A Nanosecond Is Too Long
Every benchmark has the following structure. You define the operation, but GB decides how many times to run it.
Compile it with -O3
, run, result: 0 ns.
Thanks for attention, that’s all 😅
Unfortunately, no operation runs this fast on the computer. On a 3 GHz CPU, you would perform 3 Billion ops every second. So each would take 0.33 ns, not 0 ns. The compiler just optimized everything out.
Randomizing The Inputs
What will happen if we mutate the addition arguments between loop iterations?
25ns, looks better than zero to me, but something doesn’t add up. If the addition is a single hardware instruction, why did it take ~100 cycles? There are BMI2 instructions with such runtimes on AMD Zen and Zen 2 chips, but not the addition.
Here is the pattern we often use in benchmarking.
It’s definitely not the cleanest approach for something as small as addition, but good enough to make a point.
Initialize randomly before evaluation, then mutate.
Your mutations can be regular and predictable.
They just shouldn’t be expensive.
We also apply the DoNotOptimize
function to force the compilation of that useless line.
This run took 0.7ns per iteration or around 2 cycles.
The first cycle was spent incrementing a
and b
on different ALUs of the same core, while the second was performed the final accumulation.
Now let’s run those benchmarks on 8 threads.
The std::rand
powered function took 12'000 ns, while our latest variant remained the same.
Like many other libc
functions, it depends on the global state and uses mutexes to synchronize concurrent access to global memory.
Here is its source in glibc:
Doing Some Math
Let’s say you want to compute the sine of a floating-point number.
The standard has std::sin
for that, but there is a different way.
We can approximate the result with the Taylor-Maclaurin series, taking the first three components of the expansion.
|
|
Pretty simple, right? Here is the result:
- 51 ns for the
std::sin
. - 27 ns for the Maclaurin series with
std::pow
.
Can we do better?
Of course, we can.
Raising to a power x
is a generic operation.
In other words, likely slow.
Let’s unfold the expression manually:
|
|
Result: 2.4 ns, or 10x improvement! The compiler will take care of redundant operations, but he cannot do permutations.
Both addition and multiplication are associative for real numbers in math but not in programming. The order of operands will affect the result, so the compilers have their hands in cuffs. Let’s set them free!
|
|
If you are not familiar with the modern C++ attributes, here is the older GCC version: __attribute__((optimize("-ffast-math")))
.
Here we advise GCC to apply extra flags when compiling the scope of this specific function.
This is one of my favorite tricks for rapid benchmarking without having to change CMakeLists.txt
.
Result: 0.85 ns.
For those interested, GCC has a separate manual for the full list of available floating-point math optimizations.
The -ffast-math
is a shortcut for -funsafe-math-optimizations
, which is a shortcut for -fassociative-math
.
In our specific case, with that many chained multiplications, the ordering may change under/overflow behavior, as stated in the “Transformations” section.
Integer Division
We have just accelerated a floating-point trigonometric function by a factor of 60x from 51 ns to less than a nanosecond! Can we do the same for integer division, the slowest integer arithmetical operation? To a certain extent, yes.
This variant runs in 3.3 ns, or 10x longer than addition. But what if we divide by a constant, not a mutating value? In that case, we just have to make sure that the compiler knows that our denominator remains constant!
|
|
The above functions only differ in that the first one is involved in something shady - money
launder
ing.
The compiler fails to trace the origins of money
, so it can’t replace it with shifts and multiplications.
How big is the difference: 3.3 ns vs 0.5 ns!
Hardware Acceleration
Most of the examples in this article are abstract, but this actually happened to me a couple of months ago.
Compilers have special intrinsics, like __builtin_popcount
.
Those are high-performance routines shipped with the compiler and aren’t portable.
Those are different from Intel or ARM intrinsics, hardware-specific and compiler-agnostic.
These are hardware-agnostic and compiler-specific.
I forgot to pass the -mpopcnt
flag in the first variant, and the compiler silently continued.
When you want to be specific - explicitly trigger the popcnt
instruction.
The cost of this mistake was: 1.9 ns and 0.3 ns, or over 6x!
Data Alignment
Compute may be expensive, but memory accesses always are! The more you miss your caches, the more you waste time!
Let’s illustrate it by creating a cache-aligned array with 32x float
s.
That means 2x 64-byte cache lines worth of content.
|
|
Now we run two benchmarks.
Both perform the same logical operations - 8x float
additions and 8x stores.
But the first took 8 ns and the seond took 4 ns.
Our benchmark ends up being dominated by the cost of the split-load, or the lack of memory alignment, in other words.
From Nanoseconds to Microseconds
Let’s gradually grow our operations in size. What’s wrong with the following benchmark?
|
|
We are benchmarking the std::sort
, the most used <algorithm>
.
Array sizes are fixed at 3 and 4 elements in different benchmarks.
Each runs twice:
- with
std::reverse
preprocessing time included. - with
std::reverse
preprocessing time excluded.
One took 227 ns and another took 9 ns. Which is which?
The Cost of Timing in Google Benchmark
If we think about this, branches always evaluate identically, so the only problems might be coming from state.*Timing()
functions or the reversal itself.
Anyways, if you have read the title of this section, it shouldn’t be a surprise.
This took 213 ns. Those things look tiny when processing gigabytes, but you will run your benchmark on a small input one day. Plan ahead to extract unaffected asymptotic curve as we will do later.
Failing To Make A Point, Again
Branching is also critical. We avoid it on hot paths and design branchless vectorizable procedures. It’s time-consuming but makes further porting to SIMD assembly much more manageable.
|
|
Unfortunately, it’s tough to showcase its cost in a five-line code snippet. Every simple snippet like this completes in just around 2 ns. There are still papers like this regularly published on dynamic branch prediction, so I am looking for someone to contribute a better example for the cost of branching!
|
|
Even though I couldn’t show the cost of if
before, I still recommend to separate compile-time and run-time logic.
It just makes sense.
Bigger Benchmarks
GB comes with all kinds of bells and whistles, like complexity analysis.
Those come handy when you analyze scaling.
Most of us assume that std::sort
uses Quick Sort, at least for big inputs.
But what if the input is ginormous?
There is the Parallel STL! In GCC, those procedures are directed towards Intel TBB, which will use many threads to sort your numbers. The underlying algorithm would almost certainly be a linear-time Radix sort on GPUs. But whats the complexity of TBBs implementation?
|
|
This short benchmark logs more information than just the runtime per iteration.
Aside from the built-in SetItemsProcessed
and SetBytesProcessed
you can add anything else:
|
|
I then run this benchmark with a couple of different policies and on numerous sizes:
|
|
Which would yield results like this:
Algorithm | Result |
---|---|
supersort/seq/1048576/min_time:10.000 | … |
supersort/seq/2097152/min_time:10.000 | … |
supersort/seq/16777216/min_time:10.000 | … |
supersort/seq/134217728/min_time:10.000 | … |
supersort/seq/1073741824/min_time:10.000 | … |
supersort/seq/4294967296/min_time:10.000 | … |
supersort/seq/min_time:10.000_BigO | 1.78 NlgN |
supersort/seq/min_time:10.000_RMS | 41 % |
supersort/par_unseq/1048576/min_time:10.000 | … |
supersort/par_unseq/2097152/min_time:10.000 | … |
supersort/par_unseq/16777216/min_time:10.000 | … |
supersort/par_unseq/134217728/min_time:10.000 | … |
supersort/par_unseq/1073741824/min_time:10.000 | … |
supersort/par_unseq/4294967296/min_time:10.000 | … |
supersort/par_unseq/min_time:10.000_BigO | 0.09 NlgN |
supersort/par_unseq/min_time:10.000_RMS | 2 % |
So GB took the heavy burden of fitting and comparing our results against the suggested complexity curve.
From those experiments, NLogN
fits best.
Another thing to note here is the missing UseRealTime
.
Without it the “CPU Time” is used by default.
If you were to sleep your process, the “CPU Time” would stop growing.
GBs functionality list is extensive, and many parts of it are little-known:
- Random Interleaving with
--benchmark_enable_random_interleaving=true
CLI argument. - Hardware Performance Counters via
libpmf
. - Comparing with previous results with
compare.py
.
It’s impossible to cover the depth of performance engineering and complexity of modern software/hardware with just one tool. Luckily, they have excellent single-page user guide, and all the codes from this article are also available on GitHub. Feel free to fork and extend them!
The other essential tools to know are perf
and valgrind
in their ascending complexity.
We were just focusing on runtime speed today, that’s easy.
While memory usage or hardware stalls tracking is complicated yet equally crucial for evaluating software.
Those things are especially tricky when you benchmark databases like our UStore.
Presumably, the fastest transactional database needs accurate benchmarks, so we are working on integrating eBPF
for UCSB and UCall.
Check those projects out.
They will come in handy in your following C++ projects!
You can always request more features on our Discord, but don’t forget to ⭐ them on GitHub, so we keep open-sourcing our technology 🤗