There are only two kinds of languages: the ones people complain about and the ones nobody uses.
– Bjarne Stroustrup, creator of C++.
Very few consider C++ attractive, and only some people think it’s easy. Choosing it for a project generally means you care about the performance of your code. And rightly so! Today, machines can process hundreds of Gigabytes per second, and we, as developers, should all learn to saturate those capabilities. So, let’s look into a few simple code snippets and familiarize ourselves with Google Benchmark (GB) - the most famous library in the space using incrementally complex examples.
For the impatient ones, here is the source code on GitHub.
That repo also has the results for AMD EPYC 7302 and Threadripper PRO 3995WX CPUs with -O1
and -O3
builds.
Math Micro-Ops
Every benchmark has the following structure. You define the operation, but GB decides how often to run it.
Compile it with -O3
, run, result: 0 ns.
Thanks for your attention, that’s all 😅
Unfortunately, no operation runs this fast on the computer. On a 3 GHz CPU, you would perform 3 Billion ops every second. So, each would take 0.33 ns, not 0 ns. The compiler just optimized everything out.
Cost of Random Generators
What will happen if we mutate the addition arguments between loop iterations?
__25ns__looks better than zero, but something doesn’t add up. If the addition is a single hardware instruction, why did it take ~100 cycles? There are BMI2 bit-manipulation assembly instructions with such runtime on AMD Zen and Zen 2 chips, but not the addition.
Here is the pattern we often use in benchmarking.
There are better approaches for something as small as an addition, but it is good enough to make a point.
Initialize randomly before evaluation, then mutate.
Your mutations can be regular and predictable.
They just shouldn’t be expensive.
We also apply the DoNotOptimize
function to force the compilation of that useless line.
This run took 0.7ns per iteration or around 2 cycles.
The first cycle was spent incrementing a
and b
on different ALUs of the same core, while the second was performed the final accumulation.
Now, let’s run those benchmarks on 8 threads.
The std::rand
powered function took 12'000 ns, while our latest variant remained the same.
Like many other libc
functions, it depends on the global state and uses mutexes to synchronize concurrent access to global memory.
Here is its source in glibc:
Slow Trigonometry in LibC and STL and Fast Math
Let’s say you want to compute the sine of a floating-point number.
The standard has std::sin
for that, but there is a different way.
We can approximate the result with the Taylor-Maclaurin series, taking the first three components of the expansion.
|
|
Pretty simple. Here is the result:
- 51 ns for the
std::sin
. - 27 ns for the Maclaurin series with
std::pow
.
Can we do better?
Of course, we can.
Raising to a power x
is a generic operation.
In other words, likely slow.
Let’s unfold the expression manually:
|
|
Result: 2.4 ns, or 10x improvement! The compiler will handle redundant operations, but he cannot do permutations. In math, addition and multiplication are associative for real numbers but not programming. The order of operands will affect the result, so the compilers have their hands in cuffs. Let’s set them free!
|
|
If you are not familiar with the modern C++ attributes, here is the older GCC version: __attribute__((optimize("-ffast-math")))
.
Here, we advise GCC to apply extra flags when compiling the scope of this specific function.
This is one of my favorite tricks for rapid benchmarking without having to change CMakeLists.txt
.
Result: 0.85 ns.
For those interested, GCC has a separate manual for the complete list of available floating-point math optimizations.
The -ffast-math
is a shortcut for -funsafe-math-optimizations
, a shortcut for -fassociative-math
.
In our specific case, with that many chained multiplications, the ordering may change under/overflow behavior, as stated in the “Transformations” section.
Integer Division
We have just accelerated a floating-point trigonometric function by a factor of 60x from 51 ns to less than a nanosecond! Can we do the same for integer division, the slowest integer arithmetical operation? To a certain extent, yes.
This variant runs in 3.3 ns, or 10x longer than addition. But what if we divide by a constant, not a mutating value? In that case, we must ensure the compiler knows that our denominator remains constant!
|
|
The above functions only differ because the first involves some shady money
launder
ing 😅
The compiler fails to trace the origins of money
, so it can’t replace it with shifts and multiplications.
How big is the difference: 3.3 ns vs 0.5 ns!
Hardware Acceleration without Intrinsics
Most of the examples in this article are abstract, but this happened to me a couple of months ago.
Compilers have special intrinsics, like __builtin_popcount
.
Those are high-performance routines shipped with the compiler and aren’t portable.
Those differ from Intel or ARM intrinsics, which are hardware-specific and compiler-agnostic.
These are hardware-agnostic and compiler-specific.
I forgot to pass the -mpopcnt
flag in the first variant, and the compiler silently continued.
When you want to be specific - explicitly trigger the popcnt
instruction.
The cost of this mistake was: 1.9 ns and 0.3 ns, or over 6x!
Broader Logic and Memory
Data Alignment
Compute may be expensive, but memory accesses always are! The more you miss your CPU caches, the more you waste time!
Let’s illustrate it by creating a cache-aligned array with 32x float
s.
That means 2x 64-byte cache lines worth of content.
|
|
Now we run two benchmarks.
Both perform the same logical operations - 8x float
additions and 8x stores.
But the first took 8 ns and the second took 4 ns.
Our benchmark ends up being dominated by the cost of the split-load or the lack of memory alignment, in other words.
Cost of Branching
The if
statement and seemingly innocent ternary operator (condition ? a : b)
can have a high-performance impact.
It’s especially noticeable when conditional execution happens at the scale of single bytes, like in text processing, parsing, search, compression, encoding, etc.
Don’t forget that every for
statement is, in reality, just a combination of an if
and goto
.
The CPU has a branch-predictor, one of the silicon’s most complex parts.
It memorizes the most common if
statements to allow “speculative execution”.
In other words, start processing the job i + 1
before finishing the job i
.
Those branch predictors are very powerful, and if you have a single if
statement on your hot-path, it’s not a big deal.
But most modern programs are almost entirely built on if
statements.
To estimate the cost of those, let’s generate some random_values
, iterate through them and pick different branches depending on the value of
|
|
Up to 4096 branches will be memorized on most modern CPUs, but anything beyond that would work slower. Running the benchmark, I end up with 2.9 ns vs. 0.7 ns, so on average, 10 cycles are wasted when you have 2 random branches. This means the cost of branch mis-prediction is around 20 CPU cycles.
Advanced Google Benchmark Features
GB packs a lot of little-known functionality. Let’s take a more complex workload to understand it - sorting. What’s wrong with the following benchmark?
|
|
We are benchmarking the std::sort
, the most used <algorithm>
.
Array sizes are fixed at 3 and 4 elements in different benchmarks.
Each runs twice:
- with
std::reverse
preprocessing time included. - with
std::reverse
preprocessing time excluded.
One took 227 ns and another took 9 ns. Which is which? The answer is different from what common sense suggests.
The Cost of Timing in Google Benchmark
Aside from the primary operation in question - std::sort
, we also do some pre-processing.
That pre-processing code is isolated with state.*Timing()
functions, and often, people use those without realizing the actual cost.
This took 213 ns. Those things look tiny when processing gigabytes, but you will run your benchmark on a small input one day. Plan ahead to extract an unaffected asymptotic curve, as we will do later.
Asymptotic Complexity and Sorting
GB also packs functionality for complexity analysis.
Those come in handy when you analyze scaling.
Most of us assume that std::sort
uses Quicksort, at least for significant inputs.
But what if the input is ginormous?
There is the Parallel STL! In GCC, those procedures are directed towards Intel’s oneTBB, which will use many threads to sort your numbers. The underlying algorithm would almost certainly be a linear-time Radix sort on GPUs. But what’s the complexity of oneTBB’s implementation on the CPU?
|
|
This short benchmark logs more information than just the runtime per iteration.
Aside from the built-in SetItemsProcessed
and SetBytesProcessed
you can add anything else:
|
|
I then run this benchmark with a couple of different policies and on numerous sizes:
|
|
GB will take the heavy burden of fitting and comparing our results against the suggested complexity curve.
From those experiments, NlgN
fits best, as seen in the output.
Algorithm | Result |
---|---|
super_sort/seq/1048576/min_time:10.000 | … |
super_sort/seq/2097152/min_time:10.000 | … |
super_sort/seq/16777216/min_time:10.000 | … |
super_sort/seq/134217728/min_time:10.000 | … |
super_sort/seq/1073741824/min_time:10.000 | … |
super_sort/seq/4294967296/min_time:10.000 | … |
super_sort/seq/min_time:10.000_BigO | 1.78 NlgN |
super_sort/seq/min_time:10.000_RMS | 41 % |
Another thing to note here is the missing UseRealTime
.
Without it the “CPU Time” is used by default.
If you were to sleep your process, the “CPU Time” would stop growing.
Random Interleaving
GB provides Random Interleaving for benchmarks with high run-to-run or order-depend variance. To enable it:
- pass the
--benchmark_enable_random_interleaving=true
flag, - optionally specify non-zero repetition count
--benchmark_repetitions=1
, - optionally decrease the per-repetition time
--benchmark_min_time=0.1
.
In a few cases, enabling this flag will save you some pain. One example is when dealing with CPUs with variable frequency with “Turbo Boost” like features that scale CPU frequency to a higher license. In that such case your CPU may accelerate from 2 GHz to 3 GHz, but will only sustain that frequency for a few seconds. So, if you run multiple compute-heavy benchmarks, the first one will work better, and the rest will work worse.
Another scenario is when you are working with large static arrays of data preserved from benchmark to benchmark; the first one will be slow, and the others will be fast if the data has been cached in the CPU.
Hardware Performance Counters
Most chips, including CPUs, include hardware performance counters.
Those gather stats that your benchmarking toolkit may be able to collect.
In GB, those counters are implemented via libpmf
(PMF).
PMF, however, is only available on Linux, and only some kernels support it.
“Windows Subsystem for Linux” users are out of luck, but if you are running on an original kernel, it will only take one extra argument and sudo
privileges to access:
|
|
More often, however, I’d see people call GB through the infamous Linux perf
utility.
It will expose a lot more functionality, such as taskset
, to control the availability of different CPU cores.
|
|
In Closing
GB has a lot of functionality we haven’t touched, which is listed in their single-page user guide.
You can take all the discussed codes from this article on GitHub and extend them to include more functionality, but that’s just the peak of the iceberg.
We haven’t touched on system calls and communications with external devices, where you would need eBPF
to effectively trace calls from almost inside the kernel.
If you want to go deeper, check out the following repos I maintain 🤗