Some of the most common questions in programming interviews are about strings - reversing them, splitting, joining, counting, etc. These days, having to interview more and more developers across the whole spectrum, we see how vastly the solutions, even to the most straightforward problems, differ depending on experience. Let’s imagine a test with the following constraints:
- You must find the first occurrence of every unique string in a non-empty array.
- You are only allowed to use the standard library, no other dependencies.
- You have 20 minutes.
In Python, the solution may look like this:
There must be a way to make this faster, even in Python, but as often happens, the hack may backfire once the next CPython version comes out. The simplest solution is generally the best. In C++, that’s not the case. There many ways to implement the same thing, and the performance difference can be staggering.
Junior
C++ allows you to combine very high-level and low-level abstractions, giving you opportunities to improve almost any code snippet or shoot yourself in the foot.
We will do just that. Let’s start with a solution that any developer can write after the first two C++ tutorials and progress through their career, marking potential differences in answers of Junior, Mid, Senior, and Enthusiast developers and benchmarking everything to get some unexpected results.
|
|
Mid
A mid-developer will probably have the following suggestions looking at that code:
- Capturing a copy of an input
std::vector<std::string>
argument by value can be more expensive than the logic inside the function. - Using
std::map
is slower thanstd::unordered_map
, both asymptotically and practically. One is a Binary Search Tree. Another one is a Hash-Table. You should probably use the latter if you don’t need sorted iterators. - Using
int
results in comparing integers of a different sign as thestrings.size()
output is astd::size_t
In the end, a mid developer may add that offsets.find
can be replaced with a offsets.contains
in C++20 and patch the code like this:
|
|
Senior
Senior developers would notice, that the previous solution performs two lookups for every new unique string.
First, when checking for presence with contains
, and then on insertion.
The standard has more functionality than just operator []
overloads: insert
, emplace
, and most importantly try_emplace
, which will only add an entry, if the key isn’t present.
|
|
Another anti-pattern is to use std::vector<...> const &
, where all you need is just a span.
The std::vector
is a very specific class template, that instantiates with a specific allocator and owns the memory.
But the logic of this function is independent of the memory allocator used for the input, so it is wiser to use std::span
.
It is a surprisingly new tool, introduced in C++20, but every major C++ frameworks implement something like this.
A Span
in Abseil, Slice
in RocksDB, and so on.
Similarly, dynamically-allocating containers placed inside other dynamically allocating containers should be avoided where possible.
In this case it is easy, as we can use a std::string_view
mapping into parts of the original input.
Enthusiast
Writing general-purpose libraries for most of the last decade, I have a natural tendency to overcomplicate things. A passion for cutting-edge tech may only sometimes be an advantage, but let’s explore how we can improve the previous already good solution.
- The
std::string_view
andstd::size_t
, being 16 and 8 bytes, respectively, will be combined into astd::pair<std::string_view, std::size_t>
, resulting in a 32-byte structure, not 24. Moreover, one can’t stop thinking that we don’t need thestd::size_t
to solve the puzzle. - As everyone knows,
std::unordered_map
is not the fastest Hash-Table out there, and it is far from being space-efficient. To be more efficient, we need a Hash-Table with continuous addressing.
The “Apple to Apple comparison” article covers HTs in more details, benchmarking Arm-based Macs against older Intel-based versions and a $50K server.
|
|
This is a bit harder to implement in 20 minutes, but most C++ enthusiasts have probably implemented hash-tables at least a dozen times.
Which C++ and STL features did we use here?
- We have hoped to get a
std::flat_unordered_set
in C++ for a while, but we are only getting astd::flat_set
in C++23. Not a single compiler suite ships it at the time of writing. Moreover, using a sorted array would throttle on anything other than highly sparse sets due to countlessstd::memmove
calls. So we can prototype a minimalisticflat_unordered_set_t
. - I belong to the “anti-exception cult”, not particularly appreciating it when programs crash, even if I am trying to process a 10 TB dataset with a 16 GB RAM budget. Dynamic allocations can cause
std::bad_alloc
. So naturally, I wouldcatch
the exception and return some message that the request can’t be processed. I recently used a feature from C++98 I have never used in my career - “Function-try-block”. Last time I was so proud of myself when I usedstd::launder
in production code. I even tweeted about it 🐦 Not a big deal, but looks cleaner than nested expression blocks. - C++23 also brings
std::expected
, which can complement the previous point. Again, STL implementations don’t have it yet, so let’s pick thestd::optional
. - C++20
<bit>
header bringsstd::bit_ceil
to compute the next power-of-two integer.
Well, this solution is much longer - 45 lines instead of 10. Let’s benchmark to see if it was worth it.
Benchmarks
Let’s generate large arrays of random strings controlling three variables:
- number of strings in the input array,
- size of the alphabet,
- length of strings.
We would then use Google Benchmark to evaluate the four algorithms.
- The last column contains the expected number of repetitions per unique string.
- The other columns store the number of strings processed on 1 core.
- The measurements were conducted on a 16" i9 2019 MacBook Pro.
- The unit of measurement is: millions of strings per second.
- Higher is better.
Junior | Middle | Senior | Enthusiast | |
---|---|---|---|---|
1 M strings | ||||
32 char alphabet, length 3 | 3.1 | 13.2 | 12.6 | 23.9 |
32 char alphabet, length 4 | 0.8 | 2.0 | 2.0 | 14.0 |
32 char alphabet, length 5 | 0.6 | 1.4 | 1.2 | 18.6 |
1 M strings | ||||
16 char alphabet, length 3 | 4.9 | 21.5 | 19.2 | 31.0 |
16 char alphabet, length 4 | 2.2 | 9.8 | 8.9 | 20.5 |
16 char alphabet, length 5 | 0.8 | 2.0 | 2.1 | 13.2 |
32 char alphabet | ||||
1 M strings of length 3 | 3.1 | 13.2 | 12.6 | 23.9 |
100 K strings of length 4 | 0.9 | 2.5 | 2.6 | 26.3 |
10 K strings of length 5 | 1.4 | 3.3 | 3.6 | 41.6 |
In at least 2 cases we are getting close to 30x performance improvement even in the single-threaded environment. In multi-threaded case, the memory allocations would be more expensive, and the gap would be wider.
Frankly speaking, reserving a relatively small buffer ahead of time and using copy-less Hash-Table with open addressing would obviously be faster than the associative container in STL.
The sources for the benchmark are on my GitHub, together with some other strings- and SIMD-related benchmarks. Feel free to repeat and share your numbers.
What is the weird part?
The numbers for the Senior solution are sometimes worse, than the Middle, while the code looks strictly better.
If we just change the try_emplace
line to this:
|
|
And rerun the benchmark:
The numbers for the senior case get strictly better.
This was a nice warm-up, but if you are curious about advanced string algorithms and ready to go below the C++ layer and into the assembly, check out StringZilla and some of the other libraries I maintain 🤗