Some of the most common questions in programming interviews are about strings - reversing them, splitting, joining, counting, etc. These days, having to interview more and more developers across the whole spectrum, we see how vastly the solutions, even to the most straightforward problems, differ depending on experience. Let’s imagine a test with the following constraints:
- You must find the first occurrence of every unique string in a non-empty array.
- You are only allowed to use the standard library, no other dependencies.
- You have 20 minutes.
In Python, the solution may look like this:
There must be a way to make this faster, even in Python, but as often happens, the hack may backfire once the next CPython version comes out. The simplest solution is generally the best.
C++ allows you to combine very high-level and low-level abstractions, giving you opportunities to improve almost any code snippet or shoot yourself in the foot.
We will do just that. Let’s start with a solution that any developer can write after the first two C++ tutorials and progress through their career, marking potential differences in answers of Junior, Mid, Senior, and Enthusiast developers and benchmarking everything to get some unexpected results.
A mid-developer will probably have the following suggestions looking at that code:
- Capturing a copy of an input
std::vector<std::string>argument by value can be more expensive than the logic inside the function.
std::mapis slower than
std::unordered_map, both asymptotically and practically. One is a Binary Search Tree. Another one is a Hash-Table. You should probably use the latter if you don’t need sorted iterators.
intresults in comparing integers of a different sign as the
strings.size()output is a
In the end, a mid developer may add that
offsets.find can be replaced with a
offsets.contains in C++20 and patch the code like this:
Senior developers would notice, that the previous solution performs two lookups for every new unique string.
First, when checking for presence with
contains, and then on insertion.
The standard has more functionality than just
operator  overloads:
emplace, and most importantly
try_emplace, which will only add an entry, if the key isn’t present.
Another anti-pattern is to use
std::vector<...> const &, where all you need is just a span.
std::vector is a very specific class template, that instantiates with a specific allocator and owns the memory.
But the logic of this function is independent of the memory allocator used for the input, so it is wiser to use
It is a surprisingly new tool, introduced in C++20, but every major C++ frameworks implement something like this.
Span in Abseil,
Slice in RocksDB, and so on.
Similarly, dynamically-allocating containers placed inside other dynamically allocating containers should be avoided where possible.
In this case it is easy, as we can use a
std::string_view mapping into parts of the original input.
Writing general-purpose libraries for most of the last decade, I have a natural tendency to overcomplicate things. A passion for cutting-edge tech may only sometimes be an advantage, but let’s explore how we can improve the previous already good solution.
std::size_t, being 16 and 8 bytes, respectively, will be combined into a
std::pair<std::string_view, std::size_t>, resulting in a 32-byte structure, not 24. Moreover, one can’t stop thinking that we don’t need the
std::size_tto solve the puzzle.
- As everyone knows,
std::unordered_mapis not the fastest Hash-Table out there, and it is far from being space-efficient. To be more efficient, we need a Hash-Table with continuous addressing.
The “Apple to Apple comparison” article covers HTs in more details, benchmarking Arm-based Macs against older Intel-based versions and a $50K server.
This is a bit harder to implement in 20 minutes, but most C++ enthusiasts have probably implemented hash-tables at least a dozen times.
Which C++ and STL features did we use here?
- We have hoped to get a
std::flat_unordered_setin C++ for a while, but we are only getting a
std::flat_setin C++23. Not a single compiler suite ships it at the time of writing. Moreover, using a sorted array would throttle on anything other than highly sparse sets due to countless
std::memmovecalls. So we can prototype a minimalistic
- I belong to the anti-exception cult, not particularly appreciating it when programs crash, even if I am trying to process a 10 TB dataset with a 16 GB RAM budget. Dynamic allocations can cause
std::bad_alloc. So naturally, I would
catchthe exception and return some message that the request can’t be processed. I recently used a feature from C++98 I have never used in my career - “Function-try-block”. Last time I was so proud of myself when I used
std::launderin production code. I even tweeted about it 🐦 Not a big deal, but looks cleaner than nested expression blocks.
- C++23 also brings
std::expected, which can complement the previous point. Again, STL implementations don’t have it yet, so let’s pick the
std::bit_ceilto compute the next power-of-two integer.
Well, this solution is much longer - 45 lines instead of 10. Let’s benchmark to see if it was worth it.
Let’s generate large arrays of random strings controlling three variables:
- number of strings in the input array,
- size of the alphabet,
- length of strings.
We would then use Google Benchmark to evaluate the four algorithms.
- The last column contains the expected number of repetitions per unique string.
- The other columns store the number of strings processed on 1 core.
- The measurements were conducted on a 16" i9 2019 MacBook Pro.
- The unit of measurement is: millions of strings per second.
- Higher is better.
|1 M strings|
|32 char alphabet, length 3||3.1||13.2||12.6||23.9|
|32 char alphabet, length 4||0.8||2.0||2.0||14.0|
|32 char alphabet, length 5||0.6||1.4||1.2||18.6|
|1 M strings|
|16 char alphabet, length 3||4.9||21.5||19.2||31.0|
|16 char alphabet, length 4||2.2||9.8||8.9||20.5|
|16 char alphabet, length 5||0.8||2.0||2.1||13.2|
|32 char alphabet|
|1 M strings of length 3||3.1||13.2||12.6||23.9|
|100 K strings of length 4||0.9||2.5||2.6||26.3|
|10 K strings of length 5||1.4||3.3||3.6||41.6|
In at least 2 cases we are getting close to 30x performance improvement even in the single-threaded environment. In multi-threaded case, the memory allocations would be more expensive, and the gap would be wider.
Frankly speaking, reserving a relatively small buffer ahead of time and using copy-less Hash-Table with open addressing would obviously be faster than the associative container in STL.
The sources for the benchmark are on my GitHub, together with some other strings- and SIMD-related benchmarks. Feel free to repeat and share your numbers.
What is the weird part?
The numbers for the Senior solution are sometimes worse, than the Middle, while the code looks strictly better.
If we just change the
try_emplace line to this:
And rerun the benchmark: