Criticizing software is easy, yet the C++ and C standard libraries have withstood the test of time admirably.
Nevertheless, they are not perfect.
Especially the <string>
, <string_view>
, and <string.h>
headers.
The first two alone bring in over 20,000 lines of code, slowing the compilation of every translation unit by over 100 milliseconds.
Most of that code seems dated, much slower than LibC, and equally error-prone, with interfaces that are very hard to distinguish.
This is not a new problem, and I don’t have an exhaustive list of all the issues with STL and LibC, but some issues became very noticeable when upgrading StringZilla to v3.
The upgrade makes it largely compatible with STL, stateful allocators aside.
Now it covers most of C++ 20 strings functionality in a C++ 11 compatible form, also adding dynamic (runtime) dispatch for SIMD-accelerated functions.
It also provides a few extensions, that are not present in STL, but are common in other languages.
For most code bases, replacing std::string
and std::string_view
with sz::string
and sz::string_view
should now be a drop-in replacement.
- So, what’s wrong with STL existing functionality?
- What’s missing in C++, but exists in other languages?
- And how can we make it up to 10x faster?
Error Prone APIs
Ambiguous Function Overloads
Let’s start with a question.
The std::string
has 14 variants of replace
with different argument order and meaning.
Can you guess what they do?
|
|
I definitely can’t… despite testing this functionality last week.
Those seven replace
calls produce six different results.
|
|
This complexity is likely a byproduct of the standard library’s evolutionary path:
std::string
was welcomed in C++ 98.std::string_view
made its debut in C++ 17.std::span
joined the roster in C++ 20.
I’ve noticed a trend of preferring these newer constructs in reverse order. Absent non-owning views and slices, you’re bound to lug around references to the original strings. Doing so, you need to pass additional integer arguments to specify the range of the original string you want to operate on.
This raises another inquiry - why opt for integers over iterators? And more crucially, how thorough are the checks on these integer parameters?
Asymmetric “Out of Bounds” Checks
Most containers in the C++ standard library have an operator[]
and an at
method.
- The
operator[]
is fast, unchecked, and can lead to undefined behavior. - The
at
method is slower, checked, and throws an exception on out-of-bounds access.
Easy to remember when it’s just one argument per function. What if you have two arguments? Common sense suggests that argument checking is either done for both or for neither. Documentation suggests otherwise.
|
|
The substr
method is the one-dimensional sibling of the zero-dimensional at
.
Instead of returning one char
scalar, it returns a string slice.
The kicker is that the substr
method has a boundary check for the first argument but not for the second.
Moreover, unless you enable “warnings as errors”, the compiler won’t even warn you about the negative arguments
It’s debatable, but Python’s support for negative indices is more intuitive.
Without it, you need to write s.substr(1, s.size() - 2)
instead of s.substr(1, -2)
.
With StringZilla, you can use negative indices and get the expected results:
|
|
We can’t have all the good things for now.
Due to language constraints, the "a:v"[-2:-1]
syntax is impossible.
So I’ve used an std::initializer_list
for the indices, and an underscore-prefixed literal for the view.
Missing Functionality
Continuing the topic of extended functionality, there are some very basic utilities missing in STL.
This includes lazy ranges for find
, split
, and bulk replace
.
Split and Search Lazy Ranges
Many production code bases have utility functions like:
Going from worst to best, they allocate memory at least for the std::vector
, and reallocate when it needs to grow.
Each allocation can be orders of magnitude more expensive than the search itself.
StringZilla provides lazily-evaluated ranges to avoid those, similar to Rust and some other systems languages.
Implementing them in C++ is not trivial and took about 400 lines of expression templates.
Now, StringZilla supports overlapping and non-overlapping substring search ranges.
Splits.
By string.
By character.
By character-set delimiters.
In normal and reverse order.
All SIMD-accelerated.
Here is a list:
haystack.[r]find_all(needle[, interleaving])
haystack.[r]find_all(char_set(""))
haystack.[r]split(needle)
haystack.[r]split(char_set(""))
For $N$ matches, the split functions will report $N + 1$ matches, potentially including empty strings.
Ranges provide begin()
and end()
forward-iterators and have a few convenience methods as well:
A special case of split
is partition
.
It’s one of the most neglected functions in Python strings.
StringZilla brings them to C++, returning a simple struct
, that can be unpacked with structured bindings.
|
|
Bulk Replace
Another way to know that the functionality is missing in STL is its frequency on StackOverflow and presence in Boost.
Boost has a replace_all
function, which is not in STL.
The std::string::replace
is a very different beast.
It replaces one predefined string slice with the given input.
The Boost function returns all occurrences of a substring with another substring, combining bulk-search functionality with several memmove
or memcpy
calls.
StringZilla supports that out of the box.
Updates and Allocations
Memory allocators are broken in C++, the same as in most languages. That’s not a big deal for most applications, but it is a problem in IoT and Big Data applications. The Unum stack is designed for the latter.
In Big Data, efficient software would always work in a memory-starved environment.
Even if you have 2 TB of RAM on a CPU socket, you will reach a point where 100% is used, you can’t allocate more, but you also can’t terminate
.
The fact that most containers in STL raise exceptions when memory allocations fail is an issue.
That’s why StringZilla provides “try” versions of all allocation functions and explicitly marks all public interfaces with noexcept
and noexcept(false)
.
This might be a niche use case.
A more common one is concatenating multiple strings together.
The STL provides std::string::operator+
and std::string::append
, but those are inefficient if many invocations are performed.
The efficient approach would be pre-allocating the memory and copying the strings into it.
That’s mouthful and error-prone.
StringZilla provides a more convenient concatenate
function, which takes variadic arguments.
It also overrides the operator|
to concatenate strings lazily without any allocations.
Generating Random Strings
Software developers often need to generate random strings for testing purposes.
The STL provides std::generate
and std::random_device
, that can be used with StringZilla.
|
|
Mouthful and slow.
StringZilla provides a C native method - sz_generate
and a convenient C++ wrapper - sz::generate
.
Similar to Python it also defines the commonly used character sets.
It uses precomputed multiplication and shift tables to avoid module and division operations.
Those are used to sample from the given alphabet fairly and are slow on most CPU architectures.
|
|
Performance and LibC
C++ is synonymous with performance.
The STL is not.
Every major shop in town has homegrown hash tables or prefers open-source alternatives to std::unordered_map
and std::unordered_set
.
The std::string
is not an exception.
For some operations, it calls down to LibC, which is much more optimized in general but still doesn’t reach the hardware potential and doesn’t cover all the needs of the C++ class.
The strstr
can only be used for substring search on NULL-terminated strings.
The memmem
is better and can be used with std::string_view
.
But there is a catch - there is no reverse search in LibC.
So, only one evaluation order is optimized.
Let’s see the numbers for exact search performance.
LibC | C++ Standard | Python | StringZilla |
---|---|---|---|
find the first occurrence of a random word from text, ≅ 5 bytes long | |||
strstr 1x86: 7.4 · arm: 2.0 GB/s | .find x86: 2.9 · arm: 1.6 GB/s | .find x86: 1.1 · arm: 0.6 GB/s | sz_find x86: 10.6 · arm: 7.1 GB/s |
find the last occurrence of a random word from text, ≅ 5 bytes long | |||
❌ | .rfind x86: 0.5 · arm: 0.4 GB/s | .rfind x86: 0.9 · arm: 0.5 GB/s | sz_rfind x86: 10.8 · arm: 6.7 GB/s |
find the first occurrence of any of 6 whitespaces 2 | |||
strcspn 1x86: 0.74 · arm: 0.29 GB/s | .find_first_of x86: 0.25 · arm: 0.23 GB/s | re.finditer x86: 0.06 · arm: 0.02 GB/s | sz_find_charset x86: 0.43 · arm: 0.23 GB/s |
find the last occurrence of any of 6 whitespaces 2 | |||
❌ | .find_last_of x86: 0.25 · arm: 0.25 GB/s | ❌ | sz_rfind_charset x86: 0.43 · arm: 0.23 GB/s |
Most benchmarks were conducted on a 1 GB English text corpus, with an average word length of 5 characters. The code was compiled with GCC 12, using
glibc
v2.35. The benchmarks performed on Arm-based Graviton3 AWSc7g
instances andr7iz
Intel Sapphire Rapids. Most modern Arm-based 64-bit CPUs will have similar relative speedups. Variance withing x86 CPUs will be larger. ¹ Unlike other libraries, LibC requires strings to be NULL-terminated. ² Six whitespace characters in the ASCII set are:\t\n\v\f\r
. Python’s and other standard libraries have specialized functions for those.
Summarizing, StringZilla is…:
- 3.5x faster than LibC for substring search.
- 4.4x faster than STL for substring search.
- 16.8x faster than STL for reverse order substring search.
You can read more about the SIMD tricks in the preceding articles.
Conclusion
The library has a lot of “work in progress” functionality that goes far beyond the “standard needs”. It packs Levenshtein edit distances, Needleman-Wunsch alignment scores for Bioinformatics, Rabin fingerprints for fuzzy matching, and fast Radix-based sorting. As it matures, it might be worth suggesting as a new baseline implementation for the standard library of C++ and the strings in other programming languages. Until then, it’s an easy-to-install tool for performance-sensitive applications. Give it a chance and let me know if there is some functionality you would like to see in the next release 🤗
This week StringZilla 🦖 received a new much friendlier mascot and improved 🦀 Rust support! And if it wasn’t enough to start using it, Google is also switching to Arm-based CPUs for cloud.
— Ash Vardanian (@ashvardanian) April 11, 2024
50% faster search than LibC on x86
250% faster on Arm 💪https://t.co/hCrHAybIsZ