- Python’s not the fastest language out there.
- Developers often use tools like Boost.Python and SWIG to wrap faster native C/C++ code for Python.
- PyBind11 is the most popular tool for the job not the quickest. NanoBind offers improvements, but when speed really matters, we turn to pure CPython C API bindings.
With StringZilla, I started with PyBind11 but switched to CPython to reduce latency. The switch did demand more coding effort, moving from modern C++17 to more basic C99, but the result is a 5x lower call latency! So if you want to make your bindings lighter and better understand how CPython works under the hood - continue reading 🤗
Why use CPython for StringZilla?
I designed StringZilla’s
Str class to be a faster version of Python’s native
str class. It was built for speed, handling large strings from memory-mapped files using SIMD Assembly instructions. Spending $50K on data-engineering tasks in public clouds this year, with StringZilla I was able to cut the processing time and costs by a factor of ten.
However, as more people suddenly started using StringZilla for smaller tasks in the last couple of months, PyBind11’s latency became noticeable:
- 1 microsecond using native
- 15 microseconds using PyBind11:
- 3 microseconds using CPython:
This is a major improvement, but you can do even better, and I’ve already done that in SimSIMD, mentioned in the previous post. Sounds interesting? Do you want to know CPython works under the hood and how I’m leveraging that in StringZilla and my other open-source projects? Continue reading!
- Why use CPython for StringZilla?
- How is a Module Loaded?
- How is a Function Invoked and Arguments Parsed?
- How is a Class Defined and Which Methods it Has?
- Our Accomplishments
How is a Module Loaded?
pip install stringzilla and subsequently running
import stringzilla, the CPython runtime begins its search for the associated dynamic library within the directories enumerated in
sys.path. This encompasses the current directory, any paths specified by the
PYTHONPATH environment variable, standard library directories, and notably, the
site-packages directory. Once CPython locates the corresponding library and verifies the presence of the initialization function,
PyInit_stringzilla, it integrates the module within the Python environment.
Insights from the Code:
- Module Definition: Every module in Python is essentially a heap-allocated object, primarily delineated by
- Reference Counting: Given the nature of CPython, reference management of objects is manual, necessitating the frequent utilization of functions like
- Dynamic Memory Allocation: The code introduces a static dynamically allocated
temporary_memory. Freeing this memory is vital to avoid “memory leaks”, thus the incorporation of the
Modules of this nature, rooted in C, can be compiled employing tools like CMake or Make. Alternatively, one can adapt
setup.py to encompass this native extension.
Upon completion of the compilation process, the resultant artifact is a
.so (on Linux/macOS) or
.pyd (on Windows) shared library. A practical demonstration on macOS can be conducted by simply starting the Python runtime, and checking if the newly compiled binding can be fetched, as follows:
Executing the above will display a comprehensive list, predominantly consisting of around 80 CPython’s default dependencies, with
stringzilla distinctly listed therein.
How is a Function Invoked and Arguments Parsed?
str class provides a wide array of methods, offering diverse string operations. In StringZilla, apart from member methods specific to
Str instances, I’ve also introduced global functions, enabling operations without needing an instance. These methods are enumerated in the
The highlighted flags combination in the snippet is
METH_VARARGS | METH_KEYWORDS. Several flag options are available, each catering to specific requirements:
METH_VARARGS: Standard convention, accepting a tuple with all arguments. Commonly combined with
METH_KEYWORDSto accept keyword arguments.
METH_KEYWORDS: For methods to receive keyword arguments, usually paired with either
METH_FASTCALL: A post-Python 3.7 optimization, this flag enhances the calling convention efficiency.
METH_NOARGS: For methods that don’t require any arguments.
METH_O: Suitable for methods taking just a single object argument.
METH_STATIC: Specify class and static methods, respectively.
METH_COEXIST: Permits a method’s loading even if another definition with the same name already exists.
As an illustrative example, consider the
Str_levenstein function which calculates the Levenshtein (Edit) distance between two strings.
Depending on the way you call it,
self, positional arguments
args and named arguments
kwargs will contain different values.
|Method||Global Function||Member Method|
|Example with Last Positional Arg.|
|Example with Last Named Arg.|
|Object in |
|Object in |
|Object in ||possibly ||possibly |
Here comes the code:
The function employs a substantial amount of boilerplate code:
- Just 3 lines to compute and return the distance.
- 8 lines for temporary memory allocations and potential errors.
- Over 30 lines meticulously parse the function arguments.
Most developers avoid that using the
PyArg_ParseTupleAndKeywords in turn calls
vgetargskeywords and creates a huge stack frame for all the temporary variables required to parse the provided arguments according to the
Things go downhill from there, as 3 separate
for-loops will have to be executed to perform the parsing.
Obviously, this is a lot more expensive then most functions inside of StringZilla, that we are going to wrap.
In pursuit of further enhancements, the fast calling convention
METH_KEYWORDS | METH_FASTCALL should be utilized, eliminating overhead associated with tuple-unpacking. If you’re interested, I’ve incorporated this approach in the recent release of SimSIMD. You can reference its implementation to grasp the improvements. Looking forward to your contributions 🤗
Do you use #NumPy or #SciPy to compute vector similarity? For @OpenAI Ada embeddings – SimSIMD does it 3-200x faster 🤯— Ash Vardanian (@ashvardanian) October 5, 2023
float16 works great on every arch - Arm NEON & SVE and x86 AVX2 & AVX-512. Intel AMX also looks excellent 🔜https://t.co/ouFdvMJd7g#opensource #Python
How is a Class Defined and Which Methods it Has?
Defining a Python class offers more versatility than merely specifying its methods. Let’s delve into the
Str class in StringZilla as an illustrative example. Alongside the
PyMethodDef methods, this class was designed to support Python’s “sequence” protocols, enabling:
- Determining the string’s length:
- Fetching a specific character:
- Checking substring presence:
"a" in Str("a").
To achieve the above, the class requires a
PySequenceMethods definition. Furthermore, to support slicing (e.g.,
text[10:-10] to omit the first and last 10 characters), a
PyMappingMethods definition becomes essential.
Other notable features include:
- Buffer Protocol (
Str_as_buffer): Commonly associated with Linear Algebra and Data Science libraries, this protocol facilitates zero-copy access. It treats
Stras a one-dimensional character array, making it reminiscent of C arrays.
- Numerical Operations (
Str_as_number): Currently, the class only supports string concatenation using the addition operator:
Str("fizz") + Str("buzz").
These methods, combined with other essential methods like
tp_hash for hashing (
tp_richcompare for comparisons (
< == > != <= >=), and more, are all unified under the
I put StringZilla to the test using the well-known Leipzig 1M dataset, which is 129,644,797 bytes in size. A standout feature of StringZilla is its ability to avoid unnecessary data copies. This is particularly beneficial when using functions like
split on large datasets, as it leads to significant memory savings. Moreover, the
File class empowers us to handle datasets that exceed available memory.
Here’s a comparative performance chart:
|150 ms||50 ms||3x|
|49 ms||11 ms||4.5x|
|223 ms||57 ms||3.9x|
|100 ms||34 ms||2.9x|
|8 ms||259 ns||31x|
|8 ms||264 ns||31x|
|247 ns||172 ns||1.4x ¹|
|70 ms||8 ms||8.8x|
¹ The accuracy of this value might be off. In reality, StringZilla could be slightly slower. The challenge lies in measuring such short durations, especially using tools like the Jupyter notebook I employed.
Lately, I’ve been primarily leveraging low-level direct CPython bindings in projects such as:
- SimSIMD – For computations that outpace SciPy and NumPy in vector similarity.
- UCall – A turbocharged alternative to FastAPI.
Looking forward to your contributions, and don’t forget to share with your friends, if the projects resonate 🤗