- Python’s not the fastest language out there.
- Developers often use tools like Boost.Python and SWIG to wrap faster native C/C++ code for Python.
- PyBind11 is the most popular tool for the job not the quickest. NanoBind offers improvements, but when speed really matters, we turn to pure CPython C API bindings.
With StringZilla, I started with PyBind11 but switched to CPython to reduce latency. The switch did demand more coding effort, moving from modern C++17 to more basic C99, but the result is a 5x lower call latency! So if you want to make your bindings lighter and better understand how CPython works under the hood - continue reading 🤗
Why use CPython for StringZilla?
I designed StringZilla’s Str
class to be a faster version of Python’s native str
class. It was built for speed, handling large strings from memory-mapped files using SIMD Assembly instructions. Spending $50K on data-engineering tasks in public clouds this year, with StringZilla I was able to cut the processing time and costs by a factor of ten.
However, as more people suddenly started using StringZilla for smaller tasks in the last couple of months, PyBind11’s latency became noticeable:
- 1 microsecond using native
str
class:"the future".find("solutions")
- 15 microseconds using PyBind11:
Str("the future").find("solutions")
- 3 microseconds using CPython:
Str("the future").find("solutions")
This is a major improvement, but you can do even better, and I’ve already done that in SimSIMD, mentioned in the previous post. Sounds interesting? Do you want to know CPython works under the hood and how I’m leveraging that in StringZilla and my other open-source projects? Continue reading!
- Why use CPython for StringZilla?
- How is a Module Loaded?
- How is a Function Invoked and Arguments Parsed?
- How is a Class Defined and Which Methods it Has?
- Our Accomplishments
How is a Module Loaded?
Upon executing pip install stringzilla
and subsequently running import stringzilla
, the CPython runtime begins its search for the associated dynamic library within the directories enumerated in sys.path
. This encompasses the current directory, any paths specified by the PYTHONPATH
environment variable, standard library directories, and notably, the site-packages
directory. Once CPython locates the corresponding library and verifies the presence of the initialization function, PyInit_stringzilla
, it integrates the module within the Python environment.
|
|
Insights from the Code:
- Module Definition: Every module in Python is essentially a heap-allocated object, primarily delineated by
PyModuleDef
. - Reference Counting: Given the nature of CPython, reference management of objects is manual, necessitating the frequent utilization of functions like
Py_INCREF
andPy_XDECREF
. - Dynamic Memory Allocation: The code introduces a static dynamically allocated
temporary_memory
. Freeing this memory is vital to avoid “memory leaks”, thus the incorporation of thestringzilla_cleanup
function.
Module Compilation:
Modules of this nature, rooted in C, can be compiled employing tools like CMake or Make. Alternatively, one can adapt setup.py
to encompass this native extension.
|
|
Upon completion of the compilation process, the resultant artifact is a .so
(on Linux/macOS) or .pyd
(on Windows) shared library. A practical demonstration on macOS can be conducted by simply starting the Python runtime, and checking if the newly compiled binding can be fetched, as follows:
|
|
Executing the above will display a comprehensive list, predominantly consisting of around 80 CPython’s default dependencies, with stringzilla
distinctly listed therein.
How is a Function Invoked and Arguments Parsed?
Python’s str
class provides a wide array of methods, offering diverse string operations. In StringZilla, apart from member methods specific to Str
instances, I’ve also introduced global functions, enabling operations without needing an instance. These methods are enumerated in the stringzilla_methods[]
array:
|
|
The highlighted flags combination in the snippet is METH_VARARGS | METH_KEYWORDS
. Several flag options are available, each catering to specific requirements:
METH_VARARGS
: Standard convention, accepting a tuple with all arguments. Commonly combined withMETH_KEYWORDS
to accept keyword arguments.METH_KEYWORDS
: For methods to receive keyword arguments, usually paired with eitherMETH_VARARGS
orMETH_FASTCALL
.METH_FASTCALL
: A post-Python 3.7 optimization, this flag enhances the calling convention efficiency.METH_NOARGS
: For methods that don’t require any arguments.METH_O
: Suitable for methods taking just a single object argument.METH_CLASS
andMETH_STATIC
: Specify class and static methods, respectively.METH_COEXIST
: Permits a method’s loading even if another definition with the same name already exists.
As an illustrative example, consider the Str_levenstein
function which calculates the Levenshtein (Edit) distance between two strings.
Depending on the way you call it, self
, positional arguments args
and named arguments kwargs
will contain different values.
Method | Global Function | Member Method |
---|---|---|
Example | sz.levenstein(a, b) | a.levenstein(b) |
Example with Last Positional Arg. | sz.levenstein(a, b, 10) | a.levenstein(b, 10) |
Example with Last Named Arg. | sz.levenstein(a, b, bound=10) | a.levenstein(b, bound=10) |
Object in self | sz module | a |
Object in args | a , b , possibly 10 | b , possibly 10 |
Object in kwargs | possibly bound=10 | possibly bound=10 |
Here comes the code:
|
|
The function employs a substantial amount of boilerplate code:
- Just 3 lines to compute and return the distance.
- 8 lines for temporary memory allocations and potential errors.
- Over 30 lines meticulously parse the function arguments.
Most developers avoid that using the PyArg_ParseTupleAndKeywords
function:
Once called, PyArg_ParseTupleAndKeywords
in turn calls vgetargskeywords
and creates a huge stack frame for all the temporary variables required to parse the provided arguments according to the OO|O
format:
Things go downhill from there, as 3 separate for
-loops will have to be executed to perform the parsing.
Obviously, this is a lot more expensive then most functions inside of StringZilla, that we are going to wrap.
|
|
In pursuit of further enhancements, the fast calling convention METH_KEYWORDS | METH_FASTCALL
should be utilized, eliminating overhead associated with tuple-unpacking. If you’re interested, I’ve incorporated this approach in the recent release of SimSIMD. You can reference its implementation to grasp the improvements. Looking forward to your contributions 🤗
Do you use #NumPy or #SciPy to compute vector similarity? For @OpenAI Ada embeddings – SimSIMD does it 3-200x faster 🤯
— Ash Vardanian (@ashvardanian) October 5, 2023
float16 works great on every arch - Arm NEON & SVE and x86 AVX2 & AVX-512. Intel AMX also looks excellent 🔜https://t.co/ouFdvMJd7g#opensource #Python
How is a Class Defined and Which Methods it Has?
Defining a Python class offers more versatility than merely specifying its methods. Let’s delve into the Str
class in StringZilla as an illustrative example. Alongside the PyMethodDef
methods, this class was designed to support Python’s “sequence” protocols, enabling:
- Determining the string’s length:
len(Str(""))
. - Fetching a specific character:
Str("a")[0]
. - Checking substring presence:
"a" in Str("a")
.
To achieve the above, the class requires a PySequenceMethods
definition. Furthermore, to support slicing (e.g., text[10:-10]
to omit the first and last 10 characters), a PyMappingMethods
definition becomes essential.
Other notable features include:
- Buffer Protocol (
Str_as_buffer
): Commonly associated with Linear Algebra and Data Science libraries, this protocol facilitates zero-copy access. It treatsStr
as a one-dimensional character array, making it reminiscent of C arrays.
- Numerical Operations (
Str_as_number
): Currently, the class only supports string concatenation using the addition operator:Str("fizz") + Str("buzz")
.
These methods, combined with other essential methods like tp_hash
for hashing (hash(Str(""))
), tp_richcompare
for comparisons (< == > != <= >=
), and more, are all unified under the StrType
structure:
|
|
Our Accomplishments
I put StringZilla to the test using the well-known Leipzig 1M dataset, which is 129,644,797 bytes in size. A standout feature of StringZilla is its ability to avoid unnecessary data copies. This is particularly beneficial when using functions like partition
and split
on large datasets, as it leads to significant memory savings. Moreover, the File
class empowers us to handle datasets that exceed available memory.
Here’s a comparative performance chart:
Task | Python | StringZilla | Speed Boost |
---|---|---|---|
.count("the") | 150 ms | 50 ms | 3x |
.count("\n") | 49 ms | 11 ms | 4.5x |
.split("the") | 223 ms | 57 ms | 3.9x |
.split("\n") | 100 ms | 34 ms | 2.9x |
.partition("the") | 8 ms | 259 ns | 31x |
.partition("\n") | 8 ms | 264 ns | 31x |
.find("\n") | 247 ns | 172 ns | 1.4x ¹ |
.find("wzyx") | 70 ms | 8 ms | 8.8x |
¹ The accuracy of this value might be off. In reality, StringZilla could be slightly slower. The challenge lies in measuring such short durations, especially using tools like the Jupyter notebook I employed.
Lately, I’ve been primarily leveraging low-level direct CPython bindings in projects such as:
- SimSIMD – For computations that outpace SciPy and NumPy in vector similarity.
- UCall – A turbocharged alternative to FastAPI.
I’m also contemplating their integration in USearch which we use to replace FAISS, and UForm, to serve faster multi-modal inference API.
Looking forward to your contributions, and don’t forget to share with your friends, if the projects resonate 🤗