March 12, 2024 · 9 min · 1729 words · Ash Vardanian
Table of Contents
Downloaded over 5 Billion times, NumPy is the most popular library for numerical computing in Python.
It wraps low-level HPC libraries like BLAS and LAPACK, providing a high-level interface for matrix operations.
BLAS is mainly implemented in C, Fortran, or Assembly and is available for most modern chips, not just CPUs.
BLAS is fast, but bindings aren’t generally free.
So, how much of the BLAS performance is NumPy leaving on the table?
TLDR: up to 90% of throughput is lost on 1536-dimensional OpenAI Ada embeddings, more on shorter vectors.
SimSIMD can fix that, at least partly.
importnumpyasnpa=np.random.rand(1536)b=np.random.rand(1536)np.dot(a.astype(np.float64),b.astype(np.float64))# cblas_ddotnp.dot(a.astype(np.float32),b.astype(np.float32))# cblas_sdotnp.dot(a.astype(np.float16),b.astype(np.float16))# no analognp.dot(a.astype(np.float64).view(np.complex128),b.astype(np.float64).view(np.complex128))# cblas_zdotu_subnp.dot(a.astype(np.float32).view(np.complex64),b.astype(np.float32).view(np.complex64))# cblas_cdotu_sub
They directly compare NumPy’s default PyPi distribution with the same underlying OpenBLAS library used from the C layer.
For profiling, I chose the c7g instances on AWS, powered by Arm-based Graviton 3 CPUs with Scalable Vector Extensions (SVE) - variable-width SIMD Assembly instructions - my favorite example of convergence between CPUs and GPUs.
AWS Graviton 3 CPUs are an excellent baseline for the new wave of energy-efficient supercomputers and clouds.
Arm also powers Fugaku, the fastest supercomputer until 2022, which also supports SVE.
Nvidia Grace CPUs are also based on Arm, Neoverse N2 cores to be precise.
They also support SVE instructions and already power Isambard 3 supercomputer in Bristol, one of the greenest in the world.
Similarly, Microsoft Azure is rolling out its Cobalt CPUs, and the Ampere One is gaining adoption.
Numeric Type
OpenBLAS in C
OpenBLAS in NumPy
Slowdown
float64 ¹⁵³⁶
1,693 K
479 K
3.53 x
float32 ¹⁵³⁶
6,568 K
752 K
8.73 x
float16 ¹⁵³⁶
_
173 K
-
complex128 ⁷⁶⁸
1,706 K
469 K
3.64 x
complex64 ⁷⁶⁸
2,980 K
632 K
4.71 x
complex32 ⁷⁶⁸
-
-
-
The benchmarks were run on a single thread, resulting in dot-product operations per second.
In this case, NumPy is 3.53x to 8.73x slower than the underlying BLAS library for 1536-dimensional real and 768-dimensional complex vectors.
The smaller you make the vectors, the more significant the slowdown.
The slowdown comes from multiple sources:
Dynamic dispatch - as Python is a dynamic language, locating the dot symbol may involve several lookup tables and function calls,
Type checking - as Python is a dynamically typed language, the native implementation of dot must check the argument types, check if they are compatible,
Memory allocations - NumPy has to allocate memory for the result and then free it.
In the v4 release of SimSIMD, I’ve added complex dot products.
SimSIMD now also supports half-precision complex numbers, not supported by NumPy or most BLAS implementations, but considered for CuPy.
Looking at hardware-accelerated dot-products specifically, this is how SimSIMD and NumPy compare in terms of functionality:
NumPy
SimSIMD
Real Types
float16, float32, float64
float16, float32, float64
Complex Types
complex64, complex128
complex32, complex64, complex128
Backends
Come from BLAS
NEON, SVE, Haswell, Skylake, Ice Lake, Sapphire Rapids
The library also contains benchmarks against OpenBLAS, which I’ve used in this article.
First, I’ve profiled the C layer, looking for ways to supersede OpenBLAS and comparing auto-vectorization with explicit SIMD kernels.
Numeric Type
OpenBLAS in C
Serial C Code
SimSIMD in C
Improvement
float64 ¹⁵³⁶
1,693 K
861 K
2,944 K
+ 73.9 %
float32 ¹⁵³⁶
6,568 K
1,248 K
5,464 K
- 16.8 %
float16 ¹⁵³⁶
_
845 K
10,500 K
complex128 ⁷⁶⁸
1,706 K
1.264 K
2,061 K
+ 20.8 %
complex64 ⁷⁶⁸
2,980 K
1,335 K
3,960 K
+ 32.9 %
complex32 ⁷⁶⁸
-
881 K
7,138 K
The comparison wasn’t possible for half-precision representations.
In the remaining 4 cases, SimSIMD won 3 times and lost once.
Patches for the lost case are more than welcome 🤗
Putting all together, this is how SimSIMD and NumPy compare when used in Python:
Numeric Type
NumPy in Python
SimSIMD in Python
Improvement
float64 ¹⁵³⁶
479 K
753 K
+ 57.2 %
float32 ¹⁵³⁶
752 K
1,215 K
+ 61.6 %
float16 ¹⁵³⁶
173 K
1,484 K
+ 757.8 %
complex128 ⁷⁶⁸
469 K
737 K
+ 57.1 %
complex64 ⁷⁶⁸
632 K
1,011 K
+ 60.0 %
complex32 ⁷⁶⁸
-
1,173 K
In all cases but one, the SimSIMD improvements over NumPy can be attributed to the quality of the binding layer, the BLAS wasn’t the bottleneck.
NumPy could have been just as fast as SimSIMD, if it had a better binding layer.
In half-precision, however, the NumPy doesn’t call BLAS, and as a result, it’s 8x slower than SimSIMD.
Assuming how often numpy.inner, numpy.dot, and numpy.vdot are used, I’ll call this a win.
So if you want to make your pipelines - pip install simsimd 😉