NumPy vs BLAS: Losing 90% of Throughput

Downloaded over 5 Billion times, NumPy is the most popular library for numerical computing in Python. It wraps low-level HPC libraries like BLAS and LAPACK, providing a high-level interface for matrix operations. BLAS is mainly implemented in C, Fortran, or Assembly and is available for most modern chips, not just CPUs. BLAS is fast, but bindings aren’t generally free. So, how much of the BLAS performance is NumPy leaving on the table?

TLDR: up to 90% of throughput is lost on 1536-dimensional OpenAI Ada embeddings, more on shorter vectors. SimSIMD can fix that, at least partly.

NumPy standing on the shoulders of BLAS (meme)

We aren’t going to cover the basics of NumPy or BLAS. We won’t cover BLAS level 2 or 3 for vector-matrix and matrix-matrix operations, only level 1 vector-vector dot-products. We have already shown how pure Assembly dot-products can be over 2,000x faster than Python. We have also demonstrated how bindings across 10 programming languages work. Instead, let’s jump into benchmark results.

Baseline Benchmarks

I wrote 2 benchmark scripts: one in Python and one in C++ using Google Benchmark. In short, we are benchmarking the following interfaces against their BLAS counterparts:

1
2
3
4
5
6
7
8
9
import numpy as np
a = np.random.rand(1536)
b = np.random.rand(1536)

np.dot(a.astype(np.float64), b.astype(np.float64)) # cblas_ddot
np.dot(a.astype(np.float32), b.astype(np.float32)) # cblas_sdot
np.dot(a.astype(np.float16), b.astype(np.float16)) # no analog
np.dot(a.astype(np.float64).view(np.complex128), b.astype(np.float64).view(np.complex128)) # cblas_zdotu_sub
np.dot(a.astype(np.float32).view(np.complex64), b.astype(np.float32).view(np.complex64)) # cblas_cdotu_sub

They directly compare NumPy’s default PyPi distribution with the same underlying OpenBLAS library used from the C layer. For profiling, I chose the c7g instances on AWS, powered by Arm-based Graviton 3 CPUs with Scalable Vector Extensions (SVE) - variable-width SIMD Assembly instructions - my favorite example of convergence between CPUs and GPUs.

AWS Graviton 3 CPUs are an excellent baseline for the new wave of energy-efficient supercomputers and clouds. Arm also powers Fugaku, the fastest supercomputer until 2022, which also supports SVE. Nvidia Grace CPUs are also based on Arm, Neoverse N2 cores to be precise. They also support SVE instructions and already power Isambard 3 supercomputer in Bristol, one of the greenest in the world. Similarly, Microsoft Azure is rolling out its Cobalt CPUs, and the Ampere One is gaining adoption.

Numeric Type	OpenBLAS in C	OpenBLAS in NumPy	Slowdown
`float64` ¹⁵³⁶	1,693 K	479 K	3.53 x
`float32` ¹⁵³⁶	6,568 K	752 K	8.73 x
`float16` ¹⁵³⁶	_	173 K	-
`complex128` ⁷⁶⁸	1,706 K	469 K	3.64 x
`complex64` ⁷⁶⁸	2,980 K	632 K	4.71 x
`complex32` ⁷⁶⁸	-	-	-

The benchmarks were run on a single thread, resulting in dot-product operations per second. In this case, NumPy is 3.53x to 8.73x slower than the underlying BLAS library for 1536-dimensional real and 768-dimensional complex vectors. The smaller you make the vectors, the more significant the slowdown. The slowdown comes from multiple sources:

Dynamic dispatch - as Python is a dynamic language, locating the dot symbol may involve several lookup tables and function calls,
Type checking - as Python is a dynamically typed language, the native implementation of dot must check the argument types, check if they are compatible,
Memory allocations - NumPy has to allocate memory for the result and then free it.

We can partly mitigate that.

SimSIMD

In the v4 release of SimSIMD, I’ve added complex dot products. SimSIMD now also supports half-precision complex numbers, not supported by NumPy or most BLAS implementations, but considered for CuPy. Looking at hardware-accelerated dot-products specifically, this is how SimSIMD and NumPy compare in terms of functionality:

	NumPy	SimSIMD
Real Types	`float16`, `float32`, `float64`	`float16`, `float32`, `float64`
Complex Types	`complex64`, `complex128`	`complex32`, `complex64`, `complex128`
Backends	Come from BLAS	NEON, SVE, Haswell, Skylake, Ice Lake, Sapphire Rapids
Compatibility	35 wheels on PyPi	105 wheels on PyPi

The library also contains benchmarks against OpenBLAS, which I’ve used in this article. First, I’ve profiled the C layer, looking for ways to supersede OpenBLAS and comparing auto-vectorization with explicit SIMD kernels.

Numeric Type	OpenBLAS in C	Serial C Code	SimSIMD in C	Improvement
`float64` ¹⁵³⁶	1,693 K	861 K	2,944 K	+ 73.9 %
`float32` ¹⁵³⁶	6,568 K	1,248 K	5,464 K	- 16.8 %
`float16` ¹⁵³⁶	_	845 K	10,500 K
`complex128` ⁷⁶⁸	1,706 K	1.264 K	2,061 K	+ 20.8 %
`complex64` ⁷⁶⁸	2,980 K	1,335 K	3,960 K	+ 32.9 %
`complex32` ⁷⁶⁸	-	881 K	7,138 K

The comparison wasn’t possible for half-precision representations. In the remaining 4 cases, SimSIMD won 3 times and lost once. Patches for the lost case are more than welcome 🤗

Putting all together, this is how SimSIMD and NumPy compare when used in Python:

Numeric Type	NumPy in Python	SimSIMD in Python	Improvement
`float64` ¹⁵³⁶	479 K	753 K	+ 57.2 %
`float32` ¹⁵³⁶	752 K	1,215 K	+ 61.6 %
`float16` ¹⁵³⁶	173 K	1,484 K	+ 757.8 %
`complex128` ⁷⁶⁸	469 K	737 K	+ 57.1 %
`complex64` ⁷⁶⁸	632 K	1,011 K	+ 60.0 %
`complex32` ⁷⁶⁸	-	1,173 K

In all cases but one, the SimSIMD improvements over NumPy can be attributed to the quality of the binding layer, the BLAS wasn’t the bottleneck. NumPy could have been just as fast as SimSIMD, if it had a better binding layer. In half-precision, however, the NumPy doesn’t call BLAS, and as a result, it’s 8x slower than SimSIMD. Assuming how often numpy.inner, numpy.dot, and numpy.vdot are used, I’ll call this a win. So if you want to make your pipelines - pip install simsimd 😉

Replicating the Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Clone the repos and install BLAS
$ sudo apt install libopenblas-dev
$ git clone https://github.com/ashvardanian/SimSIMD.git
$ cd SimSIMD

# To run C benchmarks for dot products
$ cmake -DCMAKE_BUILD_TYPE=Release -DSIMSIMD_BUILD_BENCHMARKS=1 -B build_release
$ cmake --build build_release --config Release
$ build_release/simsimd_bench --benchmark_filter="dot.*(serial|blas|sve)"

# To run Python benchmarks
$ pip install numpy scipy scikit-learn
$ python python/bench.py

Appending 1: C++ Benchmark Logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
- Arm NEON support enabled: true
- Arm SVE support enabled: true
- x86 Haswell support enabled: false
- x86 Skylake support enabled: false
- x86 Ice Lake support enabled: false
- x86 Sapphire Rapids support enabled: false
- Compiler supports F16: true
- Benchmark against CBLAS: true

2024-03-13T05:58:15+00:00
Running build_release/simsimd_bench
Run on (4 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x4)
  L1 Instruction 64 KiB (x4)
  L2 Unified 1024 KiB (x4)
  L3 Unified 32768 KiB (x1)
Load Average: 0.12, 0.23, 0.31
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
dot_f32_blas_1536d/min_time:10.000/threads:4          38.2 ns          152 ns     91962500 abs_delta=0 bytes=80.7067G/s pairs=6.56793M/s relative_error=0
dot_f64_blas_1536d/min_time:10.000/threads:4           149 ns          590 ns     23714404 abs_delta=0 bytes=41.6305G/s pairs=1.69395M/s relative_error=0
dot_f32c_blas_1536d/min_time:10.000/threads:4         84.0 ns          336 ns     41723636 abs_delta=0 bytes=36.6258G/s pairs=2.98061M/s relative_error=0
dot_f64c_blas_1536d/min_time:10.000/threads:4          147 ns          586 ns     23884736 abs_delta=0 bytes=41.9479G/s pairs=1.70687M/s relative_error=0
vdot_f32c_blas_1536d/min_time:10.000/threads:4        84.1 ns          336 ns     41718708 abs_delta=0 bytes=36.6236G/s pairs=2.98043M/s relative_error=0
vdot_f64c_blas_1536d/min_time:10.000/threads:4         147 ns          586 ns     23895620 abs_delta=0 bytes=41.9427G/s pairs=1.70665M/s relative_error=0
dot_f16_sve_1536d/min_time:10.000/threads:4           23.9 ns         95.2 ns    146982384 abs_delta=119.209u bytes=64.5152G/s pairs=10.5005M/s relative_error=158.405u
dot_f32_sve_1536d/min_time:10.000/threads:4           45.9 ns          183 ns     76501512 abs_delta=0 bytes=67.1458G/s pairs=5.46434M/s relative_error=0
dot_f64_sve_1536d/min_time:10.000/threads:4           85.0 ns          340 ns     41207148 abs_delta=0 bytes=72.3747G/s pairs=2.94493M/s relative_error=0
dot_f16c_sve_1536d/min_time:10.000/threads:4          35.1 ns          140 ns     99915388 abs_delta=817.418u bytes=43.8583G/s pairs=7.1384M/s relative_error=1094.91u
vdot_f16c_sve_1536d/min_time:10.000/threads:4         35.1 ns          140 ns     99876424 abs_delta=446.2u bytes=43.915G/s pairs=7.14763M/s relative_error=609.749u
dot_f32c_sve_1536d/min_time:10.000/threads:4          63.2 ns          253 ns     55436532 abs_delta=0 bytes=48.6621G/s pairs=3.96013M/s relative_error=0
vdot_f32c_sve_1536d/min_time:10.000/threads:4         63.1 ns          252 ns     55614928 abs_delta=0 bytes=48.8208G/s pairs=3.97305M/s relative_error=0
dot_f64c_sve_1536d/min_time:10.000/threads:4           122 ns          485 ns     28866668 abs_delta=0 bytes=50.6679G/s pairs=2.06168M/s relative_error=0
vdot_f64c_sve_1536d/min_time:10.000/threads:4          122 ns          485 ns     28876568 abs_delta=0 bytes=50.686G/s pairs=2.06242M/s relative_error=0
dot_f16_serial_1536d/min_time:10.000/threads:4         297 ns         1183 ns     11832888 abs_delta=0 bytes=5.19258G/s pairs=845.147k/s relative_error=0
dot_f32_serial_1536d/min_time:10.000/threads:4         201 ns          801 ns     17471224 abs_delta=0 bytes=15.3362G/s pairs=1.24806M/s relative_error=0
dot_f64_serial_1536d/min_time:10.000/threads:4         291 ns         1161 ns     12059628 abs_delta=0 bytes=21.1718G/s pairs=861.482k/s relative_error=0
dot_f64c_serial_1536d/min_time:10.000/threads:4        198 ns          791 ns     17704468 abs_delta=0 bytes=31.0815G/s pairs=1.26471M/s relative_error=0
dot_f32c_serial_1536d/min_time:10.000/threads:4        187 ns          749 ns     18696188 abs_delta=0 bytes=16.4102G/s pairs=1.33546M/s relative_error=0
dot_f16c_serial_1536d/min_time:10.000/threads:4        284 ns         1135 ns     12337624 abs_delta=0 bytes=5.41538G/s pairs=881.409k/s relative_error=0

Appending 2: Python Benchmark Logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
- Vector dimensions: 1536
- Vectors count: 1000
- Hardware capabilities: serial, neon, sve
- SimSIMD version: 4.0.0
- SciPy version: 1.11.3
- scikit-learn version: 1.4.1.post1
- NumPy version: 1.26.0
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: openblas64

| Datatype | Method                |   Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
| :------- | :-------------------- | ------: | ------------: | ------------------: |
| `f64`    | `scipy.cosine`        |  40,362 |       690,516 |             17.11 x |
| `f32`    | `scipy.cosine`        |  31,097 |     1,138,570 |             36.61 x |
| `f16`    | `scipy.cosine`        |  14,631 |     1,393,120 |             95.21 x |
| `i8`     | `scipy.cosine`        |  39,489 |     1,473,040 |             37.30 x |
| `f64`    | `scipy.sqeuclidean`   | 154,574 |       718,487 |              4.65 x |
| `f32`    | `scipy.sqeuclidean`   | 176,013 |     1,180,098 |              6.70 x |
| `f16`    | `scipy.sqeuclidean`   |  47,537 |     1,411,140 |             29.69 x |
| `i8`     | `scipy.sqeuclidean`   | 121,854 |     1,416,866 |             11.63 x |
| `f64`    | `numpy.dot`           | 478,866 |       752,517 |              1.57 x |
| `f32`    | `numpy.dot`           | 752,299 |     1,214,928 |              1.61 x |
| `f16`    | `numpy.dot`           | 172,994 |     1,484,285 |              8.58 x |
| `i8`     | `numpy.dot`           | 511,496 |     1,571,563 |              3.07 x |
| `f32c`   | `numpy.dot`           | 631,648 |     1,010,783 |              1.60 x |
| `f64c`   | `numpy.dot`           | 468,782 |       737,116 |              1.57 x |
| `f32c`   | `numpy.vdot`          | 610,647 |     1,032,112 |              1.69 x |
| `f64c`   | `numpy.vdot`          | 437,204 |       749,826 |              1.72 x |
| `f64`    | `scipy.jensenshannon` |  15,597 |        38,129 |              2.44 x |
| `f32`    | `scipy.jensenshannon` |  15,436 |       234,427 |             15.19 x |
| `f16`    | `scipy.jensenshannon` |   8,381 |       213,824 |             25.51 x |
| `f64`    | `scipy.kl_div`        |  49,725 |        61,399 |              1.23 x |
| `f32`    | `scipy.kl_div`        |  47,714 |       464,078 |              9.73 x |
| `f16`    | `scipy.kl_div`        |  36,418 |       422,291 |             11.60 x |
| `b8`     | `scipy.hamming`       | 836,733 |    11,655,827 |             13.93 x |
| `b8`     | `scipy.jaccard`       | 395,086 |    10,229,237 |             25.89 x |

Baseline Benchmarks#

SimSIMD#

Replicating the Results#

Appending 1: C++ Benchmark Logs#

Appending 2: Python Benchmark Logs#

Baseline Benchmarks

SimSIMD

Replicating the Results

Appending 1: C++ Benchmark Logs

Appending 2: Python Benchmark Logs