NumKong: 2'000 Mixed Precision Kernels For All 🦍
I’m killing my SimSIMD project and re-launching under a new name — NumKong — StringZilla’s big brother. Around 2'000 SIMD kernels for mixed precision numerics, spread across 200'000 lines of code & docstrings, for 7 programming languages. One of the larger collections online — comparable to OpenBLAS, the default NumPy BLAS (Basic Linear Algebra Subprograms) backend (detailed comparison below). Highlights: RISC-V Vector Extensions, Intel AMX & Arm SME Tiles From Vectors to Matrices and Higher-rank Tensors From BFloat16 and Float16 to Float6 — E3M2 & E2M3 on any CPU Native Int4 & UInt4 Dot Products via Nibble Algebra Neumaier & Dot2 for higher-than-BLAS precision Ozaki Scheme for Float64 GEMMs via Float32 Tile Hardware Haversine & Vincenty for Geospatial — 5'300x faster than GeoPy Kabsch & Umeyama Mesh Alignment — 200x faster than BioPython Fused MaxSim for ColBERT — GPU-Free Late Interaction Scoring WebAssembly SIMD backend for AI Sandboxes, Edge, & Browsers C 99, C++ 23, Rust, Swift, JavaScript, GoLang, & Python 🐍 All of that tested against in-house 118-bit floating point numbers and heavily profiled for both numerical stability and speed. Here’s a preview of performance numbers for the most familiar part — GEMM (General Matrix Multiply)-like batched dot products: ...