For High-Performance Computing engineers, here’s the gist:

On Intel CPUs, the vaddps instruction (vectorized float addition) executes on ports 0 and 5. The vfmadd132ps instruction (vectorized fused float multiply-add, or FMA) also executes on ports 0 and 5.

On AMD CPUs, however, the vaddps instruction takes ports 2 and 3, and the vfmadd132ps instruction takes ports 0 and 1. Since FMA is equivalent to simple addition when one of the arguments is 1, we can drastically increase the throughput of addition-heavy numerical kernels.


Every few years, I revisit my old projects to look for improvements and laugh at my younger self. Great experience! Totally recommend it!

While refactoring the mess that was my ParallelReductionsBenchmark, I found an interesting optimization opportunity. The throughput of an AVX-512 kernel running at 355 GB/s on a single AMD Zen 4 core on AWS can be further improved to reach 500 GB/s! Sadly, this doesn’t apply to Intel.

What Are CPU Ports and Latency Hiding?

In the x86 world, CPU instructions are broken down into micro-ops, then dispatched to specialized ports for integer, floating-point, or load/store tasks. The cryptic notation, such as 1*p15+1*p23, shows how many micro-ops go to each port on different micro-architectures. This means the instruction is first dispatched to port 1 or 5, then continues on port 2 or 3.

Effective “port balancing” is crucial for saturating available bandwidth and hiding latency. For those interested, tools like uops.info or Agner Fog’s optimization guides compile this information for various CPU vendors and generations, saving you from diving into unstructured vendor- and generation-specific PDFs. The number of ports and capabilities vary significantly between CPU models and vendors.

Intel Ice Lake portsAMD Zen 4 ports
AES ops like AESENC (XMM, XMM)00, 1
CRC ops like CRC32 (R32, R32)1N/A on uops.info
Aligned loads like VMOVDQA64 (ZMM, M512)2, 30, 1, 2, 3
Adding bytes like VPADDB (ZMM, ZMM, ZMM)0, 50, 1, 2, 3
Adding doubles like VADDPD (ZMM, ZMM, ZMM)0, 52, 3
FMA like VFMADD132PS (ZMM, ZMM, ZMM)0, 50, 1

Check out VSCATTERQPS (VSIB_ZMM, K, YMM) for a truly complex port signature. According to uops.info, it looks like 2*p0+1*p0156+8*p49+8*p78 on Ice Lake, while on Zen 4, it’s 2*FP12+3*FP123+3*FP23+18*FP45.

Old AVX-512 Kernel

Below is my old AVX-512 kernel designed for large input arrays aligned to at least 64 bytes.

I’m using non-temporal loads, which are generally recommended when handling large volumes of data in a streaming fashion. This prevents CPU cache pollution with data that will soon be evicted without reuse.

The kernel traverses the input array in two directions - forward and reverse. Modern CPUs easily predict both traversal patterns and prefetch accordingly. Pulling data from different ends helps keep the TLB caches warm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class avx512_f32streamed_t {
    float const *const begin_ = nullptr;
    float const *const end_ = nullptr;

  public:
    avx512_f32streamed_t() = default;
    avx512_f32streamed_t(float const *b, float const *e) noexcept : begin_(b), end_(e) {}

    float operator()() const noexcept {
        auto it_begin = begin_;
        auto it_end = end_;

        __m512 acc1 = _mm512_setzero_ps(); // Accumulator for forward direction
        __m512 acc2 = _mm512_setzero_ps(); // Accumulator for reverse direction

        // Process in chunks of 32 floats in each direction
        for (; it_end - it_begin >= 64; it_begin += 32, it_end -= 32) {
            acc1 = _mm512_add_ps(acc1, _mm512_castsi512_ps(_mm512_stream_load_si512((void *)(it_begin))));
            acc2 = _mm512_add_ps(acc2, _mm512_castsi512_ps(_mm512_stream_load_si512((void *)(it_end - 32))));
        }

        // Combine the accumulators
        __m512 acc = _mm512_add_ps(acc1, acc2);
        float sum = _mm512_reduce_add_ps(acc); //! Not an instruction, but fine with GCC & Clang
        while (it_begin < it_end) sum += *it_begin++;
        return sum;
    }
};

A common suggestion for my libraries (mainly StringZilla and SimSIMD) is to unroll the loops. I generally oppose this idea in naive kernels like these. While you might gain a few points in synthetic micro-benchmarks, you’ll consume more L1i instruction cache, potentially hurting other parts of your program - and likely getting no improvements in return.

Check out how poorly the unrolled f32unrolled variants perform in the end.

New AVX-512 Kernel

The new kernel is similar but uses more registers! It employs 4 independent accumulators and a dedicated register of 1.0f values. As before, the _mm512_add_ps intrinsic (mapping to vaddps zmm, zmm, zmm instruction) is called twice in the loop body. Additionally, we use the _mm512_fmadd_ps intrinsic (mapping to vfmadd132ps zmm, zmm, zmm) twice in the loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class avx512_f32interleaving_t {
    float const *const begin_ = nullptr;
    float const *const end_ = nullptr;

  public:
    avx512_f32interleaving_t() = default;
    avx512_f32interleaving_t(float const *b, float const *e) : begin_(b), end_(e) {}

    float operator()() const noexcept {
        auto it_begin = begin_;
        auto it_end = end_;

        __m512 acc1 = _mm512_setzero_ps(); // Accumulator for forward direction addition
        __m512 acc2 = _mm512_setzero_ps(); // Accumulator for reverse direction addition
        __m512 acc3 = _mm512_setzero_ps(); // Accumulator for forward direction FMAs
        __m512 acc4 = _mm512_setzero_ps(); // Accumulator for reverse direction FMAs
        __m512 ones = _mm512_set1_ps(1.0f);

        // Process in chunks of 32 floats in each direction
        for (; it_end - it_begin >= 128; it_begin += 64, it_end -= 64) {
            acc1 = _mm512_add_ps(acc1, _mm512_castsi512_ps(_mm512_stream_load_si512((void *)(it_begin))));
            acc2 = _mm512_add_ps(acc2, _mm512_castsi512_ps(_mm512_stream_load_si512((void *)(it_end - 32))));
            acc3 = _mm512_fmadd_ps(ones, _mm512_castsi512_ps(_mm512_stream_load_si512((void *)(it_begin + 32))), acc3);
            acc4 = _mm512_fmadd_ps(ones, _mm512_castsi512_ps(_mm512_stream_load_si512((void *)(it_end - 64))), acc4);
        }

        // Combine the accumulators
        __m512 acc = _mm512_add_ps(_mm512_add_ps(acc1, acc2), _mm512_add_ps(acc2, acc3));
        float sum = _mm512_reduce_add_ps(acc);
        while (it_begin < it_end) sum += *it_begin++;
        return sum;
    }
};

On AMD Zen 4 CPUs, acc1 and acc2 execute on ports 2 and 3, while acc3 and acc4 run on ports 0 and 1.

Benchmarks

For the environment, I used an AWS m7a.metal-48xlarge instance with 192 cores across 2 sockets. The code is compiled with GCC 14.2 on Ubuntu 24.04. For context, each core has 32 KiB of L1 and 1024 KiB of L2 data cache on that machine.

With tiny arrays, like 1024 floats, we don’t need to touch RAM. The numbers exceed RAM throughput and reflect ALU throughput, exceeding 500 GB/s for the newest latency-hiding variant:

1
2
3
4
5
Benchmark                                                               Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------
avx512/f32/streamed/min_time:60.000/real_time                        11.6 ns         11.6 ns   7233046409 bytes/s=352.369G/s
avx512/f32/unrolled/min_time:60.000/real_time                        13.2 ns         13.2 ns   6374235156 bytes/s=310.835G/s
avx512/f32/interleaving/min_time:60.000/real_time                    7.91 ns         7.90 ns   10598096927 bytes/s=517.603G/s

For a much larger dataset of 268,435,456 floats coming from RAM, the throughput is lower, but we still see a 40% improvement over the baseline:

1
2
3
4
5
Benchmark                                                               Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------
avx512/f32/streamed/min_time:60.000/real_time                    68607560 ns     68466926 ns         1231 bytes/s=15.6505G/s
avx512/f32/unrolled/min_time:60.000/real_time                    61290814 ns     61164627 ns         1369 bytes/s=17.5188G/s
avx512/f32/interleaving/min_time:60.000/real_time                48417119 ns     48317695 ns         1735 bytes/s=22.1769G/s

When running on multiple cores, we’re only as fast as we are lucky! Performance depends on core affinity to the target memory region, which is controlled by the Linux kernel.

1
2
3
4
5
Benchmark                                                               Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------
avx512/f32/streamed/std::threads/min_time:60.000/real_time       10556889 ns     10169673 ns         7938 bytes/s=101.71G/s
avx512/f32/interleaving/std::threads/min_time:60.000/real_time   10411117 ns     10031164 ns         8108 bytes/s=103.134G/s
avx512/f32/unrolled/std::threads/min_time:60.000/real_time       10510807 ns     10123395 ns         8049 bytes/s=102.156G/s

The next addition to ParallelReductionsBenchmark will likely be a lower-level thread pool, replacing std::thread with POSIX API to control core affinity. While I’m working on something else right now, if you want to collaborate - that task is up for grabs along with a few other “good first issues” 🤗

Final Remarks

It’s uncommon to find operations as basic as addition where we can hide latency. It’s less uncommon with higher-level operations involving complex logic that can be implemented in multiple ways. My favorite example is Pete Cawley’s CRC32 implementation, which combines the hardware-accelerated CRC instruction (issued via _mm_crc32_u64) with the carryless multiply instruction (issued via _mm_clmulepi64_si128). I’ve just linked it to my less_slow.cpp tutorial, and I recommend checking it out!