SIMD Integer Arithmetic

Feb 8, 2026 · updated May 22, 2026 · 🌿 growing

The problem

Modern x86 SIMD (SSE/AVX) has rich support for vectorized add, subtract, and multiply, but no vector integer division instruction. Division is the expensive odd-one-out: high latency, low throughput, and scalar-only.

Strategies

Multiply by reciprocal — for division by a known constant, precompute a fixed-point reciprocal and replace x / d with a multiply + shift. See Barrett reduction.
Limb-based long division — split 64-bit operands into 32-bit limbs and run Knuth Algorithm D across lanes.
Newton–Raphson reciprocal — iteratively refine a reciprocal estimate; effective when a floating-point unit is available.

Open questions

When does lane divergence (different limb counts) erase the SIMD win?
AVX-512 IFMA for wider intermediate products — worth it?

Related project: Vectorizing 64-bit Integer Division.

SIMDInteger arithmetic