SIMD Integer Arithmetic
The problem
Modern x86 SIMD (SSE/AVX) has rich support for vectorized add, subtract, and multiply, but no vector integer division instruction. Division is the expensive odd-one-out: high latency, low throughput, and scalar-only.
Strategies
- Multiply by reciprocal — for division by a known constant, precompute a
fixed-point reciprocal and replace
x / dwith a multiply + shift. See Barrett reduction. - Limb-based long division — split 64-bit operands into 32-bit limbs and run Knuth Algorithm D across lanes.
- Newton–Raphson reciprocal — iteratively refine a reciprocal estimate; effective when a floating-point unit is available.
Open questions
- When does lane divergence (different limb counts) erase the SIMD win?
- AVX-512
IFMAfor wider intermediate products — worth it?
Related project: Vectorizing 64-bit Integer Division.
SIMDInteger arithmetic