Vectorizing 64-bit Integer Division
Summary
Hardware 64-bit integer division is slow and not vectorized on most x86 CPUs.
This project explores emulating u64 / u64 using 32-bit SIMD lanes so that many
divisions can be processed in parallel.
Motivation
Many workloads — hashing, modular reduction, fixed-point math — are bottlenecked
on integer division. The scalar div instruction has high latency and blocks
the pipeline. If we can express division in terms of multiplies and shifts, we
can vectorize it across AVX2/AVX-512 lanes.
Technical Details
Three approaches were evaluated:
- Knuth Algorithm D — schoolbook long division on 32-bit limbs.
- Barrett reduction — replace division by a constant with a multiply by a precomputed reciprocal.
- SIMD 32-bit lanes — split 64-bit operands across lanes and recombine.
See the research notes on SIMD integer arithmetic and Knuth Algorithm D for the underlying math.
Results
| Approach | Speedup |
|---|---|
| Scalar baseline | 1.0× |
| Knuth Algorithm D | 1.6× |
| Barrett reduction | 2.1× |
| SIMD 32-bit lanes | 2.4× |
Full benchmark history lives on the data page.
Lessons Learned
- Correctness corner cases (overflow, divide-by-zero lanes) dominate the effort.
- The win depends heavily on operand distribution; constant divisors favor Barrett.
Links
- Research: SIMD integer arithmetic
- Benchmarks: Data
SIMDInteger arithmeticPerformance