← All projects

Vectorizing 64-bit Integer Division

Summary

Hardware 64-bit integer division is slow and not vectorized on most x86 CPUs. This project explores emulating u64 / u64 using 32-bit SIMD lanes so that many divisions can be processed in parallel.

Motivation

Many workloads — hashing, modular reduction, fixed-point math — are bottlenecked on integer division. The scalar div instruction has high latency and blocks the pipeline. If we can express division in terms of multiplies and shifts, we can vectorize it across AVX2/AVX-512 lanes.

Technical Details

Three approaches were evaluated:

  1. Knuth Algorithm D — schoolbook long division on 32-bit limbs.
  2. Barrett reduction — replace division by a constant with a multiply by a precomputed reciprocal.
  3. SIMD 32-bit lanes — split 64-bit operands across lanes and recombine.

See the research notes on SIMD integer arithmetic and Knuth Algorithm D for the underlying math.

Results

ApproachSpeedup
Scalar baseline1.0×
Knuth Algorithm D1.6×
Barrett reduction2.1×
SIMD 32-bit lanes2.4×

Full benchmark history lives on the data page.

Lessons Learned

SIMDInteger arithmeticPerformance