Vectorizing 64-bit Integer Division

May 22, 2026 · active

Summary

Hardware 64-bit integer division is slow and not vectorized on most x86 CPUs. This project explores emulating u64 / u64 using 32-bit SIMD lanes so that many divisions can be processed in parallel.

Motivation

Many workloads — hashing, modular reduction, fixed-point math — are bottlenecked on integer division. The scalar div instruction has high latency and blocks the pipeline. If we can express division in terms of multiplies and shifts, we can vectorize it across AVX2/AVX-512 lanes.

Technical Details

Three approaches were evaluated:

Knuth Algorithm D — schoolbook long division on 32-bit limbs.
Barrett reduction — replace division by a constant with a multiply by a precomputed reciprocal.
SIMD 32-bit lanes — split 64-bit operands across lanes and recombine.

See the research notes on SIMD integer arithmetic and Knuth Algorithm D for the underlying math.

Results

Approach	Speedup
Scalar baseline	1.0×
Knuth Algorithm D	1.6×
Barrett reduction	2.1×
SIMD 32-bit lanes	2.4×

Full benchmark history lives on the data page.

Lessons Learned