LLM Inference Optimization

May 14, 2026 · 🌿 growing

The two regimes

LLM inference splits into two very different phases:

Prefill — process the prompt; compute-bound, highly parallel.
Decode — generate tokens one at a time; memory-bandwidth-bound, dominated by reading the KV cache.

Levers

KV cache management — paged attention avoids fragmentation and lets you pack more concurrent sequences into GPU memory.
Speculative decoding — a small draft model proposes several tokens that the large model verifies in one batched step, amortizing memory traffic.
Quantization — lower-precision weights/activations cut bandwidth, the true bottleneck in decode.

Why it connects to the rest of this site

Decode performance is fundamentally an integer/low-precision arithmetic and memory-bandwidth problem — the same themes as SIMD integer arithmetic.

To explore

Continuous batching scheduling policies.
The arithmetic intensity crossover between prefill and decode.

LLM inferenceSystemsPerformance