LLM Inference Optimization
The two regimes
LLM inference splits into two very different phases:
- Prefill — process the prompt; compute-bound, highly parallel.
- Decode — generate tokens one at a time; memory-bandwidth-bound, dominated by reading the KV cache.
Levers
- KV cache management — paged attention avoids fragmentation and lets you pack more concurrent sequences into GPU memory.
- Speculative decoding — a small draft model proposes several tokens that the large model verifies in one batched step, amortizing memory traffic.
- Quantization — lower-precision weights/activations cut bandwidth, the true bottleneck in decode.
Why it connects to the rest of this site
Decode performance is fundamentally an integer/low-precision arithmetic and memory-bandwidth problem — the same themes as SIMD integer arithmetic.
To explore
- Continuous batching scheduling policies.
- The arithmetic intensity crossover between prefill and decode.
LLM inferenceSystemsPerformance