The Inference Challenge
Serving large language models at scale presents unique challenges. Unlike training, inference is memory-bandwidth bound, and the autoregressive nature of generation means each token depends on all previous tokens.
KV Cache
The key-value cache stores attention keys and values from previous tokens, avoiding recomputation during generation. Without it, each forward pass would recompute attention over the entire context.
class KVCache:
def __init__(self, max_seq_len: int, num_heads: int, head_dim: int):
self.k_cache = torch.zeros(max_seq_len, num_heads, head_dim)
self.v_cache = torch.zeros(max_seq_len, num_heads, head_dim)
self.current_len = 0
def update(self, k: Tensor, v: Tensor) -> tuple[Tensor, Tensor]:
self.k_cache[self.current_len] = k
self.v_cache[self.current_len] = v
self.current_len += 1
return self.k_cache[:self.current_len], self.v_cache[:self.current_len]Speculative Decoding
Use a small draft model to generate candidate tokens, then verify with the large model in parallel. When drafts are accepted, we get multiple tokens per large model forward pass.
Continuous Batching
Instead of static batching where all sequences must complete before the next batch, continuous batching allows new requests to join mid-generation, dramatically improving GPU utilization.
Quantization
Reducing model weights from FP16 to INT8 or INT4 halves or quarters memory requirements with minimal quality degradation using techniques like GPTQ and AWQ.
Conclusion
LLM inference optimization is a rapidly evolving field. The combination of KV caching, speculative decoding, and continuous batching can achieve 10–100x improvements in throughput over naive implementations.