Justin Yamini | Fractional & Contract CTO/CEO · AI/ML

Running LLMs in production is expensive. High-end GPUs cost tens of thousands of dollars and can serve perhaps 50-100 concurrent users depending on model and latency requirements. At scale, inference costs dominate AI budgets. Understanding optimization techniques is essential.

Understanding the Bottlenecks

LLM inference has two distinct phases with different characteristics.

Prefill, which processes the prompt, is compute-bound. The model processes all input tokens in parallel. Larger prompts mean longer prefill. This is where raw GPU compute matters most.

Decode, which generates output, is memory-bandwidth-bound. The model generates one token at a time, loading the full model weights for each token. Memory bandwidth matters more than compute here.

Most optimization efforts focus on decode because that's where users wait. But prefill optimization matters for long-context applications.

Quantization

Full FP16 models use 2 bytes per parameter. A 70B model needs 140GB just for weights, requiring multiple GPUs. Quantization compresses this.

INT8 quantization uses 1 byte per parameter for 50% memory reduction. Quality loss is minimal for most applications and this is generally a safe default.

INT4 and FP4 quantization use 0.5 bytes per parameter for 75% reduction. Quality loss is noticeable but acceptable for many use cases. Testing on specific workloads is important.

Different quantization methods like GPTQ, AWQ, and GGML represent different approaches with different tradeoffs. The best choice depends on deployment target.

The practical impact is that INT8 quantization can enable serving a 70B model on a single 80GB GPU instead of two.

Batching

Single-request inference wastes GPU capacity. Batching multiple requests amortizes the memory bandwidth cost of loading weights.

Static batching waits for N requests and processes them together. It's simple but adds latency for early arrivals.

Dynamic batching processes requests as they arrive up to batch size. It has a better latency profile but is more complex.

Continuous batching doesn't wait for all requests in a batch to complete. When one finishes, a new request joins. This is the current state of the art and can improve throughput dramatically.

KV Cache Management

The key-value cache stores attention computations from previous tokens. It's essential for efficient autoregressive generation but grows with sequence length and batch size.

PagedAttention manages KV cache like virtual memory, allocating pages on demand. This eliminates memory fragmentation and enables larger batch sizes.

For applications where requests share common prefixes like system prompts or few-shot examples, caching the KV state for the prefix eliminates redundant computation.

Storing KV cache in lower precision reduces memory pressure with minimal quality impact.

Speculative Decoding

The decode phase generates one token at a time, which underutilizes the GPU. Speculative decoding uses a smaller draft model to predict multiple tokens ahead, then verifies them with the full model in parallel.

When predictions are correct, multiple tokens are generated for the compute cost of one verification pass. When incorrect, fallback to normal decoding occurs.

Speedups of 2-3x are realistic for tasks where the draft model predicts well. For highly creative tasks, gains are smaller.

Model Parallelism

For models that don't fit on a single GPU, parallelism strategies are needed.

Tensor parallelism splits each layer across GPUs. It requires fast interconnect and is good for inference because it reduces per-GPU memory and enables larger batch sizes.

Pipeline parallelism puts different layers on different GPUs. It works over slower interconnects but requires careful micro-batching.

The right choice depends on available interconnect and model characteristics.

Request Routing

At scale, multiple model replicas are needed. Round-robin is simple but ignores that some requests take longer than others. Least-connections routes to the replica with fewest active requests for better load distribution. Model-aware routing considers which models are loaded where to avoid cold-start latency.

Caching

Not every request needs to hit the model. Semantic caching checks similarity to cached requests and returns cached responses if similar enough. This works for FAQ-like queries. Exact caching hashes the prompt and returns cached responses for exact matches.

Response streaming sends tokens as they're generated. Perceived latency is much lower even if total time is unchanged.

What to Measure

Time to first token measures prefill latency. Users notice delays before the response starts. Inter-token latency is the time between tokens and affects streaming smoothness. Tokens per second per user reflects user-perceived throughput. Tokens per second per GPU is the cost efficiency metric. And P99 latencies matter for user experience more than averages.

Conclusion

LLM inference optimization is a stack where quantization reduces memory requirements, batching improves utilization, KV cache management enables scale, and speculative decoding accelerates generation. Each layer compounds, and a well-optimized stack can serve significantly more users per GPU.

Optimizing LLM Inference at Scale