Justin Yamini | Fractional & Contract CTO/CEO · AI/ML

The explosion of large language models has fundamentally changed how we think about GPU infrastructure. What was once a niche concern for researchers is now a critical business consideration. Here's a framework for thinking about GPU cluster design for inference workloads.

Start with the Workload

The most common mistake is buying hardware before understanding workload characteristics. A cluster optimized for training looks very different from one optimized for inference.

For inference, the key questions are: What are the request patterns—real-time chat, batch processing, or both? What latency targets matter—100ms, 500ms, 2 seconds? How many tokens per second at peak? What model sizes are involved?

These answers should drive everything from GPU selection to network topology.

GPU Selection

The instinct is to reach for the latest hardware, but that's not always the right choice. For inference workloads under 13B parameters, older generation cards can be more cost-effective. The advantage of newer GPUs is most pronounced with larger models and when NVLink bandwidth is needed for tensor parallelism.

For high-end deployments, the difference between PCIe and SXM form factors matters enormously. SXM provides NVLink connectivity, essential for multi-GPU inference on large models. PCIe is cheaper but limits cross-GPU communication speeds.

Memory bandwidth is often the bottleneck for inference, not raw compute.

Network Architecture

This is where many designs underinvest. GPU clusters are only as fast as their slowest link.

For multi-node inference with 70B+ models, standard Ethernet often won't suffice for cross-node tensor parallelism. Non-blocking fabric matters because oversubscription kills latency predictability. And default protocol settings are rarely optimal for these workloads.

For single-node inference serving multiple models, sufficient bandwidth prevents bottlenecks during model loading. Network topology affects overall latency more than people expect.

Power and Cooling

Modern high-end GPUs draw 700W each. An 8-GPU node can draw 10kW+ including CPUs, memory, and networking. Scale that to 100 nodes and you're looking at megawatt-level requirements.

Air cooling has practical limits around 40kW per rack. Beyond that, direct liquid cooling becomes necessary. High-end GPUs are increasingly designed with liquid cooling in mind.

Power delivery requires careful planning. Redundancy matters because losing power mid-inference corrupts batches and requires recovery logic.

Software Stack

Hardware is half the equation. The software stack determines whether you actually utilize the capacity you've built.

Purpose-built inference servers like vLLM or TensorRT-LLM handle many optimizations automatically. Kubernetes with GPU-aware scheduling that understands topology matters. Metrics on thermal throttling, memory pressure, and utilization are essential. And load balancing that considers model state and queue depth prevents problems.

Cost Optimization

GPU infrastructure is expensive. Optimization should be a first-order concern.

Quantization with INT8 and FP8 can cut memory requirements significantly with minimal quality loss. Dynamic batching can increase throughput substantially. For batch inference, spot instances can cut costs dramatically. And matching GPU capability to actual workload requirements prevents overspending.

Conclusion

GPU cluster design for AI inference requires balancing hardware capabilities, software architecture, and operational constraints. Understanding workload deeply and designing accordingly matters more than chasing the latest hardware specs.