The capacity model was right — on paper
An infrastructure team sizes a GPU cluster based on published benchmark throughput. The math works out: N GPUs, each delivering X tokens per second, for a total capacity of N×X. They provision the cluster, deploy the workload, and discover that sustained throughput is 15-20% below the planning number. The cluster is undersized, SLAs are at risk, and the next procurement cycle is months away.
The planning wasn’t careless. The benchmark number was real. The error was structural: the planning used peak performance — the number a GPU delivers at boost clocks, thermal peak, and optimal conditions — as the input to a capacity model that requires steady-state performance. These are different numbers, and the gap between them is where capacity plans go wrong.
Peak performance is not a planning input
We’ve explored in detail how peak and steady-state performance diverge and the physical mechanisms that drive that divergence. For capacity planning purposes, the practical consequence is simple: peak throughput is the wrong number.
Peak throughput describes what the hardware can do briefly — during the thermal boost window, before clock settling, under clean-room conditions with no concurrent load. Capacity planning describes what the infrastructure must deliver continuously — for hours, days, or weeks, under production conditions, with thermal settling, memory pressure, concurrent workloads, and all the other factors that push sustained performance below peak.
The correct planning input is steady-state throughput: the throughput measured after the system has reached thermal equilibrium, with realistic workload patterns, over a measurement window long enough to capture the full range of normal variation.
Predictability matters more than maxima
In capacity planning, consistency is more valuable than occasional peaks. An infrastructure planner doesn’t need to know the highest throughput the system ever achieved; they need to know the lowest throughput the system reliably maintains under production conditions.
This is why steady-state measurement reveals what peak measurement hides. A system that averages 1,000 tokens/second but dips to 700 during thermal settling or GC pauses has an effective planning capacity of 700, not 1,000. Sizing to the average will leave 30% of requests exceeding latency targets during the dips.
The relevant metric for capacity planning is often the P5 or P10 throughput — the throughput that’s sustained at least 90-95% of the time. This number accounts for the normal variation in production and provides a realistic basis for infrastructure sizing that actually meets SLA requirements.
Cost efficiency is a sustained-throughput property
Hardware cost comparisons based on peak throughput produce misleading conclusions. GPU A delivers 1,200 tokens/second at peak; GPU B delivers 1,000. GPU A costs 20% more. Simple math suggests GPU A is the better deal — more throughput per dollar.
But if GPU A’s sustained throughput (after thermal settling in a dense node) is 950 tokens/second, and GPU B sustains 900 tokens/second (because it throttles less due to lower power draw), the cost-per-sustained-token picture changes substantially. The 20% price premium buys a 5.5% sustained throughput advantage, not the 20% advantage the peak numbers suggested.
Power costs amplify this effect. A GPU that draws 700W to sustain 950 tokens/second costs more per token in electricity than one drawing 400W to sustain 900 tokens/second, especially at data center scale over multi-year deployments. The TCO calculation that matters is performance-per-dollar-per-watt at sustained conditions, not the ratio of peak numbers.
From steady-state measurement to infrastructure sizing
Translating steady-state performance into infrastructure sizing requires a few additional inputs beyond raw throughput:
Demand profile. What’s the expected request arrival rate and its variance? Peak demand periods need to be served from sustained capacity, not from the brief peak-performance window.
Headroom requirements. How much spare capacity is needed for traffic bursts, failover, and maintenance windows? Headroom must be calculated against steady-state throughput, not peak. Sizing headroom against peak numbers creates actual headroom that’s smaller than intended.
Workload evolution. If model size, sequence length, or request volume is projected to grow, sizing should account for the trajectory. As explored in the context of choosing between throughput and latency optimization targets, the metric you optimize today may not be the one that constrains you tomorrow.
Scaling efficiency. Multi-GPU and multi-node scaling is never perfectly linear. Communication overhead, load imbalance, and scheduling inefficiency reduce aggregate throughput relative to single-GPU measurements. Scaling efficiency should be measured at steady state and factored into capacity models — often as a 10-20% reduction from naive linear scaling.
The budgeting conversation
Hardware procurement ultimately requires a budget justification, and budget discussions favor clear numbers. “We need N GPUs at $X each” is the deliverable. The risk is that the throughput number feeding the GPU count calculation is the attractive peak figure, because it produces a lower GPU count, a lower budget request, and a presentation that’s easier to defend.
The alternative is to present capacity models based on steady-state performance with explicit uncertainty ranges. This requires more nuance but produces more defensible outcomes: the cluster actually meets its SLA, emergency procurement cycles are avoided, and the planning team’s credibility survives contact with production reality.
Steady-state numbers make for less impressive slides but more reliable infrastructure. That trade-off is worth making.