Steady-State Performance, Cost, and Capacity Planning

The capacity model was right — on paper

An infrastructure team sizes a GPU cluster based on published benchmark throughput. The math works out: N GPUs, each delivering X tokens per second, for a total capacity of N×X. They provision the cluster, deploy the workload, and discover that sustained throughput is 15-20% below the planning number. The cluster is undersized, SLAs are at risk, and the next procurement cycle is months away.

The planning wasn’t careless. The benchmark number was real. The error was structural: the planning used peak performance — the number a GPU delivers at boost clocks, thermal peak, and optimal conditions — as the input to a capacity model that requires steady-state performance. These are different numbers, and the gap between them is where capacity plans go wrong.

Peak performance is not a planning input

We’ve explored in detail how peak and steady-state performance diverge and the physical mechanisms that drive that divergence. For capacity planning purposes, the practical consequence is simple: peak throughput is the wrong number.

Peak throughput describes what the hardware can do briefly — during the thermal boost window, before clock settling, under clean-room conditions with no concurrent load. Capacity planning describes what the infrastructure must deliver continuously — for hours, days, or weeks, under production conditions, with thermal settling, memory pressure, concurrent workloads, and all the other factors that push sustained performance below peak.

The correct planning input is steady-state throughput: the throughput measured after the system has reached thermal equilibrium, with realistic workload patterns, over a measurement window long enough to capture the full range of normal variation.

Predictability matters more than maxima

In capacity planning, consistency is more valuable than occasional peaks. An infrastructure planner doesn’t need to know the highest throughput the system ever achieved; they need to know the lowest throughput the system reliably maintains under production conditions.

This is why steady-state measurement reveals what peak measurement hides. A system that averages 1,000 tokens/second but dips to 700 during thermal settling or GC pauses has an effective planning capacity of 700, not 1,000. Sizing to the average will leave 30% of requests exceeding latency targets during the dips.

The relevant metric for capacity planning is often the P5 or P10 throughput — the throughput that’s sustained at least 90-95% of the time. This number accounts for the normal variation in production and provides a realistic basis for infrastructure sizing that actually meets SLA requirements.A related trap is reading sustained utilization as if higher always means better. Very high sustained GPU utilization can mean one of two opposite things: a healthy, efficiently packed node, or a system that has run out of headroom and is about to hit a bottleneck. The number alone doesn’t tell you which. We treat utilization as a signal to interrogate alongside the throughput floor — a node pinned near 100% utilization while its P10 throughput is sagging is heading for a wall, whereas the same utilization with a stable floor is simply efficient. Plan capacity against the floor, and use utilization to explain why the floor is moving, not as the sizing input itself.

Cost efficiency is a sustained-throughput property

Hardware cost comparisons based on peak throughput produce misleading conclusions. GPU A delivers 1,200 tokens/second at peak; GPU B delivers 1,000. GPU A costs 20% more. Simple math suggests GPU A is the better deal — more throughput per dollar.

But if GPU A’s sustained throughput (after thermal settling in a dense node) is 950 tokens/second, and GPU B sustains 900 tokens/second (because it throttles less due to lower power draw), the cost-per-sustained-token picture changes substantially. The 20% price premium buys a 5.5% sustained throughput advantage, not the 20% advantage the peak numbers suggested.

Power costs amplify this effect. A GPU that draws 700W to sustain 950 tokens/second costs more per token in electricity than one drawing 400W to sustain 900 tokens/second, especially at data center scale over multi-year deployments. The TCO calculation that matters is performance-per-dollar-per-watt at sustained conditions, not the ratio of peak numbers.

The same discipline is what keeps a cross-accelerator comparison honest — GPU versus TPU, for instance. The tempting shortcut is peak-FLOPS-per-dollar, which collapses two different architectures into one transient number and predicts almost nothing about cost in production. A defensible comparison runs the same workload on each accelerator class, measures sustained throughput after thermal equilibrium with the realistic precision and batch profile, and divides each device’s sustained throughput by its all-in cost — capital plus power at that sustained draw. Compare sustained throughput per dollar per accelerator class, not peak FLOPS per dollar; the ranking frequently reorders once you do.

How do you translate steady-state performance into infrastructure sizing?

Translating steady-state performance into infrastructure sizing requires a few additional inputs beyond raw throughput:

Demand profile. What’s the expected request arrival rate and its variance? Peak demand periods need to be served from sustained capacity, not from the brief peak-performance window.

Headroom requirements. How much spare capacity is needed for traffic bursts, failover, and maintenance windows? Headroom must be calculated against steady-state throughput, not peak. Sizing headroom against peak numbers creates actual headroom that’s smaller than intended.

Workload evolution. If model size, sequence length, or request volume is projected to grow, sizing should account for the trajectory. As explored in the context of choosing between throughput and latency optimization targets, the metric you optimize today may not be the one that constrains you tomorrow.

Scaling efficiency. Multi-GPU and multi-node scaling is never perfectly linear. Communication overhead, load imbalance, and scheduling inefficiency reduce aggregate throughput relative to single-GPU measurements. Scaling efficiency should be measured at steady state and factored into capacity models — often as a 10-20% reduction from naive linear scaling.

Capacity planning inputs that peak benchmarks miss

Planning input	What it requires	Why peak numbers fail
Sustained throughput	Measured after thermal equilibrium under realistic workload	Peak captures boost-clock transient, not the operating regime
Throughput floor (P5/P10)	The throughput maintained at least 90–95% of the time	Averages hide dips during GC, thermal settling, or memory pressure
Scaling efficiency	Multi-GPU aggregate vs. linear extrapolation	Communication overhead typically reduces aggregate by 10–20%
Power-normalized cost	Performance-per-watt at sustained conditions	Peak FLOPS/dollar ignores the electricity cost of sustaining those FLOPS

The budgeting conversation

Hardware procurement ultimately requires a budget justification, and budget discussions favor clear numbers. “We need N GPUs at $X each” is the deliverable. The risk is that the throughput number feeding the GPU count calculation is the attractive peak figure, because it produces a lower GPU count, a lower budget request, and a presentation that’s easier to defend.

The alternative is to present capacity models based on steady-state performance with explicit uncertainty ranges. This requires more nuance but produces more defensible outcomes: the cluster actually meets its SLA, emergency procurement cycles are avoided, and the planning team’s credibility survives contact with production reality.

Steady-state numbers make for less impressive slides but more reliable infrastructure. That trade-off is worth making.

Sustained-throughput thinking is also an operations discipline: what the SRE book teaches about running production AI is the applied counterpart to planning capacity around the floor rather than the peak.

Capacity planning tools for AI: where generic tooling falls short — why time-series projection misses AI workload regime shifts.
Production capacity planning for AI inference fleets — anchoring fleet sizing to saturation curves under SLO.

LynxBenchAI makes steady-state the primary reported metric — giving capacity planners numbers that reflect sustained operating conditions rather than the transient peaks that inflate procurement models. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

Why does sizing capacity from peak AI performance numbers tend to misallocate infrastructure?

Peak numbers describe a transient operating window — boost clocks, thermal headroom, clean conditions — that production workloads never sustain. When that figure feeds a capacity model, the resulting GPU count looks accurate on paper but undershoots sustained demand by the gap between peak and steady state, typically 15-20%. The cluster lands undersized, SLAs slip, and the next procurement cycle is months away.

Why is predictability of sustained performance often more valuable than higher peaks for capacity planning?

Capacity planners don’t need the maximum throughput a system ever reached; they need the lowest throughput it reliably maintains. A system averaging 1,000 tokens/second but dipping to 700 during thermal settling has an effective planning capacity of 700. Sizing to the average leaves a meaningful fraction of requests breaching latency targets, which is why P5 or P10 throughput — the floor sustained 90-95% of the time — is the honest input.

How does steady-state throughput connect to cost efficiency in GPU infrastructure?

Cost efficiency is a sustained-throughput property, not a peak property. The TCO calculation that matters is performance-per-dollar-per-watt under sustained conditions, because both the dollar and the watt are spent continuously, not just during the boost window. Hardware that looks cheaper per peak token can be more expensive per sustained token once thermal throttling and electricity costs are factored in over a multi-year deployment.

Why can a faster-on-paper GPU end up more expensive per unit of sustained AI work?

A GPU with a higher peak figure often draws more power and throttles harder under load, so its sustained throughput advantage shrinks. In the article’s worked example, a 20% price premium for a 20% peak advantage collapses to a 5.5% sustained advantage — and once a 700W draw is compared against a 400W alternative at data-center scale, the cost-per-sustained-token can invert entirely. Peak ratios simply don’t predict sustained economics.

What should a benchmark report so that it can support an economic decision and not only an engineering one?

A benchmark useful for capacity and cost decisions must report sustained throughput after thermal equilibrium, a throughput floor (P5/P10) rather than only an average, multi-GPU scaling efficiency at steady state, and power-normalised performance under those same conditions. Anything less leaves planners extrapolating from transients. This is the reporting discipline LynxBenchAI is built around — steady-state as the primary figure, with bounded optimisation and per-precision results.

How should sustained GPU utilization be interpreted when planning capacity?

High sustained utilization is ambiguous on its own: it can mean a healthy, efficiently packed node, or a system that has exhausted its headroom and is about to bottleneck. The way to tell them apart is to read utilization alongside the throughput floor — a node pinned near 100% while its P10 throughput sags is heading for a wall, whereas the same utilization with a stable floor is simply efficient. Plan against the floor and use utilization to explain why the floor is moving, not as the sizing input itself.

How can a sustained-throughput-per-dollar comparison be built across different accelerator classes like GPU vs TPU?

Run the same workload on each accelerator class, measure sustained throughput after thermal equilibrium at the realistic precision and batch profile, then divide each device’s sustained throughput by its all-in cost — capital plus power at that sustained draw. Compare sustained throughput per dollar, not peak FLOPS per dollar, which collapses two different architectures into a transient number that predicts almost nothing about production cost. The ranking frequently reorders once the comparison is done at steady state rather than on paper.

Steady-State Performance, Cost, and Capacity Planning

The capacity model was right — on paper

Peak performance is not a planning input

Predictability matters more than maxima

Cost efficiency is a sustained-throughput property

How do you translate steady-state performance into infrastructure sizing?

Capacity planning inputs that peak benchmarks miss

The budgeting conversation

Frequently Asked Questions

Why does sizing capacity from peak AI performance numbers tend to misallocate infrastructure?

Why is predictability of sustained performance often more valuable than higher peaks for capacity planning?

How does steady-state throughput connect to cost efficiency in GPU infrastructure?

Why can a faster-on-paper GPU end up more expensive per unit of sustained AI work?

What should a benchmark report so that it can support an economic decision and not only an engineering one?

How should sustained GPU utilization be interpreted when planning capacity?

How can a sustained-throughput-per-dollar comparison be built across different accelerator classes like GPU vs TPU?

Peak Performance vs Steady-State Performance in AI

Throughput vs Latency: Choosing the Wrong Optimization Target

Capacity Planning Tools for AI: Where Generic Tooling Falls Short

Production Capacity Planning for AI Inference Fleets

Steady-State Performance, Cost, and Capacity Planning

The capacity model was right — on paper

Peak performance is not a planning input

Predictability matters more than maxima

Cost efficiency is a sustained-throughput property

How do you translate steady-state performance into infrastructure sizing?

Capacity planning inputs that peak benchmarks miss

The budgeting conversation

Related deep-dives

Frequently Asked Questions

Why does sizing capacity from peak AI performance numbers tend to misallocate infrastructure?

Why is predictability of sustained performance often more valuable than higher peaks for capacity planning?

How does steady-state throughput connect to cost efficiency in GPU infrastructure?

Why can a faster-on-paper GPU end up more expensive per unit of sustained AI work?

What should a benchmark report so that it can support an economic decision and not only an engineering one?

How should sustained GPU utilization be interpreted when planning capacity?

How can a sustained-throughput-per-dollar comparison be built across different accelerator classes like GPU vs TPU?

Peak Performance vs Steady-State Performance in AI

Throughput vs Latency: Choosing the Wrong Optimization Target

Capacity Planning Tools for AI: Where Generic Tooling Falls Short

Production Capacity Planning for AI Inference Fleets