The first five minutes looked great. Then the system settled.
A new inference deployment goes live. The engineering team watches the initial metrics come in and feels good — throughput is hitting the target, latency is within bounds, the profiler shows the GPU humming along. An hour later, the numbers look different. Throughput has dropped. Clock speeds have stabilized at a lower point. The system is warmer, the memory allocator has settled into its long-run behavior, and the workload has moved past the transient phase where everything runs in the best possible alignment.
Nothing broke. The system transitioned from its peak regime to its sustained operating regime — the regime it will actually live in for the remaining 99.9% of its operational life.
This distinction between peak and steady-state performance is one of the most consequential and most underexplored dimensions in AI performance evaluation, and it is central to understanding why AI performance changes over time. Most performance narratives are built around best-case moments, but AI systems are deployed as sustained services and long-running jobs. If you only measure the peaks, you’re characterizing a regime the system barely visits.
Peak performance: real, but transient
Peak performance is what happens when the stars align. Caches are hot, clocks are at their turbo ceiling, contention is low, and the workload is in a phase where everything favors the best possible execution path. These conditions produce the highest throughput numbers — and those numbers are not fabricated. The system genuinely reached that performance, briefly.
The problem is what people infer from peak results. A transient best-case measurement becomes “the system’s performance” in slide decks and planning documents, even though the conditions that produced it are not sustainable. The peak is evidence that the hardware can reach a certain level under favorable conditions; it is not an operating promise for what will happen during a 24-hour serving window or a multi-day training run.
We pay careful attention to this because it’s one of the most common disconnects between evaluation results and production reality. The evaluation captures the first-few-minutes regime. Production lives in the long-run regime. The performance can be genuinely different between the two, with no defects and no mistakes.
Steady state: the regime that dominates real outcomes
Steady-state performance is what the system delivers once the transient phase is over and the operating conditions have stabilized. In this regime, GPU clocks have settled under thermal and power constraints, the memory allocator has reached its equilibrium behavior, runtime caching and compilation effects have played out, and the workload is experiencing the contention and scheduling patterns it will see persistently.
Steady state is not a single universal number — it depends on the workload, the system configuration, and the operating objective. But it represents the regime where the system spends the vast majority of its time, which means it’s the regime that dominates total output, total cost, and total user experience, as discussed in steady-state performance, cost, and capacity planning.
Reasoning about steady-state performance is not “being conservative.” It’s being realistic about which performance regime your infrastructure budget, your SLA, and your capacity plan actually depend on.
How systems move between regimes
The transition from peak to steady state is not always gradual, and it’s not always obvious when it happens.
GPU clocks drop as thermal mass accumulates and power management stabilizes at sustainable levels — these are exactly the power, thermal, and hidden governor effects that cap sustained performance. This can happen within minutes on some systems. Memory subsystem behavior shifts as allocators settle and fragmentation patterns emerge. Runtime behavior changes as JIT compilation effects play out (a torch.compile warmup pass can make the first few iterations look artificially different from the rest). In multi-GPU setups, NCCL collective communication patterns may behave differently after initial synchronization versus under steady traffic.
The result is that a benchmark that captures only the first phase of execution — warmup, peak clocks, fresh caches — is measuring a different system than the one that runs in production. Both measurements are real. They’re just measurements of different temporal regimes, and the one that matters for operational planning is almost always the longer-running one.
Why peak-oriented evaluation produces wrong decisions
When capacity planning is based on peak-phase measurements, the system gets sized for a throughput level it can reach only transiently. The practical consequence is that sustained throughput is lower than planned — not because of a failure, but because the plan was built on a characterization of the wrong regime.
This shows up in concrete ways: latency targets that are met in evaluation but violated under sustained load. Throughput budgets that look adequate in a short-run test but fall short in continuous operation. GPU clusters that were sized for marketing-grade peak numbers and then need additional nodes to meet the actual steady-state requirement.
The correction isn’t to ignore peaks — they can be informative about the hardware’s capability envelope. The correction is to distinguish between “what the system can do briefly” and “what the system will do persistently,” and to make sure the distinction is explicit in every evaluation, comparison, and planning exercise.
Performance is a curve, not a point
If there’s one mental model shift that prevents the most confusion in AI performance evaluation, it’s this: performance is behavior over time, not a static snapshot. A peak measurement gives you one point on that curve — typically near the best possible moment. Steady-state reasoning asks what happens once the system has settled into the regime where it actually operates.
Once you accept that temporal dimension, many “mysterious” results resolve. The performance didn’t randomly change — the system moved between regimes. The benchmark wasn’t wrong — it just measured a different phase. The production system isn’t underperforming — it’s performing in steady state, which is what steady state looks like.
For AI workloads where throughput and latency objectives can pull in different directions, the temporal dimension adds another layer: the trade-off between those objectives can itself shift as the system transitions from peak to steady-state behavior. Capturing the right regime is the difference between an evaluation that predicts production and one that produces a pleasant-looking number you’ll never see again.