Why AI Performance Changes Over Time

The first 200 iterations looked great

A team kicks off a training run, watches the throughput counter climb during the first few hundred steps, and records the number for the weekly status report. Two hours later, the counter has dropped 15%. By the overnight checkpoint, it’s fluctuating between values that differ by 20%. Nobody changed anything.

This is normal behavior, and the fact that it surprises people reveals a widespread assumption: that hardware performance is a fixed property you measure once and then rely on. In practice, AI workload performance is a time-varying signal shaped by thermal dynamics, power management, memory allocation patterns, scheduling behavior, and framework-level optimization decisions that play out over minutes to hours.

Warmup effects are real and measurable

When a GPU begins executing a workload, several subsystems are still reaching their operating state.

CUDA contexts need to be initialized. Kernel launches incur higher overhead on first invocation because the runtime is compiling and caching PTX code. Memory pools haven’t been pre-allocated yet, so early allocations trigger expensive system calls. The GPU’s clock frequency is ramping from idle to boost state. The host-side data pipeline — data loaders, augmentation routines, prefetch buffers — is still filling.

The combined effect is that the first several minutes of a workload are systematically unrepresentative of what follows. Throughput measured during this warmup phase is typically either lower than steady-state (because subsystems are still initializing) or briefly higher than steady-state (because the GPU is at peak boost frequency before thermal limits kick in).It is worth separating warmup from the cold-start problem, since the two are often conflated. Cold start is the first-request penalty paid when nothing is resident yet — the model isn’t loaded into GPU memory, the serving process hasn’t spun up, the container is still scheduling. Warmup is the subsequent ramp once execution has begun: contexts initializing, clocks climbing, pools filling. Both inflate or depress the earliest numbers, and both are expected. Neither makes a short benchmark invalid on its own — they just mean the benchmark measured the cold-and-warming regime, and it has to say so.

We’ve discussed the gap between peak and steady-state measurement in how peak and sustained performance diverge — warmup effects are one of the primary mechanisms through which that divergence manifests.

Thermal and power dynamics reshape the performance curve

After warmup completes and the GPU reaches sustained load, thermal dynamics become the dominant source of performance variation.

Modern data center GPUs like the NVIDIA A100 and H100 are designed to operate near their thermal limits under AI workloads. The GPU starts at boost clock frequencies, delivers peak throughput for minutes to tens of minutes, then gradually reduces clock speed as junction temperature rises toward the thermal limit. The clock reduction is not a failure — it’s by design. The power management firmware maintains temperature within safe operating range by trading clock frequency for thermal headroom.

The practical effect is a throughput curve that starts high and settles lower. The settlement can be 5-15% below the initial peak, depending on the workload intensity, cooling configuration, and ambient conditions. In dense multi-GPU nodes (eight GPUs sharing an enclosure), the thermal interaction between neighboring cards makes this effect more pronounced — the cards at interior positions run hotter and settle at lower clocks than the ones at the edges.

As detailed in how power and thermal constraints govern sustained performance, these physical constraints are first-class determinants of what the hardware actually delivers over time. Any measurement that ignores them captures a transient state, not the operating reality.

Memory pressure builds over time

Long-running workloads often experience performance changes driven by memory dynamics that only become visible after extended execution.

Framework-level memory allocators (PyTorch’s CUDACachingAllocator, for instance) manage GPU memory through pooling and caching strategies. Early in a run, the allocator is building its pool, and allocations are fast because free memory is abundant. As the run progresses and more memory is allocated, fragmentation can increase, leading to occasional expensive defragmentation or fallback to CPU-side memory operations.

In inference serving, KV cache growth over long contexts can push memory utilization toward capacity limits, triggering eviction policies or degrading batch scheduling efficiency. The first thousand requests might perform well, but performance at the hundred-thousandth request — after hours of continuous serving — can look materially different.

Garbage collection behavior in Python-heavy stacks adds another time-dependent factor. Periodic GC pauses create throughput dips that are invisible in short benchmarks but visible in production monitoring.Training runs add a subtler wrinkle on top of these system effects. Loss can appear to increase or plateau over a long run — a learning-rate warmup or decay schedule, a curriculum that shifts to harder examples, or a periodic evaluation phase can all move the curve — without anything being wrong. In our experience the temporal shape of a training metric reflects the schedule and the data as much as the hardware, so a rising loss segment is not, by itself, evidence that the system is misbehaving.

Scheduling and system-level drift

Beyond the GPU itself, the broader system introduces its own time-dependent behavior.

OS-level scheduling decisions affect CPU-side preprocessing performance. Under sustained load, the kernel’s scheduler may migrate threads, compete with background processes, or encounter NUMA-related latency if memory affinity isn’t carefully managed.

Network-attached storage or distributed file systems exhibit throughput variation under sustained read patterns, especially when multiple nodes compete for I/O bandwidth. A data pipeline that kept up for the first hour of training may fall behind as other jobs on the same storage fabric increase their load.

Multi-tenant environments add another layer. Performance on a shared cluster at 2 AM (light load) versus 2 PM (peak utilization) can differ substantially, not because of anything the workload did, but because the system context changed. One trap here: GPU utilization can read 90% or 96% while sustained throughput is still drifting downward. High utilization only says the GPU was busy — not that it was doing useful work at a stable rate. When the headline utilization stays pinned but throughput falls, the temporal bottleneck has usually moved off the compute units and onto something else: clocks throttled by thermals, the data pipeline stalling, or memory traffic dominating. Utilization is a presence signal, not a productivity signal.

Temporal effects that shift AI performance

Effect	Timescale	Mechanism	Impact on measured performance
Warmup / initialization	Typically first 1–5 minutes	CUDA context init, PTX compilation, memory pool setup	Initially lower or briefly higher throughput
Thermal settling	Typically 5–30 minutes	Junction temperature rises, clocks reduce to maintain thermal limits	Sustained throughput settles below initial peak
Memory pressure	Hours	Allocator fragmentation, KV cache growth, GC pauses	Intermittent throughput dips, increased tail latency
System-level drift	Hours to days	OS scheduling changes, storage contention, multi-tenant interference	Variable throughput depending on external load

Why does time-varying performance matter for measurement?

If AI performance is time-varying, then the question “what’s the throughput?” has no single correct answer. The answer depends on when you measured: during warmup, at thermal peak, after thermal settling, during a memory fragmentation event, or at steady state.

This is why measurement methodology must specify a temporal protocol. How long was the workload run before measurement began? Over what time window was the measurement averaged? Were warmup iterations excluded? Was the system pre-conditioned to a thermal steady state?

Without these details, a benchmark number is ambiguous. Two measurements of the same hardware running the same workload can disagree substantially if they were taken at different points on the performance-over-time curve. The disagreement isn’t noise — it’s the natural consequence of measuring a time-dependent phenomenon at different times.

The practical discipline is straightforward: run long enough to reach steady state, discard warmup, and report measurement windows alongside the numbers. Anything less captures a snapshot that may not predict what the system will do over the hours and days of production execution.

Performance drift has a model-quality sibling worth keeping separate from it: data drift versus model drift sets out how each one changes the production reliability response.

Model drift vs hardware drift: two different decay curves — separating model-side from hardware-side temporal change.

LynxBenchAI encodes this discipline in its methodology — specifying warmup exclusion, measurement windows, and steady-state criteria as declared components of the protocol, not implementation details. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

Why does observed AI performance often change between the first second of a workload and the tenth minute?

The first seconds are dominated by initialization: CUDA context setup, PTX compilation, memory pool construction, and the host-side data pipeline filling. The GPU is also ramping from idle to boost clock. By the tenth minute, those transients have resolved and thermal settling has begun pulling clocks down from their initial peak. The throughput curve you observe across that window is the superposition of warmup recovery and thermal descent.

How do warmup effects shape early benchmark numbers, and why don’t they invalidate a short benchmark by themselves?

Warmup typically makes early numbers either lower than steady-state (subsystems still initializing) or briefly higher (peak boost before thermal limits engage). A short benchmark isn’t automatically invalid — it just measures a different regime. The question is whether the benchmark declares which regime it captured. A clearly scoped short benchmark that excludes warmup and reports its window can be informative; an undeclared one is ambiguous.

Why isn’t every change in sustained performance a sign of a fault?

Clock reduction under thermal load is by design — the power management firmware trades frequency for thermal headroom to keep junction temperature in range. Allocator fragmentation, KV cache growth, and scheduler migration are also expected behaviors of healthy systems under sustained execution. Treating every dip as a fault leads to chasing phantoms; the discipline is to characterize the expected curve first, then flag deviations from it.

How do thermal, power, and scheduling dynamics evolve as a workload runs?

Thermally, junction temperature rises over the first 5–30 minutes and clocks settle 5–15% below the initial peak, more so for interior cards in dense multi-GPU nodes. Power management firmware continuously rebalances frequency against the thermal envelope. At the OS level, the scheduler may migrate threads, contend with background processes, or hit NUMA penalties, while shared storage and multi-tenant load shift the system context on hour-to-day timescales.

Why don’t all systems converge to a single stable performance point under sustained load?

Steady state is a useful abstraction, not a guarantee. Memory pressure builds over hours through fragmentation, KV cache growth, and GC pauses. External factors — storage contention, neighboring tenants, ambient temperature — keep moving the operating point. What you actually get is a distribution of throughput over time, sometimes with multiple plateaus separated by transitions, not a single number.

What should a benchmark report disclose so that temporal variance can be reasoned about?

At minimum: the run duration, the warmup interval excluded, the measurement window over which throughput was averaged, and whether the system was pre-conditioned to a thermal steady state. Reporting tail latency or throughput distributions alongside the mean makes intermittent dips visible. LynxBenchAI treats these as declared components of the protocol rather than implementation details, which is the level of disclosure that makes numbers comparable across runs.

Why does GPU utilization appear high even while sustained throughput drifts, and what does that say about where a temporal bottleneck lives?

Utilization readings of 90% or 96% only tell you the GPU was busy, not that it was producing useful work at a stable rate. When utilization stays pinned high but throughput falls over time, the temporal bottleneck has usually moved off the compute units — clocks throttled by thermals, a stalling data pipeline, or memory traffic dominating the cycle. Treat utilization as a presence signal and throughput as the productivity signal; a divergence between them is the clue to where the real constraint sits.

How does the cold-start problem differ from warmup effects, and why can both shape early performance numbers without making a short benchmark invalid?

Cold start is the first-request penalty paid when nothing is resident yet — the model isn’t loaded, the serving process hasn’t spun up, the container is still scheduling. Warmup is the subsequent ramp once execution has begun: contexts initializing, clocks climbing, memory pools filling. Both depress or inflate the earliest numbers, and both are expected behavior. A short benchmark that captures either regime is still informative as long as it declares which regime it measured and what it excluded.

Why might training loss appear to increase or plateau over a long run even when the underlying system is behaving as expected?

A learning-rate warmup or decay schedule, a curriculum that shifts to harder examples, or a periodic evaluation phase can all move the loss curve without anything being wrong. The temporal shape of a training metric reflects the optimization schedule and the data distribution as much as the hardware. A rising or flattening loss segment is not, by itself, evidence of a fault — you read it against the expected shape of the run before treating it as a regression.

Why AI Performance Changes Over Time

The first 200 iterations looked great

Warmup effects are real and measurable

Thermal and power dynamics reshape the performance curve

Memory pressure builds over time

Scheduling and system-level drift

Temporal effects that shift AI performance

Why does time-varying performance matter for measurement?

Frequently Asked Questions

Why does observed AI performance often change between the first second of a workload and the tenth minute?

How do warmup effects shape early benchmark numbers, and why don’t they invalidate a short benchmark by themselves?

Why isn’t every change in sustained performance a sign of a fault?

How do thermal, power, and scheduling dynamics evolve as a workload runs?

Why don’t all systems converge to a single stable performance point under sustained load?

What should a benchmark report disclose so that temporal variance can be reasoned about?

Why does GPU utilization appear high even while sustained throughput drifts, and what does that say about where a temporal bottleneck lives?

How does the cold-start problem differ from warmup effects, and why can both shape early performance numbers without making a short benchmark invalid?

Why might training loss appear to increase or plateau over a long run even when the underlying system is behaving as expected?

Peak Performance vs Steady-State Performance in AI

Power, Thermals, and the Hidden Governors of Performance

Model Drift vs Hardware Drift: Two Different Decay Curves

Why AI Performance Changes Over Time

The first 200 iterations looked great

Warmup effects are real and measurable

Thermal and power dynamics reshape the performance curve

Memory pressure builds over time

Scheduling and system-level drift

Temporal effects that shift AI performance

Why does time-varying performance matter for measurement?

Related deep-dives

Frequently Asked Questions

Why does observed AI performance often change between the first second of a workload and the tenth minute?

How do warmup effects shape early benchmark numbers, and why don’t they invalidate a short benchmark by themselves?

Why isn’t every change in sustained performance a sign of a fault?

How do thermal, power, and scheduling dynamics evolve as a workload runs?

Why don’t all systems converge to a single stable performance point under sustained load?

What should a benchmark report disclose so that temporal variance can be reasoned about?

Why does GPU utilization appear high even while sustained throughput drifts, and what does that say about where a temporal bottleneck lives?

How does the cold-start problem differ from warmup effects, and why can both shape early performance numbers without making a short benchmark invalid?

Why might training loss appear to increase or plateau over a long run even when the underlying system is behaving as expected?

Peak Performance vs Steady-State Performance in AI

Power, Thermals, and the Hidden Governors of Performance

Model Drift vs Hardware Drift: Two Different Decay Curves