The first 200 iterations looked great
A team kicks off a training run, watches the throughput counter climb during the first few hundred steps, and records the number for the weekly status report. Two hours later, the counter has dropped 15%. By the overnight checkpoint, it’s fluctuating between values that differ by 20%. Nobody changed anything.
This is normal behavior, and the fact that it surprises people reveals a widespread assumption: that hardware performance is a fixed property you measure once and then rely on. In practice, AI workload performance is a time-varying signal shaped by thermal dynamics, power management, memory allocation patterns, scheduling behavior, and framework-level optimization decisions that play out over minutes to hours.
Warmup effects are real and measurable
When a GPU begins executing a workload, several subsystems are still reaching their operating state.
CUDA contexts need to be initialized. Kernel launches incur higher overhead on first invocation because the runtime is compiling and caching PTX code. Memory pools haven’t been pre-allocated yet, so early allocations trigger expensive system calls. The GPU’s clock frequency is ramping from idle to boost state. The host-side data pipeline — data loaders, augmentation routines, prefetch buffers — is still filling.
The combined effect is that the first several minutes of a workload are systematically unrepresentative of what follows. Throughput measured during this warmup phase is typically either lower than steady-state (because subsystems are still initializing) or briefly higher than steady-state (because the GPU is at peak boost frequency before thermal limits kick in).
We’ve discussed the gap between peak and steady-state measurement in how peak and sustained performance diverge — warmup effects are one of the primary mechanisms through which that divergence manifests.
Thermal and power dynamics reshape the performance curve
After warmup completes and the GPU reaches sustained load, thermal dynamics become the dominant source of performance variation.
Modern data center GPUs like the NVIDIA A100 and H100 are designed to operate near their thermal limits under AI workloads. The GPU starts at boost clock frequencies, delivers peak throughput for minutes to tens of minutes, then gradually reduces clock speed as junction temperature rises toward the thermal limit. The clock reduction is not a failure — it’s by design. The power management firmware maintains temperature within safe operating range by trading clock frequency for thermal headroom.
The practical effect is a throughput curve that starts high and settles lower. The settlement can be 5-15% below the initial peak, depending on the workload intensity, cooling configuration, and ambient conditions. In dense multi-GPU nodes (eight GPUs sharing an enclosure), the thermal interaction between neighboring cards makes this effect more pronounced — the cards at interior positions run hotter and settle at lower clocks than the ones at the edges.
As detailed in how power and thermal constraints govern sustained performance, these physical constraints are first-class determinants of what the hardware actually delivers over time. Any measurement that ignores them captures a transient state, not the operating reality.
Memory pressure builds over time
Long-running workloads often experience performance changes driven by memory dynamics that only become visible after extended execution.
Framework-level memory allocators (PyTorch’s CUDACachingAllocator, for instance) manage GPU memory through pooling and caching strategies. Early in a run, the allocator is building its pool, and allocations are fast because free memory is abundant. As the run progresses and more memory is allocated, fragmentation can increase, leading to occasional expensive defragmentation or fallback to CPU-side memory operations.
In inference serving, KV cache growth over long contexts can push memory utilization toward capacity limits, triggering eviction policies or degrading batch scheduling efficiency. The first thousand requests might perform well, but performance at the hundred-thousandth request — after hours of continuous serving — can look materially different.
Garbage collection behavior in Python-heavy stacks adds another time-dependent factor. Periodic GC pauses create throughput dips that are invisible in short benchmarks but visible in production monitoring.
Scheduling and system-level drift
Beyond the GPU itself, the broader system introduces its own time-dependent behavior.
OS-level scheduling decisions affect CPU-side preprocessing performance. Under sustained load, the kernel’s scheduler may migrate threads, compete with background processes, or encounter NUMA-related latency if memory affinity isn’t carefully managed.
Network-attached storage or distributed file systems exhibit throughput variation under sustained read patterns, especially when multiple nodes compete for I/O bandwidth. A data pipeline that kept up for the first hour of training may fall behind as other jobs on the same storage fabric increase their load.
Multi-tenant environments add another layer. Performance on a shared cluster at 2 AM (light load) versus 2 PM (peak utilization) can differ substantially, not because of anything the workload did, but because the system context changed.
Why this matters for measurement
If AI performance is time-varying, then the question “what’s the throughput?” has no single correct answer. The answer depends on when you measured: during warmup, at thermal peak, after thermal settling, during a memory fragmentation event, or at steady state.
This is why measurement methodology must specify a temporal protocol. How long was the workload run before measurement began? Over what time window was the measurement averaged? Were warmup iterations excluded? Was the system pre-conditioned to a thermal steady state?
Without these details, a benchmark number is ambiguous. Two measurements of the same hardware running the same workload can disagree substantially if they were taken at different points on the performance-over-time curve. The disagreement isn’t noise — it’s the natural consequence of measuring a time-dependent phenomenon at different times.
The practical discipline is straightforward: run long enough to reach steady state, discard warmup, and report measurement windows alongside the numbers. Anything less captures a snapshot that may not predict what the system will do over the hours and days of production execution.