A year ago, the same hardware ran everything
Early in most ML teams’ evolution, a single GPU or a small cluster handles both training and inference. The workloads are small enough that the distinction barely matters. A training run finishes in hours, and inference serves a handful of internal users at volumes that wouldn’t stress a laptop.
Then the model grows, the user base scales, and suddenly the team discovers that the hardware configuration optimized for training throughput is producing unacceptable inference latency — or that the low-latency inference setup can’t sustain the compute throughput needed for a reasonable training cycle. It’s not that the hardware broke. It’s that training and inference are fundamentally different workloads, and the system design that serves one well may serve the other poorly.
Different workloads, different bottlenecks
Training and inference differ in nearly every dimension that matters for hardware and system design.
Compute vs. memory pressure. Training is dominated by large matrix multiplications in the forward and backward pass — dense, regular, highly parallelizable operations that saturate compute units efficiently. Modern training runs on large batch sizes, which amortize overhead and keep tensor cores busy. Inference, especially autoregressive decoding in LLMs, is dominated by memory reads. Each token generation reads model weights and KV cache from HBM. The compute-to-memory ratio is radically different: training is compute-bound; much of inference is memory-bandwidth-bound.
Batch dynamics. Training typically processes fixed, large batches. The batch size is tuned for throughput and convergence, and it remains stable through the run. Inference serves variable-length requests that arrive at unpredictable intervals. Batch formation in inference — dynamic batching, continuous batching, iteration-level scheduling — is a system design problem with no training analog.
Latency sensitivity. In training, wall-clock time matters but per-sample latency does not. Nobody cares if a single gradient step takes 200ms instead of 180ms, as long as epoch throughput meets the schedule. In inference, per-request latency directly affects user experience. A P99 latency of 500ms might be acceptable; 2 seconds might not. The optimization target shifts from aggregate throughput to latency distribution — a fundamentally different objective.
Precision requirements. Training often uses mixed precision (FP32 for master weights, BF16 or FP16 for forward/backward computation) to balance numerical stability with throughput. Inference can frequently operate at lower precision — INT8, FP8, even INT4 with careful quantization — because the model weights are frozen and the numerical tolerance for individual predictions is typically higher than for gradient accumulation.
Performance conclusions don’t transfer between phases
This is the practical consequence that catches organizations off guard: a benchmark result measured during training does not predict inference behavior, and vice versa.
An NVIDIA A100 and an H100 might show a 2× training throughput improvement on a specific model. For inference on the same model, the improvement might be 1.3× or 3× depending on whether the workload is compute-bound or memory-bound, whether the inference framework can exploit the H100’s FP8 tensor cores, and whether the serving architecture takes advantage of the larger L2 cache.
The numbers are not contradictory. They reflect different workload characteristics exercising different hardware features under different conditions. As we discuss in how performance emerges from the full stack, the measured result is always a product of the entire execution context. Training and inference represent different execution contexts — different enough that performance generalizations between them are unreliable.
We see this confusion surface frequently when procurement decisions are based on training benchmarks but the primary production workload is inference (or the reverse). The benchmark answered a different question than the one the organization needed answered, and the mismatch only becomes visible after deployment.
System design implications
If training and inference are different workloads, they typically benefit from different system designs.
Training systems tend to prioritize GPU-to-GPU interconnect bandwidth (for gradient synchronization), high compute throughput, large HBM capacity (for model state and optimizer states), and power delivery for sustained full-load operation. A DGX-class node with NVSwitch or InfiniBand interconnects is optimized for this profile.
Inference systems tend to prioritize memory bandwidth (for weight and KV cache reads), low-latency host-to-device communication, efficient multi-request scheduling, and power efficiency under variable load. The ideal inference system might use fewer, smaller GPUs with high memory bandwidth, or specialized inference accelerators that trade peak compute for memory throughput.
These aren’t rigid categories — some workloads blend characteristics, and hardware continues to converge — but the general principle holds: optimizing for one workload phase and deploying the same configuration for both is a common source of underperformance.
The practical takeaway is that any evaluation of GPU or accelerator performance needs to specify which workload it measured. “This GPU is fast” is incomplete. “This GPU delivers X tokens/second on LLM inference at batch size 32 with INT8 precision” is useful. “This GPU trains ResNet-50 at Y images/second” is also useful, but it tells you almost nothing about the inference scenario.
As explored in the context of understanding what GPUs do within their larger system, the accelerator is one component whose contribution depends on what the rest of the system demands. Training and inference demand different things — and the performance story changes accordingly.