A year ago, the same hardware ran everything Early in most ML teams’ evolution, a single GPU or a small cluster handles both training and inference. The workloads are small enough that the distinction barely matters. A training run finishes in hours, and inference serves a handful of internal users at volumes that wouldn’t stress a laptop. Then the model grows, the user base scales, and suddenly the team discovers that the hardware configuration optimized for training throughput is producing unacceptable inference latency — or that the low-latency inference setup can’t sustain the compute throughput needed for a reasonable training cycle. It’s not that the hardware broke. It’s that training and inference are fundamentally different workloads, and the system design that serves one well may serve the other poorly. Different workloads, different bottlenecks Training and inference differ in nearly every dimension that matters for hardware and system design. Compute vs. memory pressure. Training is dominated by large matrix multiplications in the forward and backward pass — dense, regular, highly parallelizable operations that saturate compute units efficiently. Modern training runs on large batch sizes, which amortize overhead and keep tensor cores busy. Inference, especially autoregressive decoding in LLMs, is dominated by memory reads. Each token generation reads model weights and KV cache from HBM. The compute-to-memory ratio is radically different: training is compute-bound; much of inference is memory-bandwidth-bound. Batch dynamics. Training typically processes fixed, large batches. The batch size is tuned for throughput and convergence, and it remains stable through the run. Inference serves variable-length requests that arrive at unpredictable intervals. Batch formation in inference — dynamic batching, continuous batching, iteration-level scheduling — is a system design problem with no training analog. Latency sensitivity. In training, wall-clock time matters but per-sample latency does not. Nobody cares if a single gradient step takes 200ms instead of 180ms, as long as epoch throughput meets the schedule. In inference, per-request latency directly affects user experience. A P99 latency of 500ms might be acceptable; 2 seconds might not. The optimization target shifts from aggregate throughput to latency distribution — a fundamentally different objective. Precision requirements. Training often uses mixed precision (FP32 for master weights, BF16 or FP16 for forward/backward computation) to balance numerical stability with throughput. Inference can frequently operate at lower precision — INT8, FP8, even INT4 with careful quantization — because the model weights are frozen and the numerical tolerance for individual predictions is typically higher than for gradient accumulation. How training and inference differ as workloads Dimension Training Inference Compute profile Dense matrix ops (forward + backward); compute-bound Weight and KV cache reads; often memory-bandwidth-bound Batch behavior Fixed large batches, tuned for throughput and convergence Variable requests, dynamic batching, continuous batching Latency sensitivity Per-sample latency irrelevant; epoch throughput matters Per-request latency directly affects user experience Precision Mixed precision (FP32 master weights, BF16/FP16 compute) Lower precision viable (INT8, FP8, INT4 with calibration) Scaling priority GPU-to-GPU interconnect bandwidth for gradient sync Memory bandwidth, host-device communication, scheduling efficiency System design NVLink/InfiniBand, large HBM, sustained power delivery High memory bandwidth, low-latency host I/O, power efficiency Why don’t training benchmarks predict inference performance? This is the practical consequence that catches organizations off guard: a benchmark result measured during training does not predict inference behavior, and vice versa. An NVIDIA A100 and an H100 might show a 2× training throughput improvement on a specific model. For inference on the same model, the improvement might be 1.3× or 3× depending on whether the workload is compute-bound or memory-bound, whether the inference framework can exploit the H100’s FP8 tensor cores, and whether the serving architecture takes advantage of the larger L2 cache. The numbers are not contradictory. They reflect different workload characteristics exercising different hardware features under different conditions. As we discuss in how performance emerges from the full stack, the measured result is always a product of the entire execution context. Training and inference represent different execution contexts — different enough that performance generalizations between them are unreliable. We see this confusion surface frequently when procurement decisions are based on training benchmarks but the primary production workload is inference (or the reverse). The benchmark answered a different question than the one the organization needed answered, and the mismatch only becomes visible after deployment. System design implications If training and inference are different workloads, they typically benefit from different system designs. Training systems tend to prioritize GPU-to-GPU interconnect bandwidth (for gradient synchronization), high compute throughput, large HBM capacity (for model state and optimizer states), and power delivery for sustained full-load operation. A DGX-class node with NVSwitch or InfiniBand interconnects is optimized for this profile. Inference systems tend to prioritize memory bandwidth (for weight and KV cache reads), low-latency host-to-device communication, efficient multi-request scheduling, and power efficiency under variable load. The ideal inference system might use fewer, smaller GPUs with high memory bandwidth, or specialized inference accelerators that trade peak compute for memory throughput. These aren’t rigid categories — some workloads blend characteristics, and hardware continues to converge — but the general principle holds: optimizing for one workload phase and deploying the same configuration for both is a common source of underperformance. The practical takeaway is that any evaluation of GPU or accelerator performance needs to specify which workload it measured. “This GPU is fast” is incomplete. “This GPU delivers X tokens/second on LLM inference at batch size 32 with INT8 precision” is useful. “This GPU trains ResNet-50 at Y images/second” is also useful, but it tells you almost nothing about the inference scenario. As explored in the context of understanding what GPUs do within their larger system, the accelerator is one component whose contribution depends on what the rest of the system demands. Training and inference demand different things — and the performance story changes accordingly. Related deep-dives AI inference accelerators: what makes them a distinct category — why inference-specialized hardware exists and how to benchmark it on its own terms. LynxBenchAI measures training and inference as distinct workloads with distinct metrics, rather than collapsing both into a single hardware score. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation. Frequently Asked Questions Why do training and inference stress different parts of a GPU system? Training is dominated by dense forward/backward matrix multiplications that saturate compute units and rely on GPU-to-GPU interconnect bandwidth for gradient synchronization. Inference, especially autoregressive LLM decoding, is dominated by memory reads of weights and KV cache from HBM, plus per-request scheduling. The two phases exercise different parts of the silicon and the system around it. How is GPU performance for inference different from GPU performance for training? For training, the relevant measure is sustained compute throughput on large fixed batches with mixed precision. For inference, the relevant measure is per-request latency distribution under variable, dynamically batched load, often at lower precision (INT8, FP8, INT4). The same GPU can look very different on these two axes — a 2× training speedup between A100 and H100 might translate to anywhere from 1.3× to 3× on inference depending on the workload. Why is inference often memory- and latency-bound while training tends to be compute-bound? Training amortizes overhead across large batches, so tensor cores stay busy and the bottleneck is arithmetic. Inference, particularly token-by-token generation, reads model weights and KV cache from HBM for every step, making memory bandwidth the limit before compute is. Per-request latency also matters in inference because it directly affects user experience, while in training only epoch-level wall-clock time matters. Why don’t training benchmark results generalize automatically to inference deployments? Training and inference are different execution contexts exercising different hardware features under different conditions — batch size, precision, scheduling, memory access pattern all change. A benchmark result captures performance at one point in that space. Generalizing across phases assumes the bottleneck transfers, but it usually doesn’t. What changes about batching, memory pressure, and latency when moving the same model from training to production inference? Batches go from fixed and large to variable and request-shaped, which forces dynamic or continuous batching and iteration-level scheduling. Memory pressure shifts from optimizer and activation state to weight reads and KV cache. Latency becomes a primary optimization target rather than a side effect of throughput. Why is the “right GPU” question usually a different answer for training and for inference? Training systems benefit from high compute throughput, large HBM, NVLink/InfiniBand interconnects, and sustained power delivery. Inference systems benefit from high memory bandwidth, low-latency host I/O, efficient multi-request scheduling, and power efficiency under variable load. Optimizing one configuration for both phases is a common source of underperformance.