In one room, the throughput team celebrates. In the next room, the serving team is debugging latency spikes. They’re looking at the same system, the same GPU, the same deployment. The throughput number went up after a batching change. Tokens per second are higher than they’ve ever been. Meanwhile, the p99 response time has doubled, the user-facing SLO is being violated, and the serving team is trying to figure out what went wrong. Nothing went wrong. The system was optimized for one objective, and it degraded on another. This is not a bug — it’s the central tension in inference system design, and it shows up every time someone treats “performance” as a one-dimensional concept. Throughput and latency are different axes, and in most practical AI systems they compete for the same resources. Optimizing for one without tracking the other is how you end up with a system that looks great by one measure and fails by another. Throughput and latency answer different questions Throughput measures volume over time: how many tokens, images, or inference steps the system can complete in a sustained window — exactly the planning lens in steady-state performance, cost, and capacity planning. It answers the capacity question — “how much work can this system handle?” Latency measures time per unit of work: how long a single request takes from the caller’s perspective. It answers the responsiveness question — “how long does someone wait?” These sound like they should be correlated — more throughput should mean faster individual requests, right? In practice, the relationship is often inverse once you push past the system’s natural operating point, because the mechanisms that increase throughput frequently increase per-request latency as a side effect. Throughput vs. latency: two objectives that compete Dimension Throughput optimization Latency optimization Primary metric Total work per unit time (tokens/s, images/s) Time per request (p50, p99, TTFT) Batching strategy Large batches to amortize overhead Small batches to minimize queueing delay GPU utilization pattern High, sustained — GPU stays busy Variable — GPU may idle between small batches Who cares Batch processing, offline inference, cost planning Interactive serving, real-time APIs, user-facing SLAs Risk of over-optimizing Individual request latency spikes Under-utilized hardware, higher cost per request Batch size is where the trade-off becomes concrete Batching is the clearest lever for this tension, and on GPU-based inference systems it’s usually the primary one. Larger batches improve throughput because they amortize fixed overhead: kernel launch cost, memory allocation, scheduling, and framework-level bookkeeping. A batch of 32 requests pays these overheads once rather than 32 times, and the GPU’s parallel architecture means the incremental cost of additional items within a batch is often much less than the cost of processing them individually. So total tokens per second goes up. But from an individual request’s perspective, larger batches mean waiting. Each request must wait until enough peers have arrived to form a batch (queueing delay), and then the entire batch must complete before any individual result is returned (processing delay). Both of these push per-request latency upward. This isn’t a tuning problem with a universally correct answer. It’s a design trade-off, and the correct operating point depends entirely on what the system is optimizing for. A throughput-maximizing batch configuration looks very different from a latency-minimizing one, and a system optimized for one will underperform on the other — by design, not by defect. We find that the teams who get into trouble are not the ones who make this trade-off explicitly. It’s the ones who make it accidentally — by optimizing for whatever the dashboard emphasizes without realizing they’ve shifted the system into a regime that violates a different objective. Averages hide what users actually experience In latency-sensitive systems, average latency is one of the least useful statistics you can report, and it’s often the one that gets the most attention. The problem is that the average is dominated by the easy cases. If 95% of requests complete in 40ms and 5% take 500ms, the average looks like ~63ms — a number that describes neither the experience of the majority (40ms) nor the experience of the tail (500ms). The system looks “fine” by the average while 1 in 20 users has a genuinely degraded experience. This is why percentile metrics — p50, p95, p99, sometimes p999 — matter so much for serving workloads. They tell you what happens at different points in the distribution, including the tail, which is usually where operational pain concentrates: timeout-triggering latency, retries, and cascading failures in downstream systems. Contention, queueing bursts, GC pauses, CUDA context switching, and intermittent memory pressure all show up in the tail long before they affect the mean. A system can have a stable average while the tail quietly grows worse — especially under increased load, which is, of course, exactly when it matters most. How do model latency and system latency differ? Another common confusion is treating “model latency” and “system latency” as interchangeable. Model latency covers the forward pass execution on the device — the time from input tensors on the GPU to output tensors on the GPU. System latency includes everything else: request parsing, tokenization, batching policy decisions, queueing, memory management, output detokenization, and transport back to the caller. In a well-optimized serving system, model latency may account for only a fraction of end-to-end time. The rest is system overhead — not overhead in the pejorative sense, but the real mechanical cost of operating a service. If you measure only model latency (because it’s what profiling tools show most clearly), you may conclude the GPU is fast while the user is waiting on something the GPU has nothing to do with. Collapsing these into one number creates an illusion of clarity while hiding the actual lever you’d need to pull. And when someone says “the GPU is fast but the service is slow,” the explanation is almost always in this gap. Declaring the objective is not optional The practical consequence of this tension is that every performance claim about an inference system needs to state what it’s optimizing for. A throughput number without a latency constraint is incomplete. A latency number without a throughput context is incomplete. A benchmark that just reports “tokens per second” without specifying the batch configuration, the concurrency model, and whether latency was constrained to anything is reporting a number you can cite but not one you can safely plan against. As we discussed in the context of peak vs. steady-state behavior, the temporal regime of measurement matters too — a system’s throughput-latency trade-off can itself shift as the system transitions from peak to steady-state operation. The organizations that navigate this well are the ones that declare their objective up front: “we optimize for p99 latency under this concurrency level” or “we optimize for sustained throughput with latency bounded below X.” That declaration constrains the design space and makes performance results interpretable. Without it, you’re optimizing a number without knowing whether the thing the number represents is the thing your users actually care about. Related deep-dives Latency definition for AI inference: a domain-specific anchor — what latency means for inference and how it differs from networking and storage. Latency testing for AI inference: a methodology beyond best-case numbers — the batch / concurrency / arrival axes a real latency test must declare. Throughput definition for AI inference: why batch size is part of the number — the throughput-side definitional anchor and its inseparability from batch policy. LynxBenchAI requires that objective declaration up front — results are scoped to declared throughput-and-latency operating points, not free-floating numbers that shift with the configuration. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation. Frequently Asked Questions Why do throughput and latency compete for the same resources in an AI inference system? The mechanisms that raise throughput — larger batches, sustained GPU occupancy, amortised overhead — are the same mechanisms that add queueing and processing delay to any single request. Once you push past the system’s natural operating point, the relationship between the two often inverts: total tokens-per-second rises while per-request time gets worse. They share the GPU, the scheduler, and the memory subsystem, so a configuration tuned for one regime is structurally suboptimal for the other. How does batch size reshape both throughput and latency together? Batch size is the clearest lever for this trade-off on GPU-based inference. Larger batches amortise kernel launches, allocation, and framework bookkeeping across many requests, which pushes throughput up. But every request now waits for peers to arrive (queueing delay) and for the whole batch to finish (processing delay), which pushes latency up. There is no universally correct batch size — only a batch size that is correct for a declared objective. Why is average latency a misleading metric in latency-sensitive AI systems, and what should be reported instead? The average is dominated by the easy cases and hides the tail. If 95% of requests complete in 40ms and 5% take 500ms, the mean of ~63ms describes neither population. Percentile metrics — p50, p95, p99, sometimes p999 — should be reported instead, because operational pain (timeouts, retries, cascading downstream failures) concentrates in the tail, and the tail grows worse under load long before the mean shifts. When is throughput the right optimisation target, and when is latency the right one? Throughput is the right target for batch processing, offline inference, and cost planning — workloads where total work per unit time and hardware efficiency are what matter. Latency is the right target for interactive serving, real-time APIs, and any user-facing SLA, where the time a single caller waits is the operational currency. The decision is not which metric is universally better; it is which question your users are actually asking. What is the difference between model latency and end-to-end system latency in an inference benchmark? Model latency is the forward pass on the device — input tensors on the GPU to output tensors on the GPU. System latency includes everything around it: request parsing, tokenization, batching policy decisions, queueing, memory management, detokenization, and transport back to the caller. In a well-optimised serving system, model latency is often only a fraction of end-to-end time, which is why measuring only the GPU pass can make the hardware look fast while users still wait. Why is choosing the wrong target between throughput and latency one of the more expensive mistakes in inference engineering? Because the resulting system looks correct by one measure and fails by another, and the failure usually surfaces in production rather than in benchmarks. Teams who pick the wrong target tend to do so accidentally — optimising whatever the dashboard emphasises — and only discover the mismatch when an SLO breaks or capacity planning misses. Declaring the objective up front, as we cover in the closing section, is what makes performance results interpretable and prevents that class of mistake.