In one room, the throughput team celebrates. In the next room, the serving team is debugging latency spikes.
They’re looking at the same system, the same GPU, the same deployment. The throughput number went up after a batching change. Tokens per second are higher than they’ve ever been. Meanwhile, the p99 response time has doubled, the user-facing SLO is being violated, and the serving team is trying to figure out what went wrong.
Nothing went wrong. The system was optimized for one objective, and it degraded on another. This is not a bug — it’s the central tension in inference system design, and it shows up every time someone treats “performance” as a one-dimensional concept.
Throughput and latency are different axes, and in most practical AI systems they compete for the same resources. Optimizing for one without tracking the other is how you end up with a system that looks great by one measure and fails by another.
Throughput and latency answer different questions
Throughput measures volume over time: how many tokens, images, or inference steps the system can complete in a sustained window — exactly the planning lens in steady-state performance, cost, and capacity planning. It answers the capacity question — “how much work can this system handle?”
Latency measures time per unit of work: how long a single request takes from the caller’s perspective. It answers the responsiveness question — “how long does someone wait?”
These sound like they should be correlated — more throughput should mean faster individual requests, right? In practice, the relationship is often inverse once you push past the system’s natural operating point, because the mechanisms that increase throughput frequently increase per-request latency as a side effect.
Batch size is where the trade-off becomes concrete
Batching is the clearest lever for this tension, and on GPU-based inference systems it’s usually the primary one.
Larger batches improve throughput because they amortize fixed overhead: kernel launch cost, memory allocation, scheduling, and framework-level bookkeeping. A batch of 32 requests pays these overheads once rather than 32 times, and the GPU’s parallel architecture means the incremental cost of additional items within a batch is often much less than the cost of processing them individually. So total tokens per second goes up.
But from an individual request’s perspective, larger batches mean waiting. Each request must wait until enough peers have arrived to form a batch (queueing delay), and then the entire batch must complete before any individual result is returned (processing delay). Both of these push per-request latency upward.
This isn’t a tuning problem with a universally correct answer. It’s a design trade-off, and the correct operating point depends entirely on what the system is optimizing for. A throughput-maximizing batch configuration looks very different from a latency-minimizing one, and a system optimized for one will underperform on the other — by design, not by defect.
We find that the teams who get into trouble are not the ones who make this trade-off explicitly. It’s the ones who make it accidentally — by optimizing for whatever the dashboard emphasizes without realizing they’ve shifted the system into a regime that violates a different objective.
Averages hide what users actually experience
In latency-sensitive systems, average latency is one of the least useful statistics you can report, and it’s often the one that gets the most attention.
The problem is that the average is dominated by the easy cases. If 95% of requests complete in 40ms and 5% take 500ms, the average looks like ~63ms — a number that describes neither the experience of the majority (40ms) nor the experience of the tail (500ms). The system looks “fine” by the average while 1 in 20 users has a genuinely degraded experience.
This is why percentile metrics — p50, p95, p99, sometimes p999 — matter so much for serving workloads. They tell you what happens at different points in the distribution, including the tail, which is usually where operational pain concentrates: timeout-triggering latency, retries, and cascading failures in downstream systems.
Contention, queueing bursts, GC pauses, CUDA context switching, and intermittent memory pressure all show up in the tail long before they affect the mean. A system can have a stable average while the tail quietly grows worse — especially under increased load, which is, of course, exactly when it matters most.
Model latency and system latency: don’t collapse them
Another common confusion is treating “model latency” and “system latency” as interchangeable.
Model latency covers the forward pass execution on the device — the time from input tensors on the GPU to output tensors on the GPU. System latency includes everything else: request parsing, tokenization, batching policy decisions, queueing, memory management, output detokenization, and transport back to the caller.
In a well-optimized serving system, model latency may account for only a fraction of end-to-end time. The rest is system overhead — not overhead in the pejorative sense, but the real mechanical cost of operating a service. If you measure only model latency (because it’s what profiling tools show most clearly), you may conclude the GPU is fast while the user is waiting on something the GPU has nothing to do with.
Collapsing these into one number creates an illusion of clarity while hiding the actual lever you’d need to pull. And when someone says “the GPU is fast but the service is slow,” the explanation is almost always in this gap.
Declaring the objective is not optional
The practical consequence of this tension is that every performance claim about an inference system needs to state what it’s optimizing for. A throughput number without a latency constraint is incomplete. A latency number without a throughput context is incomplete. A benchmark that just reports “tokens per second” without specifying the batch configuration, the concurrency model, and whether latency was constrained to anything is reporting a number you can cite but not one you can safely plan against.
As we discussed in the context of peak vs. steady-state behavior, the temporal regime of measurement matters too — a system’s throughput-latency trade-off can itself shift as the system transitions from peak to steady-state operation.
The organizations that navigate this well are the ones that declare their objective up front: “we optimize for p99 latency under this concurrency level” or “we optimize for sustained throughput with latency bounded below X.” That declaration constrains the design space and makes performance results interpretable. Without it, you’re optimizing a number without knowing whether the thing the number represents is the thing your users actually care about.