Throughput vs Latency: Choosing the Wrong Optimization Target

In one room, the throughput team celebrates. In the next room, the serving team is debugging latency spikes.

They’re looking at the same system, the same GPU, the same deployment. The throughput number went up after a batching change. Tokens per second are higher than they’ve ever been. Meanwhile, the p99 response time has doubled, the user-facing SLO is being violated, and the serving team is trying to figure out what went wrong.

Nothing went wrong. The system was optimized for one objective, and it degraded on another. This is not a bug — it’s the central tension in inference system design, and it shows up every time someone treats “performance” as a one-dimensional concept.

Throughput and latency are different axes, and in most practical AI systems they compete for the same resources. Optimizing for one without tracking the other is how you end up with a system that looks great by one measure and fails by another.

Throughput and latency answer different questions

Throughput measures volume over time: how many tokens, images, or inference steps the system can complete in a sustained window — exactly the planning lens in steady-state performance, cost, and capacity planning. It answers the capacity question — “how much work can this system handle?”

Latency measures time per unit of work: how long a single request takes from the caller’s perspective. It answers the responsiveness question — “how long does someone wait?”

These sound like they should be correlated — more throughput should mean faster individual requests, right? In practice, the relationship is often inverse once you push past the system’s natural operating point, because the mechanisms that increase throughput frequently increase per-request latency as a side effect.

Throughput vs. latency: two objectives that compete

Dimension	Throughput optimization	Latency optimization
Primary metric	Total work per unit time (tokens/s, images/s)	Time per request (p50, p99, TTFT)
Batching strategy	Large batches to amortize overhead	Small batches to minimize queueing delay
GPU utilization pattern	High, sustained — GPU stays busy	Variable — GPU may idle between small batches
Who cares	Batch processing, offline inference, cost planning	Interactive serving, real-time APIs, user-facing SLAs
Risk of over-optimizing	Individual request latency spikes	Under-utilized hardware, higher cost per request

Batch size is where the trade-off becomes concrete

Batching is the clearest lever for this tension, and on GPU-based inference systems it’s usually the primary one.

Larger batches improve throughput because they amortize fixed overhead: kernel launch cost, memory allocation, scheduling, and framework-level bookkeeping. A batch of 32 requests pays these overheads once rather than 32 times, and the GPU’s parallel architecture means the incremental cost of additional items within a batch is often much less than the cost of processing them individually. So total tokens per second goes up.

But from an individual request’s perspective, larger batches mean waiting. Each request must wait until enough peers have arrived to form a batch (queueing delay), and then the entire batch must complete before any individual result is returned (processing delay). Both of these push per-request latency upward.

This isn’t a tuning problem with a universally correct answer. It’s a design trade-off, and the correct operating point depends entirely on what the system is optimizing for. A throughput-maximizing batch configuration looks very different from a latency-minimizing one, and a system optimized for one will underperform on the other — by design, not by defect.

We find that the teams who get into trouble are not the ones who make this trade-off explicitly. It’s the ones who make it accidentally — by optimizing for whatever the dashboard emphasizes without realizing they’ve shifted the system into a regime that violates a different objective.

Averages hide what users actually experience

In latency-sensitive systems, average latency is one of the least useful statistics you can report, and it’s often the one that gets the most attention.

The problem is that the average is dominated by the easy cases. If 95% of requests complete in 40ms and 5% take 500ms, the average looks like ~63ms — a number that describes neither the experience of the majority (40ms) nor the experience of the tail (500ms). The system looks “fine” by the average while 1 in 20 users has a genuinely degraded experience.

This is why percentile metrics — p50, p95, p99, sometimes p999 — matter so much for serving workloads. They tell you what happens at different points in the distribution, including the tail, which is usually where operational pain concentrates: timeout-triggering latency, retries, and cascading failures in downstream systems.

Contention, queueing bursts, GC pauses, CUDA context switching, and intermittent memory pressure all show up in the tail long before they affect the mean. A system can have a stable average while the tail quietly grows worse — especially under increased load, which is, of course, exactly when it matters most.### Report percentiles, not a single number

The practical rule is to report a percentile ladder rather than one summary statistic. p50 tells you the typical experience, p90 and p95 tell you where the distribution starts to fan out, and p99 (sometimes p999) tells you what your worst-served users actually live with. The gap between p50 and p99 is the signal: a wide gap means the tail is heavy and load-sensitive, even if the median looks healthy. Reporting only the average collapses that ladder into a number that hides exactly the behaviour a latency-sensitive system is judged on.

How do model latency and system latency differ?

Another common confusion is treating “model latency” and “system latency” as interchangeable.

Model latency covers the forward pass execution on the device — the time from input tensors on the GPU to output tensors on the GPU. System latency includes everything else: request parsing, tokenization, batching policy decisions, queueing, memory management, output detokenization, and transport back to the caller.

In a well-optimized serving system, model latency may account for only a fraction of end-to-end time. The rest is system overhead — not overhead in the pejorative sense, but the real mechanical cost of operating a service. If you measure only model latency (because it’s what profiling tools show most clearly), you may conclude the GPU is fast while the user is waiting on something the GPU has nothing to do with.

Collapsing these into one number creates an illusion of clarity while hiding the actual lever you’d need to pull. And when someone says “the GPU is fast but the service is slow,” the explanation is almost always in this gap.For LLM serving, this split has its own vocabulary, and each term exposes a different face of the throughput-latency trade-off. TTFT (time-to-first-token) measures how long the caller waits before the first token appears — dominated by prefill and by queueing delay, so it gets worse as you batch harder for throughput. TPOT (time-per-output-token) and ITL (inter-token latency) measure the steady drip of generation once decoding starts; they govern how fast the response streams and degrade as more sequences share a decode batch. A throughput-tuned configuration can raise aggregate tokens-per-second while pushing TTFT and ITL the wrong way for an interactive user, which is the same competition described above, just measured per-token.

Declaring the objective is not optional

The practical consequence of this tension is that every performance claim about an inference system needs to state what it’s optimizing for. A throughput number without a latency constraint is incomplete. A latency number without a throughput context is incomplete. A benchmark that just reports “tokens per second” without specifying the batch configuration, the concurrency model, and whether latency was constrained to anything is reporting a number you can cite but not one you can safely plan against.

As we discussed in the context of peak vs. steady-state behavior, the temporal regime of measurement matters too — a system’s throughput-latency trade-off can itself shift as the system transitions from peak to steady-state operation.

The organizations that navigate this well are the ones that declare their objective up front: “we optimize for p99 latency under this concurrency level” or “we optimize for sustained throughput with latency bounded below X.” That declaration constrains the design space and makes performance results interpretable. Without it, you’re optimizing a number without knowing whether the thing the number represents is the thing your users actually care about.

Declaring and defending that objective is the opening move of performance tuning for AI inference, where the throughput-versus-latency choice gets made against a real workload.

Latency definition for AI inference: a domain-specific anchor — what latency means for inference and how it differs from networking and storage.
Latency testing for AI inference: a methodology beyond best-case numbers — the batch / concurrency / arrival axes a real latency test must declare.
Throughput definition for AI inference: why batch size is part of the number — the throughput-side definitional anchor and its inseparability from batch policy.

LynxBenchAI requires that objective declaration up front — results are scoped to declared throughput-and-latency operating points, not free-floating numbers that shift with the configuration. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

Why do throughput and latency compete for the same resources in an AI inference system?

The mechanisms that raise throughput — larger batches, sustained GPU occupancy, amortised overhead — are the same mechanisms that add queueing and processing delay to any single request. Once you push past the system’s natural operating point, the relationship between the two often inverts: total tokens-per-second rises while per-request time gets worse. They share the GPU, the scheduler, and the memory subsystem, so a configuration tuned for one regime is structurally suboptimal for the other.

How does batch size reshape both throughput and latency together?

Batch size is the clearest lever for this trade-off on GPU-based inference. Larger batches amortise kernel launches, allocation, and framework bookkeeping across many requests, which pushes throughput up. But every request now waits for peers to arrive (queueing delay) and for the whole batch to finish (processing delay), which pushes latency up. There is no universally correct batch size — only a batch size that is correct for a declared objective.

Why is average latency a misleading metric in latency-sensitive AI systems, and what should be reported instead?

The average is dominated by the easy cases and hides the tail. If 95% of requests complete in 40ms and 5% take 500ms, the mean of ~63ms describes neither population. Report a percentile ladder instead — p50, p90, p95, p99, sometimes p999 — because operational pain (timeouts, retries, cascading downstream failures) concentrates in the tail, and the tail grows worse under load long before the mean shifts.

When is throughput the right optimisation target, and when is latency the right one?

Throughput is the right target for batch processing, offline inference, and cost planning — workloads where total work per unit time and hardware efficiency are what matter. Latency is the right target for interactive serving, real-time APIs, and any user-facing SLA, where the time a single caller waits is the operational currency. The decision is not which metric is universally better; it is which question your users are actually asking.

What is the difference between model latency and end-to-end system latency in an inference benchmark?

Model latency is the forward pass on the device — input tensors on the GPU to output tensors on the GPU. System latency includes everything around it: request parsing, tokenization, batching policy decisions, queueing, memory management, detokenization, and transport back to the caller. In a well-optimised serving system, model latency is often only a fraction of end-to-end time, which is why measuring only the GPU pass can make the hardware look fast while users still wait.

Why is choosing the wrong target between throughput and latency one of the more expensive mistakes in inference engineering?

Because the resulting system looks correct by one measure and fails by another, and the failure usually surfaces in production rather than in benchmarks. Teams who pick the wrong target tend to do so accidentally — optimising whatever the dashboard emphasises — and only discover the mismatch when an SLO breaks or capacity planning misses. Declaring the objective up front, as we cover in the closing section, is what makes performance results interpretable and prevents that class of mistake.

What do TTFT, TPOT, and ITL each measure, and which throughput-vs-latency trade-off does each expose in LLM inference?

TTFT (time-to-first-token) is how long the caller waits before the first token appears; it is dominated by prefill and queueing, so it worsens precisely as you batch harder for throughput. TPOT (time-per-output-token) and ITL (inter-token latency) measure the steady cadence of generation once decoding begins, and they degrade as more sequences share a decode batch. Together they decompose perceived latency into a startup cost and a streaming cost — each one a different face of the same competition between aggregate tokens-per-second and per-user responsiveness.

How should latency be reported across percentiles so the tail behaviour of a latency-sensitive AI system is visible rather than hidden by an average?

Report a ladder — p50, p90, p95, p99, sometimes p999 — rather than a single summary number. p50 captures the typical experience, p90 and p95 show where the distribution begins to fan out, and p99 captures what the worst-served users actually live with. The gap between p50 and p99 is the real signal: a wide gap means the tail is heavy and load-sensitive even when the median looks healthy, which is exactly the behaviour an average is designed to obscure.

Throughput vs Latency: Choosing the Wrong Optimization Target

In one room, the throughput team celebrates. In the next room, the serving team is debugging latency spikes.

Throughput and latency answer different questions

Throughput vs. latency: two objectives that compete

Batch size is where the trade-off becomes concrete

Averages hide what users actually experience

How do model latency and system latency differ?

Declaring the objective is not optional

Frequently Asked Questions

Why do throughput and latency compete for the same resources in an AI inference system?

How does batch size reshape both throughput and latency together?

Why is average latency a misleading metric in latency-sensitive AI systems, and what should be reported instead?

When is throughput the right optimisation target, and when is latency the right one?

What is the difference between model latency and end-to-end system latency in an inference benchmark?

Why is choosing the wrong target between throughput and latency one of the more expensive mistakes in inference engineering?

What do TTFT, TPOT, and ITL each measure, and which throughput-vs-latency trade-off does each expose in LLM inference?

How should latency be reported across percentiles so the tail behaviour of a latency-sensitive AI system is visible rather than hidden by an average?

Steady-State Performance, Cost, and Capacity Planning

Peak Performance vs Steady-State Performance in AI

Latency Definition for AI Inference: A Domain-Specific Anchor

Latency Testing for AI Inference: A Methodology Beyond Best-Case Numbers

Throughput Definition for AI Inference: Why Batch Size Is Part of the Number

Throughput vs Latency: Choosing the Wrong Optimization Target

In one room, the throughput team celebrates. In the next room, the serving team is debugging latency spikes.

Throughput and latency answer different questions

Throughput vs. latency: two objectives that compete

Batch size is where the trade-off becomes concrete

Averages hide what users actually experience

How do model latency and system latency differ?

Declaring the objective is not optional

Related deep-dives

Frequently Asked Questions

Why do throughput and latency compete for the same resources in an AI inference system?

How does batch size reshape both throughput and latency together?

Why is average latency a misleading metric in latency-sensitive AI systems, and what should be reported instead?

When is throughput the right optimisation target, and when is latency the right one?

What is the difference between model latency and end-to-end system latency in an inference benchmark?

Why is choosing the wrong target between throughput and latency one of the more expensive mistakes in inference engineering?

What do TTFT, TPOT, and ITL each measure, and which throughput-vs-latency trade-off does each expose in LLM inference?

How should latency be reported across percentiles so the tail behaviour of a latency-sensitive AI system is visible rather than hidden by an average?

Steady-State Performance, Cost, and Capacity Planning

Peak Performance vs Steady-State Performance in AI

Latency Definition for AI Inference: A Domain-Specific Anchor

Latency Testing for AI Inference: A Methodology Beyond Best-Case Numbers

Throughput Definition for AI Inference: Why Batch Size Is Part of the Number