Run a throughput benchmark, report tokens-per-second on a single GPU, pick the config with the biggest number. That is how most inference benchmarks get run, and it is how a serving config that looks fast in isolation ends up losing money on every request once you turn on the constraints production actually imposes. The problem is not that tokens-per-second is wrong. It is that it answers a question your business never asked. A finance team does not approve infrastructure spend in tokens-per-second; it approves it in cost per request, gross margin, and whether the latency is good enough that users do not churn. A benchmark that ranks candidate configs by raw throughput produces a recommendation that has to be translated into those units later — and the translation regularly flips the ranking. A benchmark grounded in cost-per-request survives the translation because it was already measured in the units the business cares about. What you measure matters more than how fast the headline number looks: the throughput-optimal config can be the wrong one, and a worked cost-per-request comparison between two serving configurations shows why. What Should an Inference Benchmark Measure If the Goal Is Cost-Per-Request? Start from the decision you are trying to make. You are choosing a serving configuration — a model, a runtime, a batch policy, a GPU type, a concurrency level — and you want the one that serves your real traffic at the lowest cost without breaking your latency commitment. That framing forces three measurements that a tokens-per-second headline omits. The first is cost-per-request and cost-per-token at a fixed latency target, not at peak throughput. Peak throughput is achieved by batching aggressively, and aggressive batching trades latency for utilisation. If your product promises a p95 response time, you cannot spend that latency budget on batching to win a benchmark you will never run in production. The second is p95 latency at each config, measured under the same offered load. Mean latency hides the tail, and the tail is what users feel and what SLOs are written against. Two configs with identical median latency can have p95 values that differ by a factor of two once a queue forms. The third is the gross-margin delta between the throughput-optimal and the cost-optimal config. This is the number that decides things. It is the answer to “what does picking the fast-looking config instead of the cheap one cost us per month at our request volume?” — and it is frequently large enough to reverse the decision on its own. This is the same shift we argue for in why cost-per-request is the right production AI optimisation target: the unit of optimisation should match the unit of the business, not the unit of the hardware spec sheet. The principle that a benchmark is decision infrastructure — something you build to make a defensible choice, not a leaderboard you climb — is argued well in LynxBenchAI’s treatment of how to benchmark a system for AI work; the cost-per-request framing here is one instance of that decision-first discipline applied to a serving config. Why Can a Throughput-Optimal Serving Config Still Lose Money Per Request? The intuition is that a faster config is a cheaper config. More tokens per second from the same GPU means more requests served, which means lower cost per request. That holds only when you can actually use the throughput — and in production you usually cannot. Throughput on a benchmark harness is measured by saturating the GPU with as much work as it will take. You feed it large batches, you keep the queue full, you report the steady-state token rate. That number is real, but it was earned under conditions your live traffic does not reproduce. Real traffic arrives unevenly, in bursts and lulls. To hit peak batch sizes you have to wait for enough requests to accumulate, and that wait is latency you are adding to every request in the batch. The config that posts the biggest tokens-per-second number is often the config that waits the longest to fill a batch. So the throughput-optimal config does two things at once: it serves a lot of tokens when it is busy, and it blows your p95 latency budget when traffic is light. To bring p95 back under the SLO you reduce the batch window, which reduces the achieved throughput, which raises the real cost per request. The headline number evaporates the moment you apply the latency constraint that was always going to apply. There is a second mechanism. Cost-per-request is GPU-hours divided by requests served, and GPU-hours are billed whether the GPU is busy or idle. A config tuned for peak throughput often needs a larger or more expensive instance to reach that peak; if your actual concurrency rarely fills it, you are paying for capacity you do not use. We see this pattern regularly — a team provisions for the benchmark and pays for the headroom every hour of the month. Distinguishing raw spend from delivered value is exactly the distinction LynxBenchAI draws in cost, efficiency, and value for AI hardware, and it is why the cost-optimal config is rarely the one with the highest peak number. How Do You Fix Latency So the Comparison Is Fair? A benchmark comparison is only meaningful if the configs are compared under the same constraint. The mistake is to let each config run at its own most-flattering operating point — one at maximum batch, one at minimum latency — and then put the two numbers side by side. That is not a comparison; it is two unrelated measurements. The fix is to hold p95 latency fixed at your product’s actual target and measure cost-per-request at that target for every config. Pick the number your SLO commits to — say 800 ms p95 for a chat completion, or 2 seconds for a long-form generation — and tune each config until it meets that bound under the same offered load. Then read off the cost. Now the configs differ on one axis only: how cheaply each one delivers the latency you require. This is the empirical, workload-bound posture LynxBenchAI describes as AI performance requiring measurement under real workload conditions rather than spec-sheet extrapolation. The same offered-load trace, the same latency bound, the same prompt and output length distribution — fix all of it, and the only variable left is the one you are trying to decide. A short checklist for keeping a serving-config benchmark honest: Fix the latency target to your real SLO (p95, not mean) and hold it constant across configs. Replay realistic traffic — bursty arrival, your actual prompt/output length distribution — not a synthetic constant stream. Bill the full instance, including idle time, when computing cost-per-request; do not credit unused capacity. Report cost-per-request and cost-per-token together; prefill-heavy and decode-heavy workloads diverge, and one number can hide the other. Measure each config at the same offered load, so concurrency is a controlled variable rather than a free one. The per-config utilisation and latency figures that feed this comparison come from profiling the serving path under load — the methodology we describe in GPU profiling for AI inference workloads supplies the raw measurements these examples consume. What Does a Worked Cost-Per-Request Comparison Look Like? Here is an illustrative comparison between two serving configurations for the same model, evaluated against an 800 ms p95 target. The numbers below are illustrative — they show the shape of the calculation, not a benchmark of any specific hardware — and follow the rule of measuring each config at the fixed latency bound rather than its peak. Assume a request averages 1,200 tokens (400 prompt, 800 generated), and the product serves roughly 5 million requests per month. Measure Config A (throughput-tuned) Config B (cost-tuned) Batch policy Large batch, 120 ms window Small batch, 25 ms window Peak tokens/sec (harness) ~6,000 ~3,800 p95 latency at offered load ~1,350 ms (over SLO) ~780 ms (under SLO) Effective tokens/sec at 800 ms p95 ~2,400 ~3,500 Instance cost ($/hr, illustrative) $4.10 $2.90 Cost per 1k tokens at SLO ~$0.0047 ~$0.0023 Cost per request (1,200 tok) ~$0.0057 ~$0.0028 Config A wins the harness benchmark decisively — 6,000 tokens/sec versus 3,800. But it cannot hold the 800 ms p95 target at that operating point; its 120 ms batch window pushes the tail over the SLO. Throttle it back to meet the SLO and its effective throughput collapses below Config B’s, because the larger, pricier instance is now running below the load it needs to be efficient. Translated to the business: at 5 million requests per month, Config A costs roughly $28,500 and Config B roughly $14,000. The throughput headline pointed at the option that costs about twice as much per request at the latency the product actually ships. That gross-margin delta — not the tokens-per-second number — is the output that decides the config (an illustrative worked example; the framing, not the figures, is the transferable part). How batching and concurrency move these numbers is the lever. Widening the batch window raises peak throughput and lowers cost-per-token if you can keep batches full without breaching latency; raising concurrency improves utilisation until the queue forms and the tail blows out. The cost-optimal point is the largest batch and highest concurrency that still clears your p95 bound — which is almost never the point that maximises the harness number. For the broader treatment of how these per-request costs roll up into a margin model, see unit economics for production AI. How Does This Relate to MLPerf Inference? Standard suites like MLPerf Inference are genuinely useful and genuinely limited for this decision. MLPerf does enforce latency constraints — its Server scenario measures throughput subject to a latency bound, which is exactly the discipline a fair comparison needs, and it is reproducible and auditable in a way ad-hoc benchmarks are not. If you want to know how two accelerators compare on a standardised workload at a standardised latency, MLPerf answers it. Where it stops short is the translation to your margin. MLPerf reports queries-per-second under a fixed latency constraint on a defined workload; it does not report cost-per-request on your traffic distribution, your instance pricing, your prompt and output lengths, or your offered-load curve. Two configs can rank identically on an MLPerf workload and diverge sharply on cost-per-request once your actual traffic and billing are applied. The suite is a calibrated reference standard for the hardware-and-runtime layer; the margin-based comparison is the layer above it that maps a calibrated result to a defensible business decision. Use MLPerf to trust the measurement method, then run the cost-per-request comparison on your own workload to make the call. FAQ How does benchmarking examples work, and what does it mean in practice? A benchmarking example is a worked comparison of candidate serving configurations measured in the units the decision actually turns on. In practice that means running each config against the same realistic traffic at a fixed p95 latency target and reading off cost-per-request and cost-per-token, rather than reporting peak tokens-per-second on a saturated GPU. The example exists to produce a defensible choice, not a leaderboard ranking. What should an inference benchmark measure if the goal is cost-per-request rather than throughput? Three things: cost-per-request and cost-per-token measured at your real latency target (not at peak throughput), p95 latency under the same offered load for every config, and the gross-margin delta between the throughput-optimal and cost-optimal choice. Those are the units a finance team approves spend in, so a benchmark measured in them needs no risky translation later. Why can a throughput-optimal serving config still lose money per request? Peak throughput is earned by aggressive batching, which adds latency by waiting to fill batches. Throttle that config back to meet your p95 SLO and its effective throughput collapses, while it often runs on a larger, pricier instance billed whether or not it is busy. The result is a higher real cost per request than a config that looked slower on the harness. How do you fix latency (p95) when comparing serving configurations so the benchmark is fair? Hold p95 latency fixed at your product’s actual SLO and tune each config to meet that bound under the same offered load, then measure cost at that point. This leaves a single variable — how cheaply each config delivers the required latency — rather than comparing each config at its own most-flattering operating point. What does a worked cost-per-request benchmark comparison between two serving configs look like? It tabulates batch policy, peak throughput, p95 latency, effective throughput at the SLO, instance cost, and cost-per-request for each config, then translates the cost-per-request to monthly spend at your request volume. The throughput-tuned config frequently wins the harness number but loses on cost-per-request once it is throttled to meet the latency target — and the monthly spend delta is the figure that decides. How do batching and concurrency settings change cost-per-token in a benchmark? Widening the batch window raises peak throughput and lowers cost-per-token only if batches stay full without breaching latency; raising concurrency improves utilisation until a queue forms and the tail latency blows out. The cost-optimal point is the largest batch and highest concurrency that still clears your p95 bound, which is almost never the point that maximises the harness throughput number. How do you turn benchmark results into a defensible config-selection decision? Compare configs on cost-per-request at a fixed latency target, compute the gross-margin delta against your request volume, and record the before/after numbers. A decision backed by “Config B costs roughly half as much per request at the latency we ship” survives a procurement or finance review in a way a tokens-per-second headline does not. How does a cost-per-request benchmark relate to standard suites like MLPerf Inference, and where do those suites stop short of a margin-based comparison? MLPerf Inference enforces a latency constraint and is reproducible, so it is a trustworthy reference for the hardware-and-runtime layer. It stops short of cost-per-request on your traffic distribution, instance pricing, and prompt/output lengths — so use MLPerf to trust the measurement method, then run the cost-per-request comparison on your own workload to make the call. Where This Leaves the Config Decision The honest version of “which serving config is cheapest” is not a single number; it is a comparison run in your units, at your latency, on your traffic. The throughput headline is a tempting shortcut precisely because it is easy to produce and easy to defend in a slide — until someone applies the p95 constraint and the gross-margin delta and the ranking inverts. If you want the comparison run against your own deployed serving path — to produce the before/after cost-per-request rather than an illustrative table — that is the work of the inference cost-cut pack, applied within a broader AI infrastructure practice. The failure class to watch for is the one this article opened on: a config selected on a number the business never measures in.