Your AI feature shipped, the usage graph climbed, and then the inference bill climbed faster. The instinct in most teams is the same: the model is too expensive, so swap it for a cheaper or smaller one and hope the integration survives. That instinct is understandable — model choice is the lever a product team controls most directly — and it is usually the wrong place to start. An AI inference cost audit exists to answer one question before you touch the model: where does the cost and latency actually come from? It profiles the deployed serving path — batching, caching, request routing, quantisation, runtime, and kernel-level execution — names the bottleneck, and produces a ranked roadmap with a calibrated cost model. A model swap that survives that audit is a measured decision. A model swap that bypasses it is a guess, and the bill for a wrong guess is the next quarter’s roadmap. Why Inference Cost Explodes After Launch The cost profile of an AI feature at launch and the cost profile three months later are rarely the same shape, and the difference is almost never the model weights. Launch traffic is low and bursty, so the serving path runs with small batches, low GPU occupancy, and plenty of headroom. None of the inefficiencies hurt yet because the absolute volume is small. Then adoption grows. The same per-request inefficiency that was invisible at a thousand requests a day becomes the dominant line item at a million. A serving path that processes requests one at a time leaves most of the GPU idle on every call — the hardware is paid for by the hour whether or not the tensor cores are busy. A caching layer that was never built means identical or near-identical prompts are recomputed end to end. A routing policy that sends every request to the largest model means the 80% of traffic a smaller model could handle is overpaying by design. In our experience auditing production deployments, the explosion is structural, not model-driven: the serving path was built to work, never tuned to scale, and the cost surfaced only when volume made the inefficiency expensive. That is an observed pattern across engagements, not a benchmarked rate — but it is consistent enough that “the model is too expensive” should be treated as a hypothesis to test, not a conclusion to act on. What an Inference Cost Audit Actually Delivers An audit is not a meeting where someone with strong opinions reviews your architecture. It is a measurement exercise that ends in a document a buyer can defend to their own leadership. The deliverable has four parts, and each one survives the engagement as evidence. Deliverable What it contains Why it matters Baseline metrics Cost-per-request (and cost-per-token for LLM paths), p95 latency, GPU utilisation, throughput at current load You cannot claim a saving without a measured “before” Profiler findings Where time and money are actually spent — kernel-level, batching, data movement, idle GPU, redundant compute Names the bottleneck instead of guessing it Ranked optimisation roadmap Each intervention scored by expected impact and integration risk Lets you sequence work by ROI, not by what is fashionable Calibrated ROI model Projected cost-per-request and latency after each intervention, with assumptions stated Turns “we think this helps” into a defensible number The honest part of the roadmap is that sometimes it ranks model replacement first — quantising or swapping the model genuinely is the highest-impact lever in some deployments. The point of the audit is not to prove replacement is unnecessary; it is to prove whether it is, before you spend a quarter on an integration you cannot un-spend. We treat that ROI model as a living artefact: a buyer should be able to hand it to a CFO and have the numbers hold up. Is the Bottleneck the Model, the Runtime, or the Hardware? This is the question the audit is built to answer, and it is the one teams most often get wrong by assumption. The three layers fail in different ways and leave different fingerprints in a profiler. A model bottleneck shows up as high, irreducible per-token compute even when the GPU is fully saturated and batching is healthy — the work is genuinely there, and only a smaller or quantised model removes it. A runtime bottleneck shows up as a saturated GPU doing the wrong work efficiently: unfused kernels, suboptimal attention implementations, a serving stack that never batches, or a graph that an inference compiler like TensorRT or ONNX Runtime could collapse. A hardware bottleneck shows up as a GPU starved by data movement — HBM bandwidth limits, PCIe transfer overhead, or NUMA effects pinning a workload to the wrong memory — where the compute units sit idle waiting for inputs. The diagnostic discipline that separates these layers is exactly the work covered in profiling AI inference and reading what the numbers actually mean, and the runtime and kernel-level interventions live in what performance tuning for AI inference means in practice. The reason the layers must be separated before action: each one points to a different fix, and applying the model fix to a runtime problem is how teams replace a perfectly good model and watch the bill barely move. A Worked Example, With Assumptions Stated Suppose a deployment serves an LLM endpoint at a measured cost-per-request that leadership flags as too high. The team’s first plan is to swap to a model roughly half the size. An audit profiles the path first and finds GPU utilisation sitting low under production load because requests are served one at a time, with no continuous batching. For example, if the profiler shows the GPU is idle most of the wall-clock time per request, the dominant cost is not the model — it is unbatched serving leaving paid-for hardware unused. In a case like this, dynamic batching and request routing can recover a large share of the cost without changing the model at all — and the avoided cost of the model-replacement project (engineering time, re-evaluation, regression risk) becomes part of the ROI the audit reports. The numbers here are illustrative, framed to show the reasoning; the actual figures come only from profiling the specific deployment. How Batching and Routing Cut Cost Without Touching the Model Two of the highest-leverage interventions never change a single weight. Continuous (or dynamic) batching groups concurrent requests so the GPU processes many sequences per forward pass, raising utilisation toward the level the hardware was billed for. Request routing sends easy traffic to a smaller model and reserves the large model for the requests that genuinely need it — the same answer quality at a fraction of the blended cost. This is also where cost-per-token and cost-per-request connect. For an LLM path, cost-per-token is the granular unit, but the operationally relevant figure is cost-per-request, because a request may span many tokens and batching changes the per-token economics non-linearly. The unit-economics decision that anchors the whole audit — which number you optimise and why — is the subject of why cost-per-request is the right production AI optimisation target. Choosing the wrong unit means optimising a number that does not move the bill. The deeper reason this works under real load — and the reason a benchmark on synthetic traffic can mislead you — is that sustained behaviour under production concurrency, not transient peak, determines cost. The reasoning behind why measured-under-load numbers behave the way they do is something the precision-as-an-economic-lever discussion in inference systems anchors on the cost side; the broader load-behaviour story sits in performance engineering for production AI under load. When Is It Worth Profiling Instead of Guessing? Profiling has a cost — engineering hours, instrumentation, a controlled load test. There is a threshold below which guessing is rational, and the audit framing is honest about it. Profile when inference is a material and growing line item, when latency is hurting the product, or when a model-replacement project is being proposed. The audit cost is small against a wrong quarter. Profile when the team cannot articulate which layer the bottleneck is in — uncertainty about model vs runtime vs hardware is itself the signal. Guess (cheaply) is defensible when the feature is pre-scale, the bill is immaterial, and the reversible fix is a one-line config change. Always profile before a migration. Porting runtime or hardware without a baseline means you cannot prove the migration paid off — the work in how runtime and hardware porting cuts cost without a model swap only has an ROI if you measured the “before”. The avoided cost of an unnecessary model-replacement project is a real, reportable saving — directional, but consistent across the deployments we have audited. None of this should be read as a universal percentage: the only honest number is the one your serving path produces under measurement. FAQ How do we reduce inference cost without replacing the model? Profile the deployed serving path first. Most launch-era cost comes from structural inefficiency — unbatched serving, missing caching, routing every request to the largest model — not from the model weights. Continuous batching, request routing, caching, and runtime optimisation often recover a large share of cost with no model change at all, and an audit ranks those interventions by measured impact before anyone touches the model. Why did our AI feature cost explode after launch? Launch traffic is low and bursty, so the serving path runs with small batches and low GPU occupancy where inefficiency is invisible. As adoption grows, the same per-request inefficiency that cost nothing at low volume becomes the dominant line item. The explosion is usually structural — a path built to work but never tuned to scale — which is why the bill grows faster than usage. What does an inference cost audit actually deliver? Four artefacts that survive the engagement: baseline metrics (cost-per-request, p95 latency, GPU utilisation), profiler findings that name the actual bottleneck, a ranked optimisation roadmap scored by impact and integration risk, and a calibrated ROI model projecting cost and latency after each intervention. The deliverable is a defensible decision document a buyer can hand to their own leadership. How do we tell whether the bottleneck is model, runtime, or hardware? Each layer leaves a different fingerprint in a profiler. A model bottleneck is high irreducible per-token compute with the GPU already saturated and batching healthy. A runtime bottleneck is a saturated GPU doing the wrong work — unfused kernels, no batching, a graph a compiler could collapse. A hardware bottleneck is a GPU starved by data movement, sitting idle waiting on HBM bandwidth or PCIe transfers. When is it worth profiling instead of guessing? Profile when inference is a material and growing cost, when latency hurts the product, when a model-replacement project is being proposed, or when the team cannot say which layer the bottleneck is in. Cheap guessing is defensible only when the feature is pre-scale, the bill is immaterial, and the fix is a reversible one-line config change. Always profile before a migration so you have a baseline to prove it paid off. How is cost-per-token related to cost-per-request when auditing an LLM inference deployment? Cost-per-token is the granular unit, but cost-per-request is the operationally relevant figure because a request spans many tokens and batching changes the per-token economics non-linearly. Optimising the wrong unit means optimising a number that does not move the actual bill. The audit anchors on cost-per-request and uses cost-per-token as a diagnostic input. How do batching and request routing affect inference cost without changing the underlying model? Continuous batching groups concurrent requests into a single forward pass, raising GPU utilisation toward the level the hardware was billed for. Request routing sends easy traffic to a smaller model and reserves the large model for requests that genuinely need it. Both lower the blended cost-per-request while leaving the model weights untouched, which is why they often rank above replacement in an audit roadmap. The audit itself is packaged as an AI Inference Cost Cut Pack — it profiles your deployed serving path, names the actual bottleneck, and produces the ranked roadmap and calibrated ROI model described above. If you are deciding whether profiling is worth it, the cheapest way to find out is to measure the one number you cannot currently defend: what a single request actually costs you under production load. That is also where a broader R&D engagement scoped to your problem begins — not with a model swap, but with the evidence that tells you whether you need one.