How do I diagnose where AI inference latency is spent?

Four layers: model-compute profiling (per-kernel), memory profiling (bandwidth, KV-cache), batching profiling (queue-wait, batch formation), transport profiling (serialisation, network, tokenisation). Run against production traffic. Latency Pareto chart decides what to optimise.

How do batching strategies trade throughput against tail latency?

Static: high throughput, poor tail latency. Dynamic with max-wait: trades throughput for predictable tail. Continuous (LLM default): best throughput-latency frontier for autoregressive. SLA defines operating point.

When should I optimise the inference path vs scaling out?

Optimise first when per-request latency is constraint, cost-per-inference is constraint, or hardware budget is fixed with uncertain headroom. Scale only when optimisation runway exhausted or engineering cost exceeds hardware cost. Default: optimise-then-scale.

How do I measure cost-per-inference to justify the engineering work?

Total infrastructure cost divided by inferences served. Baseline at production traffic; post-optimisation at same profile; delta times annual volume equals savings. Report as first-class metric alongside latency/throughput; alarm regressions; monthly review.

Machine Learning on GPU: A Faster Future

Q: What is the most efficient GPU infrastructure for low-latency inference?

LLMs 7B–70B: H100/H200 with vLLM or TensorRT-LLM, continuous batching, FP8. Very large: multi-GPU NVLink with tensor/pipeline parallelism. CV: L40S/H100 with TensorRT INT8. Edge: L4/RTX A2000 with dynamic batching.

Q: When does FP8/INT8 quantisation actually reduce serving latency?

Compute-bound workloads on tensor-core operations benefit (Hopper/Blackwell FP8/INT8 paths). Memory-bound workloads only save memory and improve concurrency, not single-sequence latency. Test: measure latency at candidate batch with FP16 baseline vs FP8/INT8.

Introduction

“Machine learning on GPU: a faster future” reads as a slogan but the engineering reality is specific: faster inference on GPU infrastructure is a measurable outcome of disciplined latency diagnosis, batching choices, quantisation decisions, and cost-per-inference accounting — not a property of buying a bigger GPU. The applied example that makes this concrete is the inference-latency optimisation programme that every production ML team eventually runs: diagnose where the latency budget is being spent, apply the optimisations that match the actual bottleneck, and measure the cost-per-inference impact to justify the engineering work. See GPU engineering for the broader programme this applied example lives inside.

The naive read of “faster ML on GPU” is “throw more or bigger GPUs at it.” The expert read is that scaling out before optimising the inference path wastes capacity, that optimisations have specific envelopes (quantisation does not always help, batching trades throughput against tail latency), and that the cost-per-inference number is the metric that justifies or refuses the engineering investment.

What this means in practice

Latency diagnosis precedes optimisation; optimising the wrong bottleneck makes things worse.
Quantisation (FP8/INT8) sometimes reduces latency, sometimes only memory; the workload determines which.
Batching strategy trades throughput against tail latency; the SLA defines the right operating point.
Cost-per-inference is the metric that gates the optimisation investment.

How do I diagnose where AI inference latency is being spent — model compute, memory, batching, or transport?

The diagnostic toolkit has four layers. Model-compute profiling: per-kernel timing from NVIDIA Nsight, TensorRT profile reports, PyTorch profiler — locates the operators that dominate per-request compute time. Memory profiling: bandwidth utilisation, KV-cache pressure (for autoregressive models), tensor-size analysis — identifies workloads that are memory-bound rather than compute-bound. Batching profiling: queue-wait time, batch-formation time, batch-size distribution under production traffic — identifies whether the latency is in the model or in the request queue.

Transport profiling: gRPC/HTTP serialisation time, network round-trip from client, tokenisation/detokenisation time — identifies the latency budget that the model never sees. Run the four layers against representative production traffic, not against a synthetic benchmark. The latency Pareto chart that the diagnosis produces is the artefact that decides what to optimise; without it, optimisation work is guesswork that usually targets the wrong layer.

What is the most efficient GPU infrastructure for low-latency inference today?

The 2026 production-optimal stack depends on the model class. For LLMs at moderate scale (7B–70B parameters): H100 or H200 with vLLM or TensorRT-LLM, with continuous batching, paged attention, and FP8 quantisation where the model supports it. For very large models (100B+): multi-GPU H100/H200 with NVLink, tensor parallelism, and pipeline parallelism tuned to the latency-throughput operating point.

For CV inference: L40S or H100 with TensorRT, INT8 quantisation where accuracy permits, batch sizes tuned to the latency budget. For edge inference at the constrained envelope: L4 or RTX A2000 with TensorRT, INT8, and dynamic batching. The right hardware follows the workload class; specifying “the most efficient infrastructure” without specifying the workload class produces procurement that does not match the actual problem. See the data centre GPU procurement framing for the broader hardware-choice methodology.

When does FP8 / INT8 quantisation actually reduce serving latency, and when does it only save memory?

Quantisation reduces serving latency when the workload is compute-bound on tensor-core operations that the lower-precision path executes faster — modern matrix multiplies on H100/H200 with FP8 tensor cores benefit substantially; integer matrix operations on Hopper and Blackwell INT8 paths benefit substantially. Quantisation only saves memory when the workload is memory-bound (the compute is not the bottleneck) — in that case the smaller model fits more in cache, fits more concurrent sequences in a given memory budget, but does not reduce per-request latency on a single sequence.

The honest test: measure latency at the candidate batch size with FP16 baseline, then with FP8/INT8; if the latency drops, the workload benefits from compute-side quantisation; if it does not, the benefit is memory and concurrency, not latency. Some workloads benefit on both axes; some benefit only on memory; some require careful accuracy validation that finds the quantisation is not lossless enough for the application. The decision is per-workload, not blanket.

How do batching strategies (continuous, dynamic, static) trade throughput against tail latency?

Static batching (fixed batch size, formed at the request boundary) gives high throughput at the cost of tail latency — requests wait for the batch to fill, which adds queueing latency the SLA may not tolerate. Dynamic batching (the server forms a batch from whatever requests are queued up to a max-wait threshold) trades throughput for predictable tail latency — the max-wait bound caps the queueing latency.

Continuous batching (the modern default for LLM inference — requests join and leave the active batch at iteration boundaries) gives the best throughput-latency frontier for autoregressive workloads by avoiding the dead time of static batches finishing together. The SLA defines the operating point: tight p99 latency targets push toward dynamic batching with a low max-wait or continuous batching; throughput-dominated objectives without strict tail-latency limits accept the static-batching wait. The wrong batching strategy for the SLA produces either wasted capacity or missed SLAs; the right strategy is workload-and-SLA-specific.

When should I optimise the inference path rather than scale out to more GPUs?

Optimise before scaling when the per-request latency is the constraint (more GPUs reduce throughput-bound latency by reducing queueing but do not change per-request inference time), when the cost-per-inference is the constraint (scaling out multiplies the cost; optimisation reduces it), or when the hardware budget is fixed and the workload’s headroom on existing GPUs is uncertain.

Scale out before optimising only when the optimisation runway is exhausted (the path has been profiled, the obvious optimisations applied, and the remaining gap requires hardware), when the engineering cost of further optimisation exceeds the hardware cost of scaling, or when the timeline forces immediate capacity expansion without the time for optimisation work. The default pattern in 2026 production deployments is optimise-then-scale, not scale-then-hope; the cost-per-inference impact of a few weeks of focused optimisation typically dwarfs the savings from comparable hardware investment.

How do I measure cost-per-inference before and after optimisation to justify the engineering work?

The cost-per-inference calculation: total infrastructure cost (hardware amortisation or cloud rental at the production utilisation) divided by inferences served in the period. Measure the baseline at production traffic with no recent optimisation; measure the post-optimisation number at the same traffic profile; the delta multiplied by projected annual inference volume is the annualised savings.

The cost-per-inference number is the artefact that justifies the engineering work to the CFO and refuses the scaling investment to the procurement team. The discipline: report cost-per-inference as a first-class production metric alongside latency and throughput; alarm on regressions; review trend monthly. Teams that optimise without measuring cost-per-inference produce technically impressive results that the finance team cannot turn into capacity decisions; teams that measure it produce the engineering programme that earns continued investment.

Limitations that remained

Even the well-executed inference-optimisation programme leaves limits. Quantisation accuracy floors exist for some workloads where the application requires the full FP16 dynamic range and the engineering team cannot ship the lower-precision path. Batching tail-latency targets that approach the per-request latency floor leave little batching headroom; the workload essentially serves one request at a time, with the cost structure that implies. Cost-per-inference reductions plateau as the optimisation runway exhausts — the third or fourth round of optimisation produces diminishing returns and the cost-discipline conversation shifts to scaling or to reducing the workload.

Model-architecture choices made at training time bound what serving-time optimisation can achieve — some optimisations require retraining or fine-tuning at lower precision, which is a separate investment with its own risks. The honest framing is that inference optimisation is a real and measurable lever, applied within a known envelope, with the next investment after exhaustion being either architectural or strategic rather than another optimisation pass.

How TechnoLynx Can Help

TechnoLynx works with production ML teams on inference latency optimisation from the diagnostic profile through quantisation and batching decisions, the cost-per-inference accounting that justifies the engineering work, and the optimise-vs-scale decision that keeps the GPU budget honest. If your team is scaling ML on GPU and needs the optimisation programme structured before the next hardware purchase, contact us.

Image credits: Freepik