Profiling Tools for AI Inference: What They Measure and How to Read the Output

A latency alert fires, p95 is up, and the first instinct is to open whatever dashboard is already on screen — usually GPU utilisation or the cloud bill. Both produce a number. Neither tells you where the time actually goes.

That gap is the reason profiling tools exist, and the reason they are so often misread. A profiler does not hand you a fix. It hands you a layered view of where time and money are spent along the serving path, and it is your job to interpret that view correctly. Read in isolation, a single profiler reading invites the wrong fix — a model swap to address a problem that was actually batching, or a hardware upgrade for a kernel that was never the bottleneck. Read against a map of the whole serving path, the same data becomes a decision you can defend.

This article is about how to read what these tools produce. Not how to run them — that is mechanical and well-documented — but how to interpret the output so that the number you act on is the one that matters.

What Profiling Tools Actually Measure

Profiling, in the inference context, means instrumenting the serving path so you can see the time and resource cost of each stage rather than the aggregate. The serving path is the full journey of a request: it arrives, waits in a queue, gets assembled into a batch, runs through a runtime that schedules GPU kernels, returns through the framework, and leaves. Each of those stages contributes latency, and each can be the bottleneck.

A useful way to think about it: aggregate metrics tell you that something is slow or expensive. Profilers tell you where. The four outputs you will most often work with are:

Per-stage latency contribution — how much of a request’s wall-clock time was spent queuing, batching, in the runtime, in compute, and in serialization.
GPU utilisation and occupancy — whether the device is busy, and whether the work it is doing actually fills its compute units.
Kernel-level time breakdown — which individual GPU operations (matrix multiplies, attention kernels, normalisations) consume the most time.
Cost-per-request attribution — what a single served request costs once you divide infrastructure spend by served volume and attribute it to a stage.

These are the inputs to a cost or latency decision. They are not the decision. We cover the general mechanics of how profiling produces these numbers in our walkthrough of how profiling AI inference works and what the numbers mean in practice; this article focuses on reading the output and choosing the right instrument for the question.

Why GPU Utilisation Can Look High While Latency Stays Bad

This is the single most common misread, so it is worth stating plainly: high GPU utilisation does not mean your inference is fast, and it does not mean your hardware is the bottleneck.

Utilisation, as most dashboards report it, measures whether the GPU was doing something during a sampling window — not whether it was doing useful work efficiently. A device can show 95% utilisation while running kernels that are memory-bound, poorly batched, or stalled on data movement. Occupancy — the fraction of available compute units actually engaged — can be low while utilisation looks pegged. You can be busy and slow at the same time.

The trap is acting on the high number. A team sees 90%+ utilisation, concludes the GPU is saturated, and either buys more GPUs or starts planning a smaller model. Both are expensive, and both can leave p95 exactly where it was if the real cost was queuing latency from undersized batches or a runtime that was not fusing kernels. The idle-GPU illusion — where a utilisation figure masks how little useful throughput the device produced — is a measurement problem before it is an engineering one, and the reasoning behind it is laid out well in LynxBenchAI’s analysis of model FLOPs utilisation in AI training. Grounding a cost claim in that measurement reasoning, rather than re-deriving it from a single dashboard, is what keeps the conclusion defensible.

The correct frame: utilisation is a sanity check, not a verdict. If utilisation is low, you have headroom and the bottleneck is upstream — queuing, batching, or the runtime. If utilisation is high, you still cannot conclude the GPU is the limit until you have looked at occupancy and kernel timing.

Request-Level Tracing vs Kernel-Level Profiling

These two terms get used interchangeably, and they answer entirely different questions. Confusing them is how teams end up profiling the wrong layer.

Request-level tracing follows a single request through the serving path and time-stamps each stage. It answers: of the 180 ms this request took, how much was queue wait, how much was batch assembly, how much was actual compute? This is the view that tells you which layer to investigate. Tools in this category include distributed tracing systems (OpenTelemetry-based traces, for example) and serving-framework instrumentation in Triton Inference Server or similar.

Kernel-level profiling zooms into the GPU itself and breaks down the compute stage into individual operations. It answers: within the 90 ms of compute, which kernels ran, how long did each take, and were they compute-bound or memory-bound? This is where NVIDIA Nsight Systems (timeline of CUDA activity across the system) and Nsight Compute (deep single-kernel analysis) live, alongside the profiler hooks in PyTorch and TensorRT.

The order matters. Request-level tracing comes first because it tells you whether the GPU is even the problem. If a trace shows that 70% of a request’s time is queue wait, opening Nsight Compute to optimise an attention kernel is wasted effort — the kernel was never on the critical path. Kernel profiling is the right tool only once a trace has pointed you at compute as the dominant stage.

Which Profiler Answers Which Question

This is the decision most teams get wrong by defaulting to the tool they already have open. The mapping below is the one we reach for when diagnosing a serving path.

Question you are asking	Layer	Tool class	Representative tools
Where does a request’s wall-clock time go?	Serving path	Request-level tracing	OpenTelemetry traces, Triton metrics, framework instrumentation
Is the GPU busy, and is the work filling it?	Hardware	Utilisation + occupancy profiler	`nvidia-smi`, DCGM, Nsight Systems
Which GPU kernels dominate compute time?	Kernel	System-wide GPU timeline	NVIDIA Nsight Systems
Why is a specific kernel slow (memory- vs compute-bound)?	Kernel	Single-kernel deep profiler	NVIDIA Nsight Compute
What is a served request costing, and where?	Economics	Cost-per-request attribution	Cloud billing + traces joined on request volume

Read the table top to bottom: it is also a sequence. You move down a layer only when the layer above has told you the bottleneck lives below it. The application-monitoring layer that sits above all of this — request rates, error budgets, dashboards — is a different concern; we draw that boundary in our look at what application performance management tools for AI inference show and what they miss.

Reading the Output: Is the Bottleneck Batching, Runtime, Kernel, or Hardware?

Once you have the layered view, interpretation comes down to attributing the dominant cost to one of four layers. Each has a distinct signature in the profiler output.

Batching is the bottleneck when request-level traces show large queue-wait or batch-assembly time and GPU utilisation is low between bursts. The device is idle, waiting for enough requests to form an efficient batch, or the batch size is too small to amortise launch overhead. The fix lives in the serving configuration, not the model.

Runtime is the bottleneck when compute time is high but kernel profiling shows many small, unfused operations with launch overhead dominating. This is the signature of a graph that was never compiled or fused — the work the GPU does is fine, but it is dispatched inefficiently. Runtime-level remedies (kernel fusion, graph compilation, switching to TensorRT) apply here, and we treat them at length in our discussion of what performance tuning for AI inference actually means in practice.

Kernel is the bottleneck when a single operation — an attention kernel, a large matrix multiply — dominates the timeline and Nsight Compute shows it is genuinely compute-bound at high occupancy. This is the rarest honest finding, because most “the kernel is slow” conclusions turn out to be runtime or batching problems on inspection.

Hardware is the bottleneck only when utilisation and occupancy are both high, kernels are compute-bound and well-fused, and there is still no headroom. This is the last conclusion to reach, not the first, because it is the most expensive to act on. Most teams that “needed more GPUs” needed a batching or runtime change instead — an observed pattern across the cost engagements we run, not a benchmarked rate.

A profiler reading that fails to match real workloads is its own failure mode — benchmark numbers measured under synthetic load routinely diverge from production behaviour, a divergence LynxBenchAI documents in why GPU utilisation in benchmark testing fails to match real workloads. Profile the path under conditions that resemble the load you actually serve, or the attribution will be wrong before you start interpreting it.

How Profiler Data Becomes an Audit’s Before/After Baseline

Profiler output is most valuable when it is not read in isolation. A per-stage latency breakdown, a kernel timeline, and a cost-per-request figure are individual readings; placed together against a map of the serving path, they form a bottleneck map — the structured statement of where the time and money go, layer by layer.

That bottleneck map is the before baseline. When you act on it — change the batch policy, compile the graph, switch runtimes — you re-profile under the same conditions and compare. The difference between the two profiles is the evidence that the fix worked, and it is the only thing that distinguishes a change that moved p95 from a quarter spent swapping models that changed nothing.

This is exactly the work an AI inference cost audit performs to find the real bottleneck before you replace the model: it applies these profilers to your deployed serving path and turns their output into a legible, defensible bottleneck map. The audit’s inference cost-cut pack is the engagement that produces that map; the profilers are the instruments it uses. Cost-per-request attribution, in particular, only becomes actionable once it is tied to a stage — which is why cost-per-request is the right optimisation target for production AI rather than a raw cloud-bill total. The broader practice of turning these readings into engineering decisions is what our R&D consulting engagements are built around.

FAQ

How does profiling tools work, and what does it mean in practice?

Profiling instruments the inference serving path so you can see the time and resource cost of each stage — queuing, batching, runtime, compute, serialization — instead of a single aggregate number. In practice it produces a layered view: per-stage latency contribution, GPU utilisation and occupancy, kernel-level timing, and cost-per-request. The tool surfaces where the time goes; interpreting that to find the bottleneck is the engineering work.

Which profiling tools should I use for an AI inference serving path?

Start with request-level tracing (OpenTelemetry traces, serving-framework metrics from Triton) to see which stage dominates a request’s wall-clock time. If compute is the dominant stage, move to GPU timeline profilers (NVIDIA Nsight Systems) and single-kernel analysis (Nsight Compute). Use utilisation profilers like nvidia-smi or DCGM as a sanity check, not a verdict. The right tool depends on which layer the trace points you at.

How do I read profiler output to tell whether the bottleneck is batching, runtime, kernel, or hardware?

Each layer has a signature. Batching shows up as high queue-wait with low GPU utilisation between bursts. Runtime shows up as high compute time made of many small, unfused kernels with launch overhead dominating. A kernel bottleneck is a single compute-bound operation dominating the timeline at high occupancy. Hardware is the bottleneck only when utilisation and occupancy are both high, kernels are well-fused, and there is still no headroom — the last conclusion to reach, not the first.

What is the difference between request-level tracing and kernel-level profiling?

Request-level tracing follows one request through the serving path and time-stamps each stage, answering which layer is slow. Kernel-level profiling zooms into the GPU compute stage and breaks it into individual operations, answering why compute is slow. Tracing comes first; kernel profiling is only worth running once a trace has shown that compute is the dominant stage.

Why can GPU utilisation look high while latency is still bad?

Utilisation usually measures whether the GPU was doing something during a sampling window, not whether the work was efficient. A device can show 95% utilisation while running memory-bound or poorly-batched kernels, with low occupancy and low useful throughput. Acting on the high number — buying more GPUs or shrinking the model — often leaves p95 unchanged because the real cost was queuing or runtime inefficiency.

How does profiler data become an audit’s before/after baseline?

Individual profiler readings — per-stage latency, kernel timeline, cost-per-request — combine into a bottleneck map: a layer-by-layer statement of where time and money go. That map is the before baseline. After a fix, you re-profile under the same conditions and compare; the difference is the defensible evidence that the change worked, rather than an unverified claim.

Which NVIDIA profiling tools (e.g. Nsight Systems, Nsight Compute) map to which questions about an inference serving path?

Nsight Systems gives a system-wide timeline of CUDA activity — use it to see which kernels dominate the compute stage and how the GPU timeline relates to the rest of the serving path. Nsight Compute does deep single-kernel analysis — use it to answer why one specific kernel is slow and whether it is memory- or compute-bound. For the busy-or-not question, nvidia-smi and DCGM report utilisation and occupancy as a sanity check above the kernel layer.

The honest closing position is restraint: a profiler number means very little until you know which layer it describes and whether the load that produced it resembles production. Read the trace before the kernel, treat utilisation as a sanity check rather than a verdict, and only call it a hardware problem once every cheaper explanation has been ruled out — that discipline is what separates a fix that moves p95 from a quarter spent on the wrong layer.