The most-watched number in the monitoring panel is also the most misread
Of all the metrics available in a GPU monitoring stack, utilization is the one people fixate on first. It shows up in nvidia-smi, in Grafana dashboards, in cloud provider consoles, and in every performance review meeting. High utilization makes people feel good. Low utilization triggers concern. And both reactions are frequently disconnected from what is actually happening in the system.
The reason is straightforward but easy to overlook: GPU utilization is not a performance metric. It’s a device-activity proxy — a signal about how much of a particular sampling window had active kernels — and the gap between that signal and actual system performance is wide enough to produce genuinely wrong conclusions in both directions.
This article is specifically about the metric — what it measures, what it hides, and how to read it without fooling yourself.
What the utilization counter actually reports
The utilization number you see in nvidia-smi or through NVML is typically the percentage of time over the last sampling interval (often 1 second) during which one or more GPU kernels were executing. That’s it. Not “percentage of compute capacity used.” Not “fraction of theoretical throughput achieved.” Not “efficiency.”
This definition has important consequences. A kernel can be executing continuously — keeping the utilization counter at 100% — while doing inefficient work: poor memory access patterns, redundant computation, low arithmetic intensity. The GPU is “busy” in the counter’s definition, but productivity (actual useful work per unit time) may be far below what the hardware can deliver.
Conversely, a workload can deliver high effective throughput while the utilization counter shows 60% because the work arrives in dense bursts followed by brief periods of host-side orchestration or memory-bound operations that the counter doesn’t classify as “active.” The system is productive. The metric just can’t see it in the way you’d expect.
Why utilization and throughput don’t track each other reliably
The disconnect between utilization and performance comes from the fact that utilization measures activity, not outcome.
In throughput terms, what matters is how much useful work the system completes per unit time — tokens generated, images processed, training steps completed. That depends on the full execution pipeline: whether kernels are efficient, whether data movement is well-organized, whether the software stack is exploiting the hardware’s strengths, whether the system is operating in a favorable regime.
Utilization doesn’t capture any of that. It captures whether kernels were scheduled. A high-utilization system running poorly chosen kernels with wasteful memory access patterns will show a “healthy” dashboard while delivering mediocre throughput. A well-optimized system that finishes work faster — with efficient attention kernels like FlashAttention, good operator fusion via torch.compile, and tight memory management — might actually show lower utilization because it completes work in shorter bursts rather than spreading it across the sampling window.
We see this inversion regularly in practice: an optimization that improves real throughput by 25% actually decreases the utilization number, because the work gets done faster per batch and the GPU spends more of each window idle between dispatches. If you’re using utilization as your success metric, you’ve just been told your optimization made things worse. It didn’t.
Averaging hides structure
Utilization is typically averaged over a sampling window, and that averaging destroys information about the workload’s temporal structure.
A serving workload that handles variable-length requests might have busy periods and quiet periods within each second-long sampling interval. The average utilization might land at 55%, but the actual execution pattern is bimodal: periods at 100% during computation and periods near 0% during queueing, preprocessing, or waiting for the next batch. The 55% is a statistically real number that describes no real moment in the system’s operation.
Workloads with distinct phases — torch.compile graph capture, warmup, steady-state inference, and intermittent GC pauses — produce utilization traces that average these phases together. The resulting number tells you nothing about which phase was dominant, which was the bottleneck, or what steady-state behavior actually looks like — one reason benchmarks fail to match real AI workloads.
Understanding that peak and steady-state performance reflect fundamentally different temporal regimes makes this problem sharper. Utilization averaged across those regimes is averaging across the most important dimension your system has — and presenting the result as if it were a single meaningful state.
The high-utilization trap
There’s a mirror-image problem that’s less discussed but equally dangerous: treating high utilization as confirmation that the system is performing well.
High utilization means the GPU has active kernels most of the time. It does not mean those kernels are doing the right work efficiently, that the system is producing useful output at a high rate, that end-to-end latency is acceptable, or that the user experience meets its target.
You can achieve high utilization by over-batching in a latency-sensitive serving system — the GPU stays busy because it’s always processing a large queue, but individual request latency spikes because each request waits longer before being served. The dashboard says “healthy.” The users disagree.
You can also achieve high utilization with poorly optimized kernels that spin on the compute units without efficiently converting that activity into output. The GPU is busy. It’s just not busy doing the right thing at the right speed.
In both cases, the utilization metric is telling you something about activity level but nothing about whether the system’s actual objective — throughput, latency, cost-efficiency, user experience — is being met.
How to contextualize utilization without ignoring it
Utilization isn’t useless — it’s just insufficient on its own, and dangerous when elevated to the status of a performance metric.
The productive way to use it is as one signal among several, interpreted alongside metrics that directly measure outcomes: actual throughput, request latency distributions (p50, p95, p99), memory bandwidth utilization, kernel-level profiling from tools like Nsight Systems, and system-level timing that shows where time goes end-to-end.
If utilization is low and throughput is also low, you have a potential signal worth investigating — but the investigation should focus on where time goes, not on “making the GPU busier.” If utilization is low and throughput is meeting its target, the utilization number is telling you about the workload’s structure, not about a problem.
The question that actually helps is never “why is utilization low?” — it’s “where does time go across the execution path, and is the system meeting its objective?” One question leads to dashboard-driven anxiety. The other leads to engineering.