You open the monitoring panel, and the number looks wrong
GPU utilization: 38%.
If you’ve deployed AI workloads in production, you’ve probably had this moment. The number feels like an accusation — the hardware is “idle,” money is being burned, and something must be broken. The instinct is to treat utilization as a scoreboard: higher is better, lower means waste.
That instinct is understandable, but it leads to some of the most persistent misdiagnoses in AI infrastructure. Low utilization is common in real AI systems, and by itself it doesn’t tell you whether the hardware is being wasted. In many cases, the system is working as hard as the workload allows — the bottleneck just isn’t where the utilization counter is looking.
What the utilization number actually captures
GPU utilization, as reported by tools like nvidia-smi, is not “percentage of performance being used.” It’s a measurement tool’s estimate of how much time the GPU had at least one active kernel during a sampling window. That definition is narrower than most people realize.
It focuses primarily on compute unit activity and averages it across a time window that can hide the internal structure of the workload entirely. A workload that runs intense compute bursts separated by periods of memory-bound operations, synchronization, or host-side orchestration will report lower utilization than a workload that keeps the compute units continuously busy — even if both workloads are delivering comparable throughput for their respective tasks.
This isn’t a flaw in the metric. It’s a limitation of what a single aggregated number can express. Utilization is a proxy for one aspect of device activity, not a summary of system performance. The confusion happens when people treat it as the latter.
AI workloads are often not compute-bound
A large fraction of real inference work — especially autoregressive decoding with transformer models — is dominated by memory behavior rather than arithmetic throughput. When a workload is memory-bandwidth-bound, the compute units physically cannot stay fully occupied because they spend time waiting for data. The utilization counter shows this as “low utilization,” but what it actually reflects is that compute is not the limiter.
The same pattern appears with irregular operators, small batch sizes, frequent synchronization between operations, and workloads with unfavorable memory access patterns. In all of these cases, the GPU is not “idle” in any meaningful sense — the system is doing real work, it’s just that the work doesn’t look like continuous arithmetic from the compute units’ perspective.
We see this regularly with serving workloads that handle variable-length sequences: the system is handling real requests and delivering real throughput, but the utilization dashboard makes it look like the GPU is coasting. The dashboard is measuring the wrong subsystem for that regime.
The pipeline reality: work arrives in stages, not as a steady stream
AI execution is not one long kernel that runs forever. It’s a pipeline. Work has to be prepared on the host side, moved to the device, scheduled, launched, executed, synchronized, and the results moved back. Some workloads involve multiple back-and-forth stages per step, especially when the framework does host-side orchestration between device kernels.
Utilization counters reward continuous device activity, but many AI workloads are inherently bursty or staged at the device level. Inference services, in particular, often alternate between short, intense bursts of GPU work and gaps where the system is doing queueing, preprocessing, or waiting on upstream components. If your monitoring window averages those gaps together with the bursts, you get a utilization number that makes a bursty-but-productive system look idle by construction.
The bottleneck rule is simple and powerful: the system can only run as fast as its slowest stage. When the slowest stage is outside the GPU compute units — a common situation — the compute units cannot remain saturated, and low utilization becomes the expected outcome, not a defect.
When “low utilization” is the correct operating point
There’s a particularly important scenario where low utilization isn’t a problem to fix — it’s a design choice.
Inference services that optimize for latency deliberately avoid the conditions that maximize utilization. Low batching keeps response times predictable. Avoiding aggressive queueing prevents tail-latency spikes. Multi-tenant isolation means giving up global packing efficiency in exchange for fairness and stability.
These trade-offs are often correct for the service’s actual objective. Pushing for higher utilization in a latency-sensitive system typically means increasing batching, which means individual requests wait longer, which means the service gets “busier” by the dashboard’s definition but worse by the user’s definition.
So treating utilization as a maximization target is only valid if throughput is your only objective. If latency, predictability, or isolation matter — and in serving workloads they almost always do — then the “right” utilization number might be substantially below 100%, and that’s engineering, not waste.
The question that actually helps
If utilization isn’t the diagnosis, what is?
The shift we find most useful is to stop asking “why is utilization low?” and start asking “where does time go?” Because the answer to the first question is often just “the bottleneck is somewhere else” — which is true but not actionable. The answer to the second question points to the actual limiter: memory traffic patterns, host-side scheduling overhead, PCIe transfer latency, synchronization contention, or something external to the GPU entirely.
Once you find the real limiter, you can decide whether it’s fixable, fundamental, or just the natural shape of the workload. And you can stop treating the utilization number as a moral judgment on your infrastructure.
Understanding that performance is an execution property of the full system is what makes this reframe stick. Utilization is one observation of one stage in a pipeline. It’s useful as a clue. It’s dangerous as a scoreboard.