Low GPU Utilization: Where the Real Bottlenecks Hide

You open the monitoring panel, and the number looks wrong

GPU utilization: 38%.

If you’ve deployed AI workloads in production, you’ve probably had this moment. The number feels like an accusation — the hardware is “idle,” money is being burned, and something must be broken. The instinct is to treat utilization as a scoreboard: higher is better, lower means waste.

That instinct is understandable, but it leads to some of the most persistent misdiagnoses in AI infrastructure. Low utilization is common in real AI systems, and by itself it doesn’t tell you whether the hardware is being wasted. In many cases, the system is working as hard as the workload allows — the bottleneck just isn’t where the utilization counter is looking.

What the utilization number actually captures

GPU utilization, as reported by tools like nvidia-smi, is not “percentage of performance being used.” It’s a measurement tool’s estimate of how much time the GPU had at least one active kernel during a sampling window. That definition is narrower than most people realize.

It focuses primarily on compute unit activity and averages it across a time window that can hide the internal structure of the workload entirely. A workload that runs intense compute bursts separated by periods of memory-bound operations, synchronization, or host-side orchestration will report lower utilization than a workload that keeps the compute units continuously busy — even if both workloads are delivering comparable throughput for their respective tasks.

This isn’t a flaw in the metric. It’s a limitation of what a single aggregated number can express. Utilization is a proxy for one aspect of device activity, not a summary of system performance. The confusion happens when people treat it as the latter.

AI workloads are often not compute-bound

A large fraction of real inference work — especially autoregressive decoding with transformer models — is dominated by memory behavior rather than arithmetic throughput. When a workload is memory-bandwidth-bound, the compute units physically cannot stay fully occupied because they spend time waiting for data. The utilization counter shows this as “low utilization,” but what it actually reflects is that compute is not the limiter.

The same pattern appears with irregular operators, small batch sizes, frequent synchronization between operations, and workloads with unfavorable memory access patterns. In all of these cases, the GPU is not “idle” in any meaningful sense — the system is doing real work, it’s just that the work doesn’t look like continuous arithmetic from the compute units’ perspective.

We see this regularly with serving workloads that handle variable-length sequences: the system is handling real requests and delivering real throughput, but the utilization dashboard makes it look like the GPU is coasting. The dashboard is measuring the wrong subsystem for that regime.

The pipeline reality: work arrives in stages, not as a steady stream

AI execution is not one long kernel that runs forever. It’s a pipeline. Work has to be prepared on the host side, moved to the device, scheduled, launched, executed, synchronized, and the results moved back. Some workloads involve multiple back-and-forth stages per step, especially when the framework does host-side orchestration between device kernels.

Utilization counters reward continuous device activity, but many AI workloads are inherently bursty or staged at the device level. Inference services, in particular, often alternate between short, intense bursts of GPU work and gaps where the system is doing queueing, preprocessing, or waiting on upstream components. If your monitoring window averages those gaps together with the bursts, you get a utilization number that makes a bursty-but-productive system look idle by construction.

The bottleneck rule is simple and powerful: the system can only run as fast as its slowest stage. When the slowest stage is outside the GPU compute units — a common situation — the compute units cannot remain saturated, and low utilization becomes the expected outcome, not a defect.

Common bottleneck categories when utilization appears low

Bottleneck type	Symptoms	What utilization shows	Actual limiter
Compute-bound	GPU near thermal limits, high arithmetic activity	High (as expected)	Arithmetic throughput of the compute units
Memory-bandwidth-bound	High HBM traffic, moderate utilization	Lower than expected	Data movement speed, not compute capacity
Host-bound	Gaps between kernel launches, CPU at high load	Low, with idle periods	CPU preprocessing, data loading, or orchestration overhead
Pipeline-bound	Intermittent utilization spikes and dips	Variable, averaging low	Synchronization between stages, PCIe transfers, or upstream dependencies

When is low GPU utilization the correct operating point?

There’s a particularly important scenario where low utilization isn’t a problem to fix — it’s a design choice.

Inference services that optimize for latency deliberately avoid the conditions that maximize utilization. Low batching keeps response times predictable. Avoiding aggressive queueing prevents tail-latency spikes. Multi-tenant isolation means giving up global packing efficiency in exchange for fairness and stability.

These trade-offs are often correct for the service’s actual objective. Pushing for higher utilization in a latency-sensitive system typically means increasing batching, which means individual requests wait longer, which means the service gets “busier” by the dashboard’s definition but worse by the user’s definition.

So treating utilization as a maximization target is only valid if throughput is your only objective. If latency, predictability, or isolation matter — and in serving workloads they almost always do — then the “right” utilization number might be substantially below 100%, and that’s engineering, not waste.

The question that actually helps

If utilization isn’t the diagnosis, what is?

The shift we find most useful is to stop asking “why is utilization low?” and start asking “where does time go?” Because the answer to the first question is often just “the bottleneck is somewhere else” — which is true but not actionable. The answer to the second question points to the actual limiter: memory traffic patterns, host-side scheduling overhead, PCIe transfer latency, synchronization contention, or something external to the GPU entirely.

Once you find the real limiter, you can decide whether it’s fixable, fundamental, or just the natural shape of the workload. And you can stop treating the utilization number as a moral judgment on your infrastructure.There’s a useful sanity check before you start fixing anything: separate an expected idle pattern from a real stall. A memory-bandwidth-bound decode step or a deliberately low-batch latency service will report low utilization by construction — that is the workload’s natural shape, not a bug. A genuine stall looks different: the GPU sits idle while a host-side data loader, a synchronization barrier, or an upstream dependency keeps it starved of work it could otherwise be doing. The tell is whether the limiter is intrinsic to the computation or an avoidable gap in feeding it. The first is the workload; the second is worth fixing.

This is also where the so-called “30% rule” comes apart. People sometimes cite a threshold — utilization should sit above 30%, or 50%, or some other figure — as if a single number could certify health. It can’t. Utilization is context-dependent: the same model on the same hardware will land at very different numbers depending on batch size, framework orchestration, and input pipeline shape. There is no portable heuristic that maps a utilization percentage onto “healthy” or “wasted,” because the metric measures one subsystem’s occupancy, not whether the system is doing the right work efficiently. A threshold rule is a scoreboard dressed up as diagnosis.

Understanding that performance is an execution property of the full system is what makes this reframe stick. Utilization is one observation of one stage in a pipeline. It’s useful as a clue. It’s dangerous as a scoreboard.

LynxBenchAI applies this diagnostic framing at the methodology level — treating the full hardware-and-software stack as the unit of measurement rather than isolating individual device metrics such as utilisation. It is a benchmarking methodology for AI hardware that measures sustained throughput under realistic load, reported per precision, with bounded optimisation.

In a running system, the move from “why is utilization low?” to “where does time go?” is the opening step of profiling AI inference — the applied-engineering counterpart to locating the real limiter.

Frequently Asked Questions

Why is low GPU utilization common on AI workloads even when nothing is broken?

Because many AI workloads — particularly autoregressive decoding, small-batch inference, and pipelines with host-side orchestration — are limited by memory bandwidth, data movement, or scheduling rather than arithmetic throughput. The compute units cannot stay saturated when they are waiting on the slowest stage of the pipeline, so low utilization becomes the expected outcome, not a defect.

What does the GPU utilization percentage reported by tools like nvidia-smi actually count, and what does it miss?

nvidia-smi reports an estimate of how much time the GPU had at least one active kernel during a sampling window. It captures compute unit activity averaged across that window, so it misses the internal structure of the workload — memory-bound phases, synchronization gaps, host-side orchestration, and PCIe transfers all register as “not utilization” even when the system is doing productive work.

When is pushing for higher GPU utilization not the right optimisation goal?

Whenever latency, predictability, or multi-tenant isolation matters more than raw throughput. Latency-sensitive inference services deliberately keep batch sizes low and avoid aggressive queueing, which lowers utilization but improves response times and tail-latency behavior. In that regime, the “right” utilization number can be well below 100%, and treating it as a maximization target degrades the service’s actual objective.

Where do AI workload bottlenecks usually live — compute, memory bandwidth, data movement, or scheduling?

In real systems they are spread across all four, and most often outside the compute units. The decision table in the article maps four common regimes: compute-bound, memory-bandwidth-bound, host-bound, and pipeline-bound. Memory bandwidth dominates transformer decoding; host preprocessing and orchestration dominate many serving stacks; pipeline synchronization and PCIe transfers show up as intermittent utilization patterns.

How should a team diagnose an “underutilized” GPU before assuming the hardware is wasted?

Stop asking “why is utilization low?” and start asking “where does time go?” That reframes the diagnosis toward the actual limiter — memory traffic, host scheduling, PCIe latency, synchronization, or upstream dependencies — rather than the symptom. Once the real limiter is identified, you can decide whether it is fixable, fundamental, or simply the natural shape of the workload.

Why does the same model show very different utilization under different batch sizes, frameworks, or input pipelines?

Because utilization is a function of how continuously the compute units have work queued, not of how much real work the system is doing. Larger batches pack more arithmetic per kernel launch and hide host-side gaps; different frameworks vary in how much orchestration they do between kernels; input pipelines with irregular operators or variable-length sequences create memory access patterns that leave compute units waiting. Same model, same hardware, different pipeline shape — different number on the dashboard.

When does a low GPU utilization number actually indicate a real data-pipeline or scheduling stall worth fixing, versus an expected idle pattern for the workload?

An expected idle pattern is intrinsic to the computation: a memory-bandwidth-bound decode step or a deliberately low-batch latency service will report low utilization by construction. A real stall is an avoidable gap in feeding the GPU — a host-side data loader, a synchronization barrier, or an upstream dependency keeping the device starved of work it could otherwise be doing. The diagnostic test is whether the limiter is intrinsic to the work or an avoidable gap in supplying it; the first is the workload’s natural shape, the second is worth fixing.

What is the so-called “30% rule” people cite for AI workloads, and does GPU utilization map to any such heuristic in a meaningful way?

The “30% rule” is the habit of citing a fixed utilization threshold — above 30%, or 50%, or some other figure — as if a single number could certify a workload as healthy or wasteful. It does not map to anything meaningful, because utilization is context-dependent: the same model on the same hardware lands at very different numbers depending on batch size, framework orchestration, and input pipeline shape. The metric measures one subsystem’s occupancy, not whether the system is doing the right work efficiently, so no portable threshold can diagnose performance.