The Hidden Cost of GPU Underutilisation

Most GPU workloads use 30–50% of available compute. Without profiling, bandwidth, occupancy, and serialisation waste is invisible — and expensive.

The Hidden Cost of GPU Underutilisation
Written by TechnoLynx Published on 21 Apr 2026

Your GPUs are probably half idle

An NVIDIA A100 provides 312 TFLOPS of FP16 compute. An H100 provides 989 TFLOPS. These are the numbers on the data sheet. The numbers in production are different. Typical AI training workloads achieve 30–50% of theoretical peak throughput (an observed range across our engagements, not a benchmarked industry rate). Inference workloads often achieve less. The gap between the hardware’s capability and the software’s utilisation of that capability represents wasted compute budget — compute that was purchased (or rented) but not used.

This underutilisation is not a hardware deficiency. It is a software architecture problem — specifically, a mismatch between how the workload is structured and how the GPU hardware executes work. The GPU provides massive parallelism (thousands of cores), high-bandwidth memory (per NVIDIA’s published specifications, 2–3 TB/s on modern architectures), and specialised compute units (tensor cores). Exploiting this hardware requires that the workload is structured to saturate these resources simultaneously. When it is not — when the workload serialises operations that could be parallel, when memory access patterns waste bandwidth, or when kernel launch overhead dominates the compute time — the GPU sits partially idle while the wall-clock time extends beyond what the hardware could deliver.

The cost is real but invisible from the outside. A team running an 8× A100 cluster at a cloud rate of roughly £25/hour per GPU is spending £200/hour whether the silicon is working or waiting. The accounting line is the same; the work delivered is not. That gap is the subject of this article.

How do I calculate the true cost of an underutilised GPU fleet?

The honest unit of measurement is total cost of ownership per useful FLOP, not TCO per purchased FLOP. Purchased FLOPs are what appears on the invoice. Useful FLOPs are what the workload actually consumed. The ratio between them is the utilisation factor, and it is rarely the number the procurement spreadsheet assumed.

A worked example with explicit assumptions:

  • Cluster: 8× A100 80GB on a cloud provider, on-demand pricing roughly £25/hour per GPU (illustrative; rates vary by region, commitment, and provider).
  • Workload: a training run scheduled to occupy the cluster continuously for four weeks.
  • Purchased compute: 8 × 312 TFLOPS = 2,496 TFLOPS of FP16 capacity, available for 672 hours.
  • Cash cost: 8 × £25 × 672 ≈ £134,400 for the run.

Profile the workload and the picture changes. If achieved utilisation is 40% — well within the observed range we see in audits — then the useful FLOPs delivered are 40% of capacity, and the effective cost per useful FLOP is 2.5× the headline rate. Roughly £80,000 of the £134,400 invoice was paid for silicon that was waiting rather than computing. That figure is not theoretical; it is the line item the finance team will see, whether or not anyone names it.

The same calculation runs in reverse for procurement decisions. A team proposing to double cluster size to halve training time is, in effect, proposing to pay 2× for a 2× speedup that profiling could often deliver at 1× cost. Until utilisation is measured, the proposal cannot be evaluated on its merits.

Where does the GPU utilisation go?

GPU underutilisation has specific, diagnosable causes. Identifying which cause dominates a given workload is the first step toward recovering the wasted compute. The table below maps the visible symptom to the underlying bottleneck and the profiler view that confirms it.

Symptom-to-bottleneck diagnostic map

Symptom Likely bottleneck Tool to confirm Evidence class
Kernel operates below the compute roofline but near the memory-bandwidth ceiling Memory bandwidth Nsight Compute roofline analysis benchmark (per-kernel measurement)
Compute units idle during warp stalls; insufficient warps to hide memory latency Low occupancy Nsight Compute occupancy analysis (reports achieved occupancy and limiting factor: registers, shared memory, block size) benchmark (per-kernel measurement)
GPU spends more time idle between kernel launches than executing kernels Host-device serialisation Nsight Systems timeline profiling benchmark (timeline measurement)
Kernel achieves a fraction of cuBLAS or tensor-core throughput Inefficient kernel implementation Nsight Compute warp-level execution statistics and memory throughput analysis benchmark (per-kernel measurement)

Memory bandwidth bottleneck. The GPU’s compute throughput exceeds its memory bandwidth by a ratio that has grown with each hardware generation. Per NVIDIA’s published specifications, an A100 pairs 2 TB/s of HBM2e bandwidth with 312 TFLOPS of FP16 compute — meaning a kernel that reads one FP16 value per arithmetic operation is memory-bandwidth-bound, not compute-bound. Element-wise operations, batch normalisation, activation functions, and small matrix multiplications fall into this category. The compute units are idle waiting for data from memory, regardless of how many are physically available.

The diagnostic is direct. Nsight Compute’s roofline analysis places each kernel on a plot of arithmetic intensity against achieved performance; bandwidth-bound kernels sit on the diagonal bandwidth line rather than the horizontal compute ceiling. The fix depends on the operation: kernel fusion (combining several bandwidth-bound operations into a single kernel that reuses data in registers or shared memory), data-layout optimisation (ensuring coalesced memory access patterns), and mixed-precision computation (using FP16 or INT8 to halve or quarter the bandwidth required per operation).

Low occupancy. GPU occupancy measures the fraction of available warps (groups of 32 threads) that are active simultaneously on each streaming multiprocessor. Low occupancy means the GPU cannot hide memory latency through warp switching — when one warp stalls on a memory access, there are not enough other warps ready to execute, and the compute units sit idle. Common causes: excessive register usage per thread (reducing the number of threads that fit in an SM), excessive shared memory usage per thread block (reducing the number of blocks that can co-reside), or grid dimensions that do not saturate the GPU’s SMs.

The diagnostic is direct: Nsight Compute reports achieved occupancy and the limiting factor — registers, shared memory, or block size. The fix requires adjusting the kernel’s resource usage: reducing register pressure through algorithmic restructuring, reducing shared-memory usage through tiling strategies, or increasing the grid size to provide more concurrent blocks.

Host-device serialisation. Every kernel launch, memory transfer, and synchronisation point between the CPU host and the GPU device creates a serialisation boundary where the GPU may be idle waiting for the host. In workloads with many small operations — model inference with complex branching logic, training loops with frequent metric computation on the CPU — the cumulative host-device overhead can dominate execution time. We have profiled inference pipelines where the GPU spent more time idle between kernel launches than it spent executing kernels.

The fix is to reduce the number of host-device boundaries: batching operations into larger kernels, using CUDA graphs to replay a sequence of operations without per-launch overhead, moving decision logic from the CPU to the GPU where possible, and overlapping data transfer with computation using CUDA streams.

Inefficient kernel implementations. Custom CUDA kernels that do not exploit the hardware’s memory hierarchy, warp-level primitives, or tensor cores leave performance on the table. In our experience across GPU optimisation engagements, a naive matrix multiplication that does not use shared-memory tiling typically achieves on the order of 5–10% of the performance of a cuBLAS implementation (an observed range across our engagements, not a benchmarked industry rate). A convolution kernel that does not use tensor cores on Volta-and-later architectures achieves a fraction of the hardware’s potential throughput. The three reasons GPUs don’t work out often trace back to exactly these utilisation failures — the hardware was adequate, but the software did not exploit it.

Why GPU-busy percentage is misleading

The single most common mistake we encounter in utilisation discussions is treating nvidia-smi’s GPU-busy percentage as a measure of useful work. It is not. The GPU-busy field reports the fraction of time at least one kernel was executing — not how efficiently that kernel used the SMs, the bandwidth, or the tensor cores. A kernel that occupies one SM out of 108 on an A100 and reads memory inefficiently can still register as “100% busy” while delivering perhaps 1% of the hardware’s useful throughput.

This matters because the metric is widely deployed in capacity-planning dashboards. A fleet that reads as “90% utilised” in a Grafana panel may, under roofline analysis, be delivering a small fraction of its purchased FLOPs. The decision to procure additional capacity based on the dashboard reading is, in those cases, a decision to pay twice for compute that the existing fleet could deliver with a different kernel. For a deeper look at how utilisation metrics create misleading confidence, see the illusion of idle GPUs.

The honest measurement requires two NVIDIA tools rather than one. Nsight Systems gives a timeline view of when the GPU was actually executing kernels versus idle waiting for the host. Nsight Compute drills into individual kernels with roofline analysis, achieved occupancy, and memory-throughput statistics. Together they replace a single ambiguous percentage with a structured view of where the time and the bandwidth actually went.

Profile before you procure

The causes of GPU underutilisation are not visible from wall-clock timing alone. A training loop that takes 10 hours does not, by itself, indicate whether the GPU was 90% utilised (near-optimal; further gains require algorithmic changes or hardware upgrades) or 30% utilised (significant room for improvement through software optimisation — an observed range, not a benchmarked rate). The only way to distinguish these cases is profiling.

The investment is small relative to the compute cost it can recover. A workload running on 8× A100 GPUs at roughly £25/hour per GPU is spending £200/hour. A profiling session that identifies a 2× throughput improvement halves the compute cost for the lifetime of the workload — the profiling ROI is measured in days, not months. For teams weighing on-demand versus reserved versus on-premise capacity, that same profiling output also feeds the longer-term decision; see our comparison of cloud GPU and on-premise AI accelerators for how utilisation factors into the cross-deployment-model TCO calculation.

The pattern we encounter across our GPU Performance Audit engagements is consistent. A team has a workload running on GPU infrastructure. The workload is “too slow” — training takes too long, inference latency is too high, throughput does not meet the production requirement. The proposed solution is more GPUs or better GPUs. Profiling reveals that the existing GPUs are 30–50% utilised (an observed range across our engagements, not a benchmarked industry rate), and the performance gap can be closed through software optimisation rather than hardware procurement. The optimisation path is systematic: profile, identify the dominant bottleneck (memory bandwidth, occupancy, serialisation, kernel efficiency), apply the targeted fix, profile again, move to the next bottleneck. Each iteration recovers utilisation that translates directly to throughput improvement and compute cost reduction.

If your GPU workloads are not achieving the throughput you expected from the hardware — and the diagnosis has not included systematic profiling — a GPU Performance Audit identifies the specific utilisation gaps and the interventions that close them, starting with the profiling data rather than the hardware upgrade proposal.

FAQ

How do I calculate the true cost of an underutilised GPU fleet?

Measure TCO per useful FLOP, not TCO per purchased FLOP. Multiply the cluster’s headline cost by the inverse of achieved utilisation: a fleet running at 40% utilisation has an effective cost per useful FLOP that is 2.5× the headline rate. For an 8× A100 cluster at roughly £25/hour per GPU over a four-week training run, that maps to around £80,000 of a £134,400 invoice paid for silicon that was waiting rather than computing.

What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading?

The GPU-busy field in nvidia-smi reports the fraction of time at least one kernel was executing. It says nothing about how many SMs were active, how much of the memory bandwidth was consumed, or whether the tensor cores were engaged. A kernel occupying one SM out of 108 can still register as 100% busy while delivering a small fraction of the hardware’s useful throughput. The honest measurement requires Nsight Systems for the timeline view and Nsight Compute for per-kernel roofline, occupancy, and memory-throughput analysis.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Take the cluster’s hourly cash cost, multiply by the run duration to get purchased FLOP-hours, then divide by the achieved utilisation factor measured by Nsight Compute’s roofline analysis. The result is the cost per FLOP the workload actually consumed. The same calculation applies symmetrically to procurement decisions: a proposal to double cluster size to halve training time is, in effect, a proposal to pay 2× for a result that profiling-guided optimisation can often deliver at 1× cost.

Which workload patterns most often leave GPU capacity on the table?

Four bottlenecks account for almost every case we see in audits. Memory bandwidth saturation, where the kernel waits for HBM rather than for compute, hits element-wise operations, normalisation layers, activation functions, and small matrix multiplications. Low occupancy, where too few warps are active to hide memory latency, is driven by register pressure, shared-memory pressure, or undersized grids. Host-device serialisation dominates workloads with many small kernel launches. Inefficient custom kernels that bypass shared-memory tiling or tensor cores complete the set. Each has a distinct diagnostic signature in Nsight and a different fix.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first. In the engagements we run, the workload is typically 30–50% utilised before any optimisation (an observed range across our engagements, not a benchmarked industry rate). Doubling utilisation through profiling-guided fixes — kernel fusion, CUDA graphs, mixed precision, tensor-core paths — usually closes the throughput gap that motivated the hardware-expansion proposal in the first place, at a fraction of the procurement cost and timeline.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

The arithmetic follows directly from the achieved-utilisation measurement. A workload running at 40% utilisation that is optimised to 80% halves the compute cost for the lifetime of that workload; on an £200/hour cluster, that is £100/hour saved, indefinitely. Across a multi-week training run or a continuously-served inference fleet, the saved spend exceeds the cost of the profiling engagement within days. The figure is workload-specific — the only honest projection comes from profiling the workload in question, not from a generic multiplier.

The thing to take away from a Performance Audit is rarely a single number. It is the structured view of where the purchased FLOPs went — which is the only basis on which the next procurement decision can be made on its merits.

Back See Blogs
arrow icon