GPU utilization percentage is a poor AI performance metric nvidia-smi reports GPU utilization as a percentage. Teams running AI workloads often treat this number as a performance indicator: high utilization = good, low utilization = wasted compute. This interpretation is wrong often enough to cause real problems. GPU utilization percentage measures how often the GPU was executing at least one kernel in the last 100ms sampling window. A GPU running a single inefficient kernel 100% of the time shows 100% utilization. A GPU running the same workload 10× faster shows 100% utilization. The numbers are identical; the performance is not. What does nvidia-smi GPU utilization actually measure? From NVIDIA’s documentation: the GPU utilization metric is “Percent of time over the past sample period during which one or more kernels was executing on the GPU.” This is a binary measure: was the GPU doing anything, not how efficiently it was working. The utilization interpretation problems Situation GPU utilization (nvidia-smi) Actual state Training well-optimized large model ~100% Efficient — compute-bound Training with data loading bottleneck ~100% (during compute) Inefficient — bubbles between compute bursts Inference at batch=1 Often 40–70% Expected for latency-optimized serving Memory-bandwidth-bound operation ~100% Expected — limited by memory, not compute Poorly optimized kernel ~100% Inefficient — many compute units idle 100% GPU utilization can mean either “the hardware is being used efficiently” or “there is a kernel running that is not efficiently using the GPU’s compute units.” The metric does not distinguish. Better metrics for AI GPU performance Metric What it measures How to get it MFU (Model FLOPS Utilization) Fraction of theoretical FLOPS achieved Manual calculation from throughput SM Occupancy Fraction of SMs with active warps NSight Compute Memory bandwidth utilization Fraction of peak bandwidth used NSight Compute Actual throughput (items/sec) The thing you care about Application-level measurement Correct interpretation pattern When diagnosing AI performance: Measure actual throughput (tokens/sec, images/sec) Check whether GPU memory bandwidth is saturated (NSight or DCGM) Only then interpret GPU utilization as context GPU utilization is not performance covers the full analysis of why the utilization metric misleads and what to use instead. Why does high GPU utilisation not mean high performance? A GPU showing 100% utilisation can still be performing poorly. The utilisation metric from nvidia-smi indicates that at least one CUDA kernel was active during each sampling period — it says nothing about what that kernel was doing. A memory-copy kernel, a poorly parallelised custom kernel, or an inefficient attention implementation all show as 100% utilisation while leaving most of the GPU’s compute units idle. The distinction matters for capacity planning. A system reporting 95% GPU utilisation appears to have no headroom, but profiling might reveal that 40% of that time is spent on suboptimal kernels that could be replaced with fused or vendor-optimised alternatives. We have seen cases where replacing a custom CUDA kernel with a cuDNN-optimised equivalent reduced inference time by 35% — with no change in the GPU utilisation percentage (it remained near 100%, but each unit of time processed more data). For benchmark testing, GPU utilisation should be reported alongside throughput (samples/second or tokens/second). If two configurations both show 98% GPU utilisation but configuration A processes 1,200 samples/second and configuration B processes 800 samples/second, configuration A is 50% more efficient despite identical utilisation. This happens when A uses optimised kernels (FlashAttention, torch.compile, TensorRT) that extract more useful work from each GPU cycle. Our benchmarking reports include three GPU metrics: utilisation percentage (occupancy), achieved memory bandwidth (GB/s), and achieved arithmetic throughput (TFLOPS). The ratio of achieved to theoretical for each metric identifies the specific bottleneck — and therefore the specific optimisation that would improve performance. Utilisation metrics for different workload types Different AI workload types produce different utilisation signatures, and understanding these signatures helps diagnose performance issues. Training workloads typically show high, steady GPU utilisation (85–98%) with periodic dips corresponding to gradient synchronisation in distributed training. The dip depth indicates communication overhead — shallow dips (2–3%) indicate efficient communication, deep dips (15–20%) indicate a communication bottleneck. Inference serving workloads show variable utilisation that tracks request load. At low load, utilisation is low and latency is low. As load increases, utilisation increases and latency remains stable until a saturation point, beyond which utilisation plateaus near 100% and latency increases sharply. We identify the saturation point experimentally and set capacity alerts at 80% of that utilisation level to maintain latency SLAs. For teams implementing GPU benchmarking practices, we recommend collecting all three metrics (utilisation percentage, memory bandwidth, arithmetic throughput) from the first benchmark run, even if only one metric seems relevant. The complete dataset enables retrospective analysis when performance questions arise later — questions that are impossible to answer without historical baseline data. Collecting incomplete metrics initially and adding more later produces a fragmented history with no consistent baseline for comparison across time periods.