Neither TOPS nor GPU utilization predicts AI throughput Two commonly cited AI hardware metrics — TOPS (Tera Operations Per Second) and GPU utilization percentage — both fail to predict throughput for real AI workloads. TOPS is a theoretical ceiling that workloads never reach. GPU utilization percentage measures whether the GPU is busy, not how efficiently it is working. For capacity planning and performance debugging, neither metric is useful in isolation. What determines actual AI throughput AI workload throughput is determined by the binding constraint in the performance roofline. For a given operation, either: The operation is compute-bound: throughput limited by GPU FLOPS. Measured as fraction of peak FLOPS achieved. The operation is memory-bandwidth-bound: throughput limited by how fast data can be read from/written to GPU memory. Measured as fraction of peak memory bandwidth utilized. Most LLM inference at low batch sizes is memory-bandwidth-bound (the model weights must be read from HBM on each token generation step). Most LLM training with large batches is compute-bound (matrix multiplications dominate). The roofline model for AI Workload Bound Relevant metric What to optimize LLM inference, batch=1 Memory bandwidth GB/s utilization Quantize weights to INT8 LLM inference, batch=64 Compute FLOPS utilization (MFU) Increase batch, use FlashAttention Diffusion inference Compute + memory Both Profile per layer CNN training Compute MFU Larger batch, mixed precision Feature extraction, embeddings Memory Bandwidth Batching How to determine your binding constraint? import torch from torch.profiler import profile, record_function, ProfilerActivity with profile(activities=[ProfilerActivity.CUDA], with_flops=True) as prof: with record_function("model_inference"): output = model(input) print(prof.key_averages().table(sort_by="cuda_time_total")) The profiler output shows time per operation and FLOP count. Operations with high FLOP counts but low time are compute-bound. Operations with low FLOP counts but high memory bytes accessed are memory-bound. TOPS and utilization in context TOPS: Use as a rough ceiling to eliminate hardware that can’t theoretically support the required compute. Don’t use for comparison between hardware options. GPU utilization (nvidia-smi): Use to detect gross inefficiency (utilization < 30% in a training loop suggests data loading bottleneck). Don’t use to measure efficiency — high utilization is compatible with both efficient and inefficient workloads. The GPU utilization is not performance analysis explains the full measurement problem in detail. How do you measure what your GPU is actually doing? nvidia-smi reports GPU utilization as a percentage, but this metric measures time occupancy (was any kernel running on the GPU during the sampling window?) not computational efficiency (how much of the GPU’s arithmetic capacity was used?). A GPU running a single memory-bound kernel at 100% of the time reports 100% utilisation but may be using only 15% of its compute capacity. More informative metrics require profiling tools. NVIDIA’s Nsight Compute reports SM occupancy, memory throughput achieved vs theoretical, and arithmetic throughput achieved vs theoretical — for each kernel individually. PyTorch’s built-in profiler provides kernel-level timing that reveals which operations dominate execution time. We use a two-stage profiling approach: first, nvidia-smi dmon -s u to continuously monitor GPU utilisation during a production workload and identify periods of low utilisation (which indicate pipeline stalls or CPU bottlenecks). Second, Nsight Systems to profile a representative window and identify the specific operations causing underutilisation. The most actionable metric for production AI is throughput per watt: (inferences per second) / (GPU power draw in watts). This single number captures the combined effect of hardware capability, software optimisation, and workload fit. Two systems with identical GPU utilisation percentages but different throughput-per-watt values have different optimisation levels — the lower value indicates waste that can be recovered through software tuning. Building a custom utilisation dashboard For production AI serving, we build custom utilisation dashboards that combine GPU metrics with application-level metrics. The dashboard shows: requests/second (application load), tokens/second or images/second (application throughput), GPU SM occupancy (compute utilisation), GPU memory bandwidth utilisation (memory pressure), and GPU power draw (energy efficiency). Correlating these metrics reveals patterns invisible in any single metric. A GPU showing 90% utilisation with flat throughput despite increasing request load indicates that the serving framework’s batching is suboptimal — requests are queuing but not being batched efficiently. A GPU showing 40% utilisation with maximum throughput for the model indicates that the model is memory-bandwidth-bound and cannot saturate the GPU’s compute capacity regardless of load. These dashboards use NVIDIA DCGM (Data Center GPU Manager) for GPU metrics collection, exposed as Prometheus metrics and visualised in Grafana. The setup requires 2–3 hours of initial configuration but provides ongoing visibility that prevents performance problems from reaching production users. The dashboard we describe above has prevented three production incidents in the past year where GPU utilisation appeared healthy but application throughput had degraded. In each case, the discrepancy between GPU utilisation percentage and application-level throughput metrics triggered an investigation that revealed a software regression — a framework update that introduced suboptimal kernel selection, a configuration change that reduced batch size, and a memory leak that caused increasing garbage collection pauses. Without the combined view, these issues would have persisted until user-facing impact triggered escalation.