AI TOPS and GPU Utilization: When TOPS Is the Wrong Metric for Your Workload

TOPS and GPU utilization both mislead AI capacity planning. Learn when compute, memory bandwidth, or throughput is the right metric for your workload.

AI TOPS and GPU Utilization: When TOPS Is the Wrong Metric for Your Workload
Written by TechnoLynx Published on 07 May 2026

Neither TOPS nor GPU utilization predicts AI throughput

Two of the most quoted AI hardware metrics — TOPS (Tera Operations Per Second) and GPU utilization percentage — collapse the moment you try to use them for capacity planning. TOPS is a theoretical ceiling that real workloads never approach. GPU utilization, as reported by nvidia-smi, tells you whether the GPU was busy during a sampling window, not whether the silicon was doing useful work. Worse, that coarse figure is not even the same quantity as SM utilization — the two can diverge sharply on the same workload, and confusing them is a common source of wrong conclusions. Neither answers the question buyers and operators actually want answered: how much throughput will I get for my workload, on this hardware, under realistic conditions?

This piece is specifically about TOPS as a metric paired with GPU utilization — what the two together can and cannot tell you about a running workload. Two adjacent questions live in companion articles: what TOPS on the spec sheet measures is covered in AI TOPS on the spec sheet; how the hardware-software stack turns TOPS into achieved throughput is covered in TOPS performance across the stack. This piece treats the metric pair, not the spec sheet or the stack, as the unit of analysis.

The mistake is to treat utilization as an outcome rather than a proxy. It is a proxy — a coarse, one-dimensional signal that lacks workload context. A GPU running a single memory-bound kernel can report 100% utilization while using roughly 15% of its arithmetic capacity. Another GPU running a heavily compute-bound kernel can report 40% utilization and still be saturating the relevant hardware unit. Same number on the dashboard, two completely different operational realities.### Coarse nvidia-smi Utilization Versus SM Utilization

The single percentage nvidia-smi prints is the fraction of the sampling window during which at least one kernel was running on the device. It says nothing about how many of the GPU’s Streaming Multiprocessors were actually working. SM utilization — what NVIDIA Nsight Compute and DCGM expose as SM activity or achieved occupancy — measures the fraction of those SMs doing work. The two diverge sharply on the same workload: a single small memory-bound kernel keeps the device “busy” for the whole window, so coarse utilization reads near 100% while SM utilization sits in the low teens. That gap is exactly where the coarse metric stops being informative — a GPU pinned at 100% on nvidia-smi is not by itself a sign of trouble (a saturated training job should look like that), but it also does not confirm the SMs are saturated. Only the SM-level reading tells you which is the case.

What actually determines AI throughput

Throughput is set by the binding constraint in the performance roofline (an observed-pattern framing from the well-known Williams et al. roofline model, widely adopted across GPU performance literature). For any given operation, one of two things is true:

  1. The operation is compute-bound. Throughput is limited by GPU FLOPS. The relevant measure is MFU — Model FLOPs Utilization — the fraction of peak FLOPS the workload actually achieves.
  2. The operation is memory-bandwidth-bound. Throughput is limited by how fast weights and activations can be read from HBM. The relevant measure is achieved memory bandwidth as a fraction of peak.

Most LLM inference at low batch sizes is memory-bandwidth-bound — the model weights have to be streamed from HBM on every token-generation step, and arithmetic intensity per byte loaded is low. Most LLM training at large batch sizes is compute-bound, because the dense matrix multiplications inside attention and MLP layers dominate. Diffusion inference and many vision pipelines sit in between, with some layers compute-bound and others memory-bound.

This is the framing that makes utilization legible. Utilization without a roofline classification is just a number. Utilization paired with “this kernel is memory-bandwidth-bound at 78% of peak HBM bandwidth” is an actionable observation.

A quick map: what to measure, by workload

Workload Bound Relevant metric What to optimize
LLM inference, batch=1 Memory bandwidth GB/s utilization INT8 / INT4 weight quantization
LLM inference, batch=64+ Compute MFU Larger batch, FlashAttention, kernel fusion
Diffusion inference Mixed Both, per layer Profile per kernel
CNN training Compute MFU Larger batch, mixed precision
Embedding / feature extraction Memory Bandwidth Batching, dtype reduction

This is a decision rubric, not a benchmark. The point is that “the right metric” is workload-shaped — there is no single number, including utilization or TOPS, that survives across all five rows.

Why does a GPU show low utilization while still delivering high throughput?

Because utilization measures time occupancy, not work done. If the binding constraint for a workload is HBM bandwidth, then the moment memory bandwidth is saturated, additional SMs (Streaming Multiprocessors) sit idle by definition — there is nothing for them to compute on until the next batch of weights arrives. The GPU is operating at the workload’s ceiling, but the utilization metric reports the idle SMs as headroom.

This is the canonical case where “make utilization higher” is the wrong optimization goal. Pushing more concurrent work onto a memory-bandwidth-bound kernel does not raise throughput; it raises contention. The honest answer is that the workload’s arithmetic intensity is low and the only paths forward are (a) reduce bytes per operation through quantization, (b) restructure the kernel so more arithmetic happens per byte read, or (c) accept the ceiling.

We see this regularly in production LLM serving. A team chasing a higher nvidia-smi reading by increasing concurrency hits the same tokens-per-second wall, then concludes the GPU is “underused.” It isn’t. The metric is misreading the situation.

How do you actually measure what a GPU is doing?

Three tiers, in increasing fidelity:

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(activities=[ProfilerActivity.CUDA],
             with_flops=True) as prof:
    with record_function("model_inference"):
        output = model(input)

print(prof.key_averages().table(sort_by="cuda_time_total"))

PyTorch’s profiler gives you per-operation time and FLOP counts. Operations with high FLOPs and short time are compute-bound. Operations with low FLOPs but high memory traffic are memory-bound. This is enough to tell you which roofline edge you are sitting on.

For deeper analysis, NVIDIA Nsight Compute reports SM occupancy, achieved-vs-theoretical memory throughput, and achieved-vs-theoretical arithmetic throughput on a per-kernel basis. Nsight Systems gives the timeline view — useful for spotting pipeline stalls, CPU bottlenecks in data loading, and serialization between streams. nvidia-smi dmon -s u is the cheapest first pass: continuous utilization sampling to spot gross underutilization windows during a production run.

The two-stage pattern we use on engagements is: dmon to identify periods of suspicious behaviour, then Nsight Systems on a representative window to find the responsible kernels. Jumping straight to Nsight without first localising in time is how engineers lose afternoons.

Where TOPS and nvidia-smi actually belong

Neither metric is useless. They are mis-deployed when treated as performance measurements.

  • TOPS: use as a coarse ceiling check. If a hardware option’s published TOPS is below what your workload theoretically needs, eliminate it. Do not use TOPS to rank two options that both clear the bar — vendors compute it under different sparsity and precision assumptions, and real-world achievable fraction varies by an order of magnitude (an observed pattern across the published MFU literature; not a benchmark of any specific system).
  • nvidia-smi utilization: use as a coarse efficiency tripwire. Sustained utilization under ~30% during a training loop is a strong indicator of a data-loading or CPU-side bottleneck. Above that threshold, the metric loses discriminative power — it is compatible with both well-tuned and badly-tuned workloads.

The broader argument — that utilization is a proxy, not an outcome — is developed in our analysis of why GPU utilization is not performance. That piece works through the measurement gap in detail.

When does the utilization metric mislead an investigation more than it helps?

When the workload is memory-bandwidth-bound and the team is chasing the percentage. When two hardware options are being compared on nvidia-smi readings rather than achieved throughput per dollar or per watt. When a serving framework reports 90%+ GPU utilization but tokens-per-second has flatlined — almost always a batching pathology that utilization cannot see. And when an autoscaler triggers on utilization thresholds for a workload whose ceiling sits well below 100% by design.

The general rule: utilization is suggestive at the extremes (very low = something is stalling; very high with stable throughput = saturated) and noisy everywhere in between. Drawing conclusions from utilization in the middle of its range, without a roofline classification, is the dominant failure mode.

Why idle GPU time is not automatically waste

There is a procurement-side version of the same confusion. Idle time on a GPU is read as “wasted hardware” — therefore consolidate, therefore push utilization up. But idle time can be the correct outcome of a system with bursty load and latency SLOs. A GPU that runs at 35% average utilization but serves p99 latency targets during traffic peaks is doing its job. Forcing it to higher average utilization (through co-tenancy, more aggressive batching, or workload packing) often breaks the tail-latency property that justified the deployment in the first place.

This is also why throughput per watt is the most actionable single number for production serving: (inferences per second) divided by (GPU power draw in watts). It captures hardware capability, software tuning, and workload fit in one figure, and it does not reward idle-time elimination at the expense of latency. Two systems with identical utilization percentages but different throughput-per-watt values have different optimisation states — the lower one has recoverable waste, the higher one mostly doesn’t.

A dashboard that combines GPU metrics with application metrics

For production AI serving, the operationally useful view combines GPU metrics with application-level metrics on the same timeline:

  • Requests per second (application load)
  • Tokens per second or images per second (application throughput)
  • GPU SM occupancy (compute utilisation)
  • GPU memory bandwidth utilisation (memory pressure)
  • GPU power draw (energy efficiency)

Correlating these reveals patterns invisible in any single metric. A GPU at 90% utilization with flat throughput under rising request load indicates that the serving framework’s batching is suboptimal — requests are queuing rather than batching. A GPU at 40% utilization with maximum sustained throughput for the model indicates a memory-bandwidth ceiling; further load will not lift output regardless of how many SMs appear idle.

The implementation we typically reach for: NVIDIA DCGM (Data Center GPU Manager) collecting GPU metrics, exposed as Prometheus metrics, visualised in Grafana alongside application telemetry. Two to three hours of setup, and from then on the combined view does the work that a utilization dashboard alone cannot.

Frequently Asked Questions

What is the difference between coarse GPU utilization from nvidia-smi and SM utilization, and why can they diverge?

Coarse nvidia-smi utilization is the fraction of the sampling window during which any kernel was running on the device — a time-occupancy signal. SM utilization, exposed by NVIDIA Nsight Compute and DCGM, measures the fraction of the Streaming Multiprocessors actually doing work. A single memory-bound kernel keeps the device busy the whole window, so coarse utilization reads near 100% while SM utilization sits in the low teens. The two diverge sharply because one counts wall-clock occupancy and the other counts compute resources actually in use.

Is a GPU pinned at 100% utilization a sign of a problem, or is that what a saturated training job should look like?

By itself, 100% on nvidia-smi tells you only that a kernel was running across the whole sampling window — it is exactly what a healthy, saturated training job looks like. The figure becomes a problem signal only when paired with flat throughput: 90%+ utilization with tokens-per-second flatlined usually points to a batching pathology the coarse metric cannot see. To tell a saturated job from a stalled one, read SM utilization and an application-level throughput metric alongside the percentage.

Which tools expose SM utilization rather than just coarse device utilization?

NVIDIA Nsight Compute reports SM occupancy and achieved-versus-theoretical throughput on a per-kernel basis, and Nsight Systems gives the timeline view for spotting stalls and serialization. For continuous production monitoring, NVIDIA DCGM exposes SM activity as Prometheus metrics that can be visualised in Grafana alongside application telemetry. nvidia-smi dmon -s u remains the cheapest first pass to localise suspicious windows in time before reaching for the heavier profilers.

Back See Blogs
arrow icon