GPU profiling is the only reliable way to know where time actually goes inside a workload. Without it, optimisation is guesswork — and the guesses are usually wrong, because the kernel that feels slow is rarely the one consuming the wall time. NVIDIA’s profiling ecosystem has several overlapping tools, and choosing the wrong one for the question you’re asking wastes a working day and produces misleading conclusions. This article covers the two tools that matter for almost every modern GPU workload — Nsight Systems and Nsight Compute — and gives a practical workflow for moving from “my GPU code is slow” to “here is the specific bottleneck and what to do about it.” The deeper methodology behind the workflow lives in our hub article on how to profile GPU kernels to find the real bottleneck; this piece is the operator-level companion. Which profiler answers which question? The mistake we see most often is teams reaching for Nsight Compute first because it sounds more detailed. It is more detailed — and that is exactly why it is the wrong starting point. Per-kernel hardware counters are only useful once you already know which kernels are worth looking at. Tool Question answered Granularity Overhead Nsight Systems Where does time go across the full system? Timeline of API calls, CPU/GPU overlap, transfers Low Nsight Compute Why is a specific kernel slow? Per-kernel hardware metrics, warp stalls, Roofline High nvprof (deprecated) Basic kernel timing on pre-Volta hardware Kernel-level Medium The decision tree is simple. Start with Nsight Systems for a system-level picture. Drill into specific kernels with Nsight Compute only after the timeline tells you which kernels deserve the attention. nvprof is still mentioned in older tutorials but does not support recent architectures and should be retired from any modern workflow. Nsight Systems: the timeline view first Nsight Systems captures a unified timeline of CPU threads, CUDA API calls, kernel executions, memory transfers, NCCL collectives, and any NVTX ranges you have annotated. It is low overhead enough to run against a realistic training step or inference batch without distorting the result. A representative invocation: nsys profile --trace=cuda,nvtx,osrt,cudnn \ --output=report \ python train.py Open the resulting .nsys-rep file in the Nsight Systems GUI. The timeline shows CPU/GPU overlap — or, more often, the lack of it. The patterns worth scanning for: GPU idle gaps between kernels — usually indicate CPU-side data preparation, Python overhead in the DataLoader, or a synchronous .item() call pulling a tensor back to host. PCIe transfers (host-to-device or device-to-host) that block kernel execution rather than overlapping with it. Long tails of tiny kernels — operator fusion (via torch.compile, TensorRT, or XLA) or larger batch shapes may collapse them. NCCL collectives stretching the step in distributed runs — a sign the communication topology, not the compute, is the constraint. In our experience, the most common finding at this stage is not a slow kernel at all. It is excessive host-device synchronisation, or memory transfers that could be pipelined but aren’t. Both are invisible to Nsight Compute because Nsight Compute only sees one kernel at a time. Nsight Compute: kernel-level diagnosis Once Nsight Systems has identified the two or three kernels dominating runtime, Nsight Compute provides per-kernel hardware metrics — memory throughput, compute throughput, achieved occupancy, warp stall reasons, and instruction mix. It is heavy. A full metric set can slow a kernel by 100× or more, which is fine for a single-kernel diagnostic but unusable for a whole training run. ncu --set full \ --kernel-name regex:"my_hot_kernel.*" \ --launch-skip 50 --launch-count 5 \ --output report \ python train.py The --kernel-name filter and --launch-skip / --launch-count flags are the practical difference between a profile that completes and a profile that runs overnight. Always target specific kernels you have already identified as hot. Reading the Roofline The Roofline chart plots your kernel’s achieved compute throughput against its achieved memory bandwidth, relative to the hardware ceilings of the device. Its position is a direct, structurally meaningful diagnosis: Below the memory-bandwidth ceiling, left of the ridge point — memory-bound. The fix is in the access pattern: coalescing, tiling, reducing HBM round-trips, or moving data into shared memory. Below the compute ceiling, right of the ridge point — compute-bound. The fix is algorithmic: higher arithmetic intensity, lower precision (FP16 / BF16 / FP8 where appropriate), or eliminating redundant work. Far below both ceilings — launch-bound, occupancy-limited, or stalled on something else entirely. Check block/grid configuration and register pressure. What are the common bottleneck patterns? A small number of warp-stall reasons account for most real-world kernel slowness: LG Throttle (long-scoreboard memory stalls) — global memory loads are stalling the pipeline. The usual cause is non-coalesced access; consecutive threads should access consecutive addresses. Low achieved occupancy — too many registers per thread, or too much shared memory per block, prevents the scheduler from keeping enough warps in flight to hide latency. Compile with --ptxas-options=-v to see register counts per kernel. High L2 miss rate — data is not reusing the L2 cache. Tiling or blocking the algorithm typically improves locality. Unbalanced SM utilisation — some streaming multiprocessors finish much earlier than others. Usually caused by irregular work distribution; consider load balancing or work-stealing schemes across blocks. These are observed patterns across the GPU-engineering work we do — not a benchmark, and the relative frequency varies by workload class (training vs inference vs classical HPC). Profiling workflow checklist A pragmatic order of operations that survives contact with real codebases: Build with -lineinfo (or nvcc -lineinfo) so Nsight Compute can correlate metrics back to source lines. Add NVTX ranges around the meaningful phases (forward, backward, optimizer_step, dataloader) so the Nsight Systems timeline is readable. Run Nsight Systems first. Identify the top three kernels by aggregate wall time, and check whether the GPU is ever idle waiting on the CPU. Confirm CPU/GPU overlap is healthy before touching kernels. Fixing a synchronous transfer is almost always a larger win than micro-optimising a kernel. Run Nsight Compute only on the top kernels, with --launch-skip past warm-up iterations. Read the Roofline for each — memory-bound, compute-bound, or neither. Check warp-stall reasons. They name the actual hardware constraint directly. Re-profile after every change. Never assume an optimisation helped without a second measurement on the same hardware. How should I interpret memory throughput numbers? Nsight Compute reports memory throughput as a percentage of theoretical peak, and the percentage is routinely misread. A kernel achieving 60% of peak HBM bandwidth on an A100 is doing well — sustained 70–80% is achievable only for very regular, fully coalesced access patterns. Compute-bound kernels typically show 5–20% memory utilisation, which is exactly what you would expect, not a problem to solve. GPU Peak HBM bandwidth Typical achievable in production NVIDIA A100 80GB 2,000 GB/s 1,400–1,600 GB/s NVIDIA H100 SXM 3,350 GB/s 2,400–2,800 GB/s NVIDIA RTX 4090 1,008 GB/s 700–850 GB/s These ranges are observed-pattern ceilings from GPU performance work we have done across a mix of training and inference workloads — they are practical planning numbers, not benchmark results, and the exact figure for any specific kernel depends on access patterns, working-set size, and how well the kernel cooperates with the memory hierarchy. Connecting profiling to optimisation strategy Profiling data should directly dictate the optimisation path, with no creative leaps in between. Memory-bound kernels need better access patterns, operator fusion, or improved cache locality. Compute-bound kernels need algorithmic restructuring or a precision change. Launch-bound workloads need larger batches or kernel consolidation, often via graph capture (CUDA Graphs, torch.compile, or TensorRT engine builds). Host-bound workloads need none of the above — they need the input pipeline fixed. This is the bridge from the operator-level tooling here back to the full decision framework in our hub piece on profiling GPU kernels for the real bottleneck, which covers what to do when the profile reveals the constraint is architectural rather than kernel-level. FAQ