How to Profile GPU Kernels to Find the Real Bottleneck

A CUDA kernel takes 12 milliseconds to execute. Is that slow? The question is unanswerable without context, and that gap between “the kernel took N milliseconds” and “the kernel is achieving X% of what this hardware allows” is where most GPU optimisation work goes wrong. Teams that skip profiling end up tuning kernel parameters because the GPU “looks busy,” when the actual constraint is memory bandwidth, host-device serialisation, or kernel launch overhead — none of which respond to the changes being made.

This article describes the profiling workflow we use across GPU Performance Audit engagements: how to capture the right traces with Nsight Systems and Nsight Compute, how to read a roofline chart to separate compute-bound kernels from memory-bound ones, and how to decide the optimisation order so the first intervention delivers the largest speedup rather than the smallest. The point is not that profiling is virtuous. The point is that without it, every subsequent decision is engineering by superstition.

Why wall-clock timing is not profiling

Return to the 12-millisecond kernel. To use a worked example from our GPU profiling engagements: if that kernel is performing 500 billion floating-point operations and the GPU’s peak throughput is roughly 312 TFLOPS, the theoretical minimum is around 1.6 milliseconds — the kernel is achieving about 13% of peak (an operational measurement from that profile, not an extrapolated benchmark), and there is significant room for improvement. If, on the other hand, the kernel is memory-bandwidth-bound — performing element-wise operations on 4 GB of data on a GPU with roughly 2 TB/s of memory bandwidth — the theoretical minimum is around 2 milliseconds. Now the kernel is achieving about 17% of the memory bandwidth ceiling, and the optimisation target is memory access patterns rather than compute efficiency.

Same 12 milliseconds. Two completely different problems. Two completely different fixes. Wall-clock timing cannot tell you which one you have. Profiling can.

Which GPU profiler should I use?

For NVIDIA hardware, the practical answer is both Nsight Systems and Nsight Compute, used in sequence. They answer different questions and the cost of conflating them is wasted weeks.

Nsight Systems captures a system-level timeline: kernel launches, CUDA stream activity, host-device memory transfers, synchronisation events, GPU idle gaps. It tells you what is happening at the workload level — whether the GPU is actually active, whether there is host-side serialisation, whether data transfer is overlapping computation. This is the right tool for the first question: where is the time going?

Nsight Compute profiles a single kernel in detail: achieved occupancy, memory throughput per cache level, warp execution efficiency, instruction mix, and the roofline chart that places the kernel against the hardware ceilings. This is the right tool for the second question: given that this kernel is dominant, why is it not faster?

Vendor alternatives exist — AMD’s rocprof and Radeon GPU Profiler, Intel’s VTune for GPU, and PyTorch’s built-in profiler for the framework layer — and the same two-stage logic applies on those platforms. The framework-level profiler in PyTorch or JAX is useful as a first cut to identify which operator is dominant, but it does not give you the kernel-level analysis that determines what to actually change.

A common failure mode: teams reach for Nsight Compute first, profile some kernel they suspect, and never discover that the real problem was a 40-millisecond cudaMemcpy blocking the launch queue. The system-level pass costs almost nothing and reorders the entire investigation.

Minimum viable profiling workflow

Before optimising anything, every team should complete these steps. The order matters more than the tooling.

Capture a system-level timeline with Nsight Systems. Run the full workload — not a synthetic micro-benchmark — and record kernel launches, memory transfers, synchronisation events, and GPU idle gaps.
Identify the top three to five kernels by cumulative execution time. In our experience across GPU profiling engagements, these dominant kernels typically account for 80% or more of GPU time (an observed pattern across our engagements, not a benchmarked industry rate). They are the only ones worth optimising.
Profile each dominant kernel with Nsight Compute. Collect the roofline chart, achieved occupancy, memory throughput breakdown, and warp execution efficiency for each target kernel.
Classify each kernel as compute-bound, memory-bound, or occupancy-limited. The roofline position and arithmetic intensity tell you which hardware ceiling is the binding constraint.
Quantify the gap between current and achievable performance. Compare the kernel’s achieved throughput against the theoretical ceiling. As a planning heuristic from our engagements (not a benchmarked industry rate), a kernel at 20% of peak implies up to a 5× potential improvement if the constraint can be addressed.
Select the single highest-impact optimisation per kernel. Match the identified bottleneck to a specific fix: coalescing for uncoalesced memory access, mixed precision for tensor core underutilisation, kernel fusion for launch overhead.
Re-profile after each change. Fixing one bottleneck shifts the constraint — always re-measure before applying the next intervention.

That is the entire methodology in seven steps. Everything below is detail on how to execute it.

The roofline model: where does your kernel sit?

The roofline model is the foundational framework for GPU kernel performance analysis. It maps every kernel onto a two-dimensional space defined by two hardware limits: compute throughput in FLOPS, and memory bandwidth in bytes per second. The kernel’s arithmetic intensity — operations performed per byte of data accessed — determines which limit is binding.

Compute-bound kernels have high arithmetic intensity. Large matrix multiplications (GEMM), convolutions with large filter sizes, and dense attention computations fall into this category. The optimisation target is compute efficiency: are the tensor cores being used? Is warp execution efficient? Are there divergent branches forcing serial execution?

Memory-bound kernels have low arithmetic intensity. Element-wise operations (ReLU, sigmoid, addition), batch normalisation, and small matrix operations live here. The optimisation target is memory throughput: are global memory accesses coalesced? Is the data layout cache-friendly? Can adjacent operations be fused so data is reused in registers or shared memory rather than re-read from HBM?

Nsight Compute generates the roofline chart automatically for any profiled kernel. The kernel appears as a point; its position relative to the compute ceiling and memory bandwidth ceiling indicates which limit is binding and how much headroom remains. In our experience reading these charts, a kernel sitting at 80% of the memory bandwidth ceiling is already well-optimised for a memory-bound workload — further gains require reducing the memory access volume through fusion or algorithmic change, not improving the access pattern. A kernel sitting at 20% of the compute ceiling, by contrast, is poorly optimised, and there is roughly a 5× potential improvement available from better compute utilisation (an observed range across our engagements, not a guaranteed outcome).

How do I tell whether my kernel is compute-bound, memory-bound, or host-bound?

This is the question the entire workflow exists to answer. The diagnostic logic is straightforward once the traces are in hand.

Symptom in the trace	Likely class	First-cut fix
Roofline point near compute ceiling, low memory throughput	Compute-bound	Tensor cores, reduce divergence, raise ILP
Roofline point near memory bandwidth ceiling, low FLOPS	Memory-bound	Coalescing, fusion, shared memory tiling
Achieved occupancy <30%, low everything	Occupancy-limited	Reduce registers or shared memory per thread
GPU idle gaps between kernels in Nsight Systems	Host-bound or launch-overhead-bound	CUDA graphs, async streams, batch launches
Long `cudaMemcpy` blocks before kernel runs	I/O / transfer-bound	Pinned memory, overlap with `cudaMemcpyAsync`
Roofline far below both ceilings, no idle gaps	Mixed — re-examine warp efficiency and stalls	Inspect stall reasons in Nsight Compute

The last row is the one that catches teams out. A kernel can be far from both ceilings and still not be straightforwardly fixable, because the binding constraint is something more granular — long scoreboard stalls waiting on memory, warp serialisation from atomics, or shared memory bank conflicts. Nsight Compute exposes the stall-reason breakdown explicitly, and it is usually where the real story lives once the obvious causes have been ruled out.

What does it mean when GPU utilisation looks high but throughput is low?

This is one of the most common patterns we see, and it is almost always a misreading of what nvidia-smi reports. The utilisation percentage in nvidia-smi is the fraction of time at least one kernel was running on the device. It says nothing about how efficiently those kernels used the hardware. A workload showing 95% GPU utilisation can easily be running kernels that achieve 10% of peak throughput — the GPU is busy doing very little, very consistently.

End-to-end throughput depends on three things the utilisation number does not see: kernel efficiency against the relevant ceiling, the overhead of launching and synchronising the work, and the fraction of time spent in non-kernel activity (data loading, host-side preprocessing, gradient synchronisation across devices). Profiling separates these out. The hidden cost of GPU underutilisation is almost always something a utilisation gauge cannot detect — and a roofline chart can.

What does profiling tell you that benchmarks do not?

Benchmark numbers — training throughput, inference latency, images per second — describe the current performance. Profiling describes the achievable performance and the specific path to reach it. The gap between the two is the optimisation opportunity, and its magnitude determines whether the engineering investment is worth making. This is part of why spec-sheet benchmarking fails for AI workloads — the headline numbers on the data sheet describe a ceiling that production workloads rarely reach without the profiling-driven work described here.

Across GPU Performance Audit engagements, we have seen profiling reveal opportunities ranging from negligible (the workload is already within 10% of the hardware ceiling, leave it alone) to transformative (the workload is at 15% of peak with a clear path to 60% through targeted kernel optimisation and memory layout changes) — an observed range across our engagements, not a benchmarked industry rate. The profiling data determines which situation you are in. It also prevents the more expensive mistake of provisioning more GPU memory or procuring additional hardware when the existing hardware is sitting underutilised.

Common findings and their fixes

From our audit engagements, four findings recur often enough that they are worth naming explicitly. The improvement ranges below are observed across our engagements (not benchmarked industry rates) and assume the diagnosis is correct — applying the wrong fix to the wrong bottleneck delivers nothing.

Uncoalesced global memory access. Threads within a warp access non-contiguous memory locations, forcing the memory subsystem to issue multiple transactions where one would suffice. The fix is structural: convert Array-of-Structures layouts to Structure-of-Arrays, or change the access pattern so consecutive threads touch consecutive addresses. Typical improvement: roughly 2–4× memory throughput on the affected kernel.

Tensor core underutilisation. The kernel performs matrix multiplications in FP32 on hardware that supports tensor cores (Volta and later) but never invokes them. The fix is to move to mixed precision — FP16 or BF16 inputs with FP32 accumulation — using either the WMMA API directly or a tensor-core-aware library path in cuBLAS or cuDNN. Typical improvement on the matrix multiplication itself: roughly 4–8× throughput.

Excessive kernel launch overhead. Hundreds or thousands of small kernels are dispatched sequentially, and the host-side cost of each launch (on the order of 5–15 microseconds) exceeds the GPU execution time of the kernel itself. The fix is fusion — combining small kernels into larger ones — or using CUDA graphs to capture the launch sequence into a single replayable structure. Typical improvement for launch-overhead-dominated workloads: roughly 2–10× reduction in end-to-end time.

Register pressure causing spills. The kernel demands more registers per thread than the SM can supply, and the surplus spills to local memory, which is backed by global memory and is roughly two orders of magnitude slower than register access. The fix is to refactor the kernel to reduce live-variable count — simplify expressions, reduce loop unrolling, or split the kernel into stages that need fewer simultaneous registers. Typical improvement: roughly 1.5–3× throughput recovery once the spill pressure is eliminated.

When is the bottleneck outside the kernel?

Often. This is the question Nsight Systems exists to answer, and skipping it is how teams spend weeks tuning a kernel that contributes 2% of total execution time.

The system-level timeline will show, plainly: whether the GPU is idle for significant fractions of the run (host or I/O bound), whether data transfers from host to device are blocking compute (overlap missing), whether small kernels are being launched faster than they can execute (launch overhead), whether collective operations in multi-GPU training are stalling because one rank finished its compute long before the others (load imbalance or NCCL configuration). None of these are visible from kernel-level profiling alone, and none of them are fixable by changing the kernel.

The optimisation order that gives the largest speedup first is almost always: (1) eliminate idle gaps and host-device serialisation, (2) reduce launch overhead via fusion or CUDA graphs, (3) fix the dominant kernel against its binding ceiling, (4) iterate on the next dominant kernel as the workload’s profile shifts. Skip the first step and the rest will not matter.

FAQ

How do I tell whether my GPU kernel is compute-bound, memory-bound, or host-bound?

Profile the workload first with Nsight Systems to see whether the GPU is even busy — large idle gaps or blocking memory transfers indicate a host or I/O bottleneck. Then profile the dominant kernels with Nsight Compute and read the roofline chart: a kernel near the compute ceiling is compute-bound, a kernel near the memory bandwidth ceiling is memory-bound, and a kernel far below both with low achieved occupancy is occupancy-limited. The classification determines the fix.

Which GPU profiler should I use — Nsight Systems, Nsight Compute, or vendor alternatives?

On NVIDIA hardware, use Nsight Systems for the system-level timeline and Nsight Compute for per-kernel analysis. They answer different questions and you need both. On AMD and Intel hardware, the equivalents are rocprof plus Radeon GPU Profiler, and VTune for GPU. Framework-level profilers in PyTorch and JAX are useful as a first cut but do not replace kernel-level analysis.

What does it mean when GPU utilisation looks high but end-to-end throughput is low?

The utilisation reported by nvidia-smi is the fraction of time at least one kernel is running. It does not measure how efficiently those kernels use the hardware. A workload at 95% utilisation can still be achieving 10% of peak throughput because every kernel is running far below its ceiling, or because launch overhead dominates the wall clock. Roofline analysis exposes this; the utilisation gauge cannot.

How do I read a profiler trace to identify the real bottleneck rather than a symptom?

Start at the system level: identify the top three to five kernels by cumulative execution time, and confirm they actually dominate the run. Then for each one, look at the roofline position, the achieved occupancy, and the stall-reason breakdown in Nsight Compute. The bottleneck is whichever resource the kernel is closest to saturating — or, if the kernel is far from all ceilings, the dominant stall reason will name the deeper constraint.

When does profiling reveal the bottleneck is outside the kernel?

Whenever the system-level timeline shows substantial GPU idle time, long host-device transfers blocking compute, kernel launches arriving faster than they can execute, or collective operations stalling in multi-GPU training. None of these are visible from kernel-level profiling alone, which is why Nsight Systems comes first in the workflow.

Once I’ve identified the bottleneck, what is the optimisation order that gives the largest speedup first?

Eliminate system-level idle gaps and host-device serialisation first. Reduce launch overhead next, through fusion or CUDA graphs. Then optimise the dominant kernel against its binding ceiling — memory access pattern for memory-bound, tensor cores or warp efficiency for compute-bound. Re-profile after each change because the binding constraint shifts as bottlenecks are removed.

The first step is always the same: profile the workload, identify the dominant bottleneck, and quantify the gap between current and achievable performance — which is what a GPU Performance Audit delivers.