Why guessing which optimisation to apply wastes engineering time The most common GPU performance improvement pattern we observe is: an engineer identifies slow execution, assumes the cause, applies an optimisation, and sees no improvement — because the assumed bottleneck was wrong. GPU performance is governed by whichever hardware resource is saturated first (compute throughput, memory bandwidth, or launch overhead), and the correct intervention depends entirely on which resource that is. Profiling must precede optimisation. The highest-impact bottleneck is rarely where engineers assume it is. In our GPU engineering practice, we have seen teams spend weeks optimising kernel arithmetic only to discover the kernel was memory-bound — and a simple change to memory access patterns delivered more improvement than all the compute optimisations combined. The profiling-first checklist Before applying any performance fix, complete this sequence: Profile with Nsight Systems — capture a timeline of the full workload to identify which kernels dominate execution time Classify each dominant kernel — use Nsight Compute roofline analysis to determine if it is compute-bound or memory-bound Quantify the gap — compare achieved throughput to the theoretical hardware ceiling for the binding resource Select the intervention that addresses the binding constraint — not the one that seems most sophisticated Re-profile after each change — fixing one bottleneck shifts the constraint elsewhere Memory bandwidth optimisation: the highest-impact category Memory bandwidth optimisation typically delivers 2–5× more improvement than compute-bound optimisations for AI workloads. This is because the majority of operations in deep learning inference — element-wise activations, normalisation layers, attention score computation at small batch sizes — have low arithmetic intensity and are memory-bandwidth-limited. Optimisation Addresses Typical impact When to apply Memory coalescing Scattered memory access patterns 2–8× kernel speedup When Nsight shows low memory throughput efficiency Kernel fusion Multiple kernel launches with intermediate writes 1.5–3× end-to-end speedup When timeline shows many short kernels with gaps FP16/BF16 precision Memory bandwidth consumed by data movement 1.5–2× throughput When accuracy tolerates reduced precision Shared memory caching Repeated access to same data 2–4× for affected kernels When the same data is read multiple times across threads Batch size tuning Underutilised GPU compute and memory bus 1.5–4× throughput When GPU utilisation is below 60% Compute-bound optimisation: when it actually matters Compute-bound kernels — large matrix multiplications, convolutions with large filter sizes — benefit from: Tensor core utilisation — ensuring matrix dimensions are multiples of 8 (FP16) or 16 (INT8) to enable tensor core acceleration, which delivers 4–16× throughput over standard CUDA cores Mixed-precision training/inference — using FP16 for compute while maintaining FP32 for accumulation, doubling effective compute throughput Algorithm selection — choosing Winograd convolution for small-filter cases or FFT-based convolution for large filters These interventions matter only when Nsight Compute confirms the kernel is compute-bound. Applying tensor core optimisation to a memory-bound kernel has zero effect. The optimisation hierarchy for AI inference For teams running AI inference workloads, the interventions ranked by typical impact (from our experience across inference latency optimisation engagements): Model compilation (TensorRT, torch.compile) — 2–5× improvement by eliminating framework overhead Precision reduction (FP32 → FP16 → INT8) — 1.5–4× improvement per precision step Kernel fusion — 1.5–3× by reducing memory round-trips Batch size optimisation — 1.5–4× by improving hardware utilisation Memory layout optimisation — 1.2–2× by improving coalescing and cache behaviour Custom kernel implementation — highly variable, only justified when standard libraries don’t cover the operation The order matters. Applying step 6 before steps 1–3 is almost always wasted effort — the framework overhead eliminated by compilation exceeds any custom kernel improvement. When to stop optimising Performance optimisation has diminishing returns. The signal to stop is when the profiler shows the dominant kernels are achieving 70%+ of the theoretical hardware ceiling for their binding resource. Beyond that point, further improvement requires algorithmic changes (different model architecture, different precision regime) rather than kernel-level optimisation. Pushing from 70% to 90% of peak typically requires 5–10× the engineering effort of reaching 70% — a trade-off that is rarely justified outside hyperscale deployments.