Why guessing which optimisation to apply wastes engineering time The most common GPU performance improvement pattern we observe is: an engineer identifies slow execution, assumes the cause, applies an optimisation, and sees no improvement β because the assumed bottleneck was wrong. GPU performance is governed by whichever hardware resource is saturated first (compute throughput, memory bandwidth, or launch overhead), and the correct intervention depends entirely on which resource that is. Profiling must precede optimisation. The highest-impact bottleneck is rarely where engineers assume it is. In our GPU engineering practice, we have seen teams spend weeks optimising kernel arithmetic only to discover the kernel was memory-bound β and a simple change to memory access patterns delivered more improvement than all the compute optimisations combined. The parent methodology, how to profile GPU kernels to find the real bottleneck, develops the diagnostic side of this in full. The profiling-first checklist Before applying any performance fix, complete this sequence: Profile with Nsight Systems β capture a timeline of the full workload to identify which kernels dominate execution time Classify each dominant kernel β use Nsight Compute roofline analysis to determine if it is compute-bound or memory-bound Quantify the gap β compare achieved throughput to the theoretical hardware ceiling for the binding resource Select the intervention that addresses the binding constraint β not the one that seems most sophisticated Re-profile after each change β fixing one bottleneck shifts the constraint elsewhere Memory bandwidth optimisation: the highest-impact category Memory bandwidth optimisation typically delivers 2β5Γ more improvement than compute-bound optimisations for AI inference workloads (observed-pattern across our GPU engagements; not a benchmarked rate). This is because the majority of operations in deep learning inference β element-wise activations, normalisation layers, attention score computation at small batch sizes β have low arithmetic intensity and are memory-bandwidth-limited. Optimisation Addresses Typical impact (observed-pattern) When to apply Memory coalescing Scattered memory access patterns 2β8Γ kernel speedup When Nsight shows low memory throughput efficiency Kernel fusion Multiple launches with intermediate writes 1.5β3Γ end-to-end speedup When timeline shows many short kernels with gaps FP16/BF16 precision Memory bandwidth consumed by data movement 1.5β2Γ throughput When accuracy tolerates reduced precision Shared memory caching Repeated access to same data 2β4Γ for affected kernels When the same data is read multiple times across threads Batch size tuning Underutilised GPU compute and memory bus 1.5β4Γ throughput When GPU utilisation is below 60% The ranges above are planning heuristics drawn from our own engagements β they are not externally benchmarked, and individual workloads vary significantly. Compute-bound optimisation: when it actually matters Compute-bound kernels β large matrix multiplications, convolutions with large filter sizes β benefit from: Tensor core utilisation β ensuring matrix dimensions are multiples of 8 (FP16) or 16 (INT8) to enable tensor core acceleration, which delivers a large multiplier over standard CUDA cores Mixed-precision training/inference β using FP16 for compute while maintaining FP32 for accumulation, roughly doubling effective compute throughput Algorithm selection β choosing Winograd convolution for small-filter cases or FFT-based convolution for large filters These interventions matter only when Nsight Compute confirms the kernel is compute-bound. Applying tensor core optimisation to a memory-bound kernel has zero effect. The optimisation hierarchy for AI inference For teams running AI inference workloads, the interventions ranked by typical impact (observed-pattern from our inference latency optimisation engagements): Model compilation (TensorRT, torch.compile) β 2β5Γ improvement by eliminating framework overhead Precision reduction (FP32 β FP16 β INT8) β 1.5β4Γ improvement per precision step Kernel fusion β 1.5β3Γ by reducing memory round-trips Batch size optimisation β 1.5β4Γ by improving hardware utilisation Memory layout optimisation β 1.2β2Γ by improving coalescing and cache behaviour Custom kernel implementation β highly variable, only justified when standard libraries donβt cover the operation The order matters. Applying step 6 before steps 1β3 is almost always wasted effort β the framework overhead eliminated by compilation exceeds any custom kernel improvement. When should you stop optimising GPU performance? Performance optimisation has diminishing returns. The signal to stop is when the profiler shows the dominant kernels are achieving 70%+ of the theoretical hardware ceiling for their binding resource. Beyond that point, further improvement requires algorithmic changes (different model architecture, different precision regime) rather than kernel-level optimisation. Pushing from 70% to 90% of peak typically requires 5β10Γ the engineering effort of reaching 70% β a trade-off that is rarely justified outside hyperscale deployments. For the underlying profiling workflow β which traces to capture, how to read a roofline, and how to tell host-bound from kernel-bound β see the parent article on profiling GPU kernels to find the real bottleneck.