Optimising the wrong level
A GPU engineer profiles a kernel, identifies that it is achieving 35% of peak compute throughput, and spends two weeks tuning shared memory tiling, warp-level primitives, and register allocation to push it to 65%. The improvement is real — nearly 2× speedup on that kernel. Then a colleague restructures the upstream algorithm to eliminate 60% of the data the kernel processes. The restructured version, running the original unoptimised kernel on less data, is faster than the tuned kernel running on the full dataset.
This is not a contrived example. It is a pattern we encounter regularly in GPU optimisation engagements: engineering effort invested in kernel-level tuning when the higher-leverage intervention is algorithmic restructuring. The inverse also occurs — teams restructure algorithms when the bottleneck is a specific kernel implementation that is using 20% of the hardware’s capability. Knowing which level to optimise before committing effort is the difference between productive optimisation and wasted engineering time.
What does kernel tuning actually address?
Kernel tuning operates within a fixed algorithm and data volume. The algorithm has been chosen; the data it processes has been defined; the kernel implements the computation on the GPU. Tuning optimises how that computation maps to the hardware: memory access patterns, occupancy, warp utilisation, tensor core usage, instruction scheduling.
The ceiling for kernel tuning is the hardware’s theoretical peak throughput for the operation’s arithmetic intensity. A memory-bound kernel can be tuned to approach the memory bandwidth ceiling. A compute-bound kernel can be tuned to approach the compute throughput ceiling. Kernel tuning improves constant factors — it makes the existing algorithm run faster on the hardware, but it does not change what the algorithm computes or how much data it processes.
Kernel tuning is the right intervention when the profiling data shows a large gap between achieved and achievable performance for a kernel that processes the minimum necessary data volume. If the kernel is at 20% of its hardware-limited ceiling, there is a 5× potential improvement from tuning alone — and that improvement requires no changes to the algorithm or the data pipeline.
What algorithmic restructuring addresses
Algorithmic restructuring changes what the computation does — not how it maps to hardware, but what work is performed. The interventions operate at a higher level:
Reducing the problem size. An object detection pipeline processes every frame at full resolution. Restructuring introduces a lightweight first-pass detector (running at reduced resolution or on a down-sampled image) that identifies regions of interest, and the full-resolution processing is applied only to those regions. The total compute volume drops proportionally to the selectivity of the first pass — if only 15% of each frame contains objects of interest, the full-resolution compute drops by 85%. No amount of kernel tuning on the full-resolution pipeline achieves this reduction.
Changing the algorithmic complexity. A naive attention mechanism in a Transformer is O(n²) in sequence length. FlashAttention restructures the computation to tile the attention matrix and process it in blocks that fit in SRAM, changing the memory access pattern from O(n²) global memory reads to O(n) SRAM accesses. The FLOP count is identical, but the memory traffic — which is the binding constraint — drops dramatically. The restructuring achieves a speedup that no amount of tuning on the naive implementation can match, because the naive implementation’s memory access pattern is fundamentally mismatched to the hardware.
Eliminating redundant computation. A training pipeline computes gradients for all parameters on every batch. Gradient checkpointing restructures the backward pass to recompute intermediate activations from checkpoints rather than storing them all — trading compute for memory. Conversely, caching intermediate results that are reused across operations eliminates redundant recomputation. These are algorithmic decisions that determine the total work volume, not implementation decisions about how individual kernels execute.
Changing the numerical precision. Moving from FP32 to FP16 or INT8 computation halves or quarters the memory bandwidth and compute requirements per operation. This is an algorithmic decision (it changes what is computed — lower-precision approximations rather than full-precision results) that has a hardware-level effect (tensor cores operate on lower-precision operands at higher throughput). The decision requires numerical stability analysis, not kernel tuning.
How to decide which level to optimise
The decision framework follows from profiling data:
Profile the system first. Using the GPU profiling methodology, identify where execution time is spent and what the dominant bottleneck is. The system-level profile reveals whether the problem is in specific kernels (optimise kernels) or in the volume and pattern of work (restructure the algorithm).
Check the utilisation ceiling. If the dominant kernels are achieving 60%+ of their hardware-limited ceiling (compute or memory bandwidth, whichever is binding), kernel tuning will yield diminishing returns. The improvement potential is 1.5× at best. Algorithmic restructuring that reduces the workload is the higher-leverage intervention.
If the dominant kernels are achieving less than 30% of their ceiling, kernel tuning has significant headroom. A 2–3× improvement from tuning is achievable and likely worth the investment before considering algorithmic changes.
Assess the restructuring difficulty. Algorithmic restructuring requires understanding the computation at a mathematical level — not just the implementation, but the problem structure. Changing the attention mechanism requires understanding attention theory. Changing the detection pipeline requires understanding the detection accuracy trade-offs of coarse-to-fine processing. The engineering effort for restructuring is typically higher than for kernel tuning, and the risk of introducing correctness issues is also higher.
Estimate the ROI of each intervention. Kernel tuning on a kernel that contributes 5% of total execution time yields a maximum 5% end-to-end improvement. Algorithmic restructuring that eliminates 50% of the total computation yields up to 50% end-to-end improvement — if it is feasible. The ROI comparison accounts for both the potential improvement and the engineering effort required.
We have found that the most productive optimisation engagements apply both levels in sequence: algorithmic restructuring first (to reduce the total work to the minimum required for the application’s accuracy and correctness requirements), then kernel tuning on the dominant remaining kernels (to maximise hardware utilisation on the reduced workload). Applying kernel tuning first risks optimising code that will be eliminated by subsequent algorithmic changes. The cross-platform portability considerations also factor in: algorithmic restructuring is API-independent, while kernel tuning is often API-specific — and achieving cross-platform GPU performance portability demands algorithmic approaches that do not depend on vendor-specific kernel optimisations.
The practical implication for GPU teams
Our recommendation for teams seeking to improve GPU workload performance: do not start with kernel tuning. Start with profiling, then assess whether the dominant bottleneck is in the implementation (kernel tuning territory) or in the problem structure (algorithmic restructuring territory). The profiling data determines the right level of intervention — and prevents the expensive mistake of spending engineering weeks on kernel tuning when a day of algorithmic analysis would identify a higher-leverage improvement.
If your team needs to determine whether the performance gap in your GPU workload is best addressed through kernel tuning, algorithmic restructuring, or hardware scaling, a GPU Performance Audit provides the diagnostic framework. Our GPU engineering practice starts with the profiling data.