How to Increase GPU Performance for AI: Batch Sizing, Occupancy, and Operator Fusion

Increasing GPU performance for AI workloads is not primarily about changing hardware — it’s about using the hardware you have more effectively. In our experience, most production AI inference systems operate at 30–60% GPU utilization when first deployed. Getting to 80–90% is almost always an engineering problem, not a budget problem (observed pattern across our GPU-engineering engagements; not a benchmarked rate).

The techniques that actually move the needle, in rough order of impact, are: batch sizing, operator fusion, memory access optimization, and kernel occupancy tuning. Each addresses a different constraint. Applying the wrong fix for the bottleneck produces no improvement — which is why the order below starts with profiling, not tuning.

Why profile before you optimise?

This cannot be overstated. The approaches below target different bottlenecks — memory bandwidth, compute throughput, launch overhead, CPU synchronization. Without profiling, you don’t know which one limits your workload. The full profiler workflow lives in our parent methodology piece on how to profile GPU kernels to find the real bottleneck; the short version is: run Nsight Systems first to classify the workload, then drill into Nsight Compute only for the kernels that show up as hot.

The one-line decision: run Nsight Systems, confirm GPU utilization, then check whether the idle periods are compute gaps, memory transfer stalls, or CPU-side overhead. That single observation determines which of the sections below is worth your week.

Batch sizing: the highest-leverage lever

For most AI inference workloads, increasing batch size is the single highest-impact change for throughput. GPU hardware is designed for massively parallel execution. A batch of 1 leaves the vast majority of compute units idle. A batch of 32 amortizes kernel launch overhead and fills more of the available warp slots.

The relationship between batch size and throughput follows a curve we see repeatedly in practice (observed pattern across transformer and CNN inference deployments; specific shape depends on model and GPU):

Batch 1–4 — Typically memory-bandwidth-limited, low arithmetic intensity. Most of the GPU is idle.
Batch 8–32 — Throughput increases near-linearly for many models. This is the efficient operating region for many inference scenarios.
Batch 64–256 — Compute-bound for most transformer models. Throughput increase slows as arithmetic intensity exceeds the memory bandwidth roof.
Batch >256 — Typically memory-bound again due to KV cache growth in LLMs; returns to compute-bound for CNN architectures.

The constraint is latency. Higher batch size increases time-to-first-response. For latency-sensitive APIs with a p99 budget under 100 ms, practical batch sizes are bounded. For throughput-optimized offline inference, batch size should be pushed to the memory limit and held there.

A short PyTorch sweep is usually enough to find the knee:

import time
import torch

model = model.cuda().eval()
for batch_size in [1, 2, 4, 8, 16, 32, 64, 128]:
    x = torch.randn(batch_size, *input_shape).cuda()
    # Warmup
    for _ in range(5):
        _ = model(x)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(50):
        _ = model(x)
    torch.cuda.synchronize()
    elapsed = (time.perf_counter() - t0) / 50
    print(f"Batch {batch_size}: {elapsed*1000:.1f}ms, {batch_size/elapsed:.0f} samples/s")

The point at which samples-per-second flattens is your throughput-optimal batch. Whether you can actually run there depends on the SLA the service has to honour.

Operator fusion: fewer round-trips through HBM

Unfused inference executes each operation as a separate GPU kernel: linear projection, activation, another linear projection, layer norm — each reads from and writes back to HBM. Fused kernels chain these operations, keeping intermediate results in registers or shared memory and eliminating multiple HBM round-trips.

Concrete fusion opportunities for transformer inference:

Unfused operations	Fused version	Typical benefit (observed pattern, model-dependent)
Q, K, V projection → attention → softmax → weighted sum	FlashAttention	2–4× attention kernel speedup
LayerNorm → linear projection	Custom or Triton fused kernel	1.3–1.8×
Element-wise activation + gate multiply (SwiGLU)	Fused kernel	1.5–2× for the fused op
Residual add + LayerNorm	apex FusedLayerNorm, or torch.compile	1.2–1.5×

The speedup ranges above are not benchmarks — they are observed patterns across our engagements on modern transformer architectures, and the actual number on a specific model and GPU pair must be measured.

torch.compile with mode="reduce-overhead" or mode="max-autotune" performs automatic fusion through the inductor backend. This is the first thing to try before writing custom fused kernels:

model = torch.compile(model, mode="max-autotune")

In our experience, torch.compile delivers a meaningful throughput improvement on modern transformer architectures with no code changes beyond this single line (observed across multiple engagements; magnitude varies by model and PyTorch version). For older models or non-standard architectures, the inductor backend sometimes falls back to eager mode for parts of the graph, and the gain is smaller — Nsight Systems will show you which sections were actually compiled.

Kernel occupancy: when SMs sit idle

Occupancy is the ratio of active warps to the maximum number of warps an SM can support. Low occupancy means the SM has idle cycles it cannot fill due to resource constraints — registers, shared memory, or block configuration.

Check occupancy with Nsight Compute: look at the Achieved Occupancy metric and compare it to the theoretical maximum. If achieved occupancy is below 50%, investigate three suspects:

Register pressure — Too many registers per thread limits how many threads can reside on an SM simultaneously. Compile with -maxrregcount=64 to cap registers and check whether spilling occurs.
Shared memory per block — Large shared memory allocations limit concurrent blocks per SM. Check with --ptxas-options=-v.
Block size — Very small block sizes (e.g. 32 threads) waste scheduler slots. 128–256 threads per block is a common starting point.

Occupancy is not always the binding constraint. A memory-bound kernel at 50% occupancy may already be saturating HBM bandwidth, and forcing more warps onto the SM will not help. Increasing occupancy improves throughput only for compute-bound kernels with insufficient warps to hide latency. The order of operations matters: classify the kernel first, then tune the right knob.

Memory coalescing: the hidden 50% penalty

For custom kernels — or cases where profiling shows low memory throughput on stock kernels — check memory access coalescing. A coalesced access is when 32 consecutive threads (a warp) access 32 consecutive memory addresses. The GPU satisfies that with a single memory transaction; the uncoalesced version may need up to 32.

Signs of uncoalesced access in Nsight Compute:

L1/TEX Cache hit rate near 0% but L2 Cache hit rate also low
Memory throughput percentage far below bandwidth ceiling despite memory-bound classification
Global Memory Load Efficiency metric below 50%

Row-major matrix access in column-major traversal order, or transpose operations without shared memory tiling, are the usual culprits. The fix is to reorganise data layout or stage through shared memory with coalesced loads.

Asynchronous data loading

A common bottleneck that profiling reveals but developers overlook: the GPU is idle because the next batch isn’t ready yet. The CPU is busy preprocessing or loading data while the GPU waits. In Nsight Systems this appears as long gaps on the GPU row with the CPU row fully utilised — a tell that no amount of kernel tuning will fix.

Fix with pinned memory and prefetching:

# PyTorch DataLoader with pinned memory enables async H2D transfer
loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True,
    prefetch_factor=2,
)

pin_memory=True allocates host memory in pinned (non-pageable) pages, enabling faster DMA transfers. prefetch_factor=2 pre-loads the next batch while the GPU processes the current one. Combined with num_workers > 0, this is usually enough to close a CPU-bound feed gap.

Performance improvement checklist

A practical sequence we run on most engagements:

Profile with Nsight Systems — confirm GPU utilization and identify idle gaps
Increase batch size to the maximum allowed by the latency SLA and VRAM
Apply torch.compile(model, mode="max-autotune") as the first code change
Enable pinned memory and prefetching in the data pipeline
Check for synchronous CUDA operations blocking the CPU (.item(), .numpy() on GPU tensors)
Profile specific slow kernels with Nsight Compute — check occupancy and memory efficiency
For custom kernels: verify memory coalescing with the Global Memory Load Efficiency metric
Consider operator fusion for repeated sequences of element-wise operations

FAQ

In brief

GPU performance improvement for AI starts with batch sizing (highest leverage, zero kernel work), proceeds through torch.compile-based operator fusion (low effort, significant gain), and then addresses specific kernel bottlenecks identified by profiling. Occupancy tuning and memory coalescing are meaningful only for compute-bound kernels where the profiling data confirms those are the binding constraints. Profiling before every optimisation is not optional — it’s how you avoid spending a week tuning the wrong kernel.