Measuring GPU Benchmarks for AI

A practical guide to GPU benchmarks for AI: what to measure, how to run fair tests, and how to turn results into procurement and SLA decisions.

Measuring GPU Benchmarks for AI
Written by TechnoLynx Published on 15 Jan 2026

A useful GPU benchmark for AI is not the one that produces the highest number on a chart. It is the one that maps to how your AI workloads actually behave: your model mix, your prompt distribution, your batching rules, and your latency targets. A device that looks fast on a micro-test can stall in production when the data path, collectives, or KV cache hit their real limits.

This article sets out a practical approach to GPU benchmarking for AI. We cover training and inference, large language model (LLM) specifics, interconnect and memory effects, power and cost normalisation, and how to fold the Blackwell architecture into a suite without throwing the methodology away. The aim is decision-ready numbers, not lab trivia.

Why benchmark at all?

AI systems are complex. The same card can look fast on a micro-test yet underperform in production because of data stalls, small batches, or chatty collectives. Benchmarks earn their cost when they give you three things: a shared baseline for comparing devices and clusters, evidence to size capacity and budgets, and early warnings about bottlenecks that could block a launch.

The aim is not to chase the highest number. It is to measure the right numbers that map to how your AI workloads behave day to day. We see this pattern regularly — teams that benchmark against the wrong workload mix end up over-provisioning one part of the stack and starving another.

For background on why a single headline figure is misleading, see Why GPU Performance Is Not a Single Number and our broader work on training and inference as different workloads.

What “good” looks like

A useful benchmark suite does three things at once.

It represents your workload mix. Include the model types and shapes you run most often: image models, speech models, retrieval, recommendation, and at least one LLM use case that mirrors your prompt and context distribution.

It captures the full path. Measure the model and the data path around it — preprocessing, tokenisation, data loaders, post-processing, and any orchestration overhead that exists in production.

It produces decision-ready metrics. Track throughput (items/sec or tokens/sec), latency percentiles, time-to-target-accuracy for training, energy per item, and cost per item. Tie each number to a target so results dictate action. If a number will not change a decision, it does not belong in the headline table.

Training versus inference

Training stresses long runs, high utilisation, and network collectives. The key questions are wall-clock time per epoch and how stable utilisation is over hours or days. The components that decide outcomes are compute kernels (convolutions, attention, GEMM), memory bandwidth and capacity, collectives (all-reduce, all-gather) implemented over NCCL, and I/O — dataloaders and storage can starve the device even when kernels are efficient.

Inference is a different problem. It is about latency and cost at a fixed quality. The main loops are prompt and token handling for an LLM, batching strategy, and the trade-off between throughput and tail latency. Small-batch, low-latency services behave very differently from high-throughput batch jobs. Benchmark both modes if you run both.

Keep the setup honest

Your benchmark should be boring in the best sense — predictable, reproducible, and close to production. Use the same containers, drivers, and runtime flags you use in production. Fix seeds, framework and CUDA versions, and write down every setting. Warm the system before taking measurements so you do not record first-run jitter from kernel autotuning or JIT compilation. Pin data sources to the same storage class and network path used in your environment. Record power draw at the wall if possible, not just device-reported numbers — fan curves and ambient temperature both move sustained results.

When you publish results internally, include the artefacts: scripts, commit hashes, container digests, driver and CUDA versions, and a short description of the machine room or cloud region. Anything less makes the result unrepeatable, and an unrepeatable benchmark is not evidence.

Metrics that change decisions

A long list of counters can hide the truth. The short list that actually drives decisions is throughput at a fixed quality target, p50 / p95 / p99 latency per request plus cold-start and spike behaviour, time-to-target-accuracy for training runs, joules per token or image at target settings, cost per item normalised by actual cloud or on-prem rates including interconnect charges, and stability — variance over long runs, error rates, retry counts.

What to measure for benchmarks, fair comparisons, and delivered performance

The four questions below tend to dominate procurement conversations. Each row pairs the question with the measurement that actually answers it.

Question What to measure Why the obvious answer is wrong
Which benchmark figures matter for AI? MLPerf Training/Inference plus your own harness on your top three workloads at p50 and p99 latency. Spec-sheet TFLOPS describes peak silicon behaviour, not sustained delivered performance.
How do you compare accelerators fairly? Same model, same precision, same framework version, same batch size, same sequence length. Mixing FP8 against FP16 or TensorRT against default PyTorch makes most cross-vendor comparisons meaningless.
Peak vs delivered performance? Sustained tokens/sec or images/sec at production batch and precision. Well-tuned kernels typically deliver 30–70% of peak; naive code lands at 5–20%.
Can you trust vendor charts? Re-run the headline claim on your own harness with disclosed methodology. “Up to 4× faster” usually depends on a precision change or a workload that flatters one chip’s strongest path.

The classes of evidence behind these rows differ, and a fair report says so. MLPerf is a benchmark-class result — named, audited, reproducible. Your in-house harness numbers are also benchmark-class when the project and configuration are named. Cross-engagement statements like “naive code typically delivers 5–20% of peak” are observed-pattern — not a benchmarked rate, but a pattern we see across our GPU performance audits.

The data matters as much as the device

It is easy to overfit a benchmark to a toy dataset. Match sequence lengths to your real traffic, not a convenient fixed length. Preserve class imbalance and input-shape variety if your production traffic has it. Include preprocessing — tokenisation, resizing, normalisation — in the timed path if it runs at inference time in production. Test your top three workloads rather than one proxy.

LLM-specific benchmarks

LLM testing needs care because it mixes compute and memory pressure in a way that smaller models do not.

The prompt mix matters more than people expect. Use a realistic mix of short prompts, long prompts, and multi-turn contexts. Static, dynamic, and continuous batching shift the throughput-versus-latency balance in different ways; report each separately rather than picking the one that wins on your slide. KV-cache policy has a major effect on capacity and token rate — decide how you count cache memory and evictions, and report it. Speculative or assisted decoding can change both speed and quality, so measure the gain and the accuracy impact together.

For the engineering thread that connects these decisions to production deployment, see our hub article on how to optimise AI inference latency on GPU infrastructure.

Interconnects, topology, and the memory wall

Single-GPU performance is only the start. Many AI workloads rely on many devices acting as one. Within a node, NVLink-class or PCIe links govern how fast you can share tensors. Across nodes, your fabric — InfiniBand or Ethernet with the right settings — decides whether tensor, pipeline, or data parallelism scales. Topology awareness matters: ring versus tree collectives, rank placement, and NUMA-affinity settings can shift results by large margins. Your benchmark suite must include at least one multi-GPU and one multi-node run that mirrors the path you plan to operate.

Capacity and bandwidth often decide outcomes more than FLOPs. Capacity limits batch size, context length, and model size without offloading. Bandwidth feeds the cores; if it cannot keep up, FLOPs go unused. A fair report names the limiter — bandwidth, capacity, kernel launch overhead, or collective time — rather than just saying “GPU X is slower than GPU Y.”

Blackwell and the next generation

As new GPUs arrive, your benchmark suite should flex, not break. For the Blackwell architecture and similar generational shifts, plan for new tensor data types and lower-precision paths that change both speed and accuracy trade-offs, larger memory pools or faster memory that shift the sweet spot for batch size and context length, interconnect upgrades that change multi-GPU scaling behaviour, and scheduler and compiler changes that affect kernel fusion and launch overheads. Carry over methods, not numbers. Even small runtime or driver changes can invalidate old conclusions.

Reporting results the business can use

Present numbers in the language of outcomes. For training: “Model A reaches target accuracy in 9.5 hours on System 1 versus 12.3 hours on System 2 (−23% wall time). Energy per epoch −18%. Cost per epoch −21%.” For inference: “At p95 ≤ 60 ms, System 1 serves 1.7× more tokens/sec. Cost per million tokens −28%.” For capacity: “Max context 64k with acceptable latency on System 1; System 2 requires KV offload above 48k.”

Attach the “how”: environment, container, versions, flags, and test scripts. That is what lets others repeat the work and trust the conclusion.

FAQ

Which GPU benchmarks matter for AI workloads in 2026?

For training: MLPerf Training (industry standard, well-audited), throughput in tokens/sec on a representative model, and memory bandwidth on long-context attention kernels. For inference: MLPerf Inference, tokens/sec at p50 and p99 latency, time-to-first-token (TTFT) and inter-token latency (ITL) for LLMs, and effective batched throughput on your real prompt distribution. Vendor spec-sheet TFLOPS numbers are necessary but not sufficient — they describe peak, not sustained delivered performance.

How do you compare NVIDIA, AMD, and other AI accelerators fairly?

On the same model, same precision, same framework version, same batch size, and same input sequence length — ideally MLPerf submissions or your own reproducible harness on identical workloads. Mixing precisions (FP16 on one, FP8 on the other), comparing different software stacks (TensorRT versus default PyTorch), or quoting different batch sizes makes most cross-vendor comparisons meaningless. Honest 2026 published comparisons cover NVIDIA H100/H200/B200, AMD MI300X/MI325X, Google TPU v5p/v6e, AWS Trainium 2, and Intel Gaudi 3.

What is the difference between peak TFLOPS and delivered performance?

Peak TFLOPS is the spec-sheet number — the maximum the silicon can do under ideal conditions (perfect occupancy, no memory stalls, no kernel-launch overhead). Delivered performance is what your workload actually achieves, typically 30–70% of peak for well-optimised kernels and 5–20% of peak for naively-written code. The delta is mostly memory bandwidth, kernel-launch overhead, and synchronisation cost. Most production gains come from closing that delta, not from buying faster silicon.

Should you trust vendor benchmark numbers?

Trust them only after you have read the methodology: precision, batch size, framework, software version, model, sequence length, and whether the result is at p50 or peak. Vendor charts selectively pick the conditions that flatter their silicon — “up to 4x faster than competitor X” usually involves a precision change or a workload that hits the new chip’s strongest path. Independent MLPerf submissions and reproducible third-party benchmarks are the only basis for procurement decisions in the high tens or hundreds of millions of dollars.

For a deeper architectural walkthrough on this engineering thread, see our hub article on how to optimise AI inference latency on GPU infrastructure. For broader programme context across our engagements, explore our GPU performance engineering practice.

Image credits: Freepik.

Back See Blogs
arrow icon