Benchmark testing measures controlled conditions, not production behaviour A benchmark test runs a workload under controlled conditions and records a number. The number is real. The conditions are controlled. The problem is that controlled conditions rarely match production workloads — and in our experience across AI deployments, the gap between benchmark score and real performance typically falls in a 20–50% band (observed pattern, not a benchmarked rate). That gap is not a calibration error. It reflects the structure of benchmark design. Benchmarks optimise for reproducibility: identical conditions, minimal environmental variation, stable execution paths. Production workloads run under load variation, alongside other processes, with data that does not match the benchmark distribution, in a software stack that has been modified since the benchmark was run. The more aggressively a benchmark controls for reproducibility, the less it predicts production behaviour. This is the same underlying problem we cover from a different angle in why GPU spec-sheet benchmarking fails for AI: a static measurement, no matter how clean, cannot describe an execution property. What does benchmark testing actually measure? Benchmark testing measures performance under controlled conditions — but those conditions define what the benchmark is measuring, not what you are deploying. Three measurement choices shape what every benchmark captures, and each of them is a place where the result quietly stops applying to your workload. Workload selection. A benchmark tests a specific workload. MLPerf runs a defined set of reference models at fixed input sizes. Geekbench runs synthetic workloads that stress specific CPU subsystems. The benchmark tells you performance on its workload. If your workload differs in structure, size, batch shape, or data characteristics, the result transfers partially at best — and “partially” is doing a lot of work in that sentence. Duration. Most benchmarks measure short runs: seconds to minutes. AI inference services run continuously for hours under variable load. GPU performance over extended runs is shaped by thermal state, memory fragmentation, allocator behaviour, and driver state. A benchmark that captures the first 60 seconds of execution and a production deployment running for 8 hours are measuring different things on the same hardware. We see this routinely when teams extrapolate from a short benchmark and discover, three weeks into capacity planning, that steady-state throughput sits well below the reported number. Software state. Benchmarks run with a fixed software configuration. Production environments change: framework updates, driver patches, model updates, competing workloads. PyTorch, TensorRT, CUDA, cuDNN, NCCL — each of these has versions, and each version interacts with the others. The benchmark’s software state was true at the moment of measurement. It is unlikely to be true when you run the workload. Why widely-used benchmarks optimise for reproducibility over representativeness The most widely used benchmarks — Geekbench, 3DMark, MLPerf — optimise for reproducibility at the cost of real-world representativeness. This is a deliberate design choice, not a defect. Reproducibility is what makes benchmarks comparable across hardware, across time, and across organisations. If results vary based on uncontrolled environmental factors, the benchmark cannot serve its purpose as a standardised measurement. The cost is that the controlled conditions required for reproducibility diverge from production: Geekbench runs fixed workloads at fixed sizes. It tells you hardware capability, not inference throughput for your model. 3DMark measures graphics rendering pipelines. AI compute performance on the same hardware is a different workload class entirely. MLPerf is the most rigorous AI benchmark available, but it measures a fixed submission configuration. Vendors optimise their MLPerf submissions specifically for MLPerf conditions, which may not match their performance profile on your workload. MLPerf results remain valuable. They provide the most credible cross-vendor AI performance comparison available (benchmark class, with submission rules public). The point is narrower: they are comparisons under MLPerf conditions, not under your conditions. Reading them as if they were the latter is the most common methodology error we see. The benchmark testing methodology that predicts real AI workload behaviour Meaningful benchmark testing requires defining the workload first and selecting the benchmark second. Most organisations do the opposite — they start with whatever benchmark is convenient and reshape their question to fit. The correct sequence is short, and almost mechanical once you accept it. 1. Define what you are measuring. Start with the production workload: the actual model you are deploying, the actual batch sizes your serving infrastructure uses, the precision format you will run at (FP16, BF16, FP8, INT8), and the latency or throughput target you are trying to hit. If you cannot write these down in one paragraph, no benchmark will rescue you. 2. Select or construct a benchmark that matches. A benchmark that does not test your workload class will not predict your performance. If no existing benchmark matches, running your own workload under controlled conditions is more valuable than running a standard benchmark that measures something adjacent. 3. Measure at steady state, not at cold start. GPU performance during the first seconds of execution — warm-up, memory allocation, kernel compilation, autotuner passes — is different from sustained steady-state performance. Benchmark runs must reach thermal and execution steady state before recording results. Peak-burst numbers are an interesting upper bound; sustained behaviour is what your capacity plan depends on. 4. Record the software context. Framework version, driver version, CUDA/ROCm version, kernel library versions (cuDNN, FlashAttention, NCCL). A result without software context is not reproducible, and a result that is not reproducible is not a benchmark. 5. Test the conditions that stress your actual system. If your production system handles variable batch sizes, test variable batch sizes. If you serve under concurrent load, test under concurrent load. Benchmarking only the best-case scenario is common and produces results that do not survive deployment. Benchmark testing framework for AI workloads Phase What to do What to avoid Definition Specify model, batch size, precision, success metric Starting with available benchmarks and fitting your workload to them Measurement Run at steady state for 10+ minutes; record min/mean/p99 latency Recording only peak or average from a short run Context Document full software stack (framework, driver, runtime versions) Reporting numbers without software context Interpretation Compare to your production success metrics Comparing to vendor-published results from different stack configurations Validation Test under production-representative load conditions Testing only under ideal, single-workload conditions The structural failure in most benchmark processes The gap between benchmark and production is not a problem you solve by finding a better benchmark. It is a structural property of how benchmarks work. Our approach is to understand what the benchmark you are using actually measures, and then to add the measurement that covers what it misses. Standard benchmarks remain useful for shortlisting and for cross-vendor comparisons where conditions are similar. They cannot replace workload-specific measurement for production capacity planning. Two GPUs with very similar spec sheets — same nominal FLOPs, same memory bandwidth, similar clock targets — can behave very differently on the same model, for reasons that have nothing to do with the spec sheet and everything to do with kernel coverage, scheduler behaviour, and memory hierarchy effects. Performance only exists while a workload is running: it is the joint outcome of the model graph, the runtime, the kernels invoked, the data path, and the operating point. That is the underlying reason a published benchmark number is meaningful only when its full execution context travels with it, and why GPU spec-sheet benchmarking fails for AI on questions that look, at first glance, like they should be one-line answers. LynxBench AI is built around treating the workload, the precision regime, the AI Executor, and the operating point as inseparable inputs to a published number, because a benchmark methodology that allows any of those four to drift silently produces results that read like measurements but behave like opinions. The methodology check on any AI benchmark before adopting it: are all four inputs — workload, precision regime, AI Executor, operating point — disclosed and reproducible by a third party, or is at least one implicit in a way the next batch of results cannot defend? Frequently Asked Questions How long should a benchmark run before the numbers are trustworthy for capacity planning? Short runs of seconds to minutes capture warm-up, memory allocation, kernel compilation, and autotuner passes rather than sustained behaviour. We recommend running at steady state for 10+ minutes and recording min/mean/p99 latency, because thermal state, memory fragmentation, and allocator behaviour only stabilise after the system settles. Peak-burst numbers are a useful upper bound, but sustained steady-state throughput is what a capacity plan actually depends on. Can I trust published MLPerf numbers for my own deployment decision? MLPerf is the most rigorous AI benchmark available and the most credible cross-vendor comparison, but it measures a fixed submission configuration that vendors optimise specifically for MLPerf conditions. Those conditions are unlikely to match your model, batch shapes, precision regime, or software stack. Use MLPerf for shortlisting and cross-vendor comparison, not as a substitute for measuring your own workload under your own conditions. What software context has to travel with a benchmark result for it to be reproducible? A result needs the full software stack recorded: framework version, driver version, CUDA or ROCm version, and kernel library versions such as cuDNN, FlashAttention, and NCCL. Each of these has versions that interact with the others, so the benchmark’s software state was only true at the moment of measurement. A number reported without that context is not reproducible, and a number that is not reproducible is not really a benchmark. How should I pick a benchmark when none of the standard ones match my workload? Define the workload first — the model, batch sizes, precision format, and latency or throughput target — then select the benchmark second, never the other way around. If no existing benchmark tests your workload class, running your own workload under controlled conditions predicts your performance better than running a standard benchmark that measures something adjacent. A benchmark that does not test your workload class will not predict your performance, no matter how well-known it is.