“Stress test passed” is not a procurement-grade statement A common shape of the AI hardware pre-procurement evaluation is a brief stress test on the candidate device: run a synthetic workload at high utilization for some minutes, observe that the system doesn’t crash, observe that temperatures are within bounds, and conclude that the hardware is suitable. The conclusion does not follow from the evidence. A system that survives a short synthetic stress test has demonstrated that it does not immediately fail under load. It has not demonstrated anything about the sustained performance, the thermal behavior at equilibrium, the throughput under the candidate workload, or the cost-of-ownership profile that the procurement decision should rest on. A stress test that informs a procurement decision is a different artifact. It runs the candidate workload (not a synthetic substitute), drives the system to its saturation point, holds it there long enough for the steady-state governors to engage, and reports the methodology, the workload, the software stack, and the observed behavior in a form another team could reproduce. The methodology described here is for that artifact. Why don’t short synthetic stress tests predict deployment behavior? Synthetic stress utilities (gpu-burn, stress-ng, vendor-supplied stress harnesses) and short-duration high-utilization runs share three structural weaknesses for AI procurement purposes: They don’t exercise the candidate workload’s actual access pattern. A synthetic compute kernel pushes the device to a particular bottleneck (typically compute), which may or may not be the bottleneck the production workload encounters (memory bandwidth, kernel-launch overhead, KV-cache management for autoregressive models). A device that survives a synthetic-compute stress test can still be the wrong device for a memory-bound inference workload. They don’t reach thermal equilibrium. Modern AI accelerators take minutes to many minutes to reach thermal equilibrium with the cooling infrastructure they’re installed in. A stress test that runs for a few minutes captures the device on its way to equilibrium, not at it. The throughput, clocks, and power draw observed during transient warmup do not predict the values at sustained operation. (See the thermal-throttling article for the underlying mechanism.) They don’t expose software-stack instability. A workload’s runtime stability depends on the AI Executor’s full stack — driver, runtime, framework, kernel libraries — under sustained operation, not just the silicon’s ability to accept load. Memory leaks, scheduling pathologies, and version interactions that surface after hours of operation are invisible in short tests. A procurement decision built on this evidence is essentially a decision built on the assumption that nothing happens between minute one and month one of operation, which is an assumption AI infrastructure repeatedly demonstrates to be wrong. What a procurement-grade AI stress test actually does The structure of a stress test that supports a procurement decision differs from the synthetic test along several dimensions: Workload-faithful. The test runs the production AI workload — or the most representative proxy of it the team can produce — at the production batch policy, precision regime, and request profile. If the deployment will run a specific model at FP8 with continuous batching, the stress test runs that model at FP8 with continuous batching. The synthetic test pattern is replaced with the workload pattern. Saturation-driven. The test loads the system to its saturation point and slightly past it, so the throughput-vs-latency curve is fully traced and the saturation knee is observed. This characterizes the operating envelope, not just the no-failure regime. Sustained. The test holds the saturation load long enough — typically hours, not minutes — for thermal equilibrium, for any one-time framework initialization to clear, and for slow-developing instability (memory growth, scheduling drift) to surface. The first phase of the test (the warm-up) is discarded; the measured behavior is the post-warm-up steady state. Multi-axis. The test sweeps configurations rather than fixing them: batch size variations, concurrency variations, precision variations, and where applicable input-shape variations. The result is a behavioral surface, not a single number. Instrumented. The test records GPU utilization, memory utilization, temperature, power draw, and per-request latency throughout, so the steady-state measurements are paired with the system state that produced them. A throughput number disconnected from the temperature and power profile that produced it cannot be interpreted operationally. A Linux-side stress-test methodology checklist A stress test on Linux that produces procurement-grade evidence should satisfy: Workload identified. Production model, model size, precision regime, batch policy, expected input distribution. AI Executor specified. Accelerator + driver + runtime + framework + inference runtime + precision regime + batch policy. Versions captured at test start. Reproducible OS environment. Distribution and kernel version recorded. Kernel module versions (NVIDIA driver, AMD kgd, Intel GPU module) captured. Cgroup, NUMA, and CPU pinning policy declared. Cooling and ambient declared. Server form factor, cooling configuration, expected data-center ambient temperature. The thermal envelope is part of the test conditions. Co-tenant load defined. Whether the test runs on a quiescent host or under realistic background load (host CPU, network, storage). The number measured changes if the assumption changes. Warm-up window defined and excluded. Long enough for thermal equilibrium (typically 10-30 minutes for sustained workloads). Measurements during warmup are not used for steady-state characterization. Sustained measurement window. Hours of post-warm-up operation at saturation. Many failure modes do not surface in less. Saturation sweep. Batch and concurrency varied across the operating envelope. Curve produced, not point. Per-request latency distribution captured. p50, p95, p99 (and where SLO requires, p99.9) reported alongside throughput. System state correlated. Temperature, clock frequency, power draw, GPU memory utilization, host CPU and memory recorded throughout. Each performance number paired with the system state. Failure modes characterized. What happens past the saturation point — degraded latency, dropped requests, crash. Not just “does it work” but “how does it fail.” Multiple trials. The test re-run, ideally on multiple physical units of the candidate hardware, to distinguish unit-specific behavior from population behavior. Result reproducibility package. Test scripts, workload definition, software-stack inventory, observed-result tables. Another team can re-run and compare. A test that satisfies this list produces evidence a procurement decision can defensibly rest on. A test that satisfies a subset produces evidence whose generalization is bounded by what’s missing. Why these tests matter for the procurement frame The output of a procurement-grade stress test is not a pass/fail flag. It is a characterization of how the candidate hardware behaves under conditions matching the deployment. The procurement frame uses that characterization to answer: Does this hardware sustain the throughput the deployment needs at the latency budget the SLO requires? Does it sustain it at the cost-of-energy and cost-of-cooling profile the budget assumes? Does it fail gracefully past saturation, or catastrophically? Does its behavior match the vendor’s claims, where vendor claims were used in shortlisting? What’s the variance unit-to-unit, so the fleet sizing accounts for distribution rather than expecting the median? These are procurement questions. They are not silicon-capability questions. The evidence that answers them is workload-conditional and stack-disclosed, which is exactly what a procurement-grade stress test produces and what a synthetic short test does not. How organizations should choose AI hardware makes the broader case; the operational expression here is that the choice rests on workload-conditional evidence about sustained behavior on the production stack, and the pre-procurement stress test is what generates that evidence. The framing that helps A procurement-grade AI hardware stress test on Linux is workload-faithful, saturation-driven, sustained for hours not minutes, multi-axis across the operating envelope, fully instrumented for system state, and reported in a form another team can reproduce. A short synthetic stress test is none of these and cannot serve as procurement evidence regardless of what utilization number it produced. LynxBench AI is the methodology the procurement-grade stress test instantiates: the AI Executor is fully specified, the workload is candidate-workload-faithful, measurements are taken after thermal equilibrium under sustained load, the result is a curve across the operating envelope rather than a peak point, and the disclosure surface lets another team reproduce the test on their candidate hardware to make the same comparison.