A reframe: benchmarks are not leaderboards The dominant framing of AI hardware benchmarks in public discussion treats them as leaderboards — vendor X scored Y on benchmark Z, the chart ranks the contestants, the audience reads the rankings. The framing is consistent with how vendors deploy their benchmark spend: produce favorable numbers under favorable conditions, publish them in marketing materials, contest competitors’ numbers in similar materials. This is a real activity. It is not what benchmarks are for in procurement, and treating leaderboard numbers as procurement evidence is the source of a substantial fraction of AI hardware misprocurement. The reframe that makes benchmarks useful in the procurement context is to treat them as decision infrastructure: the durable, reproducible measurement contract that makes a procurement decision auditable, defends the decision against later review, catches regression after deployment changes, and survives the staff turnover that would otherwise erase the decision rationale. This is a different category of artifact than a leaderboard score, and it is the category that actually supports the decision-making the procurement function exists to perform. Is the benchmark a guess or a contract? A procurement decision without a benchmark contract is structurally a guess. Vendor-supplied performance numbers describe a vendor-chosen workload measured under vendor-chosen conditions on a vendor-chosen configuration, often optimized by a vendor-side engineering team specifically for the benchmark scenario. Copying those numbers into a procurement decision imports the vendor’s assumptions about which workload matters, which conditions apply, which configuration should be used, and which optimization effort is realistic — none of which the buyer’s deployment necessarily matches. The result of the import is a buying decision whose evidence basis is the assumption that the vendor’s scenario predicts the buyer’s deployment. When the assumption holds, the decision works out; when it doesn’t, the deployment underperforms the procurement projection in ways that are hard to attribute back to the source of the error because the source was an unstated assumption rather than an explicit calculation. The contract framing changes this. A benchmark that the buyer’s organization controls — methodology selected for the deployment, workload matching the production use case, configuration matching the deployment stack, optimization effort bounded and disclosed — produces evidence about the buyer’s question rather than the vendor’s. The procurement decision then rests on a measurement contract the buyer can defend: the protocol was deliberate, the conditions were the deployment conditions, the result holds under stated assumptions, and the assumptions are the buyer’s own. A guess and a contract can both produce buying decisions. The contract supports the decision afterwards in ways the guess cannot. The three properties that make a benchmark infrastructure A benchmark functions as decision infrastructure when three properties hold simultaneously: The workload is buyer-relevant. The benchmark exercises the workload the deployment will run, at the precision regime the deployment will use, with the batch policy and concurrency profile the deployment will face. A workload that doesn’t match — even one that’s plausibly similar — produces evidence about a different question, and the evidence-question gap is the source of the misprocurement risk. The methodology is reproducible. A different team with access to the matched configuration can re-run the benchmark and produce comparable results. Reproducibility distinguishes a measurement from an artifact, and it is what allows the benchmark to serve as a contract that any party can verify rather than a result that depends on the original measuring party’s word. The cost basis is reported alongside throughput. Procurement decisions are inherently economic; benchmarks that report performance without the corresponding cost (energy, hardware, software, operational) are reporting half of the trade-off the procurement is making. The cost-relevant accompanying metrics — power draw under the workload, accuracy at the precision regime, sustained behavior over the measurement window — convert a performance number into a procurement-relevant input. A benchmark that has all three properties is decision infrastructure. A benchmark that has fewer — particularly one with workload mismatch, with non-disclosed methodology, or with cost not reported — is leaderboard content that the procurement may use, but cannot rely on as the decision basis. What “outliving a single purchase” means The infrastructure framing has a temporal property the leaderboard framing does not: a benchmark methodology that is treated as infrastructure outlives the procurement moment it was created for. The same methodology can: Catch regression after driver updates. A driver upgrade pushed across the production fleet should produce throughput, latency, and accuracy that match the pre-upgrade baseline within tolerance. The methodology re-run on the new driver detects the deviation. Without a stable benchmark contract, regression detection is reactive rather than systematic. Validate new hardware against known workloads. When a refresh cycle adds new accelerator models to the candidate pool, the same methodology applied to the new candidates produces results comparable to the original procurement evidence. The decision proceeds against a stable measurement basis rather than starting the comparison from scratch. Audit-defend the original decision. When a procurement decision is questioned years after the fact (board review, audit, change of leadership), the methodology and its application during the original procurement are the artifacts that demonstrate the decision was deliberate. The methodology being durable — not a one-time benchmark run — is what makes the audit trail durable. Survive staff turnover. The team that made the original procurement turns over. A new team inherits the deployment. Without a benchmark methodology that documents the workload assumption and the measurement protocol, the new team cannot reproduce the basis for the original decision and effectively starts the evaluation over each time. With it, the methodology becomes institutional knowledge that persists across team changes. The recurring pattern is that benchmarks-as-leaderboards are point-in-time content; benchmarks-as-infrastructure are durable artifacts that produce ongoing value across the deployment lifecycle. The investment to produce the infrastructure version is larger; its return is realized over the lifetime of the deployment, not at the procurement moment alone. The difference between a benchmark and a brochure A brochure presents favorable numbers in a favorable framing to support a sales conversation. A benchmark, in the infrastructure sense, produces methodology-specified, configuration-specified, workload-relevant, reproducible measurement that supports a procurement conclusion. The difference is not always visible at the headline level — both can present similar-looking numbers. The difference is in what’s behind the headline: Property Brochure Decision-infrastructure benchmark Number selection Favorable to the seller Comprehensive across operating envelope Methodology disclosure Vague or absent Complete and reproducible Configuration Vendor-optimal Deployment-realistic Workload Vendor-chosen showcase Buyer’s actual or representative Optimization effort Maximum, undisclosed Bounded and stated Sustained vs peak Often peak Typically sustained Cost basis Often absent Required Caveats Minimized Documented Reproducibility Often vendor-only Open to any matched configuration Lifetime utility Marketing window Across deployment lifecycle A procurement decision that mistakes a brochure for an infrastructure benchmark is using a marketing artifact as decision evidence. The decision may be correct anyway; it is not defensibly correct, and the audit trail it leaves is not the kind that survives later interrogation. Benchmarks as decision infrastructure makes the broader case; the operational expression here is that benchmarks function as the contract that makes a procurement decision auditable when they are treated as infrastructure, and the failure to make this distinction explicit is the source of the recurring procurement-evidence gap. The framing that helps Benchmarks are not leaderboards and not brochures; in the procurement context, they are the decision infrastructure that makes the buying decision auditable, defends it against later review, catches deployment-time regression, and outlives staff turnover. A benchmark functions as infrastructure when the workload is buyer-relevant, the methodology is reproducible, and the cost basis is reported alongside throughput. A benchmark missing any of these is leaderboard content that may inform the decision but cannot serve as the contract the procurement record needs. LynxBench AI is structured as the benchmark methodology that satisfies the three properties — workload buyer-relevant, methodology reproducible, cost basis reported — because the procurement decision the methodology exists to support is a decision that needs infrastructure-grade evidence, and infrastructure-grade evidence is what a benchmark produces when it is designed for the procurement question rather than for the marketing one.