A question with no single right answer A production model is too slow. The standing meeting fills with diagnoses. The ML team says the platform team should provision better hardware. The platform team says the ML team’s model is inefficient. The procurement team says the hardware specs are what was approved. The infrastructure team says the application’s batching is wrong. Each diagnosis is partly correct and entirely incomplete, and the meeting ends with the assignment “investigate further” — to no team in particular. The pattern recurs because AI performance is a property of the AI Executor, and the executor spans organizational boundaries that no single team owns. Asking whose problem the slowness is — as if it must belong to one team — is the wrong shape of question. The right shape is: which team owns each layer of the executor, and which layers are contributing to the slowdown, and how do those teams collaborate without throwing the diagnosis over the wall to each other. Why is AI performance attribution structurally hard? The AI Executor that produces the workload’s actual performance has multiple layers, each owned by different teams in most organizations: Executor layer Typical team owner Application code, model architecture ML / research Model serving framework ML platform / MLOps Inference runtime, kernel libraries ML platform / engineering Framework version, dependency versions Platform / SRE OS, driver, kernel libraries (system) Infrastructure / SRE Accelerator hardware Infrastructure / hardware engineering Procurement of the hardware Procurement / finance Cooling, power, data-center infrastructure Facilities Workload demand, SLO definition Product / business A performance issue can originate in any of these layers, and an issue in one layer can manifest as a symptom in another. A model whose architecture loads memory inefficiently (ML layer) shows up as low GPU utilization (platform symptom). A driver version that interacts poorly with a framework’s vendored CUDA libraries (infrastructure layer) shows up as throughput regression after a rebuild (platform symptom). A cooling under-provision (facilities layer) shows up as throttled clocks during peak hours (infrastructure symptom). The team that sees the symptom is usually not the team that owns the cause. The structural consequence is that single-team attribution is unreliable. A diagnosis that ends “it’s the hardware team’s problem” or “it’s the model’s fault” is asserting attribution that the diagnostic process didn’t actually establish. Why hardware upgrades rarely fix software-bound systems A common procurement pattern in response to AI performance complaints is to buy more or better hardware. This pattern has a defensible rationale (more capacity for unmistakably overloaded systems) and a frequent failure mode (buying capacity for a system that is not capacity-limited). A workload bottlenecked by data movement, batching policy, kernel-launch overhead, or precision configuration does not improve when the accelerator is upgraded. The bottleneck moves with the workload, not with the silicon. A faster GPU running the same inefficient batching pipeline produces the same throughput, with new hardware sitting underutilized for the same reason the previous hardware was. The procurement spend produces no measurable performance improvement, which is a worse outcome than the absence of spend would have been. The diagnostic that distinguishes a hardware-bound from a software-bound performance issue is the kind of thing benchmark methodology is for: measure the workload at the production saturation point, characterize where time is spent, identify the dominant bottleneck, and only then make the hardware-vs-software remediation decision. A procurement decision that skips this step is buying an option whose value is contingent on assumptions the diagnostic has not tested. Performance engineering as a discipline The pattern that escapes the cross-team blame loop is to treat performance engineering as a discipline that no single team owns exclusively but that all relevant teams participate in. The discipline has three components: Measurement. Instrumented benchmarks of the production workload on the production AI Executor, run on a schedule, with results that any team can interrogate. The measurement is the shared substrate; without it, the diagnostic conversation has no common reference. Attribution. A method for decomposing observed performance into contributions from each executor layer. Profiling tools, framework-level breakdowns, kernel-level traces. The attribution makes “who owns the bottleneck” answerable rather than rhetorical. Cross-stack iteration. A loop in which the team owning the identified bottleneck makes a change, the change is re-measured, and the result is reflected back into the shared measurement. This is the iteration discipline that produces accumulated improvement, as distinct from one-off heroics. The discipline is cross-team because the executor is cross-team. It is sustained because the workload mix and software stack continually shift. The benchmark methodology is the contract that lets the discipline operate without re-litigating the measurement basis every time. Benchmarks as cross-team measurement contract When teams agree on what the benchmark measures, how it’s run, and what the results mean, the benchmark becomes a cross-team contract. Performance discussions then proceed against shared evidence rather than competing intuitions. A throughput regression after a driver upgrade is no longer a contested narrative — it’s a measurement that re-runs and reproduces, which the teams can investigate jointly because they trust the shared instrument. The contract has to be neutral with respect to which team’s work it favors. A benchmark that the platform team owns and the ML team distrusts cannot be the cross-team contract, because the ML team will (correctly) suspect that the methodology embeds platform-favorable assumptions. The methodology must be agreed in advance, applied uniformly, and re-run by anyone with access to the executor — which is the disclosure-and-reproducibility property that distinguishes a benchmark methodology from a benchmark score. Building on performance ownership spanning teams, the operational expression is that performance is owned across the boundary, and the only way the cross-boundary ownership functions is with shared measurement infrastructure that none of the teams can dispute on principle. The framing that helps AI performance failures cross organizational boundaries because the AI Executor crosses them. Single-team attribution is structurally unreliable. Hardware upgrades do not fix software-bound systems. Performance engineering is a cross-team discipline whose operation depends on shared, neutral, reproducible measurement — which is the role a benchmark methodology occupies when it is treated as a contract rather than as a score. LynxBench AI is designed as the cross-team measurement contract: the AI Executor is fully specified, the methodology is reproducible, and any team can re-run the same measurement on the same configuration to verify or contest a result — which is the property that lets the cross-team performance-engineering discipline operate against shared evidence instead of competing narratives.