“CUDA 12” is not a configuration A workload runs on one machine and fails — or runs noticeably slower — on another. Both machines report “CUDA 12.” Both have a recent NVIDIA GPU. Both have the same PyTorch version installed. The instinct is to look for a one-dimensional version mismatch and not find one. The reason is that CUDA compatibility is not one-dimensional. It is a four-axis matrix — driver version, toolkit version, framework build, and GPU compute capability — and a workload that runs on one combination is not guaranteed to run, or run at the same speed, on a different combination even when the headline CUDA version is identical. This matters specifically for benchmark reproducibility: a benchmark report that names only one or two of the four axes is under-specifying the AI Executor in ways that prevent the result from transferring to a different host. What are the four axes of CUDA compatibility? The four axes are independent. Each carries its own ABI and feature-support constraints, and the matrix of valid combinations is larger and more conditional than any single version number suggests. Driver version. The kernel-mode driver is what the GPU actually talks to. Driver versions support a range of CUDA toolkit versions (with backward compatibility usually but not always extending forward), and they expose hardware features and bug fixes per-version. A workload that depends on a feature exposed in driver version N may fail or fall back on driver N-1, even when the application’s CUDA toolkit is the same. Toolkit version. The CUDA toolkit provides headers, compiler (nvcc), libraries (cuBLAS, cuDNN, cuFFT), and runtime support. Applications compiled against toolkit version M expect a runtime API of that version, and many libraries shipped with the toolkit have version-dependent behavior. Two systems with toolkit M.0 and M.1 can produce different cuBLAS kernel selections, different cuDNN heuristic decisions, and different observed throughput on identical hardware. Framework build. PyTorch, TensorFlow, JAX, and similar frameworks ship pre-built CUDA components inside their wheels. The framework is built against a specific toolkit version and against specific cuDNN/cuBLAS versions, often statically linked or vendored into the wheel. Framework version X built against toolkit M is a different runtime artifact from framework version X built against toolkit M+1, even though the framework version string is the same. GPU compute capability. The compute capability identifies the GPU’s architectural generation and determines which instructions and tensor-core operations are physically available. A toolkit version supports a range of compute capabilities, but the specific kernels selected for a given workload depend on the compute capability of the target GPU — and code paths that target newer compute capabilities are absent on older ones. Each axis can shift the executor’s behavior independently. A change on any one of the four can change which kernels execute, which precision regimes are available, and which performance the workload achieves. How the matrix actually shifts behavior The four axes interact, which is why a single mismatch can produce a non-obvious failure or a non-obvious slowdown. Combination shift What can change Driver newer, toolkit same Same compiled code; potentially different runtime kernel selection; bug fixes or regressions in driver-side behavior Toolkit newer, driver same Compatibility limited to forward-compatible toolkit features the older driver supports; library versions inside the toolkit shift; kernel heuristics change Framework rebuilt against newer toolkit Different vendored cuDNN/cuBLAS versions; different default kernel choices; sometimes ABI-incompatible with externally-installed CUDA libraries Different compute capability, same software Different kernel paths selected; tensor-core paths may or may not exist; precision-format support differs All four nominally same, different vendor builds of framework Framework wheels from different sources can vendor different toolkit components, producing different observed behavior The matrix’s practical implication is that “CUDA compatibility” as a single concept does not exist. The actual concept is “the specific (driver, toolkit, framework, compute capability) tuple under which this workload was validated, and the tuple under which it is being run.” The CUDA ecosystem as switching cost The compatibility matrix is also the dominant source of switching cost when a team evaluates a non-NVIDIA accelerator. The matrix problem does not vanish on a different platform; it reproduces with new axes. A team moving from CUDA to ROCm trades the (NVIDIA driver × CUDA toolkit × framework build × compute capability) matrix for an (AMD driver × ROCm version × framework build × architecture target) matrix, and the validation work the team has done against the first matrix does not transfer. The same applies to oneAPI on Intel hardware, to MAX on Modular, to vendor SDKs on SoC AI accelerators. Each ecosystem has its own four-axis-shaped matrix, and re-validating a workload across it reproduces the same compatibility problem from scratch on the new platform. The strategic argument lives in CUDA, frameworks, and ecosystem lock-in; operationally, the cost is not “learning a new API” — it is re-running the matrix-shaped validation that established performance and correctness on the original ecosystem. What this means for benchmark disclosure A benchmark of a CUDA-using workload that reports only the CUDA toolkit version is reporting one of four required axes. The result is informative about the original measurement environment and indeterminate about any other host’s behavior. A reproducible CUDA benchmark report names all four: Driver version CUDA toolkit version (and the source of the toolkit — system install or framework-vendored) Framework version and the framework’s build origin GPU compute capability of the target hardware A report that omits any of these has a comparability gap that the reader cannot close, and a re-run on a different host can produce a different number for reasons the report does not allow them to diagnose. The framing that helps CUDA compatibility is a four-axis matrix, not a version number. Benchmark reproducibility for CUDA workloads requires all four axes to be named, because each axis can shift the executor’s behavior independently and the matrix’s interactions can shift it non-obviously. The same compatibility-matrix shape reproduces on every alternative ecosystem, which is what makes ecosystem switching costly even when the per-axis equivalence looks straightforward. LynxBench AI treats the (driver, toolkit, framework build, compute capability) tuple as part of the AI Executor specification — alongside the GPU model — because that tuple is what determines which kernels execute, which precision regimes are available, and which performance the benchmark actually measures.