CUDA Compatibility: The Four-Axis Matrix Behind the Version Number

“CUDA 12” is not a configuration

A workload runs on one machine and fails — or runs noticeably slower — on another. Both machines report “CUDA 12.” Both have a recent NVIDIA GPU. Both have the same PyTorch version installed. The instinct is to look for a one-dimensional version mismatch and not find one. The reason is that CUDA compatibility is not one-dimensional. It is a four-axis matrix — driver version, toolkit version, framework build, and GPU compute capability — and a workload that runs on one combination is not guaranteed to run, or run at the same speed, on a different combination even when the headline CUDA version is identical.

This matters specifically for benchmark reproducibility: a benchmark report that names only one or two of the four axes is under-specifying the AI Executor in ways that prevent the result from transferring to a different host.

What are the four axes of CUDA compatibility?

The four axes are independent. Each carries its own ABI and feature-support constraints, and the matrix of valid combinations is larger and more conditional than any single version number suggests.

Driver version. The kernel-mode driver is what the GPU actually talks to. Driver versions support a range of CUDA toolkit versions (with backward compatibility usually but not always extending forward), and they expose hardware features and bug fixes per-version. A workload that depends on a feature exposed in driver version N may fail or fall back on driver N-1, even when the application’s CUDA toolkit is the same.

Toolkit version. The CUDA toolkit provides headers, compiler (nvcc), libraries (cuBLAS, cuDNN, cuFFT), and runtime support. Applications compiled against toolkit version M expect a runtime API of that version, and many libraries shipped with the toolkit have version-dependent behavior. Two systems with toolkit M.0 and M.1 can produce different cuBLAS kernel selections, different cuDNN heuristic decisions, and different observed throughput on identical hardware.

Framework build. PyTorch, TensorFlow, JAX, and similar frameworks ship pre-built CUDA components inside their wheels. The framework is built against a specific toolkit version and against specific cuDNN/cuBLAS versions, often statically linked or vendored into the wheel. Framework version X built against toolkit M is a different runtime artifact from framework version X built against toolkit M+1, even though the framework version string is the same.

GPU compute capability. The compute capability identifies the GPU’s architectural generation and determines which instructions and tensor-core operations are physically available. A toolkit version supports a range of compute capabilities, but the specific kernels selected for a given workload depend on the compute capability of the target GPU — and code paths that target newer compute capabilities are absent on older ones.

Each axis can shift the executor’s behavior independently. A change on any one of the four can change which kernels execute, which precision regimes are available, and which performance the workload achieves.

How the matrix actually shifts behavior

The four axes interact, which is why a single mismatch can produce a non-obvious failure or a non-obvious slowdown.

Combination shift	What can change
Driver newer, toolkit same	Same compiled code; potentially different runtime kernel selection; bug fixes or regressions in driver-side behavior
Toolkit newer, driver same	Compatibility limited to forward-compatible toolkit features the older driver supports; library versions inside the toolkit shift; kernel heuristics change
Framework rebuilt against newer toolkit	Different vendored cuDNN/cuBLAS versions; different default kernel choices; sometimes ABI-incompatible with externally-installed CUDA libraries
Different compute capability, same software	Different kernel paths selected; tensor-core paths may or may not exist; precision-format support differs
All four nominally same, different vendor builds of framework	Framework wheels from different sources can vendor different toolkit components, producing different observed behavior

The matrix’s practical implication is that “CUDA compatibility” as a single concept does not exist. The actual concept is “the specific (driver, toolkit, framework, compute capability) tuple under which this workload was validated, and the tuple under which it is being run.”

The CUDA ecosystem as switching cost

The compatibility matrix is also the dominant source of switching cost when a team evaluates a non-NVIDIA accelerator. The matrix problem does not vanish on a different platform; it reproduces with new axes. A team moving from CUDA to ROCm trades the (NVIDIA driver × CUDA toolkit × framework build × compute capability) matrix for an (AMD driver × ROCm version × framework build × architecture target) matrix, and the validation work the team has done against the first matrix does not transfer.

The same applies to oneAPI on Intel hardware, to MAX on Modular, to vendor SDKs on SoC AI accelerators. Each ecosystem has its own four-axis-shaped matrix, and re-validating a workload across it reproduces the same compatibility problem from scratch on the new platform.

The strategic argument lives in CUDA, frameworks, and ecosystem lock-in; operationally, the cost is not “learning a new API” — it is re-running the matrix-shaped validation that established performance and correctness on the original ecosystem.

What this means for benchmark disclosure

A benchmark of a CUDA-using workload that reports only the CUDA toolkit version is reporting one of four required axes. The result is informative about the original measurement environment and indeterminate about any other host’s behavior. A reproducible CUDA benchmark report names all four:

Driver version
CUDA toolkit version (and the source of the toolkit — system install or framework-vendored)
Framework version and the framework’s build origin
GPU compute capability of the target hardware

A report that omits any of these has a comparability gap that the reader cannot close, and a re-run on a different host can produce a different number for reasons the report does not allow them to diagnose.

The framing that helps

CUDA compatibility is a four-axis matrix, not a version number. Benchmark reproducibility for CUDA workloads requires all four axes to be named, because each axis can shift the executor’s behavior independently and the matrix’s interactions can shift it non-obviously. The same compatibility-matrix shape reproduces on every alternative ecosystem, which is what makes ecosystem switching costly even when the per-axis equivalence looks straightforward.

LynxBench AI treats the (driver, toolkit, framework build, compute capability) tuple as part of the AI Executor specification — alongside the GPU model — because that tuple is what determines which kernels execute, which precision regimes are available, and which performance the benchmark actually measures.

CUDA Compatibility: The Four-Axis Matrix Behind the Version Number

“CUDA 12” is not a configuration

What are the four axes of CUDA compatibility?

How the matrix actually shifts behavior

The CUDA ecosystem as switching cost

What this means for benchmark disclosure

The framing that helps

Benchmarks as Decision Infrastructure, Not Marketing Material

Benchmarks as Procurement Evidence: The Audit Trail

Cost Efficiency vs Value in AI Hardware: Different Metrics

Lower Precision: When the Cost Savings Are Worth the Risk

Quantization Accuracy Loss: Why a Single Number Misleads

Hardware Precision Constraints: A Generation-Conditional Decision

Is 100% GPU Utilization a Problem on AI Workloads?

Whose Problem Is Slow AI: Hardware, ML, Platform, or Procurement?

Same GPU, Different Score: Why the Model Number Isn't a Contract

Procurement Definition for AI: Why Spec Comparisons Aren't Enough

Linux Hardware Stress Test for AI: A Procurement-Grade Methodology

Half-Precision Floating-Point: Why FP16 Needs Mixed Precision to Be Stable

Floating-Point Formats in AI: What Each Format Trades

Single-Precision Floating-Point Format: The FP32 Default Explained

Production Capacity Planning for AI Inference Fleets

Capacity Planning Tools for AI: Where Generic Tooling Falls Short

AI Data Center Power: Why Nameplate TDP Is Not a Capacity Plan

Thermal Throttling Meaning: Designed Behavior, Not Hardware Fault

Throughput Definition for AI Inference: Why Batch Size Is Part of the Number

Latency Testing for AI Inference: A Methodology Beyond Best-Case Numbers

Latency Definition for AI Inference: A Domain-Specific Anchor

Model Drift vs Hardware Drift: Two Different Decay Curves

AI Inference Accelerators: What Makes Them a Distinct Category

torch.version.cuda Explained: Why PyTorch's CUDA Differs from Your System's

CUDA Compute Capability: What It Actually Constrains for AI Workloads

System-on-a-Chip for AI: Why Integration Doesn't Eliminate the Software Stack

Benchmark Tools: What Separates Decision-Grade Tools from Leaderboards

GPU Benchmark Comparisons: Why Methodology Determines the Result

Open-Source LLM Benchmarks: Choosing for Methodology Auditability

LLM Benchmarking: A Methodology That Produces Decision-Grade Results

LLM Benchmark Explained: What It Measures and What It Cannot

Hugging Face Quantization Tools: Why the Tool Chain Matters in Benchmarks

AI Quantization Explained: The Trade-Off Behind the Marketing Term

Quantization in Machine Learning: A Family of Calibrated Trade-Offs

KV-Cache Quantization: A Different Risk Profile from Weight Quantization

LLM Quantization: Why Memory Bandwidth Wins and Where Accuracy Breaks

TOPS Performance: What AI TOPS Scores Mean and When They Mislead

Phoronix Benchmark for GPU AI Testing: Setup, Results, and Interpretation

Phoronix Test Suite for AI Benchmarking: Use Cases and Limitations

Model FLOPS Utilization in AI Training: Measuring and Interpreting MFU

Model FLOPS Utilization: What MFU Tells You and What It Doesn't

Mac System Performance Testing for AI: Apple Silicon and Framework Constraints

NVIDIA Linux Driver Installation: Correct Steps for AI Workloads

Linux CPU Benchmark for AI Systems: What to Measure and How

Laptop GPU for AI: What Benchmarks Miss About Mobile Graphics Performance

How to Benchmark Your PC for AI: A Practical Protocol

Half Precision Explained: What FP16 Means for AI Inference and Training

AI GPU Utilization Testing: What GPU-Util Means and What It Misses