The hardware capability number that the toolkit version doesn’t override CUDA compute capability is the per-architecture identifier that names what a given GPU generation can physically do. It is reported as a major-minor version (7.5, 8.0, 8.6, 9.0, and so on) and it tells the software stack which instructions, precision formats, and tensor-core operations are available on the target hardware. For AI workloads, this number matters more than the headline CUDA toolkit version, because it is what determines whether the kernels the framework wants to run will actually find hardware to run on. The common confusion is to treat the CUDA toolkit version as the binding constraint — to assume that a workload that “supports CUDA 12” will run equivalently on any GPU the toolkit accepts. The toolkit accepts a range of compute capabilities, but the workload’s actual behavior on each one depends on what that compute capability supports, not on what the toolkit version is. What does CUDA compute capability actually control? Compute capability is the hardware-feature axis of the CUDA stack. Each compute capability adds, removes, or changes specific architectural features: Precision-format support. Tensor-core operations on FP16 require compute capability 7.0+. BF16 tensor-core operations require 8.0+. FP8 (E4M3, E5M2) tensor-core operations are available starting at 8.9 (Ada) and 9.0 (Hopper). INT8 tensor cores are available on 7.2/7.5+. A workload that uses BF16 matrix multiplications on tensor cores cannot do so on a 7.5 GPU because the tensor cores there do not implement BF16, regardless of which CUDA toolkit is installed. Tensor-core matrix shapes and precisions. The matrix shapes (warp-level matrix-multiply-accumulate) and the precision types each generation supports differ. Newer compute capabilities expose larger matrix shapes and more precision options, which the framework’s kernel selection logic uses to choose between tensor-core and general-purpose CUDA-core paths. Memory model features. Asynchronous copies, distributed shared memory, thread-block clusters, and similar features are introduced at specific compute capabilities. Workloads optimized to use them on newer hardware fall back to non-optimized paths on older hardware. Maximum threads per block, registers per thread, shared memory per block. These hardware limits change across generations and constrain how kernels are launched. The framework’s kernel-selection logic respects compute capability: it picks the best kernel for the target hardware among those compiled into the binary. If the binary does not include a kernel for the target compute capability, the framework either falls back to a generic CUDA-core path or fails outright. Why compute capability matters more for AI than the toolkit version CUDA toolkit version is what an application requests. Compute capability is what the hardware provides. A toolkit version is meaningful only in conjunction with the compute capability the target hardware supports. Two GPUs accepted by the same CUDA toolkit can still differ substantially in observed behavior. A workload that uses BF16 tensor cores will run them on a compute-capability-8.0 GPU and fall back to FP32 CUDA cores on a 7.5 GPU, producing dramatically different throughput on identical software. A workload that uses FP8 tensor cores will run on a 9.0 GPU and either fall back or fail on an 8.0 GPU. A workload that depends on thread-block clusters (a 9.0 feature) will not run at all on older hardware, regardless of toolkit version. This is why a benchmark report that names the CUDA toolkit version but not the compute capability of the target hardware is under-specifying the AI Executor. The toolkit accepts; the hardware delivers; the kernel chosen is the intersection of what the toolkit knows how to compile and what the hardware can execute. Compute capability mapping at a glance Compute capability Generation Notable AI-relevant features 7.0 / 7.2 Volta First-generation tensor cores; FP16 matrix-multiply accumulation 7.5 Turing INT8 / INT4 tensor cores 8.0 Ampere (A100) BF16 tensor cores; structured sparsity; TF32 for training 8.6 Ampere (consumer / RTX 30) Subset of 8.0 features; different shared-memory budget 8.9 Ada Lovelace FP8 (E4M3 / E5M2) tensor cores 9.0 Hopper Thread-block clusters; distributed shared memory; new tensor-core APIs 10.0+ Newer generations Per-generation additions; consult the architecture’s documentation A workload tuned for 9.0 features cannot equivalently run on 8.0; a workload that requires FP8 cannot run on 8.0 at all. The toolkit’s role is to compile code paths for each target the binary intends to run on. The hardware’s role is to actually execute the path the framework selects. What this means for benchmark interpretation A CUDA-based AI benchmark report has to declare both the CUDA toolkit version and the target compute capability for the result to be interpretable. The toolkit version determines what was compiled. The compute capability determines what was executed. Two benchmarks with the same toolkit version on different compute capabilities are reporting on different executors, and the difference can be larger than any reasonable hardware-only comparison would suggest — because the precision regime that ran is itself different. The reverse case also matters: a benchmark on the same compute capability across different toolkit versions can show non-obvious shifts because the framework’s kernel selection logic has access to a different set of compiled kernels. Bounded optimization in benchmarking — the principle that the optimization effort applied to the system under test must be named and bounded — therefore extends to both the toolkit version and the compute capability. Building on CUDA, frameworks, and ecosystem lock-in, the practical content is that the ecosystem’s value comes from the depth of the kernel library across (toolkit version × compute capability) combinations, and that depth is what a benchmark exercises when it runs. The framing that helps CUDA compute capability is the hardware-feature axis that determines which precision formats, tensor-core operations, and memory features a given GPU generation can actually execute. The CUDA toolkit version is what the application requests; the compute capability is what the hardware provides; the kernel that runs is the intersection. Benchmark reports must declare both for the result to be interpretable. LynxBench AI treats the (toolkit version, compute capability, precision regime) tuple as part of the AI Executor specification — alongside the GPU model — because the precision regime that actually executes is determined by that tuple, and the per-precision performance the benchmark measures depends on which regime ran.