CUDA Compute Capability Explained: What the Version Number Means for AI Workloads

Compute capability is not a CUDA version

CUDA compute capability is a hardware property of each NVIDIA GPU — a version number (e.g., 7.0, 8.0, 8.9, 9.0) that defines which features, instructions, and precision formats the GPU silicon supports. It is distinct from the CUDA toolkit version (the software SDK). A common confusion: developers see “CUDA 12.x” installed and assume all features are available. In reality, CUDA compute capability determines which tensor core operations and precision formats a GPU supports — the software version only determines which APIs are callable.

What each compute capability enables

Compute Capability	Architecture	Key AI features
7.0	Volta (V100)	First-gen tensor cores (FP16), mixed-precision training
7.5	Turing (RTX 20xx, T4)	INT8 tensor cores, RT cores
8.0	Ampere (A100)	TF32 tensor cores, BF16, sparsity acceleration (2:4), 3rd-gen NVLink
8.6	Ampere (RTX 30xx)	Same tensor core features as 8.0, lower memory bandwidth
8.9	Ada Lovelace (RTX 40xx, L40)	FP8 tensor cores, 4th-gen tensor cores
9.0	Hopper (H100)	FP8 tensor cores, Transformer Engine, dynamic programming

The practical impact: code compiled for FP8 precision will not run on a compute capability 8.0 GPU — the hardware instruction set simply doesn’t include FP8 operations. BF16 training requires compute capability 8.0+. Sparsity-accelerated inference (2:4 structured sparsity) requires 8.0+.

How to check your GPU’s compute capability

nvidia-smi --query-gpu=compute_cap --format=csv

Or programmatically: torch.cuda.get_device_capability() returns a tuple like (8, 0).

Why this matters for AI workload deployment

When choosing between CUDA, OpenCL, and SYCL, the compute capability of your target GPUs determines which optimisation paths are available. A model quantised to FP8 for maximum inference throughput cannot deploy on compute capability 8.0 hardware — you need 8.9+ (Ada Lovelace) or 9.0 (Hopper). Teams deploying across mixed GPU fleets must compile multiple execution paths or accept the lowest common denominator.

For inference serving at scale, compute capability is the first filter in hardware selection — before price, before availability, before memory capacity. If your workload requires BF16 tensor core operations, any GPU below compute capability 8.0 is functionally incompatible regardless of its CUDA core count.

CUDA Compute Capability Explained: What the Version Number Means for AI Workloads

Compute capability is not a CUDA version

What each compute capability enables

How to check your GPU’s compute capability

Why this matters for AI workload deployment

AI Inference Infrastructure: Best Practices That Go Beyond Vendor Benchmark Claims

Tensor Parallelism vs Pipeline Parallelism: Choosing the Right Strategy for Your GPU Cluster

Choosing Efficient AI Inference Infrastructure: What to Measure Beyond Raw GPU Speed

CUDA Cores vs Tensor Cores: What Actually Determines AI Performance

How to Improve GPU Performance: A Profiling-First Approach to Compute Optimization

BF16 vs FP16: When Dynamic Range Beats Precision and Vice Versa

GPU Parallel Computing Explained: How Thousands of Cores Solve Problems Differently

AI TOPS Explained: Why This Popular Spec Tells You Almost Nothing About Real Performance

A100 GPU Rental Options: What Availability and Pricing Look Like in 2026

Agent Framework Selection for Edge-Constrained Inference Targets

Distillation vs Quantisation for Multi-Platform Edge Inference: How to Choose

GPU-Accelerating RF Signal Propagation Simulation: From Days to Hours

What Cross-Platform GPU Performance Portability Requires

Cloud GPU vs On-Premise AI Accelerators: A Total Cost Analysis

How to Optimise AI Inference Latency on GPU Infrastructure

Algorithmic Restructuring vs Kernel Tuning: Choosing the Higher-Leverage GPU Optimisation

How to Profile GPU Kernels to Find the Real Bottleneck

The Hidden Cost of GPU Underutilisation

CUDA vs OpenCL vs SYCL: Choosing a GPU Compute API

GPU Performance Per Dollar — Why Cost, Efficiency, and Value Are Not the Same Metric

Precision Is an Economic Lever in Inference Systems

Precision Choices Are Constrained by Hardware Architecture

Steady-State Performance, Cost, and Capacity Planning

Why Benchmarks Mislead AI Hardware Procurement — and How to Use Them Correctly

Building an Audit Trail: Benchmarks as Evidence for Governance and Risk

The Comparability Protocol: Why Benchmark Methodology Defines What You Can Compare

How to Choose AI Hardware and GPU for AI Workloads: A Decision Framework

How Benchmarks Shape Organizations Before Anyone Reads the Score

Accuracy Loss from Lower Precision Is Task‑Dependent

Precision Is a Design Parameter, Not a Quality Compromise

Mixed Precision Works by Exploiting Numerical Tolerance

Throughput vs Latency: Choosing the Wrong Optimization Target

Quantization Is Controlled Approximation, Not Model Damage

GPU Utilization Is Not Performance — Why Low GPU Utilization Often Means the Right Thing

FP8, FP16, and BF16 Represent Different Operating Regimes

Peak Performance vs Steady‑State Performance in AI

The Software Stack Is a First‑Class Performance Component

The Mythology of 100% GPU Utilization

Why Benchmarks Fail to Match Real AI Workloads

Why Identical GPUs Often Perform Differently

Training and Inference Are Fundamentally Different Workloads

Performance Ownership Spans Hardware and Software Teams

Performance Emerges from the Hardware × Software Stack

Power, Thermals, and the Hidden Governors of Performance

Why AI Performance Changes Over Time

CUDA, Frameworks, and Ecosystem Lock-In

GPUs Are Part of a Larger System

Why AI Performance Must Be Measured Under Representative Workloads