Driver installation is the foundation of the AI software stack Every NVIDIA GPU-based AI workload depends on a driver → CUDA → cuDNN → framework version chain. Getting any step wrong produces failures that are often confusing to diagnose: mysterious CUDA errors, framework crashes, or silent performance degradation from using sub-optimal kernel paths. The correct sequence for Linux AI environments is well-defined but frequently done incorrectly. The version compatibility chain GPU hardware ↓ requires NVIDIA driver (e.g., 550.x) ↓ determines maximum supported CUDA version (e.g., CUDA 12.4) ↓ combined with cuDNN version (e.g., 8.9 or 9.x) ↓ required by Framework version (PyTorch 2.x, TensorFlow 2.x) Every link must be compatible. Installing the latest framework with an old driver is a common cause of failures. Recommended installation method for AI workloads Use the official NVIDIA runfile or package manager, not distribution packages. Distribution-provided NVIDIA packages (Ubuntu’s nvidia-driver-xxx) are often delayed by one or more minor versions and may not include components needed for AI (NCCL, CUDA toolkit). # Remove existing packages sudo apt purge nvidia-* libnvidia-* sudo apt autoremove # Install from NVIDIA's package repository wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update sudo apt install cuda-toolkit-12-4 # Match to your target CUDA version # Verify nvidia-smi nvcc --version Version compatibility table (as of mid-2026) PyTorch version Required CUDA Minimum driver 2.4.x CUDA 12.1+ 525.60 2.3.x CUDA 12.1+ 525.60 2.2.x CUDA 11. 8 or 12.1 450.80 2.1.x CUDA 11.8 or 12.1 450.80 Always verify compatibility at the PyTorch installation matrix page for the specific version you need. What are the common failure modes? Symptom Likely cause Fix CUDA error: no kernel image available CUDA compute capability mismatch Recompile PyTorch or use compatible binary RuntimeError: CUDA not available Driver not installed or not found Reinstall driver, check nvidia-smi Slow training without error cuDNN determinism mode enabled Disable CUBLAS_WORKSPACE_CONFIG OOM on first run Driver version limits addressable VRAM Update driver The software stack dependency is a primary reason why identical GPUs often perform differently — driver and CUDA version differences between environments produce measurable throughput differences. What goes wrong during NVIDIA driver installation on Linux? The most common failure modes during NVIDIA driver installation on Linux for AI workloads are: conflicting kernel module versions (nouveau vs nvidia), mismatched CUDA toolkit and driver versions, incomplete DKMS kernel module compilation, and secure boot enforcement blocking unsigned kernel modules. Diagnosing the root cause: run nvidia-smi after installation. If it fails with “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver”, the kernel module is not loaded. Check dmesg | grep -i nvidia for module loading errors. If nvidia-smi works but PyTorch’s torch.cuda.is_available() returns False, the CUDA runtime is missing or incompatible with the driver version. We maintain a validated configuration matrix for production deployments: Ubuntu 22.04 LTS with kernel 5.15 (HWE kernel avoided unless specific hardware requires it), NVIDIA driver 550.x from the CUDA repository (not the Ubuntu repository), CUDA toolkit 12.4 installed via the NVIDIA runfile (not apt), and PyTorch installed via pip with CUDA 12.4 binaries. This specific combination has been stable across several hundred production GPU nodes. For Docker-based deployments, the NVIDIA Container Toolkit eliminates most driver compatibility issues by separating the host driver from the container’s CUDA toolkit. The container sees the host GPU through the NVIDIA driver, but bundles its own CUDA runtime, cuDNN, and framework versions. This makes the host driver version the only external dependency — and it only needs to be ≥ the minimum version required by the container’s CUDA runtime. Containerised deployments and driver management For organisations running AI workloads in Docker or Kubernetes, the NVIDIA Container Toolkit changes the driver management model fundamentally. The host machine needs only the NVIDIA kernel driver — no CUDA toolkit, no cuDNN, no framework installation. All of these are bundled in the container image. This separation simplifies driver management to a single variable: the host driver version. Container images specify their minimum required driver version in metadata, and the container toolkit validates compatibility at launch. If the host driver is too old, the container fails to start with an explicit error message rather than producing silent compute errors. We manage host drivers on GPU nodes using a pinned package version in our configuration management system (Ansible). Driver updates are applied to one node at a time, validated with a smoke test (launch a PyTorch container, run a 60-second inference test), then rolled to the next node. This rolling update strategy ensures that at least 75% of GPU capacity remains available during driver maintenance windows.