Why can’t we just switch to a different accelerator?
The question comes up in nearly every hardware evaluation meeting. A new chip promises better performance per watt, a lower price point, or competitive benchmark scores. The spec sheet looks strong. The economics look attractive. And then someone from the ML platform team explains, with varying degrees of patience, why switching isn’t as simple as plugging in a different card.
The reason is rarely hardware. The reason is software — specifically, the depth and breadth of the software ecosystem that surrounds the hardware and that years of engineering investment have baked into the organization’s workflows.
CUDA is not just an API
When people talk about CUDA lock-in, they often frame it as a programming interface problem: “We’re locked in because our code is in CUDA.” That framing dramatically understates the issue.
CUDA is a programming model, yes. But more importantly, it’s an ecosystem. That ecosystem includes cuDNN for neural network primitives, cuBLAS for linear algebra, NCCL for multi-GPU communication, TensorRT for inference optimization, Nsight for profiling, and a vast collection of community-maintained libraries, tutorials, Stack Overflow answers, and institutional knowledge that has accumulated over more than fifteen years.
When a framework like PyTorch dispatches a matrix multiplication, it doesn’t call a single CUDA function. It routes through a stack of abstractions — torch.mm → ATen → cuBLAS (or a custom CUTLASS kernel, or a FlashAttention implementation) → the CUDA runtime → the GPU driver. Each layer in that stack has been tuned, debugged, and optimized specifically for NVIDIA hardware. The performance you observe is the product of that entire vertical, not just the hardware at the bottom.
This is what makes the ecosystem valuable: it’s not one thing, it’s everything. And “everything” is very hard to replicate.
Switching costs are predominantly software-driven
The economic argument for switching accelerators usually focuses on hardware acquisition cost. But the harder cost to quantify — and the one that actually blocks transitions — is the software migration.
Moving from CUDA to an alternative ecosystem (ROCm for AMD, oneAPI for Intel, or a proprietary accelerator SDK) requires porting or replacing every CUDA-specific dependency in the stack. That includes not just the obvious kernels and library calls, but also the profiling tools the team relies on, the deployment infrastructure built around NVIDIA-specific tooling, the model optimization pipelines tuned for TensorRT, and the collective knowledge the engineering team has built about diagnosing CUDA-specific performance issues.
We’ve seen organizations estimate migration timelines of six months and spend eighteen. Not because the hardware didn’t work, but because the software ecosystem gap was larger than anyone mapped upfront. As discussed in how the software stack functions as a first-class performance component, the software layer isn’t a thin wrapper around hardware capability — it’s a substantial determinant of what capability gets realized.
The practical switching cost isn’t “recompile for a different target.” It’s “rebuild the performance engineering discipline your team has developed around a specific ecosystem.”
Ecosystem depth amplifies hardware capability
This cuts both ways. Lock-in is real and creates strategic risk, but the depth of a mature ecosystem also genuinely amplifies what the hardware can deliver.
Consider FlashAttention. It’s a memory-efficient attention algorithm that achieves substantial speedups by fusing operations and minimizing HBM reads. The original implementation was CUDA-specific, hand-tuned for NVIDIA GPU memory hierarchies. Ports to other platforms exist but often lag in optimization maturity, which means the same algorithm, on comparable hardware, can perform differently depending on how much ecosystem investment has gone into optimizing it for that specific target.
Multiply that example across thousands of kernels, operators, and optimization passes. The cumulative effect is that hardware with a deep ecosystem consistently outperforms hardware with a shallow one — not because the silicon is inherently superior, but because the software has had more time, more contributors, and more production feedback to get fast.
This is the core tension: the very depth that creates lock-in is the same depth that creates performance. Organizations can’t easily have one without the other.
How to think about ecosystem risk without oversimplifying
The temptation is to reduce this to a binary: “CUDA good” or “lock-in bad.” Neither framing helps.
A more productive approach treats ecosystem dependency as a risk factor to be managed, not a verdict. Concretely, this means:
Auditing the dependency surface. How deep does the CUDA dependency go in your stack? Is it confined to a framework layer (PyTorch, JAX) that abstracts the backend, or does it penetrate into custom kernels, deployment tooling, and monitoring infrastructure? The deeper the dependency, the higher the switching cost.
Evaluating abstraction layers. Frameworks increasingly support multiple backends. PyTorch’s torch.compile can target different hardware through backend plugins. JAX’s XLA compiler is hardware-agnostic in principle. These abstraction layers don’t eliminate switching costs, but they can contain them — if the organization has invested in using them rather than bypassing them with hardware-specific code.
Benchmarking at the stack level. When evaluating alternative accelerators, comparing raw hardware specs is insufficient. The comparison must include the software ecosystem’s maturity for your specific workload — kernel coverage, framework support, profiling tooling, and community knowledge. A chip that benchmarks well in a vendor-controlled demo environment may underperform in your production stack because critical operators lack optimized implementations.
Planning for ecosystem evolution. Software ecosystems are not static. ROCm’s coverage has expanded substantially in recent years. Intel’s oneAPI is maturing. Alternative hardware vendors are investing in framework compatibility layers. The landscape a year from now will look different from the landscape today — which argues for periodic reassessment rather than permanent commitment.
Lock-in is a system property, not a moral failing
No organization chose to be locked in. Lock-in emerged from rational decisions: adopt the most mature tools, optimize for the widest ecosystem, hire engineers with the most available expertise. Each individual decision was reasonable. The cumulative effect is a deep dependency that creates real strategic constraints.
Understanding that dependency — mapping its depth, quantifying its switching costs, and managing it as a technical risk rather than ignoring it or treating it as inevitable — is the difference between informed hardware strategy and reactive procurement. As explored in who actually owns performance outcomes, performance decisions don’t live in hardware or software alone; they live in the intersection. And the ecosystem is what fills that intersection.