Why can’t we just switch to a different accelerator? The question comes up in nearly every hardware evaluation meeting. A new chip promises better performance per watt, a lower price point, or competitive benchmark scores. The spec sheet looks strong. The economics look attractive. And then someone from the ML platform team explains, with varying degrees of patience, why switching isn’t as simple as plugging in a different card. The reason is rarely hardware. The reason is software — specifically, the depth and breadth of the software ecosystem that surrounds the hardware and that years of engineering investment have baked into the organization’s workflows. CUDA is not just an API When people talk about CUDA lock-in, they often frame it as a programming interface problem: “We’re locked in because our code is in CUDA.” That framing dramatically understates the issue. CUDA is a programming model, yes. But more importantly, it’s an ecosystem. That ecosystem includes cuDNN for neural network primitives, cuBLAS for linear algebra, NCCL for multi-GPU communication, TensorRT for inference optimization, Nsight for profiling, and a vast collection of community-maintained libraries, tutorials, Stack Overflow answers, and institutional knowledge that has accumulated over more than fifteen years. When a framework like PyTorch dispatches a matrix multiplication, it doesn’t call a single CUDA function. It routes through a stack of abstractions — torch.mm → ATen → cuBLAS (or a custom CUTLASS kernel, or a FlashAttention implementation) → the CUDA runtime → the GPU driver. Each layer in that stack has been tuned, debugged, and optimized specifically for NVIDIA hardware. The performance you observe is the product of that entire vertical, not just the hardware at the bottom. This is what makes the ecosystem valuable: it’s not one thing, it’s everything. And “everything” is very hard to replicate. Switching costs are predominantly software-driven The economic argument for switching accelerators usually focuses on hardware acquisition cost. But the harder cost to quantify — and the one that actually blocks transitions — is the software migration. Moving from CUDA to an alternative ecosystem (ROCm for AMD, oneAPI for Intel, or a proprietary accelerator SDK) requires porting or replacing every CUDA-specific dependency in the stack. That includes not just the obvious kernels and library calls, but also the profiling tools the team relies on, the deployment infrastructure built around NVIDIA-specific tooling, the model optimization pipelines tuned for TensorRT, and the collective knowledge the engineering team has built about diagnosing CUDA-specific performance issues. We’ve seen organizations estimate migration timelines of six months and spend eighteen. Not because the hardware didn’t work, but because the software ecosystem gap was larger than anyone mapped upfront. As discussed in how the software stack functions as a first-class performance component, the software layer isn’t a thin wrapper around hardware capability — it’s a substantial determinant of what capability gets realized. The practical switching cost isn’t “recompile for a different target.” It’s “rebuild the performance engineering discipline your team has developed around a specific ecosystem.” Ecosystem dependency layers and switching costs Dependency layer CUDA ecosystem example Switching cost Low-level kernels Custom CUDA kernels, CUTLASS, cuBLAS calls Must be rewritten or replaced with equivalents that may not exist or may be less optimized Optimization libraries TensorRT, cuDNN, NCCL Ports exist but often lag in optimization maturity; performance may differ substantially Profiling & debugging Nsight Systems, Nsight Compute, cuda-gdb Alternative toolchains are less mature; team expertise must be rebuilt Framework integration PyTorch CUDA backend, torch.compile CUDA paths Framework abstraction layers help, but backend-specific code paths still exist Institutional knowledge Team experience debugging CUDA-specific performance issues Cannot be transferred; must be rebuilt for the new ecosystem over months or years Ecosystem depth amplifies hardware capability This cuts both ways. Lock-in is real and creates strategic risk, but the depth of a mature ecosystem also genuinely amplifies what the hardware can deliver. Consider FlashAttention. It’s a memory-efficient attention algorithm that achieves substantial speedups by fusing operations and minimizing HBM reads. The original implementation was CUDA-specific, hand-tuned for NVIDIA GPU memory hierarchies. Ports to other platforms exist but often lag in optimization maturity, which means the same algorithm, on comparable hardware, can perform differently depending on how much ecosystem investment has gone into optimizing it for that specific target. Multiply that example across thousands of kernels, operators, and optimization passes. The cumulative effect is that hardware with a deep ecosystem consistently outperforms hardware with a shallow one — not because the silicon is inherently superior, but because the software has had more time, more contributors, and more production feedback to get fast. This is the core tension: the very depth that creates lock-in is the same depth that creates performance. Organizations can’t easily have one without the other. How to think about ecosystem risk without oversimplifying The temptation is to reduce this to a binary: “CUDA good” or “lock-in bad.” Neither framing helps. A more productive approach treats ecosystem dependency as a risk factor to be managed, not a verdict. Concretely, this means: Auditing the dependency surface. How deep does the CUDA dependency go in your stack? Is it confined to a framework layer (PyTorch, JAX) that abstracts the backend, or does it penetrate into custom kernels, deployment tooling, and monitoring infrastructure? The deeper the dependency, the higher the switching cost. Evaluating abstraction layers. Frameworks increasingly support multiple backends. PyTorch’s torch.compile can target different hardware through backend plugins. JAX’s XLA compiler is hardware-agnostic in principle. These abstraction layers don’t eliminate switching costs, but they can contain them — if the organization has invested in using them rather than bypassing them with hardware-specific code. Benchmarking at the stack level. When evaluating alternative accelerators, comparing raw hardware specs is insufficient. The comparison must include the software ecosystem’s maturity for your specific workload — kernel coverage, framework support, profiling tooling, and community knowledge. A chip that benchmarks well in a vendor-controlled demo environment may underperform in your production stack because critical operators lack optimized implementations. Planning for ecosystem evolution. Software ecosystems are not static. ROCm’s coverage has expanded substantially in recent years. Intel’s oneAPI is maturing. Alternative hardware vendors are investing in framework compatibility layers. The landscape a year from now will look different from the landscape today — which argues for periodic reassessment rather than permanent commitment. Lock-in is a system property, not a moral failing No organization chose to be locked in. Lock-in emerged from rational decisions: adopt the most mature tools, optimize for the widest ecosystem, hire engineers with the most available expertise. Each individual decision was reasonable. The cumulative effect is a deep dependency that creates real strategic constraints. Understanding that dependency — mapping its depth, quantifying its switching costs, and managing it as a technical risk rather than ignoring it or treating it as inevitable — is the difference between informed hardware strategy and reactive procurement. As explored in who actually owns performance outcomes, performance decisions don’t live in hardware or software alone; they live in the intersection. And the ecosystem is what fills that intersection. Related deep-dives CUDA compatibility: the four-axis matrix behind the version number — the driver × toolkit × framework × compute-capability matrix that governs reproducibility. CUDA compute capability: what it actually constrains for AI workloads — the hardware-feature axis that determines which precision regimes can execute. torch.version.cuda explained: why PyTorch’s CUDA differs from your system’s — the disclosure surface a reproducible PyTorch CUDA benchmark requires. LynxBenchAI accounts for ecosystem depth by evaluating the full hardware-and-software stack as the unit of measurement — results reflect what the combination of hardware, drivers, framework, and kernel library can sustain, not what the chip could theoretically deliver in isolation. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete stack, reported per precision, with bounded optimisation. Frequently Asked Questions Why is CUDA hard to replace even when competing hardware looks attractive on paper? Because the lock-in isn’t in the API — it’s in the surrounding ecosystem. CUDA’s value comes from cuDNN, cuBLAS, NCCL, TensorRT, Nsight, and more than fifteen years of community-maintained libraries, tuned kernels, and institutional debugging knowledge. A competing chip can match raw specs and still underperform once it has to run through your actual production stack. How does ecosystem depth amplify or suppress raw hardware capability? When PyTorch dispatches a matmul, it routes through a deep vertical — ATen, cuBLAS or CUTLASS, FlashAttention kernels, the CUDA runtime, the driver — each layer tuned for NVIDIA hardware. FlashAttention on CUDA versus a less-mature port on comparable hardware illustrates the point: same algorithm, different realised performance. Hardware with a deep ecosystem consistently outperforms hardware with a shallow one, because the software has had more time and more production feedback to get fast. Which forms of accelerator lock-in are predominantly software-driven rather than hardware-driven? Almost all of them. The dependency layers that block transitions are custom kernels and CUTLASS code, optimization libraries like TensorRT and cuDNN, profiling and debugging toolchains like Nsight, framework-backend integration, and the institutional knowledge of how to diagnose ecosystem-specific performance issues. Hardware swaps are tractable; rebuilding the performance-engineering discipline around a different ecosystem is what actually takes eighteen months. Why can comparing CUDA vs ROCm on a single benchmark be misleading as a procurement signal? A single benchmark captures one operator on one workload in one configuration. It does not capture kernel coverage across your actual model, framework backend maturity, profiling tool availability, or how quickly your team can diagnose regressions. A chip that benchmarks well in a vendor-controlled demo can underperform in production because critical operators lack optimized implementations. Procurement signals need to be evaluated at the stack level, not the kernel level. What switching costs need to be inventoried before evaluating an accelerator change? At minimum: low-level kernels and library calls (CUTLASS, cuBLAS), optimization libraries (TensorRT, cuDNN, NCCL), profiling and debugging toolchains, framework backend code paths, deployment and monitoring infrastructure tied to NVIDIA-specific tooling, and the team’s accumulated debugging expertise. The dependency table in the body lays these out layer by layer. The cost that surprises organizations is usually the last one — institutional knowledge cannot be ported. How should an architect reason about ecosystem risk without trying to predict which platform “wins”? Treat ecosystem dependency as a risk factor to manage, not a verdict to render. Audit the dependency surface (how deep does CUDA penetrate your stack?), evaluate abstraction layers like torch.compile and XLA that can contain switching costs, benchmark at the full-stack level rather than on isolated kernels, and plan for periodic reassessment as ROCm, oneAPI, and framework compatibility layers evolve. The goal is informed exposure, not prediction.