What Cross-Platform GPU Performance Portability Actually Requires

Portable GPU APIs translate code, not performance. What it actually takes to run fast on NVIDIA, AMD, and Intel from the same codebase.

What Cross-Platform GPU Performance Portability Actually Requires
Written by TechnoLynx Published on 12 Jun 2026

A team writes a CUDA kernel that runs beautifully on an NVIDIA A100, ships it, then gets asked to support AMD MI300X. They reach for a translation layer, the code compiles, the tests pass — and throughput collapses to a third of what the NVIDIA card delivered. Nothing is broken. The code is portable. The performance is not.

This is the misconception worth naming up front: a portable API gives you portable correctness, not portable performance. HIP will faithfully translate your CUDA calls to ROCm, and SYCL or oneAPI will give you a single-source path across Intel, NVIDIA, and AMD targets. What none of them translate is the set of hardware-specific assumptions baked into how your kernel touches memory, schedules work, and sizes its tiles. Those assumptions are where the speed lived, and they do not survive the trip.

What Does GPU Performance Portability Actually Require, Beyond a Portable API?

Performance portability is the property of a single codebase running well — not merely running — across multiple GPU architectures. The distinction matters because the two requirements are governed by different layers of your code.

API portability is a translation problem. It is solved by a layer that maps one vendor’s calls onto another’s: HIP maps CUDA to ROCm, SYCL compiles to multiple backends, and frameworks like PyTorch abstract the device entirely so the same model.to(device) call lands on whatever accelerator is present. This layer is mature, and for a large fraction of workloads it is genuinely good enough.

Performance portability is an algorithmic problem. The throughput of a GPU kernel is dominated by how well its memory-access pattern matches the target’s memory hierarchy, how its occupancy maps to the target’s register file and shared-memory budget, and whether its arithmetic intensity sits on the right side of that target’s roofline. These are properties of the algorithm and its data layout, not of the API. A translation layer cannot rewrite your tiling strategy or change your memory coalescing pattern — it only re-expresses the calls you already wrote.

So the requirement is this: write code whose structure is hardware-aware in the abstract but not hardware-locked to a single vendor. That sounds paradoxical until you separate the two design surfaces. The choice of which target you reach is an API and toolchain question. The choice of how the algorithm is shaped is the question that determines whether the reached target runs fast. The second is where portability is won or lost, and it is closely related to the broader point that algorithmic restructuring often beats kernel micro-tuning for GPU speedups — the same lever, applied across vendors instead of within one.

Why Does CUDA Code Translated to ROCm or oneAPI Rarely Match Its NVIDIA Performance?

Because the original CUDA code was almost certainly tuned for NVIDIA, even if the author never thought of it that way. A few concrete divergences explain most of the gap we see in practice.

Wavefront vs warp width. NVIDIA executes in warps of 32 threads; AMD’s GCN and CDNA architectures execute in wavefronts of 64. A kernel whose block dimensions, shuffle operations, and branch-divergence assumptions are tuned around 32 will leave AMD lanes idle or mispredicted. This is not a translation bug — HIP translates the warp-shuffle intrinsic correctly. The granularity is simply wrong for the target.

Shared-memory and register budgets. Occupancy on each architecture is bounded by per-block shared memory and per-thread registers, and the budgets differ between an NVIDIA SM and an AMD CU or an Intel Xe-core. A tile size that achieves high occupancy on one device can spill registers or exhaust shared memory on another, collapsing the number of concurrent blocks and starving the device of work to hide latency.

Memory coalescing and cache line behaviour. Access patterns optimised for one memory subsystem — HBM stride, L2 cache line size, the layout that keeps a load coalesced on NVIDIA — do not automatically coalesce identically elsewhere. A strided pattern that the NVIDIA cache hierarchy tolerated can become a bandwidth bottleneck on a different cache geometry.

The net effect is consistent: in cross-platform porting work we have done, naively translated kernels typically land well below the original target’s throughput until the access patterns and launch geometry are re-tuned for the new hardware (observed across TechnoLynx GPU engagements; not a published benchmark). The fix is rarely rewriting the API calls — it is revisiting the algorithmic and memory-layout decisions that the API was hiding.

Which Algorithmic and Memory-Access Choices Keep GPU Code Performant Across NVIDIA, AMD, and Intel?

The choices that travel well share one trait: they are parameterised rather than hard-coded, so the shape of the computation can be re-fit to each target without restructuring the algorithm. A few patterns matter more than the rest.

Design choice Vendor-locked form (fragile) Portable form (durable)
Thread granularity Hard-coded 32-wide assumptions Block/tile sizes as compile-time or runtime parameters tuned per target
Tile / blocking size One tile size picked for one device’s shared memory Tile size parameterised against the target’s shared-memory budget
Memory layout Layout chosen for one cache geometry Layout chosen for coalescing as a property, re-validated per target
Reductions / scans Warp-32 shuffle reduction Reduction expressed over a configurable wavefront width
Library reliance Hand-rolled GEMM tuned for one GPU Vendor BLAS (cuBLAS / rocBLAS / oneMKL) behind a stable interface

The single most reliable portability strategy is the last row: lean on vendor-tuned libraries wherever the workload allows. cuBLAS, rocBLAS, and oneMKL each ship hand-optimised kernels for their own hardware, and a codebase that routes its dense linear algebra through a thin interface over all three inherits each vendor’s tuning for free. The portability work then shrinks to the custom kernels that have no library equivalent — and those are exactly the kernels where parameterised tiling and configurable granularity pay off.

This is the C2 insight — that algorithmic and data-layout choices, not API calls, determine GPU performance — applied to a multi-vendor surface. The same principle that says restructure the algorithm before you tune the kernel on a single device says parameterise the algorithm before you translate the API across devices.

What Is the Realistic Engineering Cost of Supporting Multiple GPU Vendors?

This is a decision, not a default, and it should be costed honestly. Supporting a second vendor is not free, and supporting it badly — translating without re-tuning — produces code that runs everywhere and runs fast nowhere.

The cost has three components, and they compound:

  • Re-tuning, not re-writing. If the algorithm was parameterised from the start, the per-vendor cost is a tuning sweep — finding the tile sizes, launch geometry, and occupancy targets that fit each architecture. If it was hard-coded to NVIDIA, the cost is partial rewrite.
  • A second validation surface. Every target needs its own correctness and performance test on real hardware. Numerical results can diverge subtly across architectures, and a regression on one vendor must be caught before it ships.
  • Ongoing drift. New silicon, new driver versions, and new library releases all move the performance baseline. A multi-vendor codebase carries that maintenance on every axis it supports.

Against those costs sits the alternative: writing single-vendor code and paying the migration as a lump sum later, when a customer or a procurement constraint forces a second platform. We have seen this play out in real porting work — the V-Nova engagement porting GPU video processing from OpenCL to Metal was precisely this kind of cross-target performance characterisation, where the API translation was the easy part and matching the original throughput on the new target was the engineering. The honest framing: parameterising for portability up front is cheap insurance; retrofitting it after a vendor-locked codebase has accreted is the expensive path.

There is no universal right answer here. A team that will only ever ship on one cloud’s NVIDIA instances should not pay for portability it will never use. A team whose customers run mixed fleets, or whose procurement is exposed to GPU supply volatility, is buying optionality that has a measurable price when it is needed.

How Do I Structure a GPU Codebase So Future Migrations Are Not Full Rewrites?

The structural answer is to make the vendor a parameter of the build, not an assumption woven through the kernels. A maturity rubric we use to assess where a codebase sits:

  1. Single-target, hard-coded. Warp width, tile sizes, and memory layout assume one vendor. A migration is a rewrite. This is fine if you will never migrate.
  2. Single-target, parameterised. The algorithm runs on one vendor today, but its granularity and tiling are parameters, not constants. A migration is a tuning project, not a rewrite — the cheapest place to be if you might need a second vendor later.
  3. Multi-target via abstraction. A device-abstraction layer (SYCL, a HIP/CUDA shared source, or a framework like PyTorch handling the device) compiles to multiple backends from one source. Custom kernels are isolated behind interfaces.
  4. Multi-target, per-vendor tuned. The codebase compiles everywhere and carries validated per-vendor tuning profiles, with performance regression tests on each target’s real hardware.

Most teams should aim for level 2 even when they only ship on one vendor, because the marginal cost is small and it removes the rewrite cliff. The same parameterisation discipline shows up when porting inference off the GPU entirely — the decision logic for when porting Python inference to C++ or WASM earns its engineering cost rhymes with this one: isolate the hot path behind a stable interface, then move and re-tune it deliberately rather than dragging hidden assumptions along with the translation.

Cross-platform performance characterisation — measuring where a translated kernel actually lands on each target’s roofline, and what re-tuning closes the gap — is one dimension of a GPU performance audit. It is the dimension that tells you whether your “portable” code is portable in name or in throughput.

FAQ

What does GPU performance portability actually require, beyond a portable API?

It requires algorithmic and data-layout choices that are hardware-aware but not locked to one vendor. A portable API delivers portable correctness — the code compiles and runs on multiple targets — but performance is governed by memory-access patterns, occupancy, and tiling, which the API does not translate. Performance portability is won by parameterising those structural choices so they can be re-fit per target.

Why does CUDA code translated to ROCm or oneAPI rarely match its NVIDIA performance?

Because the original CUDA code was tuned for NVIDIA’s hardware characteristics, often implicitly. NVIDIA executes in 32-thread warps while AMD uses 64-wide wavefronts; shared-memory and register budgets differ across SM, CU, and Xe-core; and memory coalescing depends on each cache geometry. Translation faithfully re-expresses the API calls but cannot rewrite tile sizes, launch geometry, or access patterns, so throughput drops until those are re-tuned.

Which algorithmic and memory-access choices keep GPU code performant across NVIDIA, AMD, and Intel?

Parameterise thread granularity, tile sizes, and memory layout rather than hard-coding them, so each can be re-fit to a target’s wavefront width and shared-memory budget. Express reductions over a configurable wavefront width instead of assuming warp-32. Most reliably, route dense linear algebra through vendor-tuned libraries — cuBLAS, rocBLAS, oneMKL — behind a stable interface, which inherits each vendor’s tuning for free.

What is the realistic engineering cost of supporting multiple GPU vendors in a single accelerated-computing stack?

The cost is re-tuning per vendor (cheap if the algorithm was parameterised, a partial rewrite if it was hard-coded), a second validation surface that needs real hardware for both correctness and performance, and ongoing drift as silicon, drivers, and libraries change. Parameterising for portability up front is cheap insurance; retrofitting it onto a vendor-locked codebase later is the expensive path.

How do I structure a GPU codebase so future hardware migrations are not full rewrites?

Make the vendor a build parameter rather than an assumption woven through the kernels. Aim for at least a parameterised single-target codebase — granularity and tiling as parameters, not constants — so a migration becomes a tuning project rather than a rewrite. Isolate custom kernels behind interfaces and, where possible, compile to multiple backends via SYCL, shared HIP/CUDA source, or a framework that handles the device.

The question that decides everything else is not “can this code run on another vendor’s GPU?” — translation layers answer that cheaply. It is “have we written the algorithm so that reaching a new target is a tuning sweep, or a rewrite?” That answer is set the day the first kernel is written, not the day the second vendor is requested.

Back See Blogs
arrow icon