Why is portable performance so much harder than portable code? An OpenCL or SYCL kernel compiles and runs on NVIDIA, AMD, and Intel GPUs. The same source code, multiple hardware targets, one binary per target. This is source-level portability, and it generally works — the code compiles and produces correct results. What does not work is the assumption that a kernel which performs well on one GPU architecture will perform comparably on another. A kernel optimised for NVIDIA’s warp size, shared memory layout, and cache hierarchy may achieve roughly 60% of peak on an NVIDIA A100 and only 15% of peak on an AMD MI300X — not because the AMD hardware is weaker, but because the kernel’s implementation choices are mismatched to AMD’s compute unit architecture. Cross-platform GPU performance portability — achieving competitive (not identical, but competitive) performance across multiple GPU architectures from the same codebase — is an engineering discipline. It requires understanding the architectural differences between GPU vendors, designing abstractions that expose tuning parameters per architecture, and maintaining per-target optimisation configurations within a single source tree. Benchmark suites paint a consistent picture, though specific numbers vary by workload and configuration. SYCL code typically achieves near-native performance on the vendor’s own hardware (Intel GPUs via DPC++), but published results on other vendors’ hardware via translation backends are typically lower, depending on kernel complexity. The HPC community has developed performance portability metrics that consistently show indices well below 1.0 for cross-vendor workloads. In our experience across cross-platform GPU engagements, achieving 40–80% of peak across architectures is typical for scientific computing workloads that have not been explicitly tuned per target — that is an observed pattern from project work, not a universal industry rate. The actual portability index for a specific workload depends on its memory access patterns, parallelism structure, and use of hardware-specific features. Where the architectures diverge The fundamental parallel processing model is similar across vendors: thousands of lightweight threads grouped into execution units, a hierarchy of memory (registers, shared or local memory, caches, global memory), and hardware schedulers that manage thread execution. The specifics diverge in ways that affect kernel performance. Execution width. NVIDIA GPUs execute threads in warps of 32. AMD GPUs execute in wavefronts of 32 or 64 (RDNA uses 32, CDNA uses 64). Intel Arc GPUs use subgroups of 8, 16, or 32 depending on the configuration. Code that assumes a specific execution width — using warp-shuffle operations, warp-level reductions, or warp-synchronous programming — will produce incorrect results or degraded performance on hardware with a different width. Memory hierarchy. NVIDIA GPUs have a configurable L1/shared memory split per streaming multiprocessor. AMD GPUs have a fixed LDS (Local Data Share) per compute unit with a separate L1 cache. Intel GPUs have a shared local memory with a different capacity and access latency profile. A kernel tuned for NVIDIA’s shared memory size (per NVIDIA’s published specifications, 48 KB or 96 KB configurable on recent architectures) may not fit in AMD’s LDS or may underutilise Intel’s shared local memory. Scheduling model. NVIDIA hardware schedules warps at the SM level with implicit warp synchrony within a scheduling group. AMD hardware schedules wavefronts with a different latency hiding strategy. Intel hardware uses a thread-level scheduling model within the EU (Execution Unit). Kernels that rely on implicit scheduling assumptions — such as assuming warps within a block execute in a particular order — may behave differently across vendors. Instruction set and throughput. FMA throughput, special function unit availability, and integer-to-floating-point throughput ratios differ across architectures. An optimised kernel on NVIDIA may use instructions (for example, __shfl_down_sync, __ballot_sync) that have no direct equivalent on AMD or Intel, requiring alternative implementations. The API choice between CUDA, OpenCL, and SYCL determines which of these differences are visible to the programmer and which are abstracted away. CUDA exposes NVIDIA-specific features directly. OpenCL and SYCL abstract the hardware, but the abstraction does not eliminate the performance impact of architectural differences — it only hides the details that cause them. Architecture-aware abstraction: the design pattern The design pattern for portable performance is architecture-aware abstraction: a codebase that uses compile-time or runtime configuration to adapt kernel parameters to the target architecture, without duplicating the core algorithm. Parameterised tile sizes. Instead of hard-coding a tile size of 32×32 (a common starting point for NVIDIA shared memory), the tile size becomes a compile-time parameter set per target. These are configuration values, not a benchmarked rate: 32×32 for NVIDIA, 64×16 for AMD CDNA to match the 64-wide wavefront, 16×16 for Intel. The core algorithm is identical; the tiling parameters are architecture-specific. Execution width abstraction. Instead of calling warp-specific intrinsics, the codebase uses a subgroup abstraction that maps to warps on NVIDIA, wavefronts on AMD, and subgroups on Intel. SYCL provides this natively through sycl::sub_group. OpenCL 2.0+ provides subgroup operations. The abstraction layer dispatches to the vendor-specific implementation at compile time. Memory configuration strategy. The kernel’s shared memory usage is parameterised to fit within the target architecture’s shared memory capacity. On NVIDIA hardware with configurable L1/shared memory, the configuration is set to maximise shared memory. On AMD with fixed LDS, the kernel’s shared memory allocation is constrained to the LDS capacity. The allocation strategy is a per-target build configuration, not a runtime decision. Kernel selection. For operations where the optimal algorithm differs across architectures — not just the parameters, but the approach — the codebase maintains multiple kernel implementations and selects the appropriate one based on the target. A reduction operation might use warp shuffles on NVIDIA, wavefront operations on AMD, and a different tree reduction on Intel. The selection logic is part of the build system, not the runtime. How do you test for portable performance, not just correctness? Portable performance requires testing on every target platform. In our experience across cross-platform GPU engagements, a kernel that compiles and produces correct results on all targets may perform roughly 3× slower on one target than another — an observed pattern, not a guaranteed outcome — and the difference is invisible in correctness testing. We recommend a performance testing infrastructure that: Runs a benchmark suite on each target architecture after every code change that affects kernel code. Compares achieved throughput against architecture-specific performance targets, expressed as a fraction of theoretical peak rather than absolute numbers — because absolute peak differs across architectures. Flags regressions on any target architecture, even when other targets improve. This infrastructure is the practical enforcement of portability. Without it, the codebase drifts toward optimisation for the developer’s primary test hardware and regresses elsewhere — because the regression is invisible without measurement. A reproducible cross-platform benchmark protocol The following protocol produces fair, reproducible performance comparisons across GPU architectures. We use this structure in our own cross-platform validation work. Select a representative workload and fix the inputs. Choose a single model and dataset that reflect production workload (for example, ResNet-50 inference on a 1,000-image validation set, or a GEMM of fixed dimensions). Use identical input data across all platforms; store it in a portable format (NumPy .npy or raw binary) to eliminate data-loading variance. Pin the software environment per target. Record the exact compiler, driver, and runtime versions for each platform — for example, CUDA 12.4 with cuDNN 9.1 on NVIDIA, ROCm 6.1 with MIOpen on AMD, oneAPI 2024.1 on Intel. Use containers (Docker or Singularity) to freeze the environment so the benchmark is reproducible months later. Run warm-up iterations and discard them. Execute 10–50 warm-up iterations before measurement to ensure caches are populated, JIT compilation has completed, and GPU clocks have stabilised. Do not include warm-up iterations in reported timing. Collect timing over a fixed number of measurement iterations. Run at least 100 measurement iterations (more for sub-millisecond kernels). Use device-side timing — CUDA Events, hipEvent_t, or SYCL profiling events — rather than host-side wall-clock timing, to exclude host overhead and driver latency. Report median and percentile statistics, not the mean. Record the median, 5th percentile, and 95th percentile of per-iteration times. The median is robust to outliers from thermal throttling or OS scheduling interference. Report the coefficient of variation — as a planning heuristic from our engagements, if it exceeds 5%, investigate the source of variance before drawing conclusions. Normalise results to each platform’s theoretical peak. Express performance as a percentage of each GPU’s theoretical peak throughput (compute TFLOPS or memory bandwidth GB/s, whichever is the binding resource). This separates optimisation quality from raw hardware capability. In our experience across cross-platform GPU engagements, a kernel achieving roughly 70% of peak on both NVIDIA and AMD is well-optimised on both — an observed pattern, not a guaranteed outcome — even when absolute throughput differs. Automate and version-control the entire protocol. Store the benchmark script, environment specifications, input data checksums, and result-parsing logic in version control alongside the kernel source. Use a CI job — for example, GitHub Actions with self-hosted runners per GPU type — to run the benchmark on every kernel code change and flag regressions against per-target performance baselines. When is portability not worth the cost? Cross-platform performance portability has real engineering costs. The abstraction layer adds complexity, per-target tuning adds maintenance burden, and multi-platform testing adds infrastructure cost. These costs are justified when the software must run on multiple GPU vendors — products deployed on customer hardware, cloud workloads that span providers, or organisations with mixed GPU fleets. When the workload runs exclusively on one GPU vendor’s hardware — the case for most deep learning training (NVIDIA), most single-cloud deployments, and most embedded systems — the portability investment is waste. CUDA on NVIDIA, HIP on AMD, and oneAPI on Intel each provide deeper optimisation on their respective platforms than any cross-platform abstraction can match. We have seen teams invest months in cross-platform portability for workloads that ended up running exclusively on NVIDIA hardware for the project’s entire lifetime. We have also seen teams hard-code CUDA-specific optimisations and later discover that a customer requirement or cloud migration demanded AMD or Intel support — at which point the algorithmic restructuring and kernel tuning decisions had to be revisited for a different architecture. The portability decision should be made explicitly, based on the realistic hardware scope for the workload’s deployment lifetime — not on philosophical preference for open standards or vendor neutrality. Without upfront hardware-scope analysis, teams risk either wasted portability investment or costly late-stage rewrites — a GPU Performance Audit quantifies both risks before the architecture is locked in. FAQ What does GPU performance portability actually require, beyond a portable API? A portable API (OpenCL, SYCL, HIP) only guarantees that source compiles and runs across vendors. Performance portability additionally requires architecture-aware abstraction: parameterised tile sizes, subgroup abstractions over warps and wavefronts, per-target shared-memory configurations, and sometimes alternate kernel implementations chosen at build time. The API hides the syntax; the abstractions handle the performance. Why does CUDA code translated to ROCm or oneAPI rarely match its NVIDIA performance? Translation preserves semantics, not implementation choices. A CUDA kernel encodes NVIDIA-specific assumptions — 32-wide warps, configurable L1/shared split, warp-synchronous intrinsics, NVIDIA’s scheduling model. Translating to HIP or SYCL produces correct code, but the encoded assumptions remain mismatched to AMD’s wavefronts and LDS or Intel’s EU model. Recovering performance generally requires retuning the kernel for the target architecture, not just translating it. Which algorithmic and memory-access choices keep GPU code performant across NVIDIA, AMD, and Intel? Choices that hold up well across vendors include parameterised tile and block sizes, subgroup-level primitives instead of warp-specific intrinsics, coalesced global memory access patterns that do not depend on a specific cache line size, and shared-memory usage bounded by the smallest target’s capacity. Choices that travel poorly include warp-synchronous programming, hard-coded execution widths, and reliance on vendor-specific instructions without alternative paths. What is the realistic engineering cost of supporting multiple GPU vendors in a single accelerated-computing stack? The cost has three components: the abstraction layer (one-time design plus ongoing maintenance), per-target tuning (recurring whenever new architectures arrive), and multi-platform CI infrastructure (self-hosted runners per GPU type). For teams whose deployment scope genuinely spans vendors, this is overhead that pays back. For teams whose deployment is single-vendor for the project lifetime, it is unrecovered investment — which is why the portability decision should be made explicitly against deployment scope, not by default. How do I structure a GPU codebase so future hardware migrations are not full rewrites? Keep the core algorithm vendor-neutral and push hardware-specific choices into compile-time configuration and a thin dispatch layer. Use subgroup abstractions rather than warp intrinsics, parameterise tile sizes and shared-memory budgets, and maintain alternate kernels only where the optimal approach genuinely differs by architecture. Combine this with per-target performance baselines in CI so regressions on any vendor are visible, not silent.