Why is portable performance so much harder than portable code? An OpenCL or SYCL kernel compiles and runs on NVIDIA, AMD, and Intel GPUs. The same source code, multiple hardware targets, one binary per target. This is source-level portability, and it generally works — the code compiles and produces correct results. What does not work is the assumption that a kernel which performs well on one GPU architecture will perform comparably on another. A kernel optimised for NVIDIA’s warp size, shared memory architecture, and cache hierarchy may achieve 60% of peak on an NVIDIA A100 and 15% of peak on an AMD MI300X — not because the AMD hardware is weaker, but because the kernel’s implementation choices are mismatched to AMD’s compute unit architecture. Cross-platform GPU performance portability — achieving competitive (not identical, but competitive) performance across multiple GPU architectures from the same codebase — is an engineering discipline. It requires understanding the architectural differences between GPU vendors, designing abstractions that expose tuning parameters per architecture, and maintaining per-target optimisation configurations within a single source tree. Benchmark suites paint a consistent picture, though the specific numbers vary by workload and configuration. SYCL code typically achieves near-native performance on the vendor’s own hardware (Intel GPUs via DPC++), but published results on other vendors’ hardware via translation backends are typically lower, depending on kernel complexity. The HPC community has developed performance portability metrics that consistently show indices well below 1.0 for cross-vendor workloads — in our experience across cross-platform GPU engagements, achieving 40–80% of peak across architectures is typical for scientific computing workloads that have not been explicitly tuned per target (an observed range, not a universal industry rate). These ranges are indicative, not a universal industry rate — the actual portability index for a specific workload depends on its memory access patterns, parallelism structure, and use of hardware-specific features. Where the architectures diverge The fundamental parallel processing model is similar across vendors: thousands of lightweight threads grouped into execution units, a hierarchy of memory (registers, shared/local memory, caches, global memory), and hardware schedulers that manage thread execution. The specifics diverge in ways that affect kernel performance: Execution width. NVIDIA GPUs execute threads in warps of 32. AMD GPUs execute in wavefronts of 32 or 64 (architecture-dependent: RDNA uses 32, CDNA uses 64). Intel Arc GPUs use subgroups of 8, 16, or 32 depending on the configuration. Code that assumes a specific execution width — using warp-shuffle operations, warp-level reductions, or warp-synchronous programming — will produce incorrect results or degraded performance on hardware with a different width. Memory hierarchy. NVIDIA GPUs have a configurable L1/shared memory split per streaming multiprocessor. AMD GPUs have a fixed LDS (Local Data Share) per compute unit with a separate L1 cache. Intel GPUs have a shared local memory with a different capacity and access latency profile. A kernel that is tuned for NVIDIA’s shared memory size (48KB or 96KB configurable) may not fit in AMD’s LDS (64KB fixed) or may underutilise Intel’s shared local memory. Scheduling model. NVIDIA hardware schedules warps at the SM level with implicit warp synchrony within a scheduling group. AMD hardware schedules wavefronts with a different latency hiding strategy. Intel hardware uses a thread-level scheduling model within the EU (Execution Unit). Kernels that rely on implicit scheduling assumptions — such as assuming warps within a block execute in a particular order — may behave differently across vendors. Instruction set and throughput. FMA throughput, special function unit availability, and integer-to-floating-point throughput ratios differ across architectures. An optimised kernel on NVIDIA may use instructions (e.g., __shfl_down_sync, __ballot_sync) that have no direct equivalent on AMD or Intel, requiring alternative implementations. The API choice between CUDA, OpenCL, and SYCL determines which of these differences are visible to the programmer and which are abstracted away. CUDA exposes NVIDIA-specific features directly. OpenCL and SYCL abstract the hardware, but the abstraction does not eliminate the performance impact of architectural differences — it only hides the details that cause them. Architecture-aware abstraction: the design pattern The design pattern for portable performance is architecture-aware abstraction: a codebase that uses compile-time or runtime configuration to adapt kernel parameters to the target architecture, without duplicating the core algorithm. Parameterised tile sizes. Instead of hard-coding a tile size of 32×32 (optimal for NVIDIA shared memory), the tile size is a compile-time parameter that is set per target — these are configuration values, not a benchmarked rate: 32×32 for NVIDIA, 64×16 for AMD CDNA (to match the 64-wide wavefront), 16×16 for Intel. The core algorithm is identical; the tiling parameters are architecture-specific. Execution width abstraction. Instead of calling warp-specific intrinsics (__shfl_down_sync on NVIDIA), the codebase uses a subgroup abstraction that maps to warps on NVIDIA, wavefronts on AMD, and subgroups on Intel. SYCL provides this abstraction natively through sycl::sub_group. OpenCL 2.0+ provides subgroup operations. The abstraction layer dispatches to the vendor-specific implementation at compile time. Memory configuration strategy. The kernel’s shared memory usage is parameterised to fit within the target architecture’s shared memory capacity. On NVIDIA hardware with configurable L1/shared memory, the configuration is set to maximise shared memory. On AMD with fixed LDS, the kernel’s shared memory allocation is constrained to the LDS capacity. The allocation strategy is a per-target build configuration, not a runtime decision. Kernel selection. For operations where the optimal algorithm differs across architectures — not just the parameters, but the approach — the codebase maintains multiple kernel implementations and selects the appropriate one based on the target. A reduction operation might use warp shuffles on NVIDIA, wavefront operations on AMD, and a different tree reduction on Intel. The selection logic is part of the build system, not the runtime. The testing infrastructure for portable performance Portable performance requires testing on every target platform. In our experience across cross-platform GPU engagements, a kernel that compiles and produces correct results on all targets may perform 3× slower on one target than another (an observed range, not a guaranteed outcome) — and the performance difference is not visible in correctness testing. We recommend a performance testing infrastructure that: Runs a benchmark suite on each target architecture after every code change that affects kernel code Compares achieved throughput against architecture-specific performance targets (expressed as a fraction of theoretical peak, not as absolute numbers — because the absolute peak differs across architectures) Flags regressions on any target architecture, even if other targets improved This infrastructure is the practical enforcement of portability. Without it, the codebase will drift toward optimisation for the developer’s primary test hardware and regress on other targets — because the regression is invisible without measurement. A reproducible cross-platform benchmark protocol The following protocol produces fair, reproducible performance comparisons across GPU architectures. We use this structure in our own cross-platform validation work: Select a representative workload and fix the inputs. Choose a single model and dataset that reflect your production workload (e.g., ResNet-50 inference on a 1,000-image validation set, or a GEMM of fixed dimensions). Use identical input data across all platforms — store it in a portable format (NumPy .npy or raw binary) to eliminate data-loading variance. Pin the software environment per target. Record the exact compiler, driver, and runtime versions for each platform (e.g., CUDA 12.4 + cuDNN 9.1 on NVIDIA, ROCm 6.1 + MIOpen on AMD, oneAPI 2024.1 on Intel). Use containers (Docker or Singularity) to freeze the environment so the benchmark is reproducible months later. Run warm-up iterations and discard them. Execute 10–50 warm-up iterations before measurement to ensure caches are populated, JIT compilation (if any) has completed, and GPU clocks have stabilised to their sustained frequency. Do not include warm-up iterations in the reported timing. Collect timing over a fixed number of measurement iterations. Run at least 100 measurement iterations (more for sub-millisecond kernels). Use device-side timing (CUDA Events, hipEvent_t, or SYCL profiling events) rather than host-side wall-clock timing to exclude host overhead and driver latency. Report median and percentile statistics, not the mean. Record the median, 5th percentile, and 95th percentile of per-iteration times. The median is robust to outliers from thermal throttling or OS scheduling interference. Report the coefficient of variation — as a planning heuristic from our engagements, if it exceeds 5%, investigate the source of variance before drawing conclusions. Normalise results to each platform’s theoretical peak. Express performance as a percentage of each GPU’s theoretical peak throughput (compute TFLOPS or memory bandwidth GB/s, whichever is the binding resource). This separates optimisation quality from raw hardware capability — in our experience across cross-platform GPU engagements, a kernel achieving 70% of peak on both NVIDIA and AMD is well-optimised on both (an observed pattern, not a guaranteed outcome), even if the absolute throughput differs. Automate and version-control the entire protocol. Store the benchmark script, environment specifications, input data checksums, and result-parsing logic in version control alongside the kernel source. Use a CI job (e.g., GitHub Actions with self-hosted runners per GPU type) to run the benchmark on every kernel code change and flag regressions against per-target performance baselines. When portability is not worth the cost Cross-platform performance portability has real engineering costs: the abstraction layer adds complexity, per-target tuning adds maintenance burden, and multi-platform testing adds infrastructure cost. These costs are justified when the software must run on multiple GPU vendors — products deployed on customer hardware, cloud workloads that span providers, or organisations with mixed GPU fleets. When the workload runs exclusively on one GPU vendor’s hardware — which is the case for most deep learning training (NVIDIA), most single-cloud deployments, and most embedded systems — the portability investment is waste. CUDA on NVIDIA, HIP on AMD, and oneAPI on Intel each provide deeper optimisation on their respective platforms than any cross-platform abstraction can match. We have seen teams invest months in cross-platform portability for workloads that ended up running exclusively on NVIDIA hardware for the project’s entire lifetime. We have also seen teams hard-code CUDA-specific optimisations and later discover that a customer requirement or cloud migration demanded AMD or Intel support — at which point the algorithmic restructuring and kernel tuning decisions had to be revisited for a different architecture. The portability decision should be made explicitly, based on the realistic hardware scope for the workload’s deployment lifetime — not on philosophical preference for open standards or vendor neutrality. Without upfront hardware-scope analysis, teams risk either wasted portability investment or costly late-stage rewrites — a GPU Performance Audit quantifies both risks before the architecture is locked in.