Performance Emerges from the Hardware × Software Stack

A driver update changes kernel scheduling. Throughput drops 12%. Nobody touched the model.

This is the kind of event that makes sense only after you abandon one mental model and replace it with another. Under the old model — “hardware has performance, software just uses it” — a driver update shouldn’t change throughput by double digits. Under the replacement model, it’s not only possible but predictable, because performance was never a property of the hardware alone.

AI performance is an emergent property of the hardware, the software stack that drives it, and the workload that shapes what both of them do. You don’t get a performance outcome by summing up layer contributions. You get it from interactions — and those interactions can be surprisingly sensitive to changes in any single layer. The stack, not the device, is the correct unit of performance reasoning for AI systems.

Emergence, not aggregation

When we say performance “emerges” from the hardware × software stack, we mean something specific: the final throughput, latency, or efficiency number is not decomposable into independent contributions from hardware and software. You can’t say “the GPU contributed 70% of the performance and PyTorch contributed 30%.” That’s not how the system works.

Consider what happens during a single forward pass of a transformer model. The framework decides how to lower the computation graph — which operators to fuse, which memory layout to use, which kernels to dispatch. The CUDA runtime schedules those kernels onto the device. cuDNN or a custom attention kernel (say, FlashAttention) determines the actual execution path on the hardware. The memory subsystem serves data according to access patterns that the software stack created. Thread scheduling, synchronization points, and stream management all shape how the hardware’s resources are actually consumed.

Change any one of those decisions — swap a kernel, alter a graph transformation pass, modify the runtime’s allocation strategy — and you’re not just “tuning.” You’re changing what the hardware actually does, which changes where the bottleneck lands, which changes the measured outcome. The coupling between layers is where the performance story lives, not in any single layer’s properties.

Why does hardware-only reasoning keep producing surprises?

Hardware matters. Nobody serious argues otherwise, and framing the argument as “hardware vs. software” misses the point entirely. The problem is more specific: treating hardware as the explanatory unit for performance outcomes.

That looks like “we upgraded from GPU A to GPU B — we should see a proportional speedup” and then being puzzled when the gain is smaller than expected, varies by model, or disappears under sustained load. It looks like comparing two systems by their theoretical FLOPS ratio and finding the actual throughput ratio is nothing like it. It looks like a team purchasing hardware based on a single spec-sheet advantage and then discovering that the workload spends most of its time somewhere the spec sheet doesn’t describe.

We’ve found that when a team is surprised by an AI performance result, the explanation almost always involves a software or system-level interaction that the hardware-only model can’t account for. The hardware didn’t fail to perform — the hardware was never the whole story.

The stack is the performance definition

A pragmatic shift is to stop asking “what GPU is this?” and start asking “what execution stack is this?” — because the stack determines what actually runs.

The drivers and runtime decide how work is scheduled, synchronized, and allocated. The framework decides which operators execute and how graphs are partitioned. It’s a reminder that GPUs are part of a larger system, not isolated performance islands. Libraries and kernels decide what instructions actually hit the device. The system topology — PCIe layout, NUMA configuration, NVLink connectivity — decides how data moves between components. And the workload itself decides what gets stressed, for how long, and in what pattern.

Execution stack layers and their performance roles

Stack layer	What it controls	How it affects measured performance
Hardware	Compute units, memory, interconnect, power/thermal envelope	Defines the theoretical ceiling for arithmetic, bandwidth, and sustained operation
Drivers & runtime	Kernel scheduling, memory allocation, synchronization	Determines how efficiently hardware resources are claimed and released
Frameworks	Graph construction, operator fusion, kernel dispatch	Decides which execution path the workload takes through the hardware
Libraries & kernels	Actual device instructions (cuDNN, FlashAttention, NCCL)	Sets the practical throughput ceiling for individual operations
Workload	Model shape, batch size, sequence length, precision mode	Determines which hardware subsystem is stressed and where the bottleneck lands

None of that is optional detail. Those are the mechanisms that produce the number. When someone presents a performance result without stack context, the number isn’t wrong — the claim is just incomplete, in the same way that a benchmark result with hidden methodology is a datapoint without interpretation. As we explored in thinking about why identical GPUs can produce different results, the execution context is frequently the dominant source of variance, not the hardware identity.

What this means for performance claims and comparisons

If performance is a stack property, then performance claims need stack context to be meaningful. “This GPU delivers 500 tokens per second” is not a hardware statement — it’s a system statement, and the system includes the framework version, the CUDA runtime, the kernel libraries, the model configuration, the batch size, and the operating regime.

Teams that adopt the stack model gain a practical advantage: their performance discussions get calmer and more accurate, because performance ownership spans hardware and software teams. Discrepancies stop looking like mysteries and start looking like the natural consequence of running different stacks. Vendors’ claims become interpretable rather than confusing. And capacity planning shifts from “buy the GPU with the best number” to “validate performance under our actual execution context” — which is harder, but much more likely to produce useful results.

For a deeper look at how the software layer specifically creates performance ceilings and pathways, see our discussion of the software stack as a first-class performance component.

Not reductionism in reverse

One misreading of this argument is “so you’re saying hardware doesn’t matter and it’s all software.” That’s wrong in the opposite direction.

Hardware determines the envelope of what’s possible. A GPU with more memory can serve larger models; a device with higher bandwidth can move data faster when the access patterns are favorable; architectural features like hardware support for specific precision formats create capabilities the software stack can exploit. None of that is diminished by recognizing that the software stack mediates how those capabilities are realized.This is also why the choice of software stack changes the effective performance you get from a fixed piece of hardware. The same GPU, driven by a newer driver, a different CUDA runtime, an upgraded framework, or a swapped kernel library, will exhibit a different effective ceiling — not because the silicon changed, but because the drivers, runtime, compiler, and framework collectively decide which fraction of that envelope the workload actually reaches. A fixed device is not a fixed performance number; it is a performance range whose realized value is set by the stack that drives it.

The point is not that one layer matters more than the other. It’s that the outcome belongs to the interaction, and reasoning about the layers in isolation produces incomplete and often misleading conclusions. If your performance model doesn’t include the stack, it’s not a performance model — it’s a hardware description with an implicit hope that everything else will be fine. In our experience, that hope is not a reliable engineering strategy.

System-on-a-chip for AI: why integration doesn’t eliminate the software stack — how the stack model applies in concentrated form to SoCs.

LynxBenchAI applies this principle operationally — treating the hardware-and-software stack as the unit of measurement, not the chip in isolation. It is a benchmarking methodology for AI hardware that measures sustained performance across the complete stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

Why does AI performance emerge from the hardware × software stack rather than from hardware alone?

Because the throughput, latency, and efficiency numbers a team actually observes are produced by interactions between layers — kernels dispatched by the framework, scheduling decisions made by the runtime, memory access patterns shaped by the workload — not by hardware properties in isolation. The hardware sets the envelope of what is possible, but it does not choose which execution path the workload takes through that envelope. That choice belongs to the software stack, and the outcome belongs to the interaction.

How can identical hardware produce radically different performance under different software stacks?

Identical GPUs can be driven by different framework versions, CUDA runtimes, kernel libraries, and graph transformation passes, and each of those decides which instructions actually hit the device. Swap cuDNN for a FlashAttention kernel, change an operator fusion strategy, or alter the runtime’s allocation behaviour, and the bottleneck moves — even though the silicon is unchanged. That is why a driver update can shift throughput by double digits without anyone touching the model.

Why does reasoning about performance one layer at a time tend to break down on AI workloads?

Because the layers are not independent contributors that can be summed. The framework’s graph lowering decisions change what the runtime schedules; the runtime’s scheduling changes what the kernels can sustain; the workload’s shape changes which subsystem gets stressed. Reasoning layer-by-layer assumes a decomposition that does not hold, which is why hardware-only or software-only models keep producing surprises in practice.

What does it mean to treat AI performance as a systems problem instead of a hardware problem?

It means stopping the question “what GPU is this?” and asking “what execution stack is this?” — because the stack determines what actually runs. Performance claims then carry framework version, runtime, kernel libraries, model configuration, batch size, and operating regime as part of the claim itself. Capacity planning shifts from picking the best spec-sheet number to validating performance under the actual execution context.

Which interactions across hardware, software, and workload tend to dominate observed AI performance?

The dominant interactions are usually kernel selection and dispatch (which library and which kernel variant runs), memory access patterns created by the framework’s graph and the workload’s shape, scheduling and synchronization decisions in the runtime, and system topology effects across PCIe, NUMA, and NVLink. When a team is surprised by a result, the explanation almost always lives in one of these interactions rather than in a hardware property the spec sheet would have surfaced.

Why is it important for hardware and software to be reasoned about together when evaluating an AI system?

Because a hardware description with an implicit hope that everything else will be fine is not a performance model — and evaluations built on that hope tend to mispredict. Reasoning about hardware and software together produces interpretable vendor claims, calmer cross-team discussions, and capacity plans that survive contact with the real workload. It is also the basis for benchmarking methodologies like LynxBenchAI, which treat the complete stack as the unit of measurement rather than the chip in isolation.

How does the choice of software stack change the effective performance you get from a fixed piece of hardware like a GPU?

A fixed GPU is a performance range, not a single number. The drivers, runtime, compiler, and framework that drive it decide which fraction of the hardware’s envelope the workload actually reaches, so a newer driver, a different CUDA runtime, an upgraded framework, or a swapped kernel library can move the effective ceiling without any change to the silicon. The realized value is set by the stack, which is exactly why a driver update can shift throughput by double digits while the device stays the same.

What does a hardware × software stack actually look like in layers, and where do the interactions between those layers most often determine real-world AI performance?

The stack runs from hardware (compute units, memory, interconnect, thermal envelope), through drivers and runtime (scheduling, allocation, synchronization), through frameworks (graph construction, operator fusion, kernel dispatch), through libraries and kernels (cuDNN, FlashAttention, NCCL), to the workload itself (model shape, batch size, sequence length, precision mode). The interactions that most often determine real-world performance live at the seams between these layers — the execution stack layers table above maps each layer to what it controls and how it affects the measured number. Kernel selection, memory access patterns, runtime scheduling, and system topology across PCIe, NUMA, and NVLink are where the bottleneck usually lands.

Performance Emerges from the Hardware × Software Stack

A driver update changes kernel scheduling. Throughput drops 12%. Nobody touched the model.

Emergence, not aggregation

Why does hardware-only reasoning keep producing surprises?

The stack is the performance definition

Execution stack layers and their performance roles

What this means for performance claims and comparisons

Not reductionism in reverse

Frequently Asked Questions

Why does AI performance emerge from the hardware × software stack rather than from hardware alone?

How can identical hardware produce radically different performance under different software stacks?

Why does reasoning about performance one layer at a time tend to break down on AI workloads?

What does it mean to treat AI performance as a systems problem instead of a hardware problem?

Which interactions across hardware, software, and workload tend to dominate observed AI performance?

Why is it important for hardware and software to be reasoned about together when evaluating an AI system?

How does the choice of software stack change the effective performance you get from a fixed piece of hardware like a GPU?

What does a hardware × software stack actually look like in layers, and where do the interactions between those layers most often determine real-world AI performance?

GPUs Are Part of a Larger System

Why Identical GPUs Often Perform Differently

The Software Stack Is a First-Class Performance Component

Performance Ownership Spans Hardware and Software Teams

Performance Emerges from the Hardware × Software Stack

A driver update changes kernel scheduling. Throughput drops 12%. Nobody touched the model.

Emergence, not aggregation

Why does hardware-only reasoning keep producing surprises?

The stack is the performance definition

Execution stack layers and their performance roles

What this means for performance claims and comparisons

Not reductionism in reverse

Related deep-dives

Frequently Asked Questions

Why does AI performance emerge from the hardware × software stack rather than from hardware alone?

How can identical hardware produce radically different performance under different software stacks?

Why does reasoning about performance one layer at a time tend to break down on AI workloads?

What does it mean to treat AI performance as a systems problem instead of a hardware problem?

Which interactions across hardware, software, and workload tend to dominate observed AI performance?

Why is it important for hardware and software to be reasoned about together when evaluating an AI system?

How does the choice of software stack change the effective performance you get from a fixed piece of hardware like a GPU?

What does a hardware × software stack actually look like in layers, and where do the interactions between those layers most often determine real-world AI performance?

GPUs Are Part of a Larger System

Why Identical GPUs Often Perform Differently

The Software Stack Is a First-Class Performance Component

Performance Ownership Spans Hardware and Software Teams