You’re comparing two GPUs for an inference cluster
The spec sheets are open side by side. One card advertises higher peak TFLOPS, the other has wider memory bandwidth. The numbers feel decisive — concrete, comparable, clearly pointing to a winner. You could make a spreadsheet, sort by the metric that matters most, and call it done.
We’ve watched this process play out many times, and the pattern is remarkably consistent: the spreadsheet makes the decision feel safe, and then deployment tells a different story. Not because the spec sheets lied, but because they were answering a question nobody actually asked.
A spec sheet describes the theoretical ceiling of a component in isolation. AI performance lives in the gap between that ceiling and what actually happens when a real workload executes through a real software stack, on a real system, under sustained load. That gap is where most of the interesting — and most of the expensive — surprises live.
Theoretical limits and executed behavior are different things
A GPU spec sheet is a statement about capability under idealized conditions: maximum theoretical compute throughput, peak memory bandwidth, supported data types, clock domains and power envelopes. These values define the outer boundary of what the hardware could do in carefully constructed, short-duration scenarios.
What they don’t capture is the interaction between the workload’s structure, the framework that lowers it to device operations, the runtime that schedules and synchronizes those operations, and the physical system that sustains the whole thing over time. Spec sheets describe the envelope; AI workloads live inside it, usually far from the boundary, and rarely at the same point for different models or configurations.
This distinction matters because people routinely treat peak metrics as predictions. When someone says “this GPU does 300 TFLOPS and that one does 200, so the first is 1.5× faster,” they have made at least four assumptions — about the workload being compute-bound, about the software stack hitting optimal paths, about data movement keeping pace, and about sustained thermal and power behavior — without examining any of them. In our experience, at least one of those assumptions breaks in every real deployment. Often several break at once.
Peak FLOPS: the metric that misleads most often
FLOPS are the metric everyone reaches for first, probably because they look like the most direct proxy for “raw speed.” For AI workloads, treating peak FLOPS as a performance predictor only works if all of the following hold simultaneously: the workload is compute-bound, the compute units are saturated, data movement keeps up with compute demand, the software stack exploits the hardware’s fast execution paths, and the workload stays in a stable regime over time rather than bouncing between phases.
When a transformer inference workload is memory-bandwidth-bound — which it frequently is for autoregressive decoding — more FLOPS buys you nothing. The execution spends its time waiting on memory, not on arithmetic. Conversely, a workload that is compute-bound might still not saturate the device if the kernel doesn’t map well onto the hardware, or if synchronization and launch overhead eat into useful cycles. The relationship between the peak number on the spec sheet and the achieved throughput is contingent on so many intermediate factors that treating one as a proxy for the other is, in practice, a bet you’re making without seeing the odds.
Memory bandwidth has the same problem, just dressed differently
Memory bandwidth is often treated as the more “realistic” spec, especially by people who’ve already been burned by FLOPS comparisons. And it’s true that bandwidth matters more than peak FLOPS for many inference workloads — but it matters in context, not as an absolute.
Effective memory throughput depends on access patterns, operator fusion, cache hierarchy behavior, and runtime scheduling decisions. Two GPUs with similar advertised HBM bandwidth can deliver very different effective throughput depending on how the software stack organizes memory accesses. A PyTorch model with one attention implementation might achieve 80% of theoretical bandwidth; switch to a different kernel (say, FlashAttention vs. a naive implementation) and the effective bandwidth utilization changes substantially, even on the same hardware with the same advertised spec.
Bandwidth is not consumed directly by models — it’s mediated by execution. And that mediation is where the divergence lives.
The peak‑vs‑sustained mismatch
Spec sheets quietly blend two different regimes: burst behavior and sustained behavior. Boost clocks, peak throughput numbers, and turbo specifications describe what the hardware can reach for brief windows under favorable conditions.
AI workloads are rarely brief. Training runs last hours to weeks; inference services run continuously under variable traffic. Under sustained load, GPUs settle into operating regimes defined by power limits, thermal constraints, and stable clock states that can be significantly below the advertised peak. If you sized your capacity plan around the boost-clock number, you may find the system delivering 15–25% less sustained throughput than you expected, with no defect present — just physics doing what physics does.
We pay close attention to this distinction because it’s one of the most common sources of “the benchmarks said it would be faster” complaints. The benchmark was probably correct for the regime it measured; it just measured a regime the production system never stays in.
What actually determines AI performance
Once you accept that spec sheets describe limits rather than outcomes, the natural question is: what does determine performance? The honest answer is that it’s the interaction between hardware, software stack, and workload — operating as a coupled system over time.
The hardware provides capability and constraints. The software stack (drivers, runtime, framework, kernels) determines which execution paths are taken and how efficiently the hardware is used. The workload determines what gets stressed, for how long, and in what pattern. None of these are separable in the outcome. You can’t point at the GPU and say “that’s where the performance lives,” because a different framework version, a different kernel library, or a different batch configuration can move the bottleneck to a completely different subsystem.
This is why benchmarks measure execution, not hardware — and it’s why reducing GPU performance to a single number loses the information you actually need to make a decision.
Better questions than “which GPU has better specs?”
The fix is not to find a more clever single metric. It’s to change the shape of the question. Instead of “which GPU has better specs?”, the questions that actually survive contact with deployment are:
What is the workload actually doing — is it compute-bound, memory-bound, or limited by something outside the device entirely? How does behavior change under sustained load versus the first few minutes? What software stack is being used, and does it exploit the hardware’s strengths or work around its limitations? What assumptions are embedded in the comparison, and are those assumptions true in your environment?
Performance conclusions that don’t state their assumptions aren’t conclusions — they’re guesses wearing a lab coat. Spec sheets make it easy to skip the assumptions, which is precisely why they keep leading to surprise.
The uncomfortable implication
None of this means spec sheets are useless, or that hardware selection doesn’t matter. Both obviously do. The point is narrower and harder to dodge: spec sheets are not performance measurements, and treating them as if they were is one of the most expensive mistakes teams make in AI infrastructure decisions.
Real performance is an execution property, not a static attribute. If you care about what your system will actually do — in production, under load, over time — the spec sheet is where the conversation starts, not where it ends.