Precision Choices Are Constrained by Hardware Architecture

“Can we just switch to FP8?”

The request sounds simple. FP8 offers roughly 2× the throughput of BF16 on supported hardware, halves the memory footprint, and enables larger models on fewer GPUs. The ML engineer or infrastructure planner hears “faster and cheaper” and reasonably asks: why aren’t we already using it?

The answer, more often than not, is hardware. The precision formats a GPU can accelerate natively are determined by its tensor core architecture, and that architecture varies across generations. A format the hardware doesn’t support natively doesn’t just run slower — it may offer no throughput benefit at all, or it may not be supported by the deployment framework for that hardware target.

Precision decisions are hardware-conditional. Understanding the constraints is the prerequisite to making them well.

Tensor core generations and their numerical affordances

NVIDIA’s tensor core architecture has evolved across GPU generations, and each generation added support for new numerical formats while maintaining backward compatibility:

Volta (V100): First-generation tensor cores. Native support for FP16 matrix multiply with FP32 accumulation. No BF16, no INT8 tensor core support, no FP8. The V100 was foundational for mixed-precision training, but its format menu is limited by today’s standards.

Ampere (A100): Third-generation tensor cores. Added native BF16 and TF32 (an internal 19-bit format used transparently for FP32 operations). Added INT8 and INT4 tensor core support for inference quantization. No native FP8. The A100 is still widely deployed and handles BF16 inference and training well, but cannot accelerate FP8 workloads.

Hopper (H100, H200): Fourth-generation tensor cores. Added native FP8 (both E4M3 and E5M2 variants), along with the Transformer Engine that manages dynamic per-tensor scaling for FP8 automatically. BF16 throughput also increased substantially over Ampere.

Blackwell (B100, B200): Further FP8 optimization and potential FP4 support, with enhanced Transformer Engine capabilities.

Each generation defines a different menu of viable precision choices. A deployment targeting V100s is limited to FP16 mixed precision for tensor core acceleration. A deployment targeting A100s can use BF16, INT8, or INT4, but not FP8. A deployment targeting H100s can use any of the above plus FP8.

This isn’t a software limitation that can be patched. It’s a hardware constraint: the silicon either has execution units for a given format or it doesn’t.

Tensor core format support by GPU generation

GPU generation	Native tensor core formats	Notable limitation
Volta (V100)	FP16	No BF16, no INT8, no FP8
Ampere (A100)	FP16, BF16, TF32, INT8, INT4	No FP8 — FP8 operations fall back to BF16 speed
Hopper (H100/H200)	FP16, BF16, TF32, INT8, INT4, FP8 (E4M3, E5M2)	Full format menu with Transformer Engine
Blackwell (B100/B200)	All above + FP4 (expected)	—

The penalty for unsupported formats

Running a precision format that the hardware doesn’t natively accelerate doesn’t cause an error — the framework will typically fall back to a supported format or use software emulation. But the performance implications are serious.

FP8 operations on A100 hardware execute on BF16 or FP16 tensor cores with conversion overhead, producing throughput roughly comparable to BF16 — meaning no FP8 advantage despite the lower precision. The memory savings from FP8 model representation still apply (the model is smaller in HBM), but the compute throughput doubles that FP8 promises on Hopper hardware simply don’t materialize on Ampere.

Similarly, INT8 inference on V100 runs without dedicated tensor core support, falling back to CUDA cores with dramatically lower throughput than the INT8 tensor core path available on A100.There is a subtler version of the same trap. A format can be nominally supported and still not be efficient, because the format alone is only half the story — the scaling machinery it depends on must also be present in silicon. FP8 is the clearest example. Sustained FP8 accuracy at scale relies on per-tensor or block scaling that rescales values to use the narrow exponent range well, and on Hopper the Transformer Engine handles that dynamic scaling automatically. Finer-grained blockwise quantization (scaling per small block of values rather than per tensor) is more aggressive still. Where that scaling has no hardware acceleration, it runs on slower paths and erodes — sometimes erases — the throughput the precision format was chosen to deliver. So “FP8 supported” on a spec sheet does not by itself tell an architect whether the block-scaling path that makes FP8 usable in practice is one the hardware does efficiently.

When tensor core support is absent or only partial for a format, the work does not stop — it falls back. The matrix multiply drops to a wider-format tensor core or to CUDA cores, and any scaling the format needs runs in software rather than on dedicated units. The practical implication: a precision strategy developed and benchmarked on one hardware generation cannot be applied to a different generation without re-evaluating whether the target format — and its scaling machinery — is natively supported. A reported throughput number measured on a fallback path describes the fallback, not the format; an architect reading it should ask which units actually ran the workload before trusting the figure. Benchmark results measured on H100 at FP8 tell you nothing about FP8 performance on A100 — because on A100, “FP8 performance” effectively doesn’t exist at the hardware acceleration level.

Framework and tooling dependencies

Hardware support is necessary but not sufficient. The deployment framework must also support the target precision format on the target hardware, with optimized kernels and correct numerical handling.

TensorRT, NVIDIA’s inference optimizer, added FP8 support alongside Hopper hardware. Earlier TensorRT versions targeting A100 support INT8 and FP16 but not FP8. PyTorch’s native FP8 support arrived with specific versions and requires Hopper hardware with compatible CUDA toolkit versions.

This creates a three-layer compatibility requirement: the hardware must support the format, the framework must support the format on that hardware, and the CUDA/driver stack must be at a version that enables the feature. A mismatch at any layer — current-generation hardware with an older framework version, or a current framework on previous-generation hardware — blocks the precision strategy.

As discussed in how FP8, FP16, and BF16 represent different regimes, the format choice encodes assumptions about numerical behavior. The hardware choice determines which of those assumptions can actually be realized efficiently.

Why this matters for hardware selection

Precision support should be an explicit factor in hardware evaluation, not a footnote.

If an organization’s inference workload benefits substantially from FP8 (large language models, high-throughput serving, memory-bandwidth-bound operations), then the hardware evaluation must weigh FP8 tensor core support as a first-class requirement — because the throughput and efficiency gains from FP8 often exceed the gains from raw compute improvement between generations.

Conversely, if the workload is precision-sensitive and will remain at BF16 or FP32, then the absence of FP8 support on the target hardware is irrelevant, and the hardware evaluation should focus on other criteria.

The mistake is evaluating hardware at one precision and deploying at another without accounting for the performance change, or assuming that a format available on the newest generation is available on currently deployed hardware.

An honest hardware evaluation declares: “this is the precision format we intend to deploy, this hardware supports it natively, and our benchmark results were measured at this format.” Anything less creates a gap between the evaluation and the deployment that the economics of precision choice will eventually expose. The hardware doesn’t bend to the precision strategy. The precision strategy must fit the hardware.

Hardware precision constraints: a generation-conditional decision — native acceleration vs emulation and the joint precision-and-procurement decision.

LynxBenchAI reports results per precision format, at the hardware’s native acceleration level — so the benchmark reflects whether the precision-hardware combination you intend to deploy is what was actually measured. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

How does GPU and accelerator architecture constrain which precision formats are viable on a given system?

The silicon either has execution units for a given format or it doesn’t. V100 tensor cores accelerate FP16 only; A100 adds BF16, TF32, INT8, and INT4; Hopper adds FP8; Blackwell extends toward FP4. A format outside the chip’s native menu may still run, but via emulation or fallback to a wider format — losing the throughput the format was chosen for in the first place.

Why does tensor core support shape which precisions are actually efficient, not just which are technically supported?

Almost any format can be made to “work” through software conversion, but efficiency comes from dedicated execution units. FP8 multiplied on BF16 tensor cores with conversion overhead delivers roughly BF16 throughput — not the 2× FP8 advantage seen on Hopper. The economic case for low precision collapses without the hardware path, so the relevant question is native acceleration, not nominal support.

How do different hardware generations differ in their numerical affordances for AI workloads?

Each generation defines a different menu. Volta gives you FP16 mixed precision. Ampere adds BF16, TF32, INT8, and INT4 for inference quantization. Hopper introduces native FP8 in both E4M3 and E5M2 variants plus the Transformer Engine. Blackwell pushes further toward FP4. Backward compatibility holds, but new formats do not retrofit to older silicon.

Why are precision decisions inherently hardware-conditional rather than universal?

Because the throughput, memory, and accuracy trade-offs that drive a precision choice only materialize when the hardware can execute the format natively. The same FP8 strategy that wins on H100 produces no compute benefit on A100. There is no precision answer independent of the chip — only a precision answer conditional on the hardware-framework-driver stack that will actually run the workload.

What should a benchmark disclose about hardware precision support so a reader can interpret the result?

The hardware generation, the precision format used, whether that format is natively accelerated on that hardware, and the framework and CUDA versions involved. Without those four facts, an FP8 number could mean Hopper tensor core throughput or A100 emulation — orders of magnitude apart. LynxBench AI reports results per precision format at the hardware’s native acceleration level for exactly this reason.

Why can the “right precision” for the same workload change when the underlying hardware changes?

The workload’s numerical requirements don’t change, but the set of formats that are cheap to execute does. A model that ran at BF16 on A100 because FP8 wasn’t accelerated may shift to FP8 on H100 where the Transformer Engine handles scaling automatically. The decision is a fit between numerical tolerance and the chip’s native menu — change the chip, and the fit changes with it.

How do block scaling and blockwise quantization interact with hardware support, and can a format be supported but still require scaling machinery the hardware lacks?

Yes — this is a common gap. FP8 in practice depends on per-tensor or block scaling to keep values inside its narrow range, and finer blockwise quantization scales per small block rather than per tensor. On Hopper the Transformer Engine accelerates that dynamic scaling automatically; where the scaling path has no hardware acceleration, it runs on slower software routines and eats into the throughput the format promised. So a precision format can be nominally supported while the scaling machinery that makes it usable is not efficiently provided on a given generation.

When tensor core support is absent or only partial for a format, what falls back to slower paths, and how should an architect read reported throughput?

The matrix multiply drops to a wider-format tensor core or to CUDA cores, and any required scaling runs in software instead of on dedicated units — both of which surrender the format’s advantage. A throughput number measured on such a fallback describes the fallback, not the format. An architect should ask which execution units actually ran the workload, and treat any FP8 or INT8 figure measured on hardware that lacks native support as a fallback result, not as evidence of what the format can do on hardware that accelerates it.

Precision Choices Are Constrained by Hardware Architecture

“Can we just switch to FP8?”

Tensor core generations and their numerical affordances

Tensor core format support by GPU generation

The penalty for unsupported formats

Framework and tooling dependencies

Why this matters for hardware selection

Frequently Asked Questions

How does GPU and accelerator architecture constrain which precision formats are viable on a given system?

Why does tensor core support shape which precisions are actually efficient, not just which are technically supported?

How do different hardware generations differ in their numerical affordances for AI workloads?

Why are precision decisions inherently hardware-conditional rather than universal?

What should a benchmark disclose about hardware precision support so a reader can interpret the result?

Why can the “right precision” for the same workload change when the underlying hardware changes?

How do block scaling and blockwise quantization interact with hardware support, and can a format be supported but still require scaling machinery the hardware lacks?

When tensor core support is absent or only partial for a format, what falls back to slower paths, and how should an architect read reported throughput?

FP8, FP16, and BF16 Represent Different Operating Regimes

Precision Is an Economic Lever in Inference Systems

Hardware Precision Constraints: A Generation-Conditional Decision

Precision Choices Are Constrained by Hardware Architecture

“Can we just switch to FP8?”

Tensor core generations and their numerical affordances

Tensor core format support by GPU generation

The penalty for unsupported formats

Framework and tooling dependencies

Why this matters for hardware selection

Related deep-dives

Frequently Asked Questions

How does GPU and accelerator architecture constrain which precision formats are viable on a given system?

Why does tensor core support shape which precisions are actually efficient, not just which are technically supported?

How do different hardware generations differ in their numerical affordances for AI workloads?

Why are precision decisions inherently hardware-conditional rather than universal?

What should a benchmark disclose about hardware precision support so a reader can interpret the result?

Why can the “right precision” for the same workload change when the underlying hardware changes?

How do block scaling and blockwise quantization interact with hardware support, and can a format be supported but still require scaling machinery the hardware lacks?

When tensor core support is absent or only partial for a format, what falls back to slower paths, and how should an architect read reported throughput?

FP8, FP16, and BF16 Represent Different Operating Regimes

Precision Is an Economic Lever in Inference Systems

Hardware Precision Constraints: A Generation-Conditional Decision