“Can we just switch to FP8?” The request sounds simple. FP8 offers roughly 2× the throughput of BF16 on supported hardware, halves the memory footprint, and enables larger models on fewer GPUs. The ML engineer or infrastructure planner hears “faster and cheaper” and reasonably asks: why aren’t we already using it? The answer, more often than not, is hardware. The precision formats a GPU can accelerate natively are determined by its tensor core architecture, and that architecture varies across generations. A format the hardware doesn’t support natively doesn’t just run slower — it may offer no throughput benefit at all, or it may not be supported by the deployment framework for that hardware target. Precision decisions are hardware-conditional. Understanding the constraints is the prerequisite to making them well. Tensor core generations and their numerical affordances NVIDIA’s tensor core architecture has evolved across GPU generations, and each generation added support for new numerical formats while maintaining backward compatibility: Volta (V100): First-generation tensor cores. Native support for FP16 matrix multiply with FP32 accumulation. No BF16, no INT8 tensor core support, no FP8. The V100 was foundational for mixed-precision training, but its format menu is limited by today’s standards. Ampere (A100): Third-generation tensor cores. Added native BF16 and TF32 (an internal 19-bit format used transparently for FP32 operations). Added INT8 and INT4 tensor core support for inference quantization. No native FP8. The A100 is still widely deployed and handles BF16 inference and training well, but cannot accelerate FP8 workloads. Hopper (H100, H200): Fourth-generation tensor cores. Added native FP8 (both E4M3 and E5M2 variants), along with the Transformer Engine that manages dynamic per-tensor scaling for FP8 automatically. BF16 throughput also increased substantially over Ampere. Blackwell (B100, B200): Further FP8 optimization and potential FP4 support, with enhanced Transformer Engine capabilities. Each generation defines a different menu of viable precision choices. A deployment targeting V100s is limited to FP16 mixed precision for tensor core acceleration. A deployment targeting A100s can use BF16, INT8, or INT4, but not FP8. A deployment targeting H100s can use any of the above plus FP8. This isn’t a software limitation that can be patched. It’s a hardware constraint: the silicon either has execution units for a given format or it doesn’t. Tensor core format support by GPU generation GPU generation Native tensor core formats Notable limitation Volta (V100) FP16 No BF16, no INT8, no FP8 Ampere (A100) FP16, BF16, TF32, INT8, INT4 No FP8 — FP8 operations fall back to BF16 speed Hopper (H100/H200) FP16, BF16, TF32, INT8, INT4, FP8 (E4M3, E5M2) Full format menu with Transformer Engine Blackwell (B100/B200) All above + FP4 (expected) — The penalty for unsupported formats Running a precision format that the hardware doesn’t natively accelerate doesn’t cause an error — the framework will typically fall back to a supported format or use software emulation. But the performance implications are serious. FP8 operations on A100 hardware execute on BF16 or FP16 tensor cores with conversion overhead, producing throughput roughly comparable to BF16 — meaning no FP8 advantage despite the lower precision. The memory savings from FP8 model representation still apply (the model is smaller in HBM), but the compute throughput doubles that FP8 promises on Hopper hardware simply don’t materialize on Ampere. Similarly, INT8 inference on V100 runs without dedicated tensor core support, falling back to CUDA cores with dramatically lower throughput than the INT8 tensor core path available on A100. The practical implication: a precision strategy developed and benchmarked on one hardware generation cannot be applied to a different generation without re-evaluating whether the target format is natively supported. Benchmark results measured on H100 at FP8 tell you nothing about FP8 performance on A100 — because on A100, “FP8 performance” effectively doesn’t exist at the hardware acceleration level. Framework and tooling dependencies Hardware support is necessary but not sufficient. The deployment framework must also support the target precision format on the target hardware, with optimized kernels and correct numerical handling. TensorRT, NVIDIA’s inference optimizer, added FP8 support alongside Hopper hardware. Earlier TensorRT versions targeting A100 support INT8 and FP16 but not FP8. PyTorch’s native FP8 support arrived with specific versions and requires Hopper hardware with compatible CUDA toolkit versions. This creates a three-layer compatibility requirement: the hardware must support the format, the framework must support the format on that hardware, and the CUDA/driver stack must be at a version that enables the feature. A mismatch at any layer — current-generation hardware with an older framework version, or a current framework on previous-generation hardware — blocks the precision strategy. As discussed in how FP8, FP16, and BF16 represent different regimes, the format choice encodes assumptions about numerical behavior. The hardware choice determines which of those assumptions can actually be realized efficiently. Why this matters for hardware selection Precision support should be an explicit factor in hardware evaluation, not a footnote. If an organization’s inference workload benefits substantially from FP8 (large language models, high-throughput serving, memory-bandwidth-bound operations), then the hardware evaluation must weigh FP8 tensor core support as a first-class requirement — because the throughput and efficiency gains from FP8 often exceed the gains from raw compute improvement between generations. Conversely, if the workload is precision-sensitive and will remain at BF16 or FP32, then the absence of FP8 support on the target hardware is irrelevant, and the hardware evaluation should focus on other criteria. The mistake is evaluating hardware at one precision and deploying at another without accounting for the performance change, or assuming that a format available on the newest generation is available on currently deployed hardware. An honest hardware evaluation declares: “this is the precision format we intend to deploy, this hardware supports it natively, and our benchmark results were measured at this format.” Anything less creates a gap between the evaluation and the deployment that the economics of precision choice will eventually expose. The hardware doesn’t bend to the precision strategy. The precision strategy must fit the hardware. Related deep-dives Hardware precision constraints: a generation-conditional decision — native acceleration vs emulation and the joint precision-and-procurement decision. LynxBenchAI reports results per precision format, at the hardware’s native acceleration level — so the benchmark reflects whether the precision-hardware combination you intend to deploy is what was actually measured. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation. Frequently Asked Questions How does GPU and accelerator architecture constrain which precision formats are viable on a given system? The silicon either has execution units for a given format or it doesn’t. V100 tensor cores accelerate FP16 only; A100 adds BF16, TF32, INT8, and INT4; Hopper adds FP8; Blackwell extends toward FP4. A format outside the chip’s native menu may still run, but via emulation or fallback to a wider format — losing the throughput the format was chosen for in the first place. Why does tensor core support shape which precisions are actually efficient, not just which are technically supported? Almost any format can be made to “work” through software conversion, but efficiency comes from dedicated execution units. FP8 multiplied on BF16 tensor cores with conversion overhead delivers roughly BF16 throughput — not the 2× FP8 advantage seen on Hopper. The economic case for low precision collapses without the hardware path, so the relevant question is native acceleration, not nominal support. How do different hardware generations differ in their numerical affordances for AI workloads? Each generation defines a different menu. Volta gives you FP16 mixed precision. Ampere adds BF16, TF32, INT8, and INT4 for inference quantization. Hopper introduces native FP8 in both E4M3 and E5M2 variants plus the Transformer Engine. Blackwell pushes further toward FP4. Backward compatibility holds, but new formats do not retrofit to older silicon. Why are precision decisions inherently hardware-conditional rather than universal? Because the throughput, memory, and accuracy trade-offs that drive a precision choice only materialize when the hardware can execute the format natively. The same FP8 strategy that wins on H100 produces no compute benefit on A100. There is no precision answer independent of the chip — only a precision answer conditional on the hardware-framework-driver stack that will actually run the workload. What should a benchmark disclose about hardware precision support so a reader can interpret the result? The hardware generation, the precision format used, whether that format is natively accelerated on that hardware, and the framework and CUDA versions involved. Without those four facts, an FP8 number could mean Hopper tensor core throughput or A100 emulation — orders of magnitude apart. LynxBench AI reports results per precision format at the hardware’s native acceleration level for exactly this reason. Why can the “right precision” for the same workload change when the underlying hardware changes? The workload’s numerical requirements don’t change, but the set of formats that are cheap to execute does. A model that ran at BF16 on A100 because FP8 wasn’t accelerated may shift to FP8 on H100 where the Transformer Engine handles scaling automatically. The decision is a fit between numerical tolerance and the chip’s native menu — change the chip, and the fit changes with it.