FP8, FP16, and BF16 Represent Different Operating Regimes

Three years ago, the debate was FP32 vs FP16

The decision was conceptually simple: full precision or half precision? FP32 offered numerical safety; FP16 offered double the throughput with known stability risks. BF16 arrived as a pragmatic compromise — same 16-bit width as FP16, but with FP32’s exponent range, trading mantissa bits for dynamic range. Most teams adopted BF16 for training and inference on Ampere-generation hardware, and the conversation moved on.

Then FP8 entered production hardware. Hopper-generation GPUs (H100, H200) include native FP8 tensor cores, and suddenly the conversation shifted from a binary choice to a three-way comparison where each option encodes fundamentally different assumptions about numerical behavior.

Format properties shape operating regimes

Each format’s characteristics create a distinct operating envelope — not a simple good-better-best ranking, but genuinely different trade-off profiles:

FP32 (IEEE 754 single precision): 8 exponent bits, 23 mantissa bits. Wide dynamic range and high precision. The baseline against which everything else is measured. No hardware-accelerated throughput advantage on tensor cores — it’s the slow, safe option.

FP16 (IEEE 754 half precision): 5 exponent bits, 10 mantissa bits. Limited dynamic range (max ~65,504) with moderate precision. The narrow dynamic range makes it problematic for training (gradients can overflow or underflow) but workable for inference on well-behaved models. Historically the first “fast” option on tensor cores.

BF16 (Brain Float 16): 8 exponent bits, 7 mantissa bits. Same dynamic range as FP32 but substantially less precision. The engineering insight was that matching FP32’s range eliminates most overflow/underflow issues that plague FP16, even though individual values are less precise. This makes BF16 the practical default for training and many inference workloads on hardware that supports it.

FP8 (E4M3 and E5M2 variants): Two sub-formats compete. E4M3 has 4 exponent bits and 3 mantissa bits — narrower range than BF16 but slightly more precision per bit than E5M2. E5M2 has 5 exponent bits and 2 mantissa bits — wider range, less precision. Hardware implementations typically support both, and inference frameworks can choose per-layer or per-tensor which variant to use.

These aren’t points on a linear scale. Moving from BF16 to FP8 doesn’t just halve the bit width. It changes which values can be represented, where the rounding errors fall, how scale factors must be managed, and what the hardware will do with values outside the representable range.

Precision format comparison

Format	Bits	Exponent / Mantissa	Dynamic range	Key trade-off
FP32	32	8 / 23	Very wide	Maximum precision; no tensor core throughput advantage
BF16	16	8 / 7	Same as FP32	Eliminates overflow/underflow; reduced per-value precision
FP16	16	5 / 10	Narrow (~65K max)	Higher per-value precision than BF16; overflow risk in training
FP8 E4M3	8	4 / 3	Moderate (~448 max)	Requires scale management; highest throughput on Hopper
FP8 E5M2	8	5 / 2	Wide (~57K max)	Very coarse values; used where range matters more than precision

Throughput and density: the FP8 proposition

FP8’s primary appeal is raw throughput. On H100 tensor cores, FP8 matrix multiplications are designed to run at up to roughly 2x the rate of BF16 and 4x the rate of FP32, per NVIDIA’s published architectural targets. For memory-bandwidth-bound inference workloads, FP8 also halves the bytes read from HBM per weight compared to BF16, directly improving tokens-per-second for autoregressive decoding.

The density benefit is equally significant. A 70B-parameter model in FP8 fits in approximately 70 GB of HBM — roughly one H100 80GB card. The same model in BF16 requires approximately 140 GB, forcing multi-GPU deployment with the associated communication overhead. FP8 doesn’t just make each GPU faster; it changes the deployment topology.

These are compelling advantages, which is why FP8 adoption is accelerating. But the advantages come with constraints that BF16 doesn’t impose.

Stability and risk: the FP8 trade-offs

FP8 E4M3 can represent values up to 448. Activations that exceed this range must be clipped or scaled. Unlike BF16, which shares FP32’s exponent range and rarely encounters overflow in practice, FP8 requires per-tensor scale factors to map the activation range into the representable interval.

This scaling is not optional — it’s a requirement for correctness. If scale factors are poorly calibrated, the model produces garbage. If they’re calibrated on data that doesn’t represent the production distribution, the model works on the calibration set and fails on edge cases.

The E5M2 variant extends the range (max ~57,344) at the cost of further reduced precision — only 2 mantissa bits means each representable value is an extremely coarse approximation. This variant is sometimes used for gradient representation in training, where range matters more than per-value accuracy, but it’s aggressive even for inference on precision-sensitive tasks.It is worth separating this float-vs-float comparison from the float-vs-integer one. INT8 is a different axis entirely: it discretises values onto a uniform integer grid with a fixed scale and zero point, rather than carrying a per-value exponent. FP8 keeps a floating exponent, so it preserves dynamic range across activations of widely different magnitudes — something INT8 only approximates through per-channel scaling. INT8 can edge out FP8 on raw throughput and memory on hardware tuned for integer matmul, but it assumes the activation distribution can be captured by a single linear quantisation map. Choosing between FP8 and INT8 is therefore not a finer point on the same precision scale; it is a decision about whether your workload tolerates a uniform grid or needs the floating exponent’s adaptive range.

The practical risk profile is: BF16 is numerically robust by default, requiring minimal per-deployment validation. FP8 is numerically viable but requires careful calibration, per-task validation, and awareness of failure modes that don’t exist at higher precisions.

Hardware support defines what’s viable

Precision format choice is not purely a software decision. The hardware must have dedicated execution units for the target format, or the throughput benefit disappears.

Tensor cores on Ampere (A100) natively accelerate FP16, BF16, and TF32. They do not accelerate FP8 — running FP8 on A100 requires software emulation with no throughput advantage. Hopper (H100, H200) adds native FP8 tensor cores. Previous-generation hardware (V100) supports only FP16 on tensor cores, with no BF16 acceleration.

This means the “best” precision format is hardware-conditional. A deployment decision that assumes FP8 availability but targets A100 hardware gains nothing. A deployment optimized for BF16 on Hopper hardware leaves FP8 throughput on the table. The format choice must be made jointly with the hardware selection, not independently of it — which is precisely the interplay explored in how hardware architecture constrains precision decisions.

How do you choose between FP8, BF16, and FP16?

The temptation is to arrange these formats as a progression: FP32 → BF16 → FP8, with each step being “better” (faster, more efficient). That framing is misleading because it implies a single axis of comparison, when the reality is multi-dimensional.

FP8 is not a universal improvement over BF16. It’s a different operating regime — one that offers higher throughput and density at the cost of narrower representable range, mandatory scale factor management, and higher sensitivity to calibration quality. For workloads where these constraints are manageable (well-behaved activations, representative calibration data, tasks with high quantization tolerance), FP8 is the clear efficiency choice. For workloads where precision sensitivity is high, activation distributions are unpredictable, or deployment conditions vary, BF16 provides robustness that FP8 does not.

The engineering discipline is selecting the format that matches the workload’s requirements and the hardware’s capabilities — not defaulting to the newest or most aggressive option. Each format is a tool for a specific set of conditions, and the conditions determine the right choice. As explored in the economic implications of precision decisions, the costs of getting this choice wrong aren’t just numerical — they’re operational and financial.

Floating-point formats in AI: what each format trades — modern AI floating-point formats as a structured (range, precision) trade-off space.

LynxBenchAI treats each precision format as a distinct measurement regime — results are reported per format, so hardware comparisons preserve the conditions that determine the right choice rather than averaging across them. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

What does each of FP8, FP16, and BF16 represent as a numerical operating regime, beyond raw bit width?

Each format encodes a different split between exponent and mantissa bits, which determines dynamic range, per-value precision, and how the hardware handles values outside the representable interval. FP16 packs more precision into a narrow range; BF16 trades mantissa bits to keep FP32’s range; FP8 (in E4M3 or E5M2 variants) compresses further and shifts the burden onto scale-factor management. Bit width is the cheapest part of the description — the real distinction is which assumptions about numerical behaviour each format bakes in.

How do FP16 and BF16 trade stability for efficiency differently, even though both are 16 bits?

FP16 keeps 10 mantissa bits but only 5 exponent bits, so it has higher per-value precision than BF16 inside a narrow dynamic range (max ~65,504) and is prone to overflow and underflow under training-grade gradient magnitudes. BF16 inverts the trade: 8 exponent bits give it the same dynamic range as FP32, while 7 mantissa bits make individual values coarser. The result is that BF16 is robust by default for training and most inference, whereas FP16 demands loss-scaling and careful range management to stay stable.

Where does FP8 prioritise throughput and density, and what does that prioritisation cost?

On Hopper tensor cores, FP8 matmul targets roughly 2x BF16 and 4x FP32 throughput, and halves HBM bytes per weight — enough that a 70B model fits on a single 80GB H100 instead of spanning two cards in BF16. The cost is a much narrower representable range (~448 for E4M3, ~57K for E5M2), mandatory per-tensor scale factors, and a new class of calibration failures that simply do not exist at BF16. Get the calibration wrong on representative data and the model degrades on edge cases that never appeared in the calibration set.

Why does each precision format encode implicit assumptions about the numerical behaviour of the workload?

FP16 assumes activations and gradients stay inside a narrow range. BF16 assumes you would rather lose precision than risk overflow. FP8 assumes activation distributions are well-behaved enough that a calibrated scale factor can map them losslessly enough into a tiny representable interval. Choosing a format is therefore choosing which of those assumptions you are willing to underwrite for the specific workload — and which failure modes you are taking on if the assumption breaks.

Which axes — stability, dynamic range, throughput, hardware support — actually matter when comparing precision formats for a workload?

All four, and they interact. Stability and dynamic range determine whether the format is numerically viable for the model’s activations and gradients; throughput determines whether the format actually improves cost-per-token or only looks good on paper; hardware support determines whether the throughput exists at all, since FP8 on A100 reverts to emulated execution with no speed-up. The comparison only makes sense when those axes are evaluated jointly against the deployment target, which is the framing used in the section on hardware support.

Why is reducing FP8/FP16/BF16 to a single “which is best” question the wrong frame?

Because the formats are not points on a linear precision-vs-speed scale — they are distinct operating regimes with different stability properties, calibration requirements, and hardware dependencies. FP8 wins on throughput and density when its constraints are manageable; BF16 wins on robustness when activations are unpredictable or calibration data is thin; FP16 occupies a narrower niche tied to older tensor-core hardware. The right question is which regime matches the workload and the available silicon, not which format is universally best.

How does FP8 compare to INT8 for inference, and why is integer-vs-float a different axis than the FP8/FP16/BF16 comparison?

INT8 places values on a uniform integer grid with a fixed scale and zero point, whereas FP8 keeps a floating exponent that preserves dynamic range across activations of very different magnitudes. INT8 can match or beat FP8 on raw throughput and memory where hardware is tuned for integer matmul, but only if a single linear quantisation map captures the activation distribution. The FP8/FP16/BF16 comparison is about how a floating exponent and mantissa are split; the FP8-vs-INT8 question is the orthogonal one of whether a floating exponent is needed at all.

What hardware support constraints determine whether FP8, FP16, or BF16 is even available as an operating regime on a given accelerator?

The accelerator must carry dedicated tensor-core execution units for the target format, or the throughput benefit disappears. Ampere (A100) natively accelerates FP16, BF16, and TF32 but not FP8 — running FP8 there falls back to emulation with no speed-up — while Hopper (H100, H200) adds native FP8 tensor cores, and older V100 silicon supports only FP16. The available format is therefore hardware-conditional, so the format decision must be made jointly with the hardware selection rather than independently of it.

FP8, FP16, and BF16 Represent Different Operating Regimes

Three years ago, the debate was FP32 vs FP16

Format properties shape operating regimes

Precision format comparison

Throughput and density: the FP8 proposition

Stability and risk: the FP8 trade-offs

Hardware support defines what’s viable

How do you choose between FP8, BF16, and FP16?

Frequently Asked Questions

What does each of FP8, FP16, and BF16 represent as a numerical operating regime, beyond raw bit width?

How do FP16 and BF16 trade stability for efficiency differently, even though both are 16 bits?

Where does FP8 prioritise throughput and density, and what does that prioritisation cost?

Why does each precision format encode implicit assumptions about the numerical behaviour of the workload?

Which axes — stability, dynamic range, throughput, hardware support — actually matter when comparing precision formats for a workload?

Why is reducing FP8/FP16/BF16 to a single “which is best” question the wrong frame?

How does FP8 compare to INT8 for inference, and why is integer-vs-float a different axis than the FP8/FP16/BF16 comparison?

What hardware support constraints determine whether FP8, FP16, or BF16 is even available as an operating regime on a given accelerator?

Precision Choices Are Constrained by Hardware Architecture

Precision Is an Economic Lever in Inference Systems

Floating-Point Formats in AI: What Each Format Trades

FP8, FP16, and BF16 Represent Different Operating Regimes

Three years ago, the debate was FP32 vs FP16

Format properties shape operating regimes

Precision format comparison

Throughput and density: the FP8 proposition

Stability and risk: the FP8 trade-offs

Hardware support defines what’s viable

How do you choose between FP8, BF16, and FP16?

Related deep-dives

Frequently Asked Questions

What does each of FP8, FP16, and BF16 represent as a numerical operating regime, beyond raw bit width?

How do FP16 and BF16 trade stability for efficiency differently, even though both are 16 bits?

Where does FP8 prioritise throughput and density, and what does that prioritisation cost?

Why does each precision format encode implicit assumptions about the numerical behaviour of the workload?

Which axes — stability, dynamic range, throughput, hardware support — actually matter when comparing precision formats for a workload?

Why is reducing FP8/FP16/BF16 to a single “which is best” question the wrong frame?

How does FP8 compare to INT8 for inference, and why is integer-vs-float a different axis than the FP8/FP16/BF16 comparison?

What hardware support constraints determine whether FP8, FP16, or BF16 is even available as an operating regime on a given accelerator?

Precision Choices Are Constrained by Hardware Architecture

Precision Is an Economic Lever in Inference Systems

Floating-Point Formats in AI: What Each Format Trades