Three years ago, the debate was FP32 vs FP16
The decision was conceptually simple: full precision or half precision? FP32 offered numerical safety; FP16 offered double the throughput with known stability risks. BF16 arrived as a pragmatic compromise — same 16-bit width as FP16, but with FP32’s exponent range, trading mantissa bits for dynamic range. Most teams adopted BF16 for training and inference on Ampere-generation hardware, and the conversation moved on.
Then FP8 entered production hardware. Hopper-generation GPUs (H100, H200) include native FP8 tensor cores, and suddenly the conversation shifted from a binary choice to a three-way comparison where each option encodes fundamentally different assumptions about numerical behavior.
Format properties shape operating regimes
Each format’s characteristics create a distinct operating envelope — not a simple good-better-best ranking, but genuinely different trade-off profiles:
FP32 (IEEE 754 single precision): 8 exponent bits, 23 mantissa bits. Wide dynamic range and high precision. The baseline against which everything else is measured. No hardware-accelerated throughput advantage on tensor cores — it’s the slow, safe option.
FP16 (IEEE 754 half precision): 5 exponent bits, 10 mantissa bits. Limited dynamic range (max ~65,504) with moderate precision. The narrow dynamic range makes it problematic for training (gradients can overflow or underflow) but workable for inference on well-behaved models. Historically the first “fast” option on tensor cores.
BF16 (Brain Float 16): 8 exponent bits, 7 mantissa bits. Same dynamic range as FP32 but substantially less precision. The engineering insight was that matching FP32’s range eliminates most overflow/underflow issues that plague FP16, even though individual values are less precise. This makes BF16 the practical default for training and many inference workloads on hardware that supports it.
FP8 (E4M3 and E5M2 variants): Two sub-formats compete. E4M3 has 4 exponent bits and 3 mantissa bits — narrower range than BF16 but slightly more precision per bit than E5M2. E5M2 has 5 exponent bits and 2 mantissa bits — wider range, less precision. Hardware implementations typically support both, and inference frameworks can choose per-layer or per-tensor which variant to use.
These aren’t points on a linear scale. Moving from BF16 to FP8 doesn’t just halve the bit width. It changes which values can be represented, where the rounding errors fall, how scale factors must be managed, and what the hardware will do with values outside the representable range.
Throughput and density: the FP8 proposition
FP8’s primary appeal is raw throughput. On H100 tensor cores, FP8 matrix multiplications are designed to run at up to roughly 2× the rate of BF16 and 4× the rate of FP32, per NVIDIA’s published architectural targets. For memory-bandwidth-bound inference workloads, FP8 also halves the bytes read from HBM per weight compared to BF16, directly improving tokens-per-second for autoregressive decoding.
The density benefit is equally significant. A 70B-parameter model in FP8 fits in approximately 70 GB of HBM — roughly one H100 80GB card. The same model in BF16 requires approximately 140 GB, forcing multi-GPU deployment with the associated communication overhead. FP8 doesn’t just make each GPU faster; it changes the deployment topology.
These are compelling advantages, which is why FP8 adoption is accelerating. But the advantages come with constraints that BF16 doesn’t impose.
Stability and risk: the FP8 trade-offs
FP8 E4M3 can represent values up to 448. Activations that exceed this range must be clipped or scaled. Unlike BF16, which shares FP32’s exponent range and rarely encounters overflow in practice, FP8 requires per-tensor scale factors to map the activation range into the representable interval.
This scaling is not optional — it’s a requirement for correctness. If scale factors are poorly calibrated, the model produces garbage. If they’re calibrated on data that doesn’t represent the production distribution, the model works on the calibration set and fails on edge cases.
The E5M2 variant extends the range (max ~57,344) at the cost of further reduced precision — only 2 mantissa bits means each representable value is an extremely coarse approximation. This variant is sometimes used for gradient representation in training, where range matters more than per-value accuracy, but it’s aggressive even for inference on precision-sensitive tasks.
The practical risk profile is: BF16 is numerically robust by default, requiring minimal per-deployment validation. FP8 is numerically viable but requires careful calibration, per-task validation, and awareness of failure modes that don’t exist at higher precisions.
Hardware support defines what’s viable
Precision format choice is not purely a software decision. The hardware must have dedicated execution units for the target format, or the throughput benefit disappears.
Tensor cores on Ampere (A100) natively accelerate FP16, BF16, and TF32. They do not accelerate FP8 — running FP8 on A100 requires software emulation with no throughput advantage. Hopper (H100, H200) adds native FP8 tensor cores. Previous-generation hardware (V100) supports only FP16 on tensor cores, with no BF16 acceleration.
This means the “best” precision format is hardware-conditional. A deployment decision that assumes FP8 availability but targets A100 hardware gains nothing. A deployment optimized for BF16 on Hopper hardware leaves FP8 throughput on the table. The format choice must be made jointly with the hardware selection, not independently of it — which is precisely the interplay explored in how hardware architecture constrains precision decisions.
Comparing without ranking
The temptation is to arrange these formats as a progression: FP32 → BF16 → FP8, with each step being “better” (faster, more efficient). That framing is misleading because it implies a single axis of comparison, when the reality is multi-dimensional.
FP8 is not a universal improvement over BF16. It’s a different operating regime — one that offers higher throughput and density at the cost of narrower representable range, mandatory scale factor management, and higher sensitivity to calibration quality. For workloads where these constraints are manageable (well-behaved activations, representative calibration data, tasks with high quantization tolerance), FP8 is the clear efficiency choice. For workloads where precision sensitivity is high, activation distributions are unpredictable, or deployment conditions vary, BF16 provides robustness that FP8 does not.
The engineering discipline is selecting the format that matches the workload’s requirements and the hardware’s capabilities — not defaulting to the newest or most aggressive option. Each format is a tool for a specific set of conditions, and the conditions determine the right choice. As explored in the economic implications of precision decisions, the costs of getting this choice wrong aren’t just numerical — they’re operational and financial.