Two formats, different tradeoffs BF16 and FP16 are both 16-bit floating-point formats, but they allocate their bits differently — and that allocation difference determines which AI workloads each format serves well. FP16 (IEEE 754 half-precision): 1 sign bit, 5 exponent bits, 10 mantissa bits. Higher precision per value, but limited dynamic range (max ~65,504). BF16 (Brain Float 16): 1 sign bit, 8 exponent bits, 7 mantissa bits. Same dynamic range as FP32 (max ~3.4 × 10³⁸), but less precision per value. Property FP16 BF16 Total bits 16 16 Exponent bits 5 8 Mantissa bits 10 7 Dynamic range ±65,504 ±3.4 × 10³⁸ Precision (significant digits) ~3.3 decimal ~2.4 decimal FP32-compatible range No Yes Why the difference matters for AI BF16 trades mantissa precision for dynamic range — this means it handles gradient magnitudes in training better than FP16, but produces slightly less precise inference outputs. The practical consequence: Training gradients span many orders of magnitude. Early-layer gradients can be 10⁻⁸ while later-layer gradients are 10². FP16’s limited dynamic range cannot represent this span without loss scaling — gradients below ~6 × 10⁻⁸ underflow to zero, effectively stopping learning in those parameters. BF16’s FP32-equivalent range eliminates this problem entirely. Inference activations benefit from precision. During inference, the dynamic range of activations is typically narrower (the model is trained, weights are stable). The extra 3 mantissa bits in FP16 provide more precise intermediate computations, which can matter for tasks where small numerical differences affect output quality — particularly in attention score computation where softmax amplifies small differences. When to choose each format The choice between BF16 and FP16 depends on whether your workload is gradient-dominated (training) or activation-precision-dominated (inference) — not on which format is “newer” or which hardware vendor promotes it. Choose BF16 when: Training models (especially large models where gradient magnitudes vary widely) Fine-tuning pre-trained models (gradient dynamics still apply) Mixed-precision training without wanting to tune loss scaling hyperparameters Your hardware supports it natively (A100, H100, AMD MI300 — all have BF16 tensor cores) Choose FP16 when: Running inference where activation precision affects output quality Deploying on hardware that lacks native BF16 support (older GPUs, some edge devices) Working with models trained in FP16 that expect FP16 inference (quantisation from FP16 to FP16 is identity; from BF16 introduces rounding) Latency-matched with FP16 on your specific hardware (some older tensor cores only accelerate FP16) The hardware dependency Not all tensor core generations handle BF16 and FP16 identically. NVIDIA’s A100 processes both at the same throughput (312 TFLOPS). The H100 processes FP16 and BF16 at equal rates on tensor cores but differs in how each interacts with the transformer engine’s dynamic precision selection. AMD’s MI300X supports both formats but their CDNA 3 architecture processes BF16 matrix operations at the same rate as FP16. Intel’s Gaudi accelerators favour BF16 natively. The implication: precision format choice interacts with hardware operating regime in ways that cannot be determined from spec sheets alone. The theoretical TFLOPS for each format on a given GPU tells you the hardware capability — it does not tell you what throughput your specific model architecture will achieve, because model structure determines how effectively the tensor cores are utilised. Practical decision framework For teams choosing between BF16 and FP16: If training: Use BF16. The dynamic range advantage eliminates an entire class of numerical instability problems without meaningful accuracy cost. If inference on modern hardware: Test both. Measure actual output quality on your evaluation set, not just throughput. If quality is equivalent, choose whichever your hardware processes faster. If deploying to edge: Check hardware support. Many edge accelerators support FP16 but not BF16. If uncertain: Start with BF16 for training, benchmark both for inference on your specific model and hardware combination.