The cost line nobody expected
An inference team deploys a model in BF16, measures the per-request cost, and builds a unit economics model. Six months later, request volume has tripled. The GPU fleet is growing proportionally. Someone asks: what happens to cost if we shift the model to FP8?
The arithmetic is revealing. FP8 halves the memory footprint, so the model fits on fewer GPUs (or serves larger batches per GPU). FP8 tensor cores deliver roughly 2× the throughput of BF16. Combined, the effect isn’t just “inference is faster” — it’s “inference costs substantially less per request, at scale, over time.” The precision format change didn’t improve the model. It didn’t add features. It changed the economics of running the model in production.
This is what makes precision an economic lever, not just a technical parameter.
The three-axis impact
A precision reduction in an inference system affects at least three cost-relevant dimensions simultaneously:
Throughput. Lower precision means more operations per tensor core cycle (on hardware that supports it natively). FP8 on H100 tensor cores runs at roughly 2× the FLOPS of BF16. For compute-bound workloads, this translates directly to more requests processed per second per GPU. More throughput per GPU means fewer GPUs needed for the same request volume.
Memory. A model in FP8 uses half the HBM of the same model in BF16, and a quarter of FP32. This means either fitting a larger model on a single GPU (avoiding multi-GPU serving overhead) or serving more concurrent requests with larger batches. Both reduce cost per request.
Power. Lower-precision operations generally consume less energy per operation. At data center scale, power costs are a significant fraction of total infrastructure cost. A fleet of GPUs running FP8 inference at lower power-per-request extends the effective capacity of the power and cooling infrastructure.
These three effects compound. The throughput improvement reduces the GPU count needed. The memory improvement enables better batching, which further improves GPU utilization. The power reduction lowers operational cost on every GPU you do run. The total cost impact of a precision format change can exceed the impact of a hardware generation upgrade — without purchasing any new hardware.
Higher precision can be economically wasteful
This is the less comfortable side of the argument. If a model’s output quality at BF16 and FP8 is equivalent within the application’s requirements (as accuracy loss from lower precision often is for many tasks), then running at BF16 is paying for precision the application doesn’t need.
It’s the equivalent of shipping all data via priority overnight courier when standard mail arrives on time — the premium buys nothing except the reassurance of having paid for it.
In an inference system serving millions of requests, the cost of unnecessary precision is real and cumulative. Each wasted bit of precision is extra HBM consumed, extra memory bandwidth used, extra power drawn, and extra GPU-seconds billed — without any change in the user-facing output.
This doesn’t mean lower precision is always the right choice. It means precision should be selected based on what the task requires, validated against quality metrics, and then deployed at the lowest precision that meets those requirements. Defaulting to “the most precision available” is not a conservative engineering choice; it’s an unexamined cost assumption.
Cost-optimal precision depends on workload and SLA
The right precision is not universal. It’s a function of the workload characteristics, the quality requirements, and the infrastructure constraints:
A large language model generating text for a customer-facing chatbot may need BF16 to preserve the subtle reasoning quality that users perceive. The same model powering internal document summarization — where summaries are reviewed by humans before use — may produce equivalent utility at INT8.
An image classification model in a real-time video pipeline may need the latency reduction that FP8 provides to meet frame-rate SLAs. A batch classification system processing overnight has ample time and may prioritize accuracy over throughput.
A system with fixed GPU capacity that must handle growing traffic has a different economic calculus than a system running on autoscaled cloud instances where GPU-hours are directly billed.
Each scenario produces a different precision optimum, which is why treating precision as a design parameter rather than a binary quality gate is essential. The design question is: what precision does this specific workload need, at this SLA, at this scale, on this hardware?
How precision interacts with infrastructure decisions
Precision choice feeds back into infrastructure decisions in ways that go beyond per-GPU performance:
Fleet composition. If the target precision (say FP8) requires Hopper-generation hardware, the fleet must include H100s or newer. If the acceptable precision is BF16, Ampere hardware remains viable. Precision choice can accelerate or defer hardware refresh cycles, with major capex implications.
Deployment topology. A 70B-parameter model at BF16 requires multi-GPU serving (140 GB exceeds single-GPU memory). At FP8, it fits on one H100 (70 GB on an 80 GB card). The precision change eliminates inter-GPU communication overhead, simplifies the serving architecture, and reduces failure modes. The economic impact of this topology change often exceeds the direct throughput improvement.
Capacity planning. As explored in the broader context of how FP8, FP16, and BF16 represent different operating regimes, each format defines a different throughput-per-GPU, which means different GPU count requirements, different rack density, and different power budgets for the same request volume.
The total cost of ownership for an inference system is shaped by precision choice at every level of the infrastructure stack. Teams that treat precision as a late-stage optimization — something to consider after the fleet is provisioned — miss the opportunity to make fundamentally better infrastructure decisions from the start.
The operational conversation
Precision choice deserves a seat in the infrastructure planning process alongside hardware selection, capacity modeling, and SLA definition. The conversation should happen before procurement, not after deployment:
What precision can the target workload tolerate? Has this been validated empirically, not assumed? What hardware is required to accelerate that precision natively? What is the cost differential between precision options at the anticipated request volume and over the hardware’s projected lifespan?
These questions produce better infrastructure decisions than “buy the fastest GPUs and run at default precision.” The fastest GPU at default precision is often not the most cost-effective configuration. The most cost-effective configuration is usually the one where precision, hardware, and workload requirements are explicitly aligned — and validated before the purchase order goes out.