Vendor benchmarks are not your benchmarks Inference infrastructure decisions should be driven by measured performance under your actual workload — vendor benchmarks use optimised batch sizes and precision settings that rarely match production conditions. An H100 that delivers 3,000 tokens/second on a vendor benchmark at batch size 256 may deliver 400 tokens/second under your production traffic pattern with batch size 4–8 and latency SLA constraints. The gap exists because vendor benchmarks optimise for peak throughput (large batches, maximum GPU utilisation), while production inference operates under constraints that reduce utilisation: latency SLAs, variable request sizes, cold-start requirements, and traffic burstiness. What to measure instead Latency under load, not peak throughput. The relevant metric is P95/P99 latency at your expected concurrent request volume — not maximum tokens/second with unlimited batching. A system that delivers 2,000 tok/s with 500ms P99 latency is more production-ready than one delivering 3,000 tok/s with 2,000ms P99. Time to first token (TTFT). For interactive applications, TTFT determines perceived responsiveness. A system with high throughput but 3-second TTFT feels unresponsive regardless of total generation speed. TTFT depends on prefill computation speed, which is a different bottleneck than decode throughput. Throughput at your batch size. Production batch sizes depend on traffic concurrency and latency constraints. If your latency SLA forces batch sizes of 4–8, benchmarks at batch size 128 are irrelevant to your deployment. Cost per token under production conditions. Total cost of ownership divided by actual production tokens served — not theoretical maximum throughput. Include idle time, auto-scaling overhead, and the cost of over-provisioning to meet latency SLAs during traffic spikes. Infrastructure checklist for production inference Requirement What to validate Common failure Memory capacity Model fits with KV-cache room for max context OOM at long contexts under concurrent requests Memory bandwidth Sufficient for target throughput at operating batch size Bandwidth-bound at small batches (most production traffic) Interconnect (multi-GPU) Parallelism strategy matches available bandwidth TP over PCIe eliminates bandwidth advantage Autoscaling Scale-up time meets traffic spike requirements GPU cold-start takes 30–90 seconds; traffic spike lasts 60 seconds Redundancy Graceful degradation under partial failure Single GPU failure drops capacity to zero instead of degrading Monitoring Real-time visibility into latency, throughput, utilisation Issues detected only when users complain The batch size paradox Larger batch sizes improve GPU utilisation and throughput — but increase latency. Production inference systems must navigate this tradeoff dynamically: During low traffic: process requests immediately (batch size 1–2), accepting lower GPU utilisation During moderate traffic: batch arriving requests within a short window (5–50ms), balancing utilisation and latency During peak traffic: larger batches form naturally from concurrent arrivals, but queue depth must be monitored to prevent latency explosion Continuous batching (also called iteration-level scheduling) — where new requests join an in-progress batch at decode boundaries — improves this tradeoff significantly. But it requires inference engine support (vLLM, TensorRT-LLM, TGI) and adds implementation complexity. Making hardware decisions The framework for how organisations should choose AI hardware applies directly: start with your workload requirements (model size, latency SLA, concurrent users, cost budget), measure candidate hardware under those conditions, and decide based on measured performance — not vendor claims. The investment in workload-specific benchmarking before procurement pays for itself many times over compared to discovering post-purchase that your chosen hardware cannot meet production requirements.