βFastest GPUβ is the wrong question for inference Teams selecting GPU infrastructure for AI inference commonly optimise for the wrong metric. They compare GPUs by peak TFLOPS, select the highest number, and discover in production that their inference workload runs no faster than on cheaper hardware β because the workload is memory-bandwidth-bound, not compute-bound, and the expensive GPUβs extra FLOPS are unused. Inference efficiency is measured in performance-per-watt and cost-per-inference, not raw FLOPS. The infrastructure that delivers the most inferences per dollar per hour is rarely the infrastructure with the highest spec-sheet performance number. The three metrics that actually determine inference efficiency Metric What it measures Why it matters more than FLOPS Cost-per-inference Total cost (hardware amortisation + power + cooling) divided by inferences served The business metric β what actually determines ROI Performance-per-watt Inferences per second per watt of power consumed Determines operational cost at scale; a 2Γ throughput GPU that consumes 3Γ power is less efficient Throughput at target latency Maximum inferences/second achievable while meeting the p99 latency SLA The engineering constraint β raw throughput without latency bounds is meaningless for serving What actually determines inference throughput Batch size, precision format, and memory bandwidth β not just GPU model β determine inference throughput. Understanding why requires understanding what inference actually does to the hardware: Memory bandwidth governs throughput for most inference workloads. Loading model weights from GPU memory into compute units is the bottleneck for any model that doesnβt fit in on-chip cache. An A100 with 2 TB/s memory bandwidth serves more inferences per second than a hypothetical GPU with 2Γ the FLOPS but 1 TB/s bandwidth β because the weights cannot be fed to the compute units fast enough. Batch size determines utilisation. Serving one request at a time on an A100 utilises perhaps 5β10% of available compute. Batching 8β32 requests together amortises the weight-loading cost across multiple inferences, increasing throughput near-linearly until the compute ceiling is reached. But larger batches increase latency per request β the throughput-latency tradeoff is the core engineering decision. Precision format determines both memory footprint and compute throughput. INT8 inference uses half the memory bandwidth of FP16 and enables tensor core acceleration β delivering 2β4Γ throughput improvement on supported hardware. But INT8 requires calibration and may lose accuracy on some model architectures. Decision framework: matching infrastructure to workload The total cost analysis of cloud GPU vs on-premise provides the financial framework. Within that framework, the infrastructure selection question is: For latency-sensitive serving (chatbots, real-time APIs, interactive applications): Prioritise memory bandwidth and low batch-size efficiency H100 and L40S excel here due to high memory bandwidth per dollar Smaller, lower-power GPUs (T4, L4) often deliver better cost-per-inference for models under 7B parameters For throughput-optimised batch processing (offline inference, document processing, embedding generation): Prioritise total compute at maximum batch size A100 80GB remains cost-effective due to mature rental market and large memory pool Multi-GPU parallelism across cheaper GPUs often beats a single expensive GPU For edge deployment (on-device, constrained power): Prioritise performance-per-watt above all else NVIDIA Jetson, Intel Movidius, or custom NPUs β not data centre GPUs Model optimisation (quantisation, pruning, distillation) dominates hardware choice The utilisation trap The most common inefficiency in inference infrastructure is over-provisioning: deploying more GPU capacity than the workload requires, resulting in GPUs sitting idle 60β80% of the time. Auto-scaling and request batching are operational necessities β not optimisations β for any inference deployment that experiences variable load. A single A100 serving 10 requests per minute at p99 < 100ms is dramatically over-provisioned; a T4 could serve the same load at 10% of the cost. Measuring actual utilisation (GPU compute utilisation, memory bandwidth utilisation, and power draw under production load) before committing to hardware β rather than selecting hardware based on peak capability β is the single highest-impact infrastructure decision most teams can make.