An H100 GPU server is a substantial capital commitment, and we see procurement decisions get made on transformer-era hype rather than on workload evidence. The honest question is narrower: does your inference or training workload exercise the parts of the Hopper architecture that justify the premium over an A100 — and is the host, power, and cooling envelope around the GPU actually ready to feed it? For most teams, the answer depends less on the GPU spec sheet and more on whether the software pipeline can keep an H100 saturated. This piece walks through where the H100 earns its price, where it does not, and the procurement and infrastructure mistakes we see repeatedly. For the broader question of how to reduce inference latency before reaching for new hardware, see our hub guide on how to optimise AI inference latency on GPU infrastructure. What makes H100 GPU servers different from previous generations? The NVIDIA H100 (Hopper architecture) introduces three capabilities that matter for AI workloads: the Transformer Engine (hardware-accelerated mixed-precision for transformer models), higher HBM3 memory bandwidth (3.35 TB/s on SXM5), and fourth-generation NVLink (900 GB/s bidirectional per GPU). These are not incremental improvements — they represent 2–3× performance gains for specific workload profiles compared to the A100, an observed range across the transformer-shaped engagements we have run, not a universal multiplier. The Transformer Engine automatically manages precision between FP8 and FP16 during matrix multiplication, achieving near-FP8 throughput with FP16-level accuracy on supported operators. For transformer-based workloads — large language models, vision transformers, diffusion models — this provides roughly 2× the effective compute compared to A100 at the same power envelope. The honest caveat: the multiplier only materialises when the model, the runtime (TensorRT-LLM, PyTorch with FP8 autocast, or a custom CUDA path), and the calibration pipeline are all set up to use FP8 cleanly. Drop any one of those and the H100 reverts to behaving like a faster A100, which is a much weaker case for the price. When is the H100 investment justified? Workload H100 advantage over A100 Justification threshold LLM training (>7B params) 2–3× throughput Training runs >$50K on A100 LLM inference serving 2–4× tokens/second >1000 sustained requests/hour Vision transformer training 1.8–2.5× throughput Iterating on architecture frequently Standard CNN training 1.3–1.5× throughput Rarely justified — A100 sufficient Small model inference <1.2× improvement Not justified at H100 pricing These ranges are an observed pattern across our GPU engagements rather than a published benchmark — your numbers will move depending on sequence length, batch shape, and kernel coverage. The pattern itself is stable: the H100’s advantages concentrate in transformer-heavy workloads. For convolutional networks, traditional object detection, and small-model inference, the A100 delivers adequate performance at materially lower cost. We advise clients to benchmark their specific workload on both platforms before committing to H100 procurement. The theoretical multiplier rarely lands exactly where the slide deck claims it will. What are common H100 procurement mistakes? The most frequent mistake is purchasing H100 PCIe cards when the workload actually requires H100 SXM5. The PCIe variant has lower memory bandwidth (2.0 TB/s vs 3.35 TB/s on SXM5) and no NVLink support — the two features that provide the H100’s largest advantages over the A100 on multi-GPU and memory-bound workloads. An H100 PCIe often delivers only 1.3–1.5× the performance of an A100 SXM on the same job, which does not justify the price gap. The second mistake is under-provisioning the host system. An H100 SXM5 server requires high-bandwidth CPU-to-GPU connectivity (PCIe Gen5), fast NVMe storage to feed the GPU’s data appetite, and sufficient host CPU capacity for data preprocessing. A system with 8× H100 SXM5 GPUs sitting behind a single-socket CPU and SATA storage will bottleneck at the host before the GPUs are ever saturated, wasting most of the capacity you just paid for. The third, quieter mistake is buying H100s before profiling the existing inference pipeline. If the bottleneck is tokeniser cost, host-to-device copy, or a serving framework that cannot keep the GPU fed, replacing the A100 with an H100 will not fix it. The throughput floor is set by whatever serialised step is starving the device. This is the structural reason we push the inference latency diagnosis workflow before any hardware refresh. How should you configure an H100 server? For training workloads that need multi-GPU scaling, the configuration we recommend is 8× H100 SXM5 connected via NVLink, a dual-socket CPU (AMD EPYC 9004 or Intel Xeon Sapphire Rapids), 2 TB of system RAM, and NVMe storage with at least 25 GB/s of aggregate read throughput. This configuration runs $250K–$400K depending on the vendor and support contract — observed price band from recent procurements, not a benchmarked figure. For inference serving, a single H100 SXM5 or a pair of H100 PCIe cards is often sufficient, depending on model size and throughput target. A single H100 SXM5 serves a 70B-parameter LLM at roughly 30–50 tokens per second per concurrent user under FP8, which is adequate for interactive applications with moderate concurrency. Beyond that load level, the question shifts from “which card” to “how many replicas behind which scheduler”, and the right answer depends on tail-latency SLAs, not peak throughput. Total cost of ownership extends well beyond hardware: power consumption (700 W per H100 SXM5 under load), cooling requirements (liquid cooling is the default for SXM5 configurations), rack space, and the engineering time to optimise workloads for FP8 precision and the Transformer Engine. In our TCO calculations these operational costs typically add 30–50% to the hardware acquisition cost over a three-year deployment — again, a planning heuristic rather than a benchmarked rate. What cooling and power infrastructure does an H100 deployment require? Infrastructure requirements for H100 GPU servers extend well beyond the server itself. Power and cooling are the most frequently underestimated cost and lead-time items in H100 deployments. Power. A single 8× H100 SXM5 server draws approximately 10 kW under full load — GPUs contribute around 5.6 kW, CPUs and the rest of the system contribute the remainder. A rack containing two such servers needs 20 kW of power delivery, which exceeds the capacity of many existing data-centre racks provisioned for 8–12 kW. Upgrading power distribution to support high-density GPU racks involves electrical work with lead times of 4–12 weeks depending on the facility. Cooling. The H100 SXM5 is designed for liquid cooling. Air-cooled configurations exist, but they require high-airflow chassis that push noise to 75–85 dBA — unsuitable for environments with human occupancy. Liquid cooling (direct-to-chip or rear-door heat exchangers) reduces noise and improves thermal efficiency but assumes plumbing infrastructure that most general-purpose data centres do not have pre-installed. We have seen H100 servers delivered to facilities that could not power them, an expensive storage problem that delayed the project by months while infrastructure upgrades caught up. UPS and redundancy. Training runs interrupted by power events lose hours of compute. Full UPS capacity for a 10 kW server is a significant battery investment, and we generally do not recommend sizing UPS for sustained operation at this density. The better pattern is checkpoint-based fault tolerance — saving model state every 15–30 minutes via PyTorch checkpointing or framework-native equivalents — combined with UPS protection sized for graceful shutdown rather than sustained operation. This combination is the cost-effective resilience pattern we deploy for training workloads. The total infrastructure cost for an 8× H100 deployment, including power distribution upgrades, cooling installation, rack modifications, and UPS, typically adds $50K–$150K on top of hardware acquisition. Including this from the start prevents the scope and budget surprises that derail otherwise sound procurement plans. Bringing it back to the inference question The H100 is the right purchase for transformer-shaped workloads at scale, where FP8, HBM3 bandwidth, and NVLink all see genuine use. It is the wrong purchase as a generic “AI upgrade” — and it is almost always the wrong first purchase before the inference pipeline has been profiled end-to-end. The decision tree we use with clients runs in this order: profile the existing pipeline, identify whether the bottleneck is compute, memory, batching, or host-device transport, fix the software-side constraints first, then size hardware against the remaining gap. More often than people expect, the gap closes before the H100 line item is needed. FAQ When a procurement decision is being framed as “H100 or nothing”, the underlying question is almost always whether the inference path has been profiled honestly first. That is where we usually start.