The difference between a server GPU and a consumer GPU is not primarily about raw compute performance. It is about the assumptions each product is engineered around: consumer GPUs optimize for peak burst performance in desktop chassis, while server GPUs are engineered for sustained, continuous operation under production inference conditions. For AI inference at scale, that distinction is not academic. What Makes a GPU a “Server GPU” Server GPUs (sometimes called datacenter GPUs or compute GPUs) share a set of characteristics that consumer products typically lack: Passive cooling: Server GPUs use passive heatsinks without fans. Cooling is handled by server chassis airflow. This eliminates fan-as-failure-point and allows denser rack configurations. ECC memory: Error Correcting Code memory detects and corrects single-bit memory errors silently, and reports multi-bit errors. Production inference systems running continuously on consumer DRAM without ECC risk silent weight corruption — wrong outputs with no detectable error signal. Extended warranty and RMA support: Enterprise-grade support contracts, replaceable in 24–48 hours. Consumer GPU warranty is typically a 3-year limited warranty with consumer RMA timelines. vGPU licensing: Datacenter GPUs support NVIDIA’s vGPU software for virtualized multi-tenant deployments. RTX consumer GPUs are explicitly restricted from this use case in NVIDIA’s commercial software license. Form factor: Standard server GPUs use double-slot PCIe form factors rated for server chassis airflow requirements. Some (A100 SXM, H100 SXM) use the SXM socket for direct board-level integration and NVLink connectivity. What are the key Server GPU Options for AI Inference? GPU Memory BW (GB/s) FP16 TFLOPS Form Factor MIG Support NVIDIA L4 24 GB GDDR6 300 242 PCIe (low power) Yes (7 slices) NVIDIA A10 24 GB GDDR6 600 125 (TF32) PCIe No NVIDIA A30 24 GB HBM2 933 165 PCIe Yes (4 slices) NVIDIA A100 40GB 40 GB HBM2e 1,555 312 PCIe / SXM Yes (7 slices) NVIDIA A100 80GB 80 GB HBM2e 2,000 312 PCIe / SXM Yes (7 slices) NVIDIA H100 80GB 80 GB HBM3 3,350 989 (FP8) PCIe / SXM Yes (7 slices) NVIDIA L40S 48 GB GDDR6 864 733 PCIe Yes The L4 is notable: it fits in a 72W thermal envelope, which allows two GPUs per slot in some server configurations. For inference of models up to ~7B parameters at INT4, or smaller models at FP16, it offers competitive cost per query. The L40S at 48 GB GDDR6 covers 13B models at FP16 with headroom for KV cache. ECC Memory: The Production Requirement Bit errors in GPU memory are rare but not zero. On modern GDDR6/HBM at production inference volumes — running 24/7, processing millions of requests — silent data corruption can occur. The consequence is model weights or activation values being silently corrupted, causing incorrect inference outputs without any exception or error log. For most teams we work with, AI inference applications, a small number of silent errors is tolerable (the output is wrong, but it looks like a bad prediction, not a crash). For applications where inference outputs affect safety decisions — medical imaging, autonomous systems, financial calculations — ECC is not optional. Consumer GPUs (RTX, GTX) do not include ECC on their main memory. Some NVIDIA professional graphics cards (RTX A-series) include ECC as an option but at reduced throughput. Sustained Throughput vs Peak Throughput Server GPU thermal design directly affects sustained throughput. Consumer GPUs boost to maximum clock speeds for short durations, then throttle back when junction temperature limits are reached. In a properly ventilated server chassis with passive-cooled datacenter GPUs, there is no thermal throttle — the GPU runs at rated clocks indefinitely. In our experience, RTX 4090 deployed in 1U/2U server chassis without dedicated per-GPU airflow channels sustains 75–85% of its peak throughput under continuous inference load. A100 PCIe in the same chassis sustains 98–100% of rated throughput. Over a sustained production deployment, this gap compounds. Driver Support and Certification NVIDIA maintains separate driver branches for datacenter GPUs (Data Center drivers, updated quarterly) and consumer GPUs (Game Ready and Studio drivers, updated frequently). Datacenter driver branches receive extended support and are certified for enterprise OS environments (RHEL, Ubuntu LTS). Consumer GPU drivers are not certified for these environments and are not supported with enterprise OS support contracts. For Kubernetes/container-based inference deployments using NVIDIA’s container toolkit and GPU operator, datacenter GPU support is mature and well-tested. Consumer GPU support in these environments works but is not an officially supported configuration. Decision Checklist for GPU Tier Selection Is sustained 24/7 operation required? → Server GPU Is ECC memory required (safety-critical outputs, regulated industry)? → Server GPU Is multi-tenant virtualization needed? → Server GPU (vGPU license) Is the deployment in a 1U/2U server chassis without consumer GPU airflow? → Server GPU (thermal) Is the workload a development environment or low-volume prototype? → Consumer GPU acceptable Is MIG partitioning needed for small model isolation? → L4, A30, or A100 The inference latency optimization stack — including how hardware tier selection interacts with batching strategy and serving architecture — is covered in How to Optimise AI Inference Latency on GPU Infrastructure. Summary Server GPUs are not just expensive consumer GPUs. The engineering differences — passive cooling, ECC memory, vGPU support, certified driver branches — exist because production inference workloads have different requirements from gaming. The financial question is whether those requirements apply to your deployment. For sustained production inference, the additional cost of server-grade hardware typically recovers within the first year of operation through reduced failure rates, higher sustained utilization, and elimination of silent corruption risk.