The TPU-vs-GPU question is really a question about workload shape and operational context. GPUs (NVIDIA H100, H200, B100/B200; AMD MI300X) remain the default accelerator for almost every deep learning team because of framework coverage, deployment flexibility, and the depth of the surrounding tooling. TPUs (Google v5p, v6 Trillium) are competitive — sometimes superior — on a narrower band of workloads: large dense transformer training inside Google Cloud, where the systolic-array architecture and the ICI interconnect cleanly match the compute pattern. In our experience, teams that frame this as a binary “which is faster” question usually end up with the wrong answer; the right question is which architecture matches the workload you actually run. Both processors handle matrix multiplication, convolutions, and attention kernels at high throughput. They differ in how much else they handle well, where they can run, and what the developer experience looks like once you move past a clean prototype. What GPUs are good at Graphics processing units are general-purpose parallel accelerators. They were built originally for rendering, but the same wide SIMT execution model that handles fragment shaders also handles the dense linear algebra at the heart of deep learning. That generality is the reason GPUs dominate: the same hardware runs a billion-parameter transformer, a CNN on medical imagery, a physics simulation, and a CUDA-accelerated data pipeline. The properties that matter for AI workloads: High thread count with deep memory hierarchies (HBM3, large L2, register-rich SMs). Mature kernel libraries — cuDNN, cuBLAS, FlashAttention, NCCL, TensorRT. Framework support that lands first on GPU: PyTorch, JAX, TensorFlow, ONNX Runtime, Triton. Wide precision range, from FP64 down to FP4 on Blackwell — useful when an inference workload has different precision tolerance than the training workload that produced the model. GPUs also have an operational property TPUs cannot match: portability. The same PyTorch code runs on a workstation, an on-prem cluster, AWS, GCP, Azure, Oracle, CoreWeave, or Lambda. This matters enormously when you are choosing inference infrastructure under real availability constraints, which is the subject we develop further in how to optimise AI inference latency on GPU infrastructure. What TPUs are good at A TPU is an application-specific integrated circuit (ASIC) built around a systolic array — a grid of multiply-accumulate units fed by a tightly choreographed dataflow. The architecture trades flexibility for efficiency on a specific compute pattern: large, regular matrix multiplications. When your workload looks like that, TPUs are very good. When it doesn’t, they’re awkward at best. TPUs run inside Google Cloud. Pods scale into the thousands of chips using Google’s optical-circuit-switched ICI fabric, which avoids many of the all-reduce bottlenecks that complicate large GPU training runs. Mixed-precision (BF16, FP8 on newer generations) is the default path rather than a tuning project. For teams already on GCP doing dense transformer pre-training at very large scale, TPUs have a real cost-per-FLOP story. The boundary conditions are sharper than marketing material suggests. TPUs are weaker on irregular control flow, sparse models, dynamic shapes, and anything that wants a custom CUDA kernel. JAX and TensorFlow are first-class; PyTorch via PyTorch/XLA works but with friction. And outside Google Cloud they do not run at all — there is no on-prem TPU, no workstation TPU, no third-party TPU cloud. Core architectural differences Dimension GPU TPU Execution model Many small cores, SIMT, flexible kernels Systolic array, dataflow-driven, fixed pattern Best-fit workload Mixed, irregular, research-heavy Large dense matmul, transformer pre-training Memory HBM3/HBM3e, large caches, complex hierarchy HBM with high bandwidth tuned for matmul streaming Interconnect NVLink, NVSwitch, InfiniBand, RoCE ICI optical-circuit-switched fabric Precision range FP64 → FP4 (Blackwell) BF16, INT8, FP8 (newer generations) Deployment Cloud, on-prem, workstation, edge Google Cloud only Framework reach PyTorch, JAX, TensorFlow, Triton, ONNX JAX, TensorFlow (PyTorch via XLA, with friction) Both can train and serve deep learning models well. The honest summary is that GPUs handle a wide range of compute patterns competently, while TPUs handle one narrow pattern extremely well. Training performance Training performance depends on input shape, batch size, memory access pattern, and how much of the model time is spent in matrix multiplication versus everything else. The “everything else” — embedding lookups, attention masking, activation checkpointing, custom layers — is where workload shape starts to matter. GPUs are forgiving when models change frequently. The toolchain has decades of optimisation behind it: profilers that surface kernel-level bottlenecks, robust support for new operators, and a debugging culture that assumes you’ll want to inspect intermediate tensors. For research teams iterating on architectures, GPUs are the path of least resistance. TPUs reward stability. When a workload locks into a clean transformer shape — fixed sequence length, dense attention, predictable batch dimensions — TPUs achieve very high utilisation with fewer stalls than equivalent GPU clusters. The XLA compiler does aggressive whole-graph optimisation, which pays off when the graph doesn’t change. It also means recompilation costs are real, and dynamic shapes are penalised. For most teams the practical question is simpler than the benchmark question: how often does your model shape change? If frequently, GPU. If rarely, TPU becomes a serious option. Inference performance How do GPUs handle inference workloads? GPUs serve inference with flexibility — variable batch sizes, multiple concurrent models, dynamic request patterns, low-latency tail behaviour. TensorRT, vLLM, Triton Inference Server, and SGLang all assume GPU as the deployment target. For production systems handling unstructured request shapes, this flexibility is the headline reason GPUs dominate inference deployments. How do TPUs handle inference workloads? TPUs do inference well in the right shape: large batches, predictable request patterns, models that map cleanly to the systolic array. Google’s own large-model serving runs on TPU. But the cloud-only constraint forecloses entire deployment categories — edge, on-prem, regulated environments, hybrid clouds. If any part of your serving strategy needs to run outside GCP, TPU is not a candidate. For the deeper treatment of where inference latency is actually spent — and why architectural choice is usually a smaller lever than batching strategy or quantisation — see how to optimise AI inference latency on GPU infrastructure. Framework and ecosystem support GPU framework support is essentially complete. PyTorch and JAX are GPU-first projects; TensorFlow runs well on both; ONNX-based tooling targets GPU as the primary deployment surface. New research papers ship with GPU-runnable code. New optimisations — FlashAttention, paged attention, speculative decoding — land on GPU first because that is where the people writing them work. TPU integration is strongest with JAX and TensorFlow. The XLA compiler and the surrounding ecosystem have matured significantly, and for teams already building on JAX the experience is clean. PyTorch on TPU works through PyTorch/XLA but carries enough friction that most PyTorch shops simply don’t go there. Scalability and large-scale workloads GPUs scale well across many nodes when paired with fast interconnects (NVLink within a node, NVSwitch within a rack, InfiniBand or RoCE between racks). The scaling story is good but tuning-dependent: parallelism strategy, all-reduce algorithm, network topology, and storage all need attention. TPUs scale natively. The ICI fabric and the JAX/XLA stack together hide much of the distributed-systems complexity, and TPU pods reach into the thousands of chips with predictable scaling behaviour. For very large dense training runs, this is the strongest argument for TPU. Cost and availability Cost depends on workload, commitment terms, region, and how well the workload matches the hardware. We see organisations save money with GPUs when availability lets them shop across providers, and we see organisations save money with TPUs when they’re already deep in GCP and the workload fits cleanly. Availability is asymmetric. GPUs run everywhere — cloud, on-prem, workstation, edge devices, automotive. TPUs run in Google Cloud. That asymmetry shapes procurement strategy long before the cost-per-FLOP comparison matters. Developer experience GPU development is portable. A PyTorch model written on a laptop runs unchanged on a cluster. Profiling tools (Nsight Systems, Nsight Compute, PyTorch Profiler) are mature. The talent pool is large. The community answers most questions you’ll have. TPU development is JAX-first and cloud-only. There is no local TPU workstation. The XLA compiler does powerful optimisation but also means certain classes of bugs surface only at compile time on the actual hardware. For a team already fluent in JAX and committed to GCP, the experience is clean. For everyone else, the GPU path is faster to first production. Choosing for your workload Choose GPUs when: The model architecture or training pipeline changes frequently. You need to deploy outside Google Cloud (on-prem, edge, multi-cloud, regulated). You depend on PyTorch and the broader research ecosystem. Your inference workload has irregular shapes or low-latency tail requirements. Choose TPUs when: You are already on GCP and the workload is a stable, dense transformer at large scale. Your team is JAX-native. You’re doing pre-training runs large enough that the ICI interconnect advantage outweighs ecosystem cost. You don’t need on-prem or edge deployment paths. Many teams end up using both — GPUs for experimentation and most production serving, TPUs for specific large pre-training runs. The mixed strategy isn’t a compromise; it’s recognising that the two architectures answer different operational questions. Where this fits in the inference engineering picture The TPU-vs-GPU question is downstream of a more important one: where is your inference latency actually being spent? In most engagements we’ve seen, the largest reductions in serving latency come from algorithmic and infrastructure choices that apply regardless of whether you’re on TPU or GPU — quantisation policy, batching strategy, KV-cache management, request routing. Hardware choice matters, but it’s usually not the dominant lever. We develop that argument further in how to optimise AI inference latency on GPU infrastructure, which is the parent piece in this engineering thread. Frequently asked questions How do I diagnose where AI inference latency is being spent — model compute, memory, batching, or transport? Profile the pipeline end-to-end before assuming the model is the bottleneck. In our experience, the first measurement to take is the split between host-side preprocessing, host-device transfer, model compute, and post-processing. Tools like Nsight Systems on GPU or the XLA profiler on TPU give you a kernel-level breakdown; combined with request-level traces from your serving framework (Triton, vLLM, SGLang), this is usually enough to identify whether the constraint is compute, memory bandwidth, batching inefficiency, or transport. What is the most efficient GPU infrastructure for low-latency inference today? For most workloads in 2026, NVIDIA H100, H200, or the Blackwell B100/B200 generation paired with TensorRT-LLM or vLLM gives the strongest latency-per-dollar profile. AMD MI300X and MI325X are credible alternatives where ROCm support is sufficient for the model. The “most efficient” infrastructure also depends on batch size, model size, and whether you can use FP8 or INT8 precision — the architecture choice is downstream of the precision and batching strategy. When does FP8 / INT8 quantisation actually reduce serving latency, and when does it only save memory? Quantisation reduces latency when the workload is compute-bound and the lower-precision tensor cores are faster than the higher-precision ones on your hardware. When the workload is memory-bandwidth-bound — common for autoregressive decoding with large KV caches — quantisation primarily reduces memory pressure and lets you fit larger batches or longer contexts; the per-token latency improvement comes from better batching, not faster math. How do batching strategies (continuous, dynamic, static) trade throughput against tail latency? Static batching maximises throughput but punishes tail latency because requests wait for a full batch. Dynamic batching with a short timeout gives a reasonable middle ground for traditional models. Continuous batching (in vLLM, TGI, SGLang) is the right default for LLM serving: it processes requests at the token level rather than the request level, which dramatically improves both throughput and tail latency for autoregressive workloads. When should I optimise the inference path rather than scale out to more GPUs? Almost always optimise first. Across the engagements we’ve worked on, algorithmic and infrastructure optimisations — better batching, quantisation, kernel selection, KV-cache management, request routing — typically yield larger latency reductions than adding hardware. Scaling out is the right answer when you’ve measured the current pipeline at high utilisation, the model is already well-tuned, and demand genuinely exceeds capacity. How do I measure cost-per-inference before and after optimisation to justify the engineering work? Measure three things: GPU-hours consumed per million tokens (or requests), the p50 and p99 latency at the target throughput, and the utilisation of the GPU during steady-state load. Before/after on those three numbers, multiplied by your cloud rate or amortised on-prem cost, gives you a defensible cost-per-inference comparison. The engineering work is usually justified when the optimisation drops cost-per-inference by more than the cost of a few additional GPUs would. For broader programme context across our GPU engagements, explore our GPU performance engineering practice. For the inference latency thread specifically, the parent piece is how to optimise AI inference latency on GPU infrastructure.