Two ways to distribute a model across GPUs When a model is too large to fit on a single GPU — or when you need more throughput than one GPU provides — you must distribute computation across multiple GPUs. The two primary strategies split the model differently: Tensor parallelism (TP) splits individual operations across GPUs. A single matrix multiplication is divided so each GPU computes a portion, then results are combined via all-reduce communication. Every GPU participates in every layer’s computation. Pipeline parallelism (PP) splits model layers across GPUs. GPU 0 runs layers 1–10, GPU 1 runs layers 11–20, and so on. Each GPU runs complete operations but only for its assigned layers. Data flows through the pipeline sequentially. Tensor parallelism splits individual operations across GPUs (low latency, high bandwidth requirement); pipeline parallelism splits model layers across GPUs (tolerates lower bandwidth, adds pipeline bubble overhead). The tradeoff structure Dimension Tensor parallelism Pipeline parallelism Communication pattern All-reduce after every operation Point-to-point between adjacent stages Bandwidth requirement Very high (NVLink-class: 600+ GB/s) Moderate (PCIe or InfiniBand sufficient) Latency per token Low (all GPUs compute simultaneously) Higher (sequential stage execution) GPU utilisation High (all GPUs always active) Reduced by pipeline bubble (idle time between micro-batches) Scaling limit 4–8 GPUs per TP group (communication overhead grows) Limited by bubble fraction and memory per stage Memory efficiency Each GPU holds full layer weights (partitioned) Each GPU holds only its layers (full weights for those layers) When tensor parallelism wins TP is optimal when: GPUs are connected via high-bandwidth interconnect (NVLink within a node: 900 GB/s on H100) Latency matters more than throughput (real-time inference, interactive applications) The model fits across a small number of GPUs (2–8) with TP alone You need every GPU contributing to every token’s computation The constraint: TP requires all-reduce communication after every tensor operation. On NVLink (900 GB/s), this adds microseconds. On PCIe (64 GB/s) or InfiniBand (400 Gb/s between nodes), the communication time dominates computation time, making TP impractical across nodes. When pipeline parallelism wins PP is optimal when: GPUs are connected via lower-bandwidth links (cross-node InfiniBand) The model is large enough to require many GPUs (16+) Throughput matters more than per-request latency You can tolerate the pipeline bubble overhead by filling it with micro-batches The pipeline bubble problem: when the pipeline starts, only GPU 0 is active. GPUs enter the pipeline sequentially. At the end of a batch, GPUs finish sequentially. The fraction of time GPUs are idle (the bubble) is approximately (p-1)/(p-1+m) where p = pipeline stages and m = micro-batches. With 8 stages and 32 micro-batches, bubble overhead is ~18%. The hybrid reality: combining TP + PP Production deployments of large models almost always use both strategies simultaneously: TP within a node (leveraging NVLink’s high bandwidth for low-latency intra-operation communication) PP across nodes (tolerating InfiniBand’s lower bandwidth for inter-stage communication) A 32-GPU deployment across 4 nodes might use TP=8 (within each 8-GPU node) and PP=4 (across the 4 nodes). This maximises the use of each interconnect tier’s bandwidth characteristics. The optimal parallelism strategy depends on interconnect bandwidth and model architecture — not just GPU count — which is why performance results on one cluster configuration do not transfer directly to another. A model that achieves X tokens/second with TP=4 on NVLink-connected A100s will achieve a very different number with TP=4 on PCIe-connected A100s. Data parallelism: the third dimension Data parallelism (DP) — running identical model replicas on different GPUs, each processing different input data — combines with TP and PP for full 3D parallelism. DP is the simplest form (each GPU has a complete model copy, processes different batches, synchronises gradients). It scales throughput linearly with GPU count but requires each GPU to hold the full model in memory. The 3D combination: TP within nodes for latency, PP across node groups for model size, DP across replica groups for throughput. The specific combination for your deployment depends on model size, available GPUs, interconnect topology, and whether you optimise for latency or throughput. Understanding these tradeoffs is fundamental to interpreting benchmark results. As explored in our analysis of why training and inference are fundamentally different workloads, the parallelism strategy that is optimal for training (where batch size is flexible and throughput is king) differs from inference (where batch size is determined by traffic patterns and latency constraints matter).