GPUs don’t do the same parallelism as CPUs When engineers hear “parallel computing,” many think of CPU multithreading — 8, 16, or 64 threads executing independent tasks simultaneously. GPU parallelism is fundamentally different in architecture, granularity, and the class of problems it accelerates. A modern GPU has thousands of cores (an NVIDIA A100 has 6,912 CUDA cores), but each core is far simpler than a CPU core. The design trades per-thread sophistication for massive throughput on data-parallel workloads — problems where the same operation applies to thousands or millions of data elements simultaneously. Understanding this distinction is prerequisite to GPU optimisation. Engineers who approach GPU programming with a CPU threading mental model write code that underutilises the hardware by 10–100×. The execution model: warps, threads, and occupancy GPU execution is organised around the SIMT (Single Instruction, Multiple Threads) model. Threads are grouped into warps of 32 (NVIDIA) that execute the same instruction simultaneously on different data. A GPU schedules thousands of warps concurrently, switching between them to hide memory latency — when one warp stalls waiting for data, another warp executes immediately. Dimension CPU parallelism GPU parallelism Core count 8–128 complex cores 1,000–16,000 simple cores Thread independence Fully independent threads Threads in warps execute same instruction Context switching Expensive (OS-level) Near-free (hardware warp scheduling) Memory model Large caches per core (MB) Small shared memory per SM (KB), high-bandwidth global memory Ideal workload Task-parallel, complex branching Data-parallel, uniform operations on large arrays Latency strategy Minimise latency per thread Hide latency through massive thread occupancy When GPU parallelism wins — and when it doesn’t GPU parallelism exploits thousands of simple cores for data-parallel workloads, unlike CPU thread-level parallelism which excels at task-parallel work with complex control flow. The performance gap between serial and parallel execution grows non-linearly with problem size — Amdahl’s law governs the ceiling. Where GPUs dominate: Matrix multiplication (the foundation of neural network computation) — scales near-linearly with core count Element-wise operations on large tensors (activation functions, normalisation) Convolution operations across spatial dimensions Batch processing where the same model processes thousands of inputs simultaneously Where CPUs remain faster: Sequential algorithms with data dependencies between steps Workloads with heavy branching (if/else paths that diverge within a warp cause serialisation) Small problem sizes where GPU launch overhead exceeds the computation time Irregular memory access patterns that prevent coalesced reads Amdahl’s law: the parallel ceiling Amdahl’s law states that the speedup from parallelisation is limited by the fraction of the workload that must remain serial. If 10% of a computation is inherently sequential, the maximum theoretical speedup from infinite parallel resources is 10× — regardless of how many GPU cores you add. In practice, this means profiling to identify the serial fraction is the first step in any GPU parallelisation effort. The serial fraction is not always where engineers assume: data loading, preprocessing, host-device memory transfer, and synchronisation points all contribute serial time that limits the parallel speedup. Practical implications for AI workloads Modern AI frameworks (PyTorch, TensorFlow, JAX) abstract GPU parallelism behind high-level APIs — but the underlying execution model still determines performance. Understanding that a matrix multiplication fully utilises data parallelism while a custom loss function with conditional logic may serialise across warps explains why some model operations run 1000× faster on GPU while others show no speedup. The gap between “runs on GPU” and “runs efficiently on GPU” is the gap between launching kernels and launching kernels that match the hardware’s execution model. Data layout, memory access patterns, and operation structure all determine whether the thousands of available cores are working or waiting.