CPU vs GPU Comparison for AI: Why the Question Is Usually Misdirected

The comparison misframes the problem

“CPU vs GPU for AI” is framed as a competition, but modern AI workloads are not decided by choosing one or the other. They run on both. The question is which operations belong on each, and whether the boundary between them is causing performance problems.

Understanding the actual decision helps avoid two common mistakes: trying to run everything on GPU when CPU operations bottleneck the pipeline, and benchmarking CPUs for workloads that are architecturally GPU-bound. In our experience, teams that frame procurement as “CPU or GPU?” almost always end up with the wrong answer to a more useful question — namely, how much of their existing GPU capacity is actually being used.

What each processor is for in AI systems

Operation	CPU	GPU
Data loading and I/O	Primary (I/O is not GPU-bound)	No
Tokenisation and text preprocessing	Primary	Sometimes (GPU tokenisers exist)
Data augmentation	Primary (can be parallelised)	Sometimes (DALI, torchvision GPU transforms)
Matrix multiplication (model forward pass)	Insufficient for large models	Primary
Attention computation	Too slow at scale	Primary (FlashAttention)
Postprocessing and sampling	CPU viable for small batches	GPU for high throughput
Control flow and orchestration	CPU	No

The GPU handles the compute-intensive inner loop. The CPU handles everything around it. If either side is slower than the other can consume work from it, you have a bottleneck — and the bottleneck is almost never where the procurement question assumed it would be.

What is the actual throughput gap between CPU and GPU on AI tensor operations?

For the same tensor operations, the hardware envelopes are well known:

A modern server CPU (AMD EPYC, Intel Xeon) achieves roughly 1–4 TFLOPS FP32 for optimised matrix operations — observed range across vendor whitepapers, not a benchmarked rate for any specific deployment.
A modern data centre GPU achieves 100–1000 TFLOPS FP16/BF16 in vendor-published peak figures.

For AI model inference at batch sizes greater than one, GPU dominates by 20–100× for compute-intensive models. This is a directional industry-scale comparison, not an operational benchmark for your workload. At batch size one with small models and strict latency constraints, the gap narrows sharply, and CPU inference is viable for several real use cases.

The reason the gap collapses at small batch is structural: GPU compute is bandwidth-fed. HBM on a data centre GPU sustains roughly 2–3 TB/s, while DDR5 on a server CPU sustains 50–100 GB/s. At batch one the model is reading parameters once and doing minimal arithmetic per byte fetched, so the compute throughput advantage of the GPU is partially masked by host-to-device transfer overhead (typically 0.5–2 ms per inference on PCIe-connected GPUs, an observed-pattern range we see across customer environments).

When CPU inference is practical

CPU-only inference is appropriate when the workload pins itself to a narrow operating point:

Model size below roughly 1B parameters.
Batch size of one (single request, synchronous).
Latency budget above 100 ms.
Hardware cost or power envelope precludes a GPU.
Deployment context (embedded device, regulated environment) makes GPU drivers a validation burden.

ONNX Runtime and OpenVINO optimise CPU inference for these cases using AVX-512 and AMX instructions. For models in the 100M–1B parameter range, ONNX Runtime on CPU with INT8 quantisation typically delivers 10–50 ms inference latency for classification and embedding tasks — observed across customer benchmarks, not a guarantee for arbitrary models. That is competitive with GPU latency once host-to-device transfer overhead is included, and for single-request latency-sensitive serving without batching, CPU can actually be faster than GPU for small models.

Above 1B parameters, GPU acceleration is necessary for practical inference speeds. For a 7B-parameter LLM, CPU inference generates 1–3 tokens per second while GPU inference generates 30–100 tokens per second on the same precision (observed-pattern range across the deployments we have profiled). The 10–30× throughput difference makes CPU deployment impractical for interactive applications regardless of cost framing.

Why the GPU-busy percentage is misleading

A benchmark that reports only GPU metrics misses half the picture. CPU-side preprocessing — tokenisation for LLMs, image decoding and augmentation for vision models, feature extraction for tabular data — can starve the GPU if the data pipeline cannot feed tensors fast enough. When we see GPU utilisation below 80% during training, the cause is more often a CPU-side data-loading bottleneck than a GPU scheduling problem.

The harder issue is that the headline GPU-busy percentage that nvidia-smi reports does not distinguish between “doing useful matmul” and “executing a trivial kernel while waiting for the next batch.” A GPU that reports 95% busy can still be wasting most of its compute on memory-stall cycles or under-sized batches. Diagnosing this requires measuring both CPU utilisation per core and GPU utilisation simultaneously, and ideally adding Nsight Systems or PyTorch Profiler traces to see what the GPU is actually executing.

If CPU utilisation on data-loading cores sits at 100% while GPU utilisation dips periodically, increasing the number of data-loader workers (num_workers in PyTorch’s DataLoader) or moving preprocessing onto the GPU (DALI, torchvision’s GPU transforms) typically resolves the immediate bottleneck. For inference serving, the CPU-side overhead includes request parsing, tokenisation, batching logic, and response serialisation. We typically allocate at least four CPU cores per GPU for inference serving workloads to keep the host side from becoming the limiting factor — an observed planning heuristic, not a benchmarked ratio.

Decision framework: model size, latency, scale

The decision framework we use has three inputs: model size, latency requirement, and deployment scale.

Model size	Latency budget	Scale	Recommendation
< 100M params	> 200 ms	< 100 req/s	CPU viable (ONNX Runtime, AVX-512)
100M–1B params	> 50 ms	Moderate	CPU competitive with quantisation; GPU if batching helps
100M–1B params	< 20 ms	High	GPU, with attention to host-to-device transfer
> 1B params	Any	Any	GPU required; HBM bandwidth is the constraint
Any	Edge / embedded	Any	CPU or accelerator-on-die; GPU rarely viable

The threshold where GPU acceleration earns its complexity overhead depends on parameter count, memory bandwidth pressure, and batch shape — not on category preference. LynxBench AI treats the CPU/GPU partition as a workload-bound and model-size-bound choice for exactly this reason.

What the question should really be

Before approving more GPU capacity, the more useful question is whether the GPUs already on the floor are being used. The hidden cost of GPU underutilisation is the failure mode this article sits inside: teams compare CPU and GPU on theoretical throughput, buy more GPU, and then run those GPUs at 40% utilisation because the CPU-side data pipeline cannot feed them. The CPU-vs-GPU benchmark was answering the wrong question.

The question to put to any CPU-vs-GPU recommendation for an AI workload is whether the comparison was scoped to the same model, the same operating point, and the same data pipeline as the deployment — or whether it generalised from a benchmark whose model size, batch shape, and host configuration do not match what you actually run.

FAQ

How do I calculate the true cost of an underutilised GPU fleet?

Multiply the hourly rate (cloud GPU rental or amortised on-prem TCO) by the gap between purchased capacity and useful capacity. If a fleet runs at 40% effective utilisation, 60% of the spend is paying for idle silicon. This is TCO per useful FLOP rather than TCO per purchased FLOP, and it is the only cost figure that survives contact with a real workload.

What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading?

The nvidia-smi busy percentage reports whether any kernel is executing, not whether the kernel is doing useful work. A GPU can show 95% busy while spending most of its cycles on memory stalls, under-sized batches, or trivial kernels. Useful utilisation requires profiling tensor-core occupancy, achieved memory bandwidth, and the ratio of arithmetic to data movement — which is what Nsight Systems and PyTorch Profiler are for.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Take the purchased FLOP capacity over the depreciation window, multiply by the measured useful-utilisation fraction (achieved tensor-core throughput divided by peak), then divide total cost by that figure. The useful-utilisation fraction is the term most procurement models omit; including it typically moves the TCO conclusion significantly.

Which workload patterns most often leave GPU capacity on the table?

CPU-bound data pipelines (insufficient num_workers, CPU-side augmentation), batch sizes chosen for latency rather than throughput, sequential inference without dynamic batching, and training jobs whose collective operations stall on NCCL or interconnect bandwidth. In our experience, CPU-side starvation accounts for the largest share of “the GPU is bought but idle” cases.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first. Procurement decisions made before profiling almost always over-buy, because the headline GPU-busy metric overstates useful work. A GPU performance audit measures actual utilisation per workload and identifies where capacity is wasted before the next procurement cycle commits more spend.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

The savings depend on the starting utilisation gap and the workload mix. Across the engagements we have run, recovering utilisation from data-pipeline and batching fixes typically reclaims double-digit percentages of effective capacity — an observed pattern, not a benchmarked rate. The actionable number for any specific fleet only emerges from profiling.