Two different compute units, two different jobs NVIDIA GPUs contain two distinct types of processing units that serve fundamentally different purposes: CUDA cores are general-purpose parallel processors. Each executes one floating-point or integer operation per clock cycle. A GPU with 16,384 CUDA cores can process 16,384 independent scalar operations simultaneously. They handle everything: graphics rendering, scientific simulation, data processing, and the non-matrix-multiply portions of AI workloads. Tensor cores are specialised matrix-multiply-accumulate (MMA) units. Each tensor core processes an entire small matrix multiplication (typically 4×4 or larger, depending on generation) in a single operation. They exist specifically to accelerate the dense linear algebra that dominates neural network computation. CUDA cores handle general-purpose parallel computation; tensor cores accelerate matrix-multiply-accumulate operations — AI inference throughput depends primarily on tensor core utilisation, not CUDA core count. Why tensor cores dominate AI performance Neural network inference is overwhelmingly matrix multiplication. A transformer model processing a single token performs thousands of matrix-vector multiplications (in attention layers and feed-forward layers). Each of these maps directly to tensor core operations. The performance difference is substantial: GPU CUDA cores Tensor cores FP16 CUDA core TFLOPS FP16 Tensor core TFLOPS Ratio A100 6,912 432 19.5 312 16× H100 16,896 528 51 989 19× L40S 18,176 568 36.7 366 10× Tensor cores deliver 10–19× more throughput for AI workloads than CUDA cores on the same chip. A GPU with more CUDA cores but fewer or older-generation tensor cores will perform worse on AI inference than a GPU with fewer CUDA cores but more capable tensor cores. Tensor core generations and precision support A GPU’s tensor core generation determines which precision formats (FP8, BF16, INT8) it can accelerate — this is more important for AI performance than total core count. Each generation adds support for additional data formats: Generation First appeared in Supported precisions 1st gen V100 (Volta) FP16 2nd gen A100 (Ampere) FP16, BF16, TF32, INT8, INT4 3rd gen H100 (Hopper) FP16, BF16, TF32, FP8, INT8 4th gen B200 (Blackwell) FP16, BF16, TF32, FP8, FP4, INT8 This matters because quantised models (INT8, FP8) offer 2× or more throughput improvement over FP16 — but only on hardware with tensor cores that support those formats. Running an INT8-quantised model on V100 tensor cores gains nothing because those tensor cores only accelerate FP16. When CUDA cores still matter CUDA cores are not irrelevant to AI workloads. They handle: Non-linear activations (ReLU, GELU, SiLU) — element-wise operations that don’t map to matrix multiply Normalization layers (LayerNorm, RMSNorm) — reductions and element-wise computations Attention score softmax — exponential and normalization operations Tokenization and pre/post-processing — data preparation before and after model execution Custom kernels — any operation that doesn’t decompose into standard matrix multiplication For models where these non-MMA operations constitute a significant fraction of execution time (small batch inference, models with many activation-heavy layers), CUDA core performance affects total throughput. But for the dominant case — large matrix multiplications in transformer attention and FFN layers — tensor cores determine performance. Implications for hardware selection When evaluating GPUs for AI workloads, the relevant specifications are: Tensor core generation — determines which precision formats are hardware-accelerated Tensor core count — determines peak matrix-multiply throughput Memory bandwidth — determines how fast data reaches the tensor cores (often the actual bottleneck) CUDA core count — determines throughput of non-MMA operations (secondary factor) Understanding this hierarchy explains why the CUDA ecosystem’s influence on hardware selection extends beyond software compatibility — tensor core capabilities and their framework support determine what optimisations your workload can access. Marketing materials emphasise CUDA core counts because they are large numbers (16,384 sounds impressive). Tensor core counts are smaller (528) but more relevant. Memory bandwidth (3.35 TB/s on H100 HBM3) is often the actual performance determinant for inference workloads, because models are frequently memory-bandwidth-bound rather than compute-bound at typical batch sizes.