A CUDA kernel is a function that runs on the GPU, executed simultaneously by thousands of threads. That one-line definition hides a lot of complexity that matters in practice. Understanding the thread hierarchy — and how it maps to physical hardware — is the difference between a kernel that saturates your GPU and one that leaves most of it idle. The Thread Hierarchy CUDA organizes execution in a three-level hierarchy: threads, blocks, and grids. Threads are the individual execution units. Each thread runs the same kernel code but operates on different data, determined by its unique index. A thread reads its position using threadIdx.x, blockIdx.x, and blockDim.x to compute a global data offset. Blocks are groups of threads that share L1 cache (shared memory) and can synchronize with each other via __syncthreads(). Threads in different blocks cannot synchronize directly — that design choice is deliberate, because it lets the scheduler run blocks in any order across any available Streaming Multiprocessor (SM). Grids are collections of blocks that make up a full kernel launch. You define grid and block dimensions at launch time: kernel<<<gridDim, blockDim>>>(args); The GPU scheduler distributes blocks across SMs. Each SM has a fixed number of registers, shared memory bytes, and maximum resident threads — exceeding those limits causes register spilling or reduced occupancy. How Thread Hierarchy Maps to Hardware Software Abstraction Hardware Mapping Thread CUDA core execution lane Warp (32 threads) SIMT execution unit on one SM Block One SM at a time Grid Entire GPU The warp is the actual execution granularity. Even if you launch blocks of 128 threads, the hardware executes 4 warps of 32. If threads within a warp follow different code paths (warp divergence), the hardware serializes those paths — execution time is the sum of all divergent branches, not the maximum. Writing a Minimal CUDA Kernel __global__ void vector_add(const float* a, const float* b, float* c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; } } The __global__ qualifier tells the compiler this function is callable from the host (CPU) and runs on the device (GPU). The bounds check (if (idx < n)) handles cases where the last block has unused threads — a common source of out-of-bounds writes for beginners. Launching 1 million elements with block size 256 would be: int blocks = (n + 255) / 256; vector_add<<<blocks, 256>>>(d_a, d_b, d_c, n); When to Write Custom Kernels vs Using Libraries? Most GPU compute work does not require custom kernels. Libraries like cuBLAS, cuDNN, cuFFT, and Thrust cover the vast majority of standard operations and are heavily tuned by NVIDIA engineers. In our experience, teams that reach for custom kernels prematurely spend weeks gaining single-digit percentage improvements that a library update would have delivered for free. Decision Checklist Is the operation covered by cuBLAS, cuDNN, or Thrust? → Use the library. Is the bottleneck a kernel you already profiled and confirmed is memory-bound or compute-bound? → Investigate kernel tuning. Does your workflow involve a fused operation that no library supports? → Custom kernel is justified. Are you chaining multiple element-wise operations that each launch and terminate separately? → A fused custom kernel reduces launch overhead and memory round-trips. Is the workload size too small for library overhead to amortize? → Custom lightweight kernel may be faster. Have you tried torch.compile or XLA before writing CUDA? → Do this first. In our experience, the strongest case for a custom kernel is operator fusion: replacing a sequence of library calls (each writing intermediate results to HBM) with a single kernel that keeps data in registers or shared memory. FlashAttention is the textbook example — it fuses the attention score computation, softmax, and weighted sum into one kernel, reducing HBM reads/writes by an order of magnitude. Common Kernel Performance Mistakes Uncoalesced memory access is the most frequent cause of underperforming kernels. Global memory accesses are coalesced when consecutive threads access consecutive memory addresses. If thread i accesses data[i * stride] with a large stride, each warp issues multiple memory transactions instead of one. Insufficient occupancy happens when a kernel uses too many registers or too much shared memory per block. The GPU cannot schedule enough warps per SM to hide memory latency. Use nvcc --ptxas-options=-v to see register usage per thread. Excessive synchronization with __syncthreads() inside conditional branches causes undefined behavior and is a correctness issue, not just a performance one. Small block sizes like 32 threads per block waste scheduling overhead. A block size of 128 or 256 is a reasonable default; 512 works if register pressure is low. Profiling Before Writing Before writing any custom kernel, profile your application to confirm where time is actually spent. A custom CUDA kernel written for the wrong bottleneck is wasted engineering. The profiling workflow is covered in detail in the parent hub article: How to Profile GPU Kernels to Find the Real Bottleneck. The short version: use Nsight Compute to get per-kernel metrics (memory throughput, compute throughput, occupancy), confirm whether the kernel is memory-bound or compute-bound, then address that specific constraint. The bottom line CUDA kernels are the fundamental unit of GPU computation — functions that execute across thousands of threads organized into warps, blocks, and grids. The warp is the actual hardware execution unit at 32 threads; everything else is a scheduling abstraction. Custom kernels are justified for fused operations and workloads that no library covers well, but they require profiling to confirm the investment is warranted. Most teams should exhaust library options and compiler-based fusion (torch.compile, XLA) before writing a single line of CUDA C++.