CUDA Occupancy Calculator
Estimate theoretical occupancy for a CUDA kernel using your launch configuration and per-block resource usage.
This tool gives a practical theoretical estimate. Real performance also depends on memory bandwidth, instruction mix, latency hiding, and scheduler behavior.
What CUDA occupancy means
CUDA occupancy is the ratio of active warps on a Streaming Multiprocessor (SM) to the maximum warps the SM can support. In simple terms, it answers: how full is each SM with runnable work?
Higher occupancy can improve latency hiding, but it does not always guarantee higher performance. Many kernels run fastest at moderate occupancy if they gain instruction-level parallelism, cache locality, or reduced register spilling.
Inputs used by the calculator
GPU limits (per SM)
- Max Threads per SM: Hardware limit on active threads per SM.
- Max Warps per SM: Hardware limit on active warps per SM.
- Max Blocks per SM: Architectural cap on resident thread blocks.
- Registers per SM: Total register file capacity available per SM.
- Shared Memory per SM: On-chip shared memory pool available per SM.
Kernel launch and kernel resource usage
- Threads per Block: Your launch configuration, e.g.,
<<<grid, block>>>. - Registers per Thread: Register usage reported by compiler tools.
- Shared Memory per Block: Static + dynamic shared memory used per block.
How the occupancy is computed
The calculator evaluates how many blocks can fit on one SM under each resource constraint:
blocks_by_threads = floor(maxThreadsPerSM / threadsPerBlock)warpsPerBlock = ceil(threadsPerBlock / 32)blocks_by_warps = floor(maxWarpsPerSM / warpsPerBlock)blocks_by_registers = floor(registersPerSM / (registersPerThread * threadsPerBlock))blocks_by_smem = floor(sharedMemPerSM / sharedMemPerBlock)(if shared memory is non-zero)
The active blocks per SM is the minimum of those values and the architectural block limit. From there:
activeWarps = activeBlocks * warpsPerBlockoccupancy = activeWarps / maxWarpsPerSM
Practical tuning guidance
1) Don’t chase 100% occupancy blindly
If reducing registers to increase occupancy introduces spills to local memory, the kernel may slow down. Always verify with timing and profiler metrics.
2) Watch register pressure
Registers often become the limiting factor in compute-heavy kernels. Tools like Nsight Compute and the CUDA compiler output help you confirm register usage and occupancy ceilings.
3) Shared memory can be the bottleneck
Tiling and staging often improve memory behavior but can reduce resident blocks. Balance reuse benefits against occupancy loss.
4) Block size matters
Occupancy can change significantly between 128, 256, and 512 threads per block. Test several options, especially for memory-bound kernels.
Example workflow
- Start with a reasonable block size (e.g., 128 or 256).
- Read registers/thread and shared memory/block from build/profiler output.
- Use this calculator to estimate the occupancy ceiling.
- Benchmark and profile achieved occupancy, memory throughput, and warp stall reasons.
- Adjust block size and kernel resources, then repeat.
Limitations to keep in mind
This calculator is intentionally straightforward. It does not model every architecture-specific detail such as allocation granularities, scheduling nuances, special-function unit pressure, or cooperative launch constraints. Use it as a fast planning tool, then validate on hardware.