cuda occupancy calculator

Estimate theoretical occupancy for a CUDA kernel using your launch configuration and per-block resource usage.

GPU Preset

Max Threads per SM

Max Warps per SM

Max Blocks per SM

Max Threads per Block

Registers per SM

Shared Memory per SM (bytes)

Threads per Block (kernel launch)

Registers per Thread (kernel)

Shared Memory per Block (bytes, static + dynamic)

This tool gives a practical theoretical estimate. Real performance also depends on memory bandwidth, instruction mix, latency hiding, and scheduler behavior.

What CUDA occupancy means

CUDA occupancy is the ratio of active warps on a Streaming Multiprocessor (SM) to the maximum warps the SM can support. In simple terms, it answers: how full is each SM with runnable work?

Higher occupancy can improve latency hiding, but it does not always guarantee higher performance. Many kernels run fastest at moderate occupancy if they gain instruction-level parallelism, cache locality, or reduced register spilling.

Inputs used by the calculator

GPU limits (per SM)

Max Threads per SM: Hardware limit on active threads per SM.
Max Warps per SM: Hardware limit on active warps per SM.
Max Blocks per SM: Architectural cap on resident thread blocks.
Registers per SM: Total register file capacity available per SM.
Shared Memory per SM: On-chip shared memory pool available per SM.

Kernel launch and kernel resource usage

Threads per Block: Your launch configuration, e.g., <<<grid, block>>>.
Registers per Thread: Register usage reported by compiler tools.
Shared Memory per Block: Static + dynamic shared memory used per block.

How the occupancy is computed

The calculator evaluates how many blocks can fit on one SM under each resource constraint:

blocks_by_threads = floor(maxThreadsPerSM / threadsPerBlock)
warpsPerBlock = ceil(threadsPerBlock / 32)
blocks_by_warps = floor(maxWarpsPerSM / warpsPerBlock)
blocks_by_registers = floor(registersPerSM / (registersPerThread * threadsPerBlock))
blocks_by_smem = floor(sharedMemPerSM / sharedMemPerBlock) (if shared memory is non-zero)

The active blocks per SM is the minimum of those values and the architectural block limit. From there:

activeWarps = activeBlocks * warpsPerBlock
occupancy = activeWarps / maxWarpsPerSM

Practical tuning guidance

1) Don’t chase 100% occupancy blindly

If reducing registers to increase occupancy introduces spills to local memory, the kernel may slow down. Always verify with timing and profiler metrics.

2) Watch register pressure

Registers often become the limiting factor in compute-heavy kernels. Tools like Nsight Compute and the CUDA compiler output help you confirm register usage and occupancy ceilings.

3) Shared memory can be the bottleneck

Tiling and staging often improve memory behavior but can reduce resident blocks. Balance reuse benefits against occupancy loss.

4) Block size matters

Occupancy can change significantly between 128, 256, and 512 threads per block. Test several options, especially for memory-bound kernels.

Example workflow

Start with a reasonable block size (e.g., 128 or 256).
Read registers/thread and shared memory/block from build/profiler output.
Use this calculator to estimate the occupancy ceiling.
Benchmark and profile achieved occupancy, memory throughput, and warp stall reasons.
Adjust block size and kernel resources, then repeat.

Limitations to keep in mind

This calculator is intentionally straightforward. It does not model every architecture-specific detail such as allocation granularities, scheduling nuances, special-function unit pressure, or cooperative launch constraints. Use it as a fast planning tool, then validate on hardware.