ceph safely available storage calculator - Aaron Graves, PhDude Replica

Ceph Capacity Safety Planner

Estimate how much logical data your Ceph cluster can safely hold before you run into nearfull/full pressure and rebalance risk.

Number of OSDs

Size per OSD (TB)

Redundancy mode

Largest failure domain (OSDs) Usually OSDs on your largest host/chassis.

Replica size Common values: 2 or 3.

Nearfull ratio (%)

Full ratio (%)

Operational reserve (%) For metadata, imbalance, snapshots, and day-2 ops.

Current logical data (TB)

Planned growth (TB/month)

Enter your values and click Calculate Safe Capacity.

Why “safe” capacity matters in Ceph

Raw capacity is not the same as safely available capacity. In Ceph, you need to account for redundancy, nearfull/full thresholds, and free space required for backfill and recovery. Clusters that run too hot become difficult to rebalance and may degrade during failures.

This calculator focuses on conservative planning: how much logical data you can store while still keeping practical headroom for operations.

How this calculator works

1) Start with raw storage

Raw storage is the simple total:

Raw TB = OSD count × size per OSD

2) Apply redundancy efficiency

Ceph overhead depends on the protection method:

Replicated: efficiency = 1 / replica size
Erasure coded: efficiency = k / (k + m)

Higher protection means lower usable data efficiency, but better resilience.

3) Respect nearfull/full behavior

Most production teams avoid planning near the full ratio. Nearfull is usually the practical limit where intervention should happen before client impact. The tool computes both conservative (nearfull) and hard-limit (full) estimates.

4) Keep failure-domain buffer

The biggest hidden risk is not leaving enough space to recover after a host/chassis failure. This calculator subtracts a buffer based on your largest failure domain (in OSDs) before reporting safe logical capacity.

Input guide

Largest failure domain (OSDs): Count of OSDs that can fail together, typically a full node.
Operational reserve: Extra margin for fragmentation, skewed pools, snapshots, and temporary growth.
Current logical data: User-visible data, not replicated raw usage.
Growth rate: Helps estimate runway until safe threshold is reached.

Quick planning tips

Do capacity reviews monthly, not quarterly.
Track both raw utilization and logical growth.
Keep room for rebalance after at least one host failure.
Avoid last-minute expansions during high ingest windows.
Set alerts before nearfull, not at nearfull.

Replication vs erasure coding

Replication is simpler and often better for latency-sensitive block workloads. Erasure coding gives better space efficiency and is common for object-heavy workloads, but has recovery and performance tradeoffs depending on profile and hardware.

If you run mixed workloads, model each major pool separately and sum the risk, rather than relying on one cluster-wide average.

Final note

This is a planning estimator, not a replacement for live Ceph telemetry. Always validate with current pool mix, CRUSH rules, PG distribution, and real nearfull/full settings in your environment.