Cohen's Kappa Calculator (2×2 Table)
Enter counts from a 2-rater, 2-category confusion matrix. This tool computes observed agreement, expected agreement by chance, and Cohen's kappa coefficient.
| Rater B: Positive | Rater B: Negative | |
|---|---|---|
| Rater A: Positive | ||
| Rater A: Negative |
What is the kappa coefficient?
The kappa coefficient (usually Cohen's kappa, symbolized as κ) measures agreement between two raters while correcting for agreement that could happen purely by chance. It is commonly used in research, medicine, machine learning evaluation, and quality assurance when two people (or systems) classify the same items into categories.
Unlike raw percent agreement, kappa answers a better question: How much better is this agreement than random guessing?
Why percent agreement is not enough
Suppose two raters agree 90% of the time. That sounds excellent, but if one class is very common, both raters might achieve high agreement simply by always choosing that class. Kappa adjusts for this by estimating expected agreement from marginal totals.
- Percent agreement can be inflated by imbalanced categories.
- Kappa penalizes agreement that is likely due to chance.
- Interpretation is more meaningful when prevalence is skewed.
Formula used in this calculator
κ = (Po - Pe) / (1 - Pe)
Where:
- Po = observed agreement = (a + d) / N
- Pe = expected agreement by chance
- N = a + b + c + d
For a 2×2 table, expected agreement is calculated from the row and column totals:
Pe = [((a+b)(a+c)) + ((c+d)(b+d))] / N²
How to use this kappa coefficient calculator
Step 1: Build your 2×2 table
Enter the number of items in each cell:
- a: both raters marked Positive
- b: rater A Positive, rater B Negative
- c: rater A Negative, rater B Positive
- d: both raters marked Negative
Step 2: Click “Calculate Kappa”
The tool returns total observations, observed agreement, expected agreement, kappa value, and an interpretation label.
Step 3: Interpret with context
Statistical interpretation guides are useful, but domain context matters. In high-stakes settings (e.g., diagnostics), even moderate kappa may be insufficient depending on consequences.
Common interpretation scale (Landis & Koch)
- < 0.00: Less than chance agreement
- 0.00–0.20: Slight agreement
- 0.21–0.40: Fair agreement
- 0.41–0.60: Moderate agreement
- 0.61–0.80: Substantial agreement
- 0.81–1.00: Almost perfect agreement
Practical notes and limitations
1) Prevalence effect
When one category is very common, kappa may appear lower than expected even with high agreement.
2) Bias effect
If raters use categories differently (systematic bias), kappa can decrease even when raw agreement is high.
3) Not for all data types
This calculator is for two raters and two categories. For ordered categories, weighted kappa is often better. For more than two raters, consider Fleiss' kappa.
Example scenario
Imagine two clinicians reviewing 100 images for disease presence:
- Both positive (a): 50
- A positive / B negative (b): 10
- A negative / B positive (c): 5
- Both negative (d): 35
This gives high observed agreement and a kappa around 0.69, which is usually interpreted as substantial agreement.
Final takeaway
A kappa coefficient calculator is a quick way to move beyond simple percent agreement and evaluate inter-rater reliability more rigorously. Use κ together with raw agreement, confusion matrix context, and domain-specific thresholds to make better decisions.