Cohen's Kappa Calculator (2 Raters, 2 Categories)
Enter the frequency counts from a 2×2 agreement table. This calculator returns observed agreement, expected agreement by chance, and the kappa statistic (κ).
| Rater A \ Rater B | Positive | Negative |
|---|---|---|
| Positive | ||
| Negative |
Formula used: κ = (Po − Pe) / (1 − Pe)
What is the kappa statistic?
The kappa statistic is a reliability measure that tells you how much two raters agree beyond what would be expected by chance alone. If two people classify items into categories (for example, "disease present" vs. "disease absent"), kappa helps answer whether that agreement is truly meaningful.
Percent agreement alone can be misleading. Two raters can agree often simply because one category is very common. Kappa corrects for this by subtracting expected chance agreement, making it a more informative metric in many research, clinical, and quality assurance settings.
How to use this kappa statistic calculator
Step 1: Build your 2×2 table
Count how many cases fall into each cell:
- a: both raters said Positive
- b: rater A Positive, rater B Negative
- c: rater A Negative, rater B Positive
- d: both raters said Negative
Step 2: Enter values and calculate
Input the four counts in the calculator and click Calculate Kappa. The tool returns:
- Total sample size (N)
- Observed agreement (Po)
- Expected agreement by chance (Pe)
- Cohen's kappa (κ) and interpretation
How to interpret kappa values
A commonly used interpretation (Landis and Koch) is:
- < 0.00: Poor agreement
- 0.00–0.20: Slight agreement
- 0.21–0.40: Fair agreement
- 0.41–0.60: Moderate agreement
- 0.61–0.80: Substantial agreement
- 0.81–1.00: Almost perfect agreement
These cutoffs are guidelines, not absolute rules. In high-stakes domains (e.g., medical diagnosis), even moderate kappa may be insufficient, while in exploratory contexts it might be acceptable.
Example calculation
Suppose two reviewers rated 100 abstracts for inclusion in a systematic review:
- a = 45
- b = 5
- c = 10
- d = 40
Observed agreement is (45 + 40)/100 = 0.85. But after accounting for chance agreement, kappa is lower than 0.85. This is exactly why kappa is useful—it gives a more conservative and realistic picture of inter-rater reliability.
When kappa is useful (and when to be cautious)
Great use cases
- Two clinicians independently classifying patient outcomes
- Two annotators labeling text or images
- Two evaluators scoring pass/fail decisions
Important limitations
- Prevalence effect: If one category is very rare, kappa can be low even with high percent agreement.
- Bias effect: Different rater tendencies (one is stricter than the other) can influence kappa.
- Not for continuous data: Use intraclass correlation or correlation metrics for continuous measurements.
Best practices for reporting kappa
When publishing or presenting results, report more than just κ:
- The full contingency table (a, b, c, d)
- Sample size (N)
- Observed agreement and kappa
- Context-specific interpretation
Transparent reporting helps your audience judge reliability in practical terms, not only statistical labels.
Final thoughts
This kappa statistic calculator is a fast way to estimate agreement quality between two raters in binary classification tasks. Use it as a decision-support tool, and always interpret values in the context of your domain, category prevalence, and consequences of disagreement.