F1 Score Calculator
Calculate F1 score from either confusion matrix values (TP, FP, FN) or from precision and recall directly.
Method 1: From TP, FP, FN
Method 2: From Precision and Recall
Tip: F1 score ranges from 0 to 1. Higher is better.
What is the F1 score?
The F1 score is a single machine learning metric that balances precision and recall. It is especially useful when class distribution is imbalanced (for example, fraud detection, medical diagnosis, spam filtering, or rare event prediction).
Accuracy can look great even when a model misses most positive cases. F1 score helps solve that by combining two important questions:
- Precision: Out of all predicted positives, how many were correct?
- Recall: Out of all actual positives, how many did we find?
F1 score formula
F1 is the harmonic mean of precision and recall:
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Because it is a harmonic mean, F1 penalizes extreme imbalance. If precision is high but recall is very low (or vice versa), the F1 score drops.
Worked example
Suppose your classifier results are:
- TP = 42
- FP = 8
- FN = 10
Then:
- Precision = 42 / (42 + 8) = 0.84
- Recall = 42 / (42 + 10) = 0.8077
- F1 = 2 × (0.84 × 0.8077) / (0.84 + 0.8077) ≈ 0.8235
So the model's F1 score is 0.8235 (or 82.35%).
When to use F1 score vs other metrics
| Metric | Best For | Main Limitation |
|---|---|---|
| Accuracy | Balanced classes and equal error costs | Can be misleading with imbalanced data |
| Precision | When false positives are expensive | Ignores missed positives (FN) |
| Recall | When false negatives are expensive | Ignores false alarms (FP) |
| F1 Score | Need balance between precision and recall | Does not include true negatives directly |
Binary, macro, micro, and weighted F1
Binary classification
Standard F1 is straightforward in binary tasks: one positive class, one negative class.
Multiclass classification
For multiclass problems, you typically compute:
- Macro F1: Average of per-class F1 scores (treats each class equally).
- Micro F1: Global TP/FP/FN across all classes (weights by total instances).
- Weighted F1: Per-class F1 weighted by class support.
If minority classes matter a lot, macro F1 is often more informative than accuracy.
Common mistakes when interpreting F1
- Comparing F1 scores from different datasets without context.
- Using only F1 when business cost of FP and FN is very asymmetric.
- Ignoring threshold tuning; F1 changes when classification threshold changes.
- Assuming a high F1 always means high precision and high recall (one can still be moderate).
How to improve your model's F1 score
- Optimize decision threshold on validation data for maximum F1.
- Use class weights or focal loss for imbalanced classification.
- Collect better labeled data, especially for minority classes.
- Engineer features that improve separation for positives.
- Track precision-recall curves, not just ROC-AUC.
Final thoughts
If your project depends on correctly identifying positive cases while limiting false alarms, F1 score is one of the most practical model evaluation metrics. Use the calculator above to quickly compute F1 from raw confusion matrix counts or from precision and recall.