f1 score calculator

Calculate F1 score from either confusion matrix values (TP, FP, FN) or from precision and recall directly.

Method 1: From TP, FP, FN

True Positives (TP)

False Positives (FP)

False Negatives (FN)

Method 2: From Precision and Recall

Precision (0-1 or 0-100)

Recall (0-1 or 0-100)

Tip: F1 score ranges from 0 to 1. Higher is better.

What is the F1 score?

The F1 score is a single machine learning metric that balances precision and recall. It is especially useful when class distribution is imbalanced (for example, fraud detection, medical diagnosis, spam filtering, or rare event prediction).

Accuracy can look great even when a model misses most positive cases. F1 score helps solve that by combining two important questions:

Precision: Out of all predicted positives, how many were correct?
Recall: Out of all actual positives, how many did we find?

F1 score formula

F1 is the harmonic mean of precision and recall:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Because it is a harmonic mean, F1 penalizes extreme imbalance. If precision is high but recall is very low (or vice versa), the F1 score drops.

Worked example

Suppose your classifier results are:

TP = 42
FP = 8
FN = 10

Then:

Precision = 42 / (42 + 8) = 0.84
Recall = 42 / (42 + 10) = 0.8077
F1 = 2 × (0.84 × 0.8077) / (0.84 + 0.8077) ≈ 0.8235

So the model's F1 score is 0.8235 (or 82.35%).

When to use F1 score vs other metrics

Metric	Best For	Main Limitation
Accuracy	Balanced classes and equal error costs	Can be misleading with imbalanced data
Precision	When false positives are expensive	Ignores missed positives (FN)
Recall	When false negatives are expensive	Ignores false alarms (FP)
F1 Score	Need balance between precision and recall	Does not include true negatives directly

Binary, macro, micro, and weighted F1

Binary classification

Standard F1 is straightforward in binary tasks: one positive class, one negative class.

Multiclass classification

For multiclass problems, you typically compute:

Macro F1: Average of per-class F1 scores (treats each class equally).
Micro F1: Global TP/FP/FN across all classes (weights by total instances).
Weighted F1: Per-class F1 weighted by class support.

If minority classes matter a lot, macro F1 is often more informative than accuracy.

Common mistakes when interpreting F1

Comparing F1 scores from different datasets without context.
Using only F1 when business cost of FP and FN is very asymmetric.
Ignoring threshold tuning; F1 changes when classification threshold changes.
Assuming a high F1 always means high precision and high recall (one can still be moderate).

How to improve your model's F1 score

Optimize decision threshold on validation data for maximum F1.
Use class weights or focal loss for imbalanced classification.
Collect better labeled data, especially for minority classes.
Engineer features that improve separation for positives.
Track precision-recall curves, not just ROC-AUC.

Final thoughts

If your project depends on correctly identifying positive cases while limiting false alarms, F1 score is one of the most practical model evaluation metrics. Use the calculator above to quickly compute F1 from raw confusion matrix counts or from precision and recall.