Calculate accuracy, precision, recall, F1 score, and other classification metrics from a confusion matrix for machine learning models.
Last updated: March 2026
Enter values for each quadrant
| Metric | Poor | Fair | Good | Excellent |
|---|---|---|---|---|
| Accuracy | <60% | 60–75% | 75–90% | >90% |
| Precision | <50% | 50–70% | 70–85% | >85% |
| Recall | <50% | 50–70% | 70–85% | >85% |
| F1 Score | <0.50 | 0.50–0.70 | 0.70–0.85 | >0.85 |
Note: "Excellent" ranges depend heavily on problem context. Medical diagnosis may require >99% recall, while spam detection prioritizes precision.
Classification accuracy is the most basic metric for evaluating machine learning classification models. It measures the proportion of correct predictions (both true positives and true negatives) out of all predictions made. The formula is: Accuracy = (TP + TN) / (TP + TN + FP + FN).
While accuracy is intuitive and easy to understand, it can be misleading with imbalanced datasets. For example, if 95% of your data is negative, a model that always predicts "negative" would achieve 95% accuracy despite being useless. This is why complementary metrics like precision, recall, and F1 score are crucial.
A confusion matrix is a table that visualizes a classifier's performance by showing the counts of:True Positives (TP) - correctly predicted positive cases,True Negatives (TN) - correctly predicted negative cases,False Positives (FP) - incorrectly predicted as positive (Type I error),False Negatives (FN) - incorrectly predicted as negative (Type II error).
The overall correctness of the model. Use when classes are balanced and all errors cost equally.
Of all positive predictions, how many were correct? Important when false positives are costly (e.g., spam detection, medical diagnosis).
Of all actual positive cases, how many did we catch? Important when false negatives are costly (e.g., disease screening, fraud detection).
The harmonic mean of precision and recall. Useful when you need to balance both metrics and have imbalanced classes. Ranges from 0 (worst) to 1 (perfect).
Of all actual negative cases, how many were correctly identified? Important in medical screening to avoid unnecessary treatments.
Evaluating a spam email classifier on 1,000 emails:
Overall accuracy (94%) is quite good, but let's look deeper:
Precision (75%): When the classifier says "spam," it's right 75% of the time. This means 25% of emails in your spam folder are actually legitimate (40 out of 160 marked as spam).
Recall (85.71%): The classifier catches 85.71% of actual spam emails. However, 14.29% of spam (20 emails) still reach your inbox.
Trade-off: You could tune the classifier to catch more spam (higher recall) but that would likely send more legitimate emails to spam (lower precision). The F1 score (0.7999) balances these two metrics.
Recommendation: For spam filtering, false positives (legitimate emails marked as spam) are generally worse than false negatives (spam getting through), so you might prefer higher precision even if it means slightly lower recall.
Accuracy fails with imbalanced datasets. If 99% of cases are negative, a model predicting 'negative' always gets 99% accuracy despite being useless. Use precision, recall, F1, or AUC-ROC instead.
Precision asks: 'Of what I predicted as positive, how many were correct?' Recall asks: 'Of all actual positives, how many did I find?' High precision minimizes false alarms; high recall minimizes misses.
Use F1 when you need to balance precision and recall, especially with imbalanced classes. It's the harmonic mean, so it's closer to the lower value—both precision and recall must be high for a high F1.
It depends on the problem and baseline. For balanced datasets, 90%+ is often good. But compare against a baseline: if random guessing gives 50%, then 70% might be excellent. Context matters more than absolute numbers.
By adjusting your classifier's threshold, you can increase recall (catch more positives) at the cost of precision (more false alarms), or vice versa. Plot a precision-recall curve to visualize this trade-off.
Yes, but also look at per-class metrics. A model might be accurate overall but terrible at rare classes. Use a confusion matrix for all classes, and compute precision/recall/F1 for each class individually.
They're the same thing! Sensitivity, recall, true positive rate (TPR), and hit rate all refer to TP/(TP+FN). Different fields use different names for historical reasons.
Collect more/better data, try different algorithms, tune hyperparameters, engineer better features, handle class imbalance (oversampling, undersampling, class weights), or adjust decision thresholds.
Related Tools
Classification metrics.
Diagnostic test accuracy.
Diagnostic test performance.
Updated diagnostic probability.
Risk ratio comparison.
Event probability.