Classification Accuracy Calculator

Classification Accuracy Calculator

Calculate accuracy, precision, recall, F1 score, and other classification metrics from a confusion matrix for machine learning models.

Last updated: March 2026

Enter Confusion Matrix Values

Confusion Matrix

Enter values for each quadrant

Predicted
Positive
Predicted
Negative
Actual
Positive
True Positive (TP)
False Negative (FN)
Actual
Negative
False Positive (FP)
True Negative (TN)
Overall Accuracy
94.00%
1,000 total samples
Precision (PPV)
75.00%
TP / (TP + FP)
Recall (Sensitivity/TPR)
85.71%
TP / (TP + FN)
Specificity (TNR)
95.35%
TN / (TN + FP)
F1 Score
0.8000
Harmonic mean
Error Rate
6.00%
1 - Accuracy
Total Samples
1,000
TP+TN+FP+FN

Typical Model Performance Ranges

MetricPoorFairGoodExcellent
Accuracy<60%60–75%75–90%>90%
Precision<50%50–70%70–85%>85%
Recall<50%50–70%70–85%>85%
F1 Score<0.500.50–0.700.70–0.85>0.85

Note: "Excellent" ranges depend heavily on problem context. Medical diagnosis may require >99% recall, while spam detection prioritizes precision.

What is Classification Accuracy?

Classification accuracy is the most basic metric for evaluating machine learning classification models. It measures the proportion of correct predictions (both true positives and true negatives) out of all predictions made. The formula is: Accuracy = (TP + TN) / (TP + TN + FP + FN).

While accuracy is intuitive and easy to understand, it can be misleading with imbalanced datasets. For example, if 95% of your data is negative, a model that always predicts "negative" would achieve 95% accuracy despite being useless. This is why complementary metrics like precision, recall, and F1 score are crucial.

A confusion matrix is a table that visualizes a classifier's performance by showing the counts of:True Positives (TP) - correctly predicted positive cases,True Negatives (TN) - correctly predicted negative cases,False Positives (FP) - incorrectly predicted as positive (Type I error),False Negatives (FN) - incorrectly predicted as negative (Type II error).

Understanding the Metrics

Accuracy

The overall correctness of the model. Use when classes are balanced and all errors cost equally.

(TP + TN) / (TP + TN + FP + FN)

Precision (Positive Predictive Value)

Of all positive predictions, how many were correct? Important when false positives are costly (e.g., spam detection, medical diagnosis).

TP / (TP + FP)

Recall (Sensitivity, True Positive Rate)

Of all actual positive cases, how many did we catch? Important when false negatives are costly (e.g., disease screening, fraud detection).

TP / (TP + FN)

F1 Score

The harmonic mean of precision and recall. Useful when you need to balance both metrics and have imbalanced classes. Ranges from 0 (worst) to 1 (perfect).

2 × (Precision × Recall) / (Precision + Recall)

Specificity (True Negative Rate)

Of all actual negative cases, how many were correctly identified? Important in medical screening to avoid unnecessary treatments.

TN / (TN + FP)

Worked Example

Evaluating a spam email classifier on 1,000 emails:

Confusion Matrix:
True Positives (TP): 120 spam emails correctly identified
True Negatives (TN): 820 legitimate emails correctly identified
False Positives (FP): 40 legitimate emails wrongly marked as spam
False Negatives (FN): 20 spam emails that got through
Calculations:
Accuracy = (120 + 820) / 1000 = 0.94 = 94.00%
Precision = 120 / (120 + 40) = 0.75 = 75.00%
Recall = 120 / (120 + 20) = 0.857 = 85.71%
Specificity = 820 / (820 + 40) = 0.953 = 95.35%
F1 Score = 2 × (0.75 × 0.857) / (0.75 + 0.857) = 0.7999
Interpretation:

Overall accuracy (94%) is quite good, but let's look deeper:

Precision (75%): When the classifier says "spam," it's right 75% of the time. This means 25% of emails in your spam folder are actually legitimate (40 out of 160 marked as spam).

Recall (85.71%): The classifier catches 85.71% of actual spam emails. However, 14.29% of spam (20 emails) still reach your inbox.

Trade-off: You could tune the classifier to catch more spam (higher recall) but that would likely send more legitimate emails to spam (lower precision). The F1 score (0.7999) balances these two metrics.

Recommendation: For spam filtering, false positives (legitimate emails marked as spam) are generally worse than false negatives (spam getting through), so you might prefer higher precision even if it means slightly lower recall.

a

Frequently Asked Questions

When is accuracy not a good metric?

Accuracy fails with imbalanced datasets. If 99% of cases are negative, a model predicting 'negative' always gets 99% accuracy despite being useless. Use precision, recall, F1, or AUC-ROC instead.

What's the difference between precision and recall?

Precision asks: 'Of what I predicted as positive, how many were correct?' Recall asks: 'Of all actual positives, how many did I find?' High precision minimizes false alarms; high recall minimizes misses.

When should I optimize for F1 score?

Use F1 when you need to balance precision and recall, especially with imbalanced classes. It's the harmonic mean, so it's closer to the lower value—both precision and recall must be high for a high F1.

What's a good accuracy score?

It depends on the problem and baseline. For balanced datasets, 90%+ is often good. But compare against a baseline: if random guessing gives 50%, then 70% might be excellent. Context matters more than absolute numbers.

What's the precision-recall trade-off?

By adjusting your classifier's threshold, you can increase recall (catch more positives) at the cost of precision (more false alarms), or vice versa. Plot a precision-recall curve to visualize this trade-off.

Should I use accuracy for multiclass problems?

Yes, but also look at per-class metrics. A model might be accurate overall but terrible at rare classes. Use a confusion matrix for all classes, and compute precision/recall/F1 for each class individually.

What's the difference between sensitivity and recall?

They're the same thing! Sensitivity, recall, true positive rate (TPR), and hit rate all refer to TP/(TP+FN). Different fields use different names for historical reasons.

How do I improve my model's metrics?

Collect more/better data, try different algorithms, tune hyperparameters, engineer better features, handle class imbalance (oversampling, undersampling, class weights), or adjust decision thresholds.

Related Tools