Statistics

Intraclass Correlation Coefficient (ICC)

Measure reliability and agreement between multiple raters evaluating the same subjects.

Number of Subjects (n)

Number of Raters (k)

Between-Subjects Mean Square (BMS)

Within-Subjects Mean Square (WMS)

ICC VALUE

0.6230

Moderate reliability

Interpretation

Moderate

Use Case

Training

BMS

12.50

WMS

2.10

What is Intraclass Correlation Coefficient (ICC)?

The ICC measures the proportion of variance in measurements attributable to differences between subjects, rather than differences between raters. It quantifies inter-rater reliability—how consistently multiple raters assess the same subjects on a continuous scale. ICC values range from 0 (perfect disagreement) to 1 (perfect agreement).

ICC(2,1): Two-way mixed model with absolute agreement, single rater. Used when specific raters are of interest and their measurements should be averaged.
Between-Subjects MS (BMS): Variance from ANOVA table representing differences between subjects. Higher BMS indicates subjects are more distinct.
Within-Subjects MS (WMS): Variance from ANOVA table representing differences between raters within each subject. Lower WMS indicates better agreement.
Formula: ICC = (BMS - WMS) / (BMS + (k-1)×WMS), where k is the number of raters.
Interpretation: ICC > 0.9 is excellent; 0.75–0.9 is good; 0.5–0.75 is moderate; < 0.5 is poor.

How to Use This Calculator

Perform ANOVA Analysis: Collect ratings from k raters on n subjects and conduct a two-way ANOVA (subjects × raters).
Extract Mean Squares: Obtain BMS (between-subjects effect) and WMS (error term) from your ANOVA output.
Enter Parameters: Input the number of subjects, raters, BMS, and WMS into the calculator.
Review ICC: The ICC value and interpretation appear automatically; values > 0.75 suggest acceptable reliability.
Use Results: High ICC (> 0.75) indicates raters can be used interchangeably. Lower ICC suggests need for rater training or averaging multiple raters.

Example: Medical Image Assessment

Three radiologists independently assess 30 CT scans on a severity scale (0–100). An ANOVA reveals BMS = 12.5 (true subject differences) and WMS = 2.1 (rater variability).

n = 30 subjects, k = 3 raters

BMS = 12.5, WMS = 2.1

ICC = (12.5 - 2.1) / (12.5 + 2×2.1)

ICC = 10.4 / 16.7 = 0.623

Moderate reliability: Some rater training beneficial.

ICC = 0.623 (moderate) suggests raters agree reasonably well but some variability remains. Radiologists might benefit from standardized scoring guidelines before averaging their assessments for clinical decisions.

Frequently Asked Questions

What is the difference between ICC and Pearson correlation?▼

Pearson correlation measures linear association but is invariant to systematic bias (e.g., one rater consistently rates 5 points higher). ICC accounts for systematic differences, making it more appropriate for agreement studies. ICC also assumes raters are interchangeable.

When should I use ICC vs. Kappa?▼

ICC is for continuous measurements (ratings, scores); Kappa is for categorical data (agreement on categories). If you have discrete categories, use Kappa; if continuous scales, use ICC.

What if ICC is negative?▼

Negative ICC indicates raters show worse-than-chance agreement, suggesting systematic disagreement or data entry errors. Clamp values to [0, 1]; ICC = 0 means no agreement, requiring rater retraining or different measurement approach.

How many raters do I need?▼

At least 2 raters for ICC analysis. However, more raters (3–5) increase reliability and power to detect true agreement. The ICC value depends on k; more raters generally yield higher ICC values for the same data quality.

Does ICC account for learning effects?▼

Standard ICC formulas do not account for order effects. If raters assess subjects in the same sequence, fatigue or learning may bias results. Randomize assessment order or counterbalance to minimize systematic biases affecting ICC magnitude.

What is the difference between ICC(2,1) and ICC(2,k)?▼

ICC(2,1) applies when using a single rater's measurement; ICC(2,k) applies when averaging all k raters' measurements. Averaging multiple raters (ICC(2,k)) yields higher reliability. This calculator computes ICC(2,1); multiply by k/(1+(k-1)r) for ICC(2,k).

Does ICC require replicate observations?▼

Yes, ICC assumes each rater assesses each subject independently (replicate measures). Each subject × rater combination should be a separate assessment. Do not use ICC if ratings are averaged or if subjects are nested within raters.

How do I improve ICC in my study?▼

(1) Provide raters with clearer rating guidelines and training. (2) Use objective measurement scales or checklists. (3) Reduce fatigue by limiting assessments per rater per session. (4) Select raters with relevant expertise. (5) Repeat measurements; averaging multiple raters increases ICC(2,k).

Related Tools

Cohen's d Calculator

Effect size measure.

Shannon Entropy Calculator

Information entropy.

Shannon Diversity Index Calculator

Species diversity.

Index of Qualitative Variation Calculator

Categorical variability.

Moving Average Calculator

Smoothed trend line.

Relative Change Calculator