| X | Y | Predicted | Residual |
|---|---|---|---|
| 1.00 | 2.10 | 2.0250 | 0.0750 |
| 2.00 | 4.00 | 4.0143 | -0.0143 |
| 3.00 | 5.80 | 6.0036 | -0.2036 |
| 4.00 | 8.20 | 7.9929 | 0.2071 |
| 5.00 | 9.80 | 9.9821 | -0.1821 |
| 6.00 | 12.10 | 11.9714 | 0.1286 |
| 7.00 | 14.00 | 13.9607 | 0.0393 |
| 8.00 | 15.90 | 15.9500 | -0.0500 |
Understand residuals—the differences between actual and predicted values—crucial for assessing regression model quality, assumptions, and prediction accuracy.
A residual is the difference between an observed (actual) value and the value predicted by a regression model. For each data point, the residual shows how far the prediction missed: residual = actual value − predicted value. Residuals measure the "leftover" variation not explained by the regression line—essentially the model's prediction error at each point.
Understanding residuals is essential for regression analysis because they reveal: - Whether the model fits the data well (small residuals = good fit) - Whether regression assumptions are violated (patterns in residuals indicate problems) - Which observations are poorly predicted (outliers with large residuals) - Whether the linear relationship assumption is valid
Analyzing residual patterns (residual plots) helps identify model problems like non-linearity, heteroscedasticity, or non-normal errors. A good regression model produces residuals that are randomly scattered around zero with no obvious patterns, indicating the model has captured the true relationship.
Simple dataset with 4 points: (1, 2), (2, 3.5), (3, 5.2), (4, 6.8)
Residuals tell us whether the regression model is appropriate. By examining residual patterns, we can detect violations of key regression assumptions: linearity, homoscedasticity (equal variance), independence, and normality. Good models produce residuals with no patterns.
A residual plot graphs residuals on the y-axis against predicted or x values on the x-axis. Random scatter around zero indicates the model is appropriate. Curved patterns suggest non-linearity (try polynomial regression). Funnel patterns indicate heteroscedasticity (unequal variance).
SSE (Sum of Squared Errors) = Σ(eᵢ)² is the total of all squared residuals. It measures overall model fit. Lower SSE = better fit. SSE is used to calculate MSE (SSE/(n-2)) and other error metrics. Comparing SSE across models helps select the best fit.
MSE (Mean Squared Error) = SSE/(n-2), the average squared residual. RMSE (Root MSE) = √MSE is in the same units as y, making it more interpretable. RMSE of 2 means predictions are typically off by about 2 units from actual values.
Yes! Points with residuals much larger in magnitude than others (±2 or ±3 RMSE) are potential outliers. Outliers have large prediction errors and may heavily influence regression coefficients. Investigate whether outliers are data entry errors or genuinely unusual observations.
Linear regression assumes: (1) linear relationship between x and y, (2) errors are normally distributed, (3) error variance is constant (homoscedasticity), (4) errors are independent, (5) no perfect multicollinearity. Residual plots help verify these assumptions.
If residual plots show a curved pattern (J-shape, inverted U), the relationship is non-linear. Try polynomial regression (quadratic, cubic). If residuals still show patterns, consider other models like exponential, logarithmic, or power regression.
Minimum: n > 2 (to define a line). Practically: n > 20−30 for reliable inference. More data = more stable estimates. For k predictor variables, use n > 10k as a rough guideline. Publish analyses typically require n ≥ 30 for credibility.
Related Tools
Linear relationship strength.
Linear correlation coefficient.
Rank correlation.
Regression fit quality.
Parabolic curve fit.
Third-degree polynomial fit.