Residual = Actual − Predicted (Can Be Positive or Negative)

Q: How do you use a residual plot to check a regression model?

Plot the residuals on the vertical axis against the $x$-values (or predicted values) on the horizontal axis. If the residuals scatter randomly with no pattern, a linear model is appropriate. If you see a curved pattern, the relationship may not be linear and a different model might fit better.

Residual

The vertical distance between a data point and the graph of a regression equation. The residual is positive if the data point is above the graph. The residual is negative if the data point is below the graph. The residual is 0 only when the graph passes through the data point.

Key Formula

e_i = y_i - \hat{y}_i

Where:

$e_i$ = The residual for the $i$th data point
$y_i$ = The actual (observed) $y$-value of the $i$th data point
$\hat{y}_i$ = The predicted $y$-value from the regression equation for the $i$th data point

Worked Example

Problem: A regression line for study hours vs. test scores is ŷ = 50 + 5x. A student who studied 6 hours scored 83. Find the residual.

Step 1: Identify the actual

y

-value. The student's actual test score is 83.

y = 83

Step 2: Calculate the predicted

y

-value by substituting

x = 6

into the regression equation.

\hat{y} = 50 + 5(6) = 80

Step 3: Compute the residual by subtracting the predicted value from the actual value.

e = y - \hat{y} = 83 - 80 = 3

Step 4: Interpret the result. The residual is positive, so the data point lies 3 units above the regression line. The student scored 3 points higher than the model predicted.

Answer: The residual is 3. The student scored 3 points above the predicted value.

Another Example

This example differs by computing multiple residuals at once, illustrating all three sign cases (positive, negative, zero) and introducing the idea that residuals from a least-squares line sum to zero when all data points are included.

Problem: Using the same regression line ŷ = 50 + 5x, three students studied 4, 8, and 10 hours and scored 68, 95, and 100 respectively. Calculate each residual and verify that the sum is close to zero.

Step 1: Find the predicted score for each student.

\hat{y}_1 = 50 + 5(4) = 70, \quad \hat{y}_2 = 50 + 5(8) = 90, \quad \hat{y}_3 = 50 + 5(10) = 100

Step 2: Calculate each residual.

e_1 = 68 - 70 = -2, \quad e_2 = 95 - 90 = 5, \quad e_3 = 100 - 100 = 0

Step 3: Interpret the signs. The first student scored below the prediction (negative residual), the second scored above (positive), and the third landed exactly on the line (zero residual).

Step 4: Sum the residuals. For a least-squares regression line fitted to all the data, the residuals sum to exactly zero. Here we only have three of the data points, so the sum may not be zero.

e_1 + e_2 + e_3 = -2 + 5 + 0 = 3

Answer: The residuals are −2, 5, and 0. This example shows that residuals can be negative, positive, or exactly zero.

Frequently Asked Questions

What does a positive or negative residual mean?

A positive residual means the actual data point is above the regression line—the model underestimated the value. A negative residual means the actual point is below the line—the model overestimated. A residual of zero means the prediction was exactly correct for that point.

Why do the residuals of a least-squares regression line sum to zero?

The least-squares method minimizes the sum of squared residuals. A mathematical consequence of this optimization is that the sum of the residuals equals exactly zero. This means the regression line passes through the point

(\bar{x}, \bar{y})

, balancing overestimates and underestimates.

How do you use a residual plot to check a regression model?

Plot the residuals on the vertical axis against the

x

-values (or predicted values) on the horizontal axis. If the residuals scatter randomly with no pattern, a linear model is appropriate. If you see a curved pattern, the relationship may not be linear and a different model might fit better.

Residual vs. Predicted Value (ŷ)

	Residual	Predicted Value (ŷ)
Definition	The difference between the observed and predicted y-values	The y-value estimated by the regression equation for a given x
Formula	$e_i = y_i - \hat{y}_i$	$\hat{y}_i = b_0 + b_1 x_i$
Sign	Can be positive, negative, or zero	Depends on the regression equation; can be any real number
Purpose	Measures prediction error for each data point	Provides the model's best estimate for a given input

Why It Matters

Residuals appear throughout statistics courses whenever you study regression. They are the foundation for assessing how well a model fits data—residual plots reveal whether a linear model is appropriate, and the sum of squared residuals is the quantity minimized when fitting a least-squares line. In AP Statistics and college-level courses, you will be asked to calculate residuals, construct residual plots, and use them to judge model quality.

Common Mistakes

Mistake: Subtracting in the wrong order, computing

\hat{y} - y

instead of

y - \hat{y}

Correction: Always subtract the predicted value from the actual value:

e = y - \hat{y}

. Reversing the order flips the sign of every residual, which changes the interpretation of whether the model over- or underestimates.

Mistake: Confusing residuals with the distance formula or horizontal distance.

Correction: A residual is strictly a vertical distance—the difference in

y

-values only. It does not involve

x

-differences or the point-to-line perpendicular distance.

Related Terms

Regression Equation — The equation whose predictions residuals measure error from
Least-Squares Regression Line — The line that minimizes the sum of squared residuals
Scatterplot — Graph where data points and residuals are visualized
Vertical — Residuals measure vertical distance from the line
Graph of an Equation or Inequality — The regression curve from which residuals are measured
Positive Number — Residual is positive when data point is above the line
Negative Number — Residual is negative when data point is below the line