Mathwords logoReference LibraryMathwords

Least-Squares Regression Line

Least-Squares Regression Line
Least-Squares Line
Least-Squares Fit
LSRL

The linear fit that matches the pattern of a set of paired data as closely as possible. Out of all possible linear fits, the least-squares regression line is the one that has the smallest possible value for the sum of the squares of the residuals.

 

Scatterplot with x and y axes showing a Least-Squares Regression Line. Vertical dotted lines indicate residuals from data...

 

See also

Least-squares regression equation, scatterplot

Key Formula

y^=b0+b1x\hat{y} = b_0 + b_1 x where b1=nxiyi(xi)(yi)nxi2(xi)2b0=yˉb1xˉb_1 = \frac{n\sum x_i y_i - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2} \qquad b_0 = \bar{y} - b_1\bar{x}
Where:
  • y^\hat{y} = The predicted value of the response variable
  • b1b_1 = The slope of the regression line (change in ŷ per unit change in x)
  • b0b_0 = The y-intercept of the regression line (predicted value when x = 0)
  • xi,yix_i, y_i = Individual data values for the explanatory and response variables
  • xˉ,yˉ\bar{x}, \bar{y} = The means of the x-values and y-values
  • nn = The number of data points

Worked Example

Problem: Find the least-squares regression line for the data: (1, 2), (2, 5), (3, 6), (4, 8), (5, 9).
Step 1: Compute the necessary sums. With n = 5 data points:
xi=15,yi=30,xiyi=107,xi2=55\sum x_i = 15, \quad \sum y_i = 30, \quad \sum x_i y_i = 107, \quad \sum x_i^2 = 55
Step 2: Calculate the means of x and y.
xˉ=155=3,yˉ=305=6\bar{x} = \frac{15}{5} = 3, \qquad \bar{y} = \frac{30}{5} = 6
Step 3: Compute the slope using the formula for b₁.
b1=5(107)(15)(30)5(55)(15)2=535450275225=8550=1.7b_1 = \frac{5(107) - (15)(30)}{5(55) - (15)^2} = \frac{535 - 450}{275 - 225} = \frac{85}{50} = 1.7
Step 4: Compute the y-intercept using the formula for b₀.
b0=yˉb1xˉ=61.7(3)=65.1=0.9b_0 = \bar{y} - b_1\bar{x} = 6 - 1.7(3) = 6 - 5.1 = 0.9
Step 5: Write the least-squares regression line.
y^=0.9+1.7x\hat{y} = 0.9 + 1.7x
Answer: The least-squares regression line is ŷ = 0.9 + 1.7x. This means that for each 1-unit increase in x, the predicted y-value increases by 1.7.

Another Example

This example uses summary statistics (means, standard deviations, and the correlation coefficient) instead of raw data. Many AP Statistics problems provide these values directly, making this formula faster to apply.

Problem: Using the alternative formula with means and standard deviations: A data set has x̄ = 10, ȳ = 25, sₓ = 4, sᵧ = 6, and the correlation r = 0.8. Find the LSRL.
Step 1: When you know r, sₓ, and sᵧ, compute the slope using the alternative slope formula.
b1=rsysx=0.864=0.81.5=1.2b_1 = r \cdot \frac{s_y}{s_x} = 0.8 \cdot \frac{6}{4} = 0.8 \cdot 1.5 = 1.2
Step 2: Compute the y-intercept. The LSRL always passes through the point (x̄, ȳ).
b0=yˉb1xˉ=251.2(10)=2512=13b_0 = \bar{y} - b_1\bar{x} = 25 - 1.2(10) = 25 - 12 = 13
Step 3: Write the equation of the regression line.
y^=13+1.2x\hat{y} = 13 + 1.2x
Answer: The least-squares regression line is ŷ = 13 + 1.2x.

Frequently Asked Questions

Why do we square the residuals instead of just adding them up?
If you simply add the residuals without squaring, positive and negative residuals cancel each other out, often giving a sum near zero for many different lines. Squaring ensures all residuals contribute positively, so the total reflects the actual size of every error. It also gives more weight to large deviations, which penalizes poor fits more heavily.
What is the difference between the regression line and the correlation coefficient?
The regression line gives you an equation to predict y from x, while the correlation coefficient r measures the strength and direction of the linear relationship. You need r to know how reliable the regression line's predictions are: an r close to ±1 means predictions are accurate, while an r near 0 means the line fits poorly. The slope of the LSRL is directly related to r through the formula b₁ = r(sᵧ/sₓ).
Does the least-squares regression line always pass through the point (x̄, ȳ)?
Yes, always. This is a mathematical consequence of how b₀ is calculated: b₀ = ȳ − b₁x̄. Substituting x = x̄ into the equation gives ŷ = b₀ + b₁x̄ = ȳ. This property is useful for checking your work and understanding the geometry of the line.

Least-Squares Regression Line vs. Correlation Coefficient (r)

Least-Squares Regression LineCorrelation Coefficient (r)
What it tells youAn equation to predict y from xHow strong and in what direction the linear relationship is
OutputAn equation: ŷ = b₀ + b₁xA single number between −1 and 1
Depends on variable rolesYes — switching x and y gives a different lineNo — r is the same regardless of which variable is x or y
Primary useMaking predictionsAssessing how well a line fits

Why It Matters

The least-squares regression line is one of the most frequently tested topics in AP Statistics and introductory data analysis courses. You will use it whenever you need to predict one variable from another — for example, predicting test scores from hours studied, or estimating cost from quantity produced. Understanding the LSRL also lays the groundwork for multiple regression and more advanced statistical modeling you will encounter later.

Common Mistakes

Mistake: Switching the explanatory and response variables when computing the slope.
Correction: The LSRL of y on x is different from the LSRL of x on y. Always identify which variable you are predicting (response, ŷ) and which is the predictor (explanatory, x). The slope formula uses sᵧ/sₓ, not sₓ/sᵧ.
Mistake: Using the regression line to predict far beyond the range of the data (extrapolation).
Correction: The LSRL is only reliable within (or close to) the range of x-values in your data set. Predicting y for x-values far outside this range assumes the linear trend continues indefinitely, which is often false and can produce misleading results.

Related Terms