Mathwords logoMathwords

Least-Squares Regression Equation

Least-Squares Regression Equation

An equation of a particular form (linear, quadratic, exponential, etc.) that fits a set of paired data as closely as possible. The equation must be chosen so that the sum of the squares of the residuals is made as small as possible.

 

A parabola fitted through scattered data points on an x-y graph. Vertical dotted lines show residuals between points and curve.

 

See also

Least-squares regression line, scatterplot

Key Formula

y^=b0+b1x\hat{y} = b_0 + b_1 x where the slope and intercept are computed as: b1=nxiyi(xi)(yi)nxi2(xi)2b0=yˉb1xˉb_1 = \frac{n\sum x_i y_i - \left(\sum x_i\right)\left(\sum y_i\right)}{n\sum x_i^2 - \left(\sum x_i\right)^2} \qquad b_0 = \bar{y} - b_1\bar{x}
Where:
  • y^\hat{y} = The predicted value of the response variable
  • xx = The explanatory (independent) variable
  • b1b_1 = The slope of the regression line
  • b0b_0 = The y-intercept of the regression line
  • nn = The number of data points
  • xi,yix_i, y_i = Individual paired data values
  • xˉ,yˉ\bar{x}, \bar{y} = The means of the x-values and y-values, respectively

Worked Example

Problem: Find the least-squares regression line for the data: (1, 2), (2, 4), (3, 5), (4, 4), (5, 5).
Step 1: Compute the needed sums. With n = 5 data points:
xi=15,yi=20,xiyi=67,xi2=55\sum x_i = 15,\quad \sum y_i = 20,\quad \sum x_i y_i = 67,\quad \sum x_i^2 = 55
Step 2: Find the means of x and y.
xˉ=155=3,yˉ=205=4\bar{x} = \frac{15}{5} = 3,\qquad \bar{y} = \frac{20}{5} = 4
Step 3: Calculate the slope using the formula.
b1=5(67)(15)(20)5(55)(15)2=335300275225=3550=0.7b_1 = \frac{5(67) - (15)(20)}{5(55) - (15)^2} = \frac{335 - 300}{275 - 225} = \frac{35}{50} = 0.7
Step 4: Calculate the y-intercept.
b0=yˉb1xˉ=40.7(3)=42.1=1.9b_0 = \bar{y} - b_1\bar{x} = 4 - 0.7(3) = 4 - 2.1 = 1.9
Step 5: Write the least-squares regression equation.
y^=1.9+0.7x\hat{y} = 1.9 + 0.7x
Answer: The least-squares regression equation is ŷ = 1.9 + 0.7x. For every 1-unit increase in x, the predicted y increases by 0.7.

Another Example

This example focuses on residuals and the sum of squares, demonstrating what 'least squares' actually minimizes rather than just computing the equation.

Problem: Using the regression equation ŷ = 1.9 + 0.7x from the first example, compute all the residuals and verify that the sum of squared residuals is minimized (i.e., show the value that the method minimizes).
Step 1: Compute the predicted value for each data point using ŷ = 1.9 + 0.7x.
y^1=2.6,y^2=3.3,y^3=4.0,y^4=4.7,y^5=5.4\hat{y}_1 = 2.6,\quad \hat{y}_2 = 3.3,\quad \hat{y}_3 = 4.0,\quad \hat{y}_4 = 4.7,\quad \hat{y}_5 = 5.4
Step 2: Calculate each residual: residual = observed y − predicted ŷ.
e_1 = 2 - 2.6 = -0.6,\quad e_2 = 4 - 3.3 = 0.7,\quad e_3 = 5 - 4.0 = 1.0$$ $$e_4 = 4 - 4.7 = -0.7,\quad e_5 = 5 - 5.4 = -0.4
Step 3: Square each residual and add them together.
SSR=(0.6)2+(0.7)2+(1.0)2+(0.7)2+(0.4)2=0.36+0.49+1.00+0.49+0.16=2.50\text{SSR} = (-0.6)^2 + (0.7)^2 + (1.0)^2 + (-0.7)^2 + (-0.4)^2 = 0.36 + 0.49 + 1.00 + 0.49 + 0.16 = 2.50
Step 4: No other line of the form ŷ = b₀ + b₁x produces a sum of squared residuals smaller than 2.50. This is the quantity the least-squares method minimizes.
mini=1n(yiy^i)2=2.50\min \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = 2.50
Answer: The sum of squared residuals is 2.50. Any other straight line through these data would give a larger value.

Frequently Asked Questions

What is the difference between a least-squares regression equation and a least-squares regression line?
A least-squares regression line is specifically a straight line (ŷ = b₀ + b₁x). A least-squares regression equation is a broader term that includes any model form—linear, quadratic, exponential, and so on—as long as the parameters are chosen to minimize the sum of squared residuals. Every least-squares regression line is a least-squares regression equation, but not every least-squares regression equation is a line.
Why do we square the residuals instead of just adding them?
If you simply add the residuals without squaring, positive and negative errors cancel out, often giving a sum near zero even for a terrible fit. Squaring makes all residuals positive and also penalizes larger errors more heavily, which pushes the best-fit equation closer to the bulk of the data. This approach also leads to clean, closed-form formulas for the slope and intercept.
When should you use a least-squares regression equation?
Use it whenever you have paired data and want to model the relationship between an explanatory variable and a response variable. It is most common in statistics courses for prediction—estimating a y-value from a given x-value. Before fitting, always check a scatterplot to see whether a linear, quadratic, or other model shape is appropriate.

Least-Squares Regression Equation vs. Least-Squares Regression Line

Least-Squares Regression EquationLeast-Squares Regression Line
DefinitionAny equation form (linear, quadratic, exponential, etc.) fit by minimizing squared residualsSpecifically a linear equation ŷ = b₀ + b₁x fit by minimizing squared residuals
Formula formsŷ = b₀ + b₁x, ŷ = ax² + bx + c, ŷ = abˣ, etc.ŷ = b₀ + b₁x only
When to useWhen the scatterplot suggests any particular curve or patternWhen the scatterplot shows a roughly linear trend
ComplexityMay require more parameters and technology to computeSlope and intercept computed with standard formulas or a calculator

Why It Matters

The least-squares regression equation is central to AP Statistics, college introductory statistics, and data science. You use it every time you draw a trend line on a scatterplot or predict values from data. Understanding how it works also builds the foundation for more advanced topics like multiple regression and machine learning models.

Common Mistakes

Mistake: Swapping x and y when computing the slope, which gives the regression of x on y instead of y on x.
Correction: Always place the response variable (the one you want to predict) as y and the explanatory variable as x. The regression of y on x is not the same equation as the regression of x on y.
Mistake: Using the regression equation to predict far outside the range of the original x-values (extrapolation).
Correction: The least-squares equation is only reliable within (or near) the range of x-values in your data. Predicting well beyond that range assumes the same pattern continues, which may not be true.

Related Terms