Least-Squares Regression
Finding the line of best fit
Least-Squares Regression
Regression Line
Purpose: Find best-fit line through scatterplot
Equation:
Where:
- = predicted value of y
- b = slope
- a = y-intercept
- x = value of explanatory variable
Least-Squares Criterion
Least-squares regression line: Line minimizing sum of squared residuals
Residual: Difference between observed and predicted
Least-squares minimizes:
Why square? Positive and negative deviations don't cancel
Formulas for Slope and Intercept
Slope:
Where:
- r = correlation
- s_y = standard deviation of y
- s_x = standard deviation of x
y-intercept:
Key insight: Line always passes through
Example: Finding Regression Line
Data: Height (x) and weight (y) of 5 people
,
,
Slope:
Intercept:
Equation:
Interpretation: For each inch increase in height, predicted weight increases by 4 pounds.
Interpreting Slope
Slope b = change in per unit increase in x
Template: "For each [1 unit] increase in [x], predicted [y] [increases/decreases] by [|b|] [y-units]."
Example: b = 4 in height/weight
"For each 1-inch increase in height, predicted weight increases by 4 pounds."
Negative slope: "decreases by..."
Interpreting y-Intercept
y-intercept a = predicted y when x = 0
Often meaningless!
- Height = 0 → weight = -122 pounds? Nonsense!
Only interpret if x = 0 is meaningful and within data range
Example where meaningful:
- y = test score, x = hours studied
- a = predicted score with 0 hours studying
Making Predictions
Substitute x into equation:
Example: Predict weight for height = 70 inches
Caution: Extrapolation (predicting outside data range) is risky!
Extrapolation
Interpolation: Predict within range of data ✓
Extrapolation: Predict outside range of data ⚠
Problem with extrapolation:
- Relationship may not continue
- May become nonlinear
- Other factors may matter
Example: Predicting weight for height = 100 inches
- Well outside typical range
- Relationship might not hold
- Prediction unreliable
Calculator Method
TI-83/84:
- Enter data in L1 (x) and L2 (y)
- STAT → CALC → 8:LinReg(a+bx)
- Read a, b, r, r²
Result shows:
- y = a + bx
- r (correlation)
- r² (coefficient of determination)
Properties of Regression Line
1. Passes through (, )
2. Sum of residuals = 0
- Positive and negative balance out
3. Unique (only one least-squares line)
4. Sensitive to outliers
- One outlier can drastically change line
Residuals
Residual = observed - predicted = y -
Positive residual: Point above line (underestimate)
Negative residual: Point below line (overestimate)
Zero residual: Point on line (exact prediction)
Example: Actual weight = 160, predicted = 158
- Residual = 160 - 158 = 2 pounds
- Underestimated by 2 pounds
Influential Points
Influential point: Removing it substantially changes regression line
Usually:
- Outliers in x-direction (far from )
- Have high leverage (pull line toward them)
Not all outliers are influential!
- Outlier in y-direction but near → less influential
Always identify and investigate influential points
Regression Toward the Mean
Phenomenon: Extreme x-values tend to predict less extreme y-values
Why? Correlation < 1 (not perfect relationship)
Example: Very tall parents tend to have shorter children (still tall, but less extreme)
Slope formula explains:
- Since r < 1, predicted change smaller than proportional
Switching x and y
Regression NOT symmetric!
Different lines:
- Regression of y on x:
- Regression of x on y:
These are NOT equivalent!
Use: Predict y from x → use y on x line
Common Mistakes
❌ Interpreting y-intercept when x = 0 meaningless
❌ Extrapolating beyond data range
❌ Confusing slope units
❌ Thinking regression proves causation
❌ Using regression when relationship nonlinear
Causation Reminder
Regression line can be used for prediction
Does NOT prove causation!
Strong relationship ≠ cause-and-effect
Need: Controlled experiment to establish causation
Quick Reference
Equation:
Slope:
Intercept:
Line passes through:
Residual:
Least-squares minimizes:
Remember: Regression gives best prediction line but doesn't prove causation. Beware extrapolation! Always check for influential points.
📚 Practice Problems
1Problem 1medium
❓ Question:
A study measures hours studied (x) and test scores (y) for 5 students: (2,65), (3,70), (4,75), (5,80), (6,85). Given x̄ = 4, ȳ = 75, calculate the least-squares regression line.
💡 Show Solution
Step 1: Calculate slope b₁ Formula: b₁ = Σ(x-x̄)(y-ȳ) / Σ(x-x̄)²
Create table: | x | y | (x-x̄) | (y-ȳ) | (x-x̄)(y-ȳ) | (x-x̄)² | |---|-------|--------|---------| | 2 | 65 | -2 | -10 | 20 | 4 | | 3 | 70 | -1 | -5 | 5 | 1 | | 4 | 75 | 0 | 0 | 0 | 0 | | 5 | 80 | 1 | 5 | 5 | 1 | | 6 | 85 | 2 | 10 | 20 | 4 |
Σ(x-x̄)(y-ȳ) = 50 Σ(x-x̄)² = 10
b₁ = 50/10 = 5
Step 2: Calculate y-intercept b₀ b₀ = ȳ - b₁x̄ b₀ = 75 - 5(4) b₀ = 75 - 20 = 55
Step 3: Write equation ŷ = 55 + 5x
Interpretation: Each additional hour studied predicts a 5-point increase in test score.
Answer: ŷ = 55 + 5x
2Problem 2medium
❓ Question:
For data with Σx = 50, Σy = 120, Σx² = 350, Σxy = 720, n = 10, find the least-squares regression line.
💡 Show Solution
Step 1: Calculate means x̄ = Σx/n = 50/10 = 5 ȳ = Σy/n = 120/10 = 12
Step 2: Calculate slope Formula: b₁ = [Σxy - n(x̄)(ȳ)] / [Σx² - n(x̄)²]
Numerator: Σxy - n(x̄)(ȳ) = 720 - 10(5)(12) = 720 - 600 = 120 Denominator: Σx² - n(x̄)² = 350 - 10(5)² = 350 - 250 = 100
b₁ = 120/100 = 1.2
Step 3: Calculate intercept b₀ = ȳ - b₁x̄ = 12 - 1.2(5) = 12 - 6 = 6
Step 4: Write equation ŷ = 6 + 1.2x
Verification: When x = 5, ŷ = 6 + 6 = 12 = ȳ ✓
Answer: ŷ = 6 + 1.2x
3Problem 3easy
❓ Question:
A regression of car weight (x, in 1000s of lbs) on fuel efficiency (y, mpg) gives ŷ = 45 - 5.2x. Interpret the slope and predict mpg for a 3,500 lb car.
💡 Show Solution
Step 1: Interpret slope Slope = -5.2 mpg per 1000 lbs
Interpretation: "For each additional 1,000 pounds of car weight, fuel efficiency is predicted to DECREASE by 5.2 miles per gallon."
The negative slope makes sense: heavier cars use more fuel.
Step 2: Convert weight to correct units Car weight = 3,500 lbs = 3.5 thousands of lbs So x = 3.5
Step 3: Make prediction ŷ = 45 - 5.2(3.5) ŷ = 45 - 18.2 ŷ = 26.8 mpg
Step 4: Complete interpretation "A car weighing 3,500 pounds is predicted to have fuel efficiency of approximately 26.8 miles per gallon."
Answer: Slope: Each 1,000 lb increase predicts 5.2 mpg decrease Prediction: 26.8 mpg
4Problem 4medium
❓ Question:
Given x̄ = 15, ȳ = 240, sₓ = 4, sᵧ = 60, r = 0.75, find the regression line using b₁ = r(sᵧ/sₓ).
💡 Show Solution
Step 1: Calculate slope Formula: b₁ = r(sᵧ/sₓ)
b₁ = 0.75 × (60/4) b₁ = 0.75 × 15 b₁ = 11.25
Step 2: Calculate y-intercept b₀ = ȳ - b₁x̄ b₀ = 240 - 11.25(15) b₀ = 240 - 168.75 b₀ = 71.25
Step 3: Write regression equation ŷ = 71.25 + 11.25x
Interpretation: Each 1-unit increase in x predicts an 11.25-unit increase in y.
Answer: ŷ = 71.25 + 11.25x
5Problem 5medium
❓ Question:
The regression of temperature (°F) vs ice cream sales ($) is ŷ = -2 + 0.8x. Is it appropriate to predict sales when temp = 0°F? Explain.
💡 Show Solution
Step 1: Make the prediction ŷ = -2 + 0.8(0) = -2
This predicts -$2 in sales, which is IMPOSSIBLE!
Step 2: Identify the problem This is EXTRAPOLATION - predicting outside the data range.
Issues:
- Temperature of 0°F likely outside original data range
- Linear relationship may not hold at extremes
- Model gives nonsensical result (negative sales)
- Y-intercept is just a mathematical constant, not meaningful here
Step 3: Proper approach Should only use regression for INTERPOLATION (within data range). If data collected at 60-100°F, only predict in that range.
Answer: NO - This is inappropriate extrapolation resulting in an impossible prediction. Only use regression within the range of observed x-values.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics