Inference for Regression

Confidence intervals and tests for slope

Inference for Regression

Beyond Description

So far: Described relationship in sample data

Now: Make inferences about population relationship

  • Confidence interval for slope
  • Hypothesis test for slope
  • Prediction intervals

Conditions for Inference (LINE)

L - Linear relationship: Check scatterplot

I - Independent observations: Random sample, n < 10%N

N - Normal distribution of residuals: Check histogram/normal plot of residuals

E - Equal variance: Check residual plot (constant spread)

Must check all before inference!

Slope as Parameter

Sample: b = slope from data

Population: β (beta) = true slope in population

Question: Is there really a relationship, or did we just see pattern by chance?

Hypothesis Test for Slope

Hypotheses:

  • H₀: β = 0 (no linear relationship)
  • Hₐ: β ≠ 0 (linear relationship exists)

If β = 0: x has no effect on y

Test statistic:

t=b0SEbt = \frac{b - 0}{SE_b}

df = n - 2

SE_b (standard error of slope): Provided by calculator/computer

Example 1: Test for Slope

Height (x) and weight (y), n = 25:

b = 4, SE_b = 1.2

STATE:

  • β = true slope
  • H₀: β = 0
  • Hₐ: β ≠ 0
  • α = 0.05

PLAN:

  • t-test for slope
  • Conditions: LINE all checked ✓

DO:

t=401.23.33t = \frac{4 - 0}{1.2} \approx 3.33

df = 25 - 2 = 23

P-value = 2 × P(t > 3.33) ≈ 0.003

CONCLUDE: P-value < 0.05, reject H₀. Significant linear relationship between height and weight.

Confidence Interval for Slope

Formula:

b±tSEbb \pm t^* SE_b

df = n - 2

Interpretation: "We are C% confident the true slope is between [L] and [U]."

Meaning: For each unit increase in x, y changes by between L and U units (on average in population)

Example 2: CI for Slope

Same data: b = 4, SE_b = 1.2, n = 25

95% CI:

df = 23, t* ≈ 2.069

CI=4±2.069(1.2)=4±2.48=(1.52,6.48)CI = 4 \pm 2.069(1.2) = 4 \pm 2.48 = (1.52, 6.48)

Interpretation: "We are 95% confident that for each additional inch of height, weight increases by between 1.52 and 6.48 pounds on average."

Relationship Between Test and CI

For two-sided test at α:

Check if (1-α) CI contains 0:

  • If 0 in CI → fail to reject H₀
  • If 0 not in CI → reject H₀

Example: 95% CI is (1.52, 6.48)

  • Doesn't contain 0
  • Reject H₀: β = 0 at α = 0.05

Prediction Interval

Different from confidence interval!

Confidence interval: For mean response
Prediction interval: For individual response

Prediction interval is wider (more uncertainty predicting individual)

Formula (approximate):

y^±ts\hat{y} \pm t^* s

Where s = standard deviation of residuals

More precise formula accounts for:

  • Distance of x from xˉ\bar{x} (farther = wider interval)
  • Sample size

Example 3: Prediction Interval

Predict weight for height = 70:

y^\hat{y} = 158, s = 10, n = 25

95% prediction interval (rough):

158±2.069(10)=158±20.69=(137.31,178.69)158 \pm 2.069(10) = 158 \pm 20.69 = (137.31, 178.69)

Interpretation: "We predict an individual with height 70 inches will weigh between 137 and 179 pounds with 95% confidence."

Standard Error of Slope

Formula:

SEb=s(xxˉ)2SE_b = \frac{s}{\sqrt{\sum(x - \bar{x})^2}}

Where s = standard deviation of residuals

Factors making SE_b smaller:

  1. Smaller s (points closer to line)
  2. Larger sample size n
  3. More spread in x-values

Smaller SE_b → narrower CI → more precise estimate

Checking Conditions

Linearity:

  • Scatterplot roughly linear
  • Residual plot shows no curve

Independence:

  • Random sample
  • No time trends
  • Each observation independent

Normality:

  • Histogram of residuals roughly normal
  • Normal probability plot roughly linear
  • Less critical for large n (CLT)

Equal Variance:

  • Residual plot shows constant spread
  • No fan shape

What if Conditions Not Met?

Nonlinear: Transform variables or use nonlinear methods

Not normal (small n): Be cautious with inference

Not equal variance: Consider transformation or weighted regression

Not independent: Use time series or other methods

Don't ignore violations! Inference may be invalid

Prediction vs Confidence Interval

Confidence Interval for Mean Response:

  • "Average y for all individuals with x = x₀"
  • Narrower
  • Use: Policy decisions, understanding average effect

Prediction Interval for Individual:

  • "Single y value for one individual with x = x₀"
  • Wider (includes individual variability)
  • Use: Predicting specific outcome

Always wider: Prediction interval > confidence interval

Multiple Regression Preview

So far: One explanatory variable

Multiple regression: Several explanatory variables

y^=a+b1x1+b2x2+...+bkxk\hat{y} = a + b_1x_1 + b_2x_2 + ... + b_kx_k

Can test each slope: Does this variable help predict y (controlling for others)?

Beyond AP Stats but important to know exists

Common Mistakes

❌ Not checking LINE conditions
❌ Using normal instead of t-distribution
❌ Confusing prediction and confidence intervals
❌ Using df = n instead of n - 2
❌ Making inference when conditions violated

Practical Significance

Statistical significance (P < 0.05) doesn't mean practical importance

Example: Slope = 0.01, P = 0.001

  • Statistically significant
  • But is 0.01 change per unit practically meaningful?

Consider:

  • Effect size (magnitude of slope)
  • Context
  • Practical implications

Quick Reference

Test for slope: t=bSEbt = \frac{b}{SE_b}, df = n - 2

CI for slope: b±tSEbb \pm t^* SE_b

Conditions: LINE (Linear, Independent, Normal, Equal variance)

Prediction interval: Wider than confidence interval

0 in CI for slope? → No significant relationship

Remember: Check LINE conditions before inference! Inference lets us extend conclusions beyond our sample to the broader population, but only if conditions are met.

📚 Practice Problems

1Problem 1medium

Question:

A regression of study hours (x) on test scores (y) gives slope b₁ = 5.2 with SE = 1.3, n = 20. Construct a 95% confidence interval for the true slope β₁.

💡 Show Solution

Step 1: Identify given information Slope: b₁ = 5.2 Standard error: SE = 1.3 Sample size: n = 20 Confidence level: 95%

Step 2: Find degrees of freedom df = n - 2 = 20 - 2 = 18 (Use n-2 for regression, not n-1)

Step 3: Find t* critical value From t-table with df = 18, 95% confidence: t* = 2.101

Step 4: Calculate margin of error ME = t* × SE ME = 2.101 × 1.3 ME ≈ 2.73

Step 5: Construct confidence interval CI = b₁ ± ME CI = 5.2 ± 2.73 CI = (2.47, 7.93)

Step 6: Interpret "We are 95% confident that for each additional hour studied, the true mean increase in test score is between 2.47 and 7.93 points."

Note: Since 0 is NOT in the interval, there is significant evidence of a positive relationship (can reject H₀: β₁ = 0).

Answer: 95% CI: (2.47, 7.93) points per hour

2Problem 2medium

Question:

Test H₀: β₁ = 0 vs Hₐ: β₁ ≠ 0 given b₁ = 3.5, SE = 1.2, n = 25, α = 0.05.

💡 Show Solution

Step 1: Set up hypotheses H₀: β₁ = 0 (no relationship) Hₐ: β₁ ≠ 0 (relationship exists)

Two-tailed test, α = 0.05

Step 2: Check conditions LINEAR: Assume scatterplot is linear ✓ INDEPENDENT: Assume random sample, n < 10% population ✓ NORMAL: Residuals approximately normal ✓ EQUAL VARIANCE: Residual plot shows constant spread ✓ RANDOM: Random sample ✓

(LINE conditions for regression inference)

Step 3: Calculate test statistic df = n - 2 = 25 - 2 = 23

t = (b₁ - 0)/SE t = 3.5/1.2 t ≈ 2.917

Step 4: Find p-value From t-table with df = 23, two-tailed: t = 2.917 is between t = 2.807 (p = 0.01) and t = 3.767 (p = 0.001)

So: 0.001 < p-value < 0.01

More precisely: p-value ≈ 0.0077

Step 5: Make decision p-value (0.0077) < α (0.05) REJECT H₀

Step 6: Conclusion in context "There is significant evidence (p = 0.008) that a linear relationship exists between x and y. The slope is significantly different from zero."

Answer: t = 2.92, p-value ≈ 0.008. Reject H₀. Significant evidence of linear relationship.

3Problem 3medium

Question:

What are the conditions (LINE) for inference in regression? Explain each briefly.

💡 Show Solution

The LINE conditions for regression inference:

L - LINEAR Relationship between x and y is linear Check: Scatterplot should show linear pattern Residual plot should show no curve

I - INDEPENDENT
Observations are independent Check: Random sampling n < 10% of population (if sampling without replacement) No time series or repeated measures

N - NORMAL Residuals are approximately normally distributed Check: Histogram or normal probability plot of residuals Not critical if n is large (n ≥ 30) Just need no strong skewness or outliers

E - EQUAL VARIANCE (also called homoscedasticity) Variability of y is constant for all x Check: Residual plot shows roughly equal vertical spread No fan shape or other pattern in spread

Why these matter:

  • LINEAR: For model to be appropriate
  • INDEPENDENT: For formulas to be valid
  • NORMAL: For t-distribution to apply (especially small samples)
  • EQUAL VARIANCE: For standard errors to be correct

If violations:

  • Not linear → transform or use nonlinear model
  • Not independent → use different methods (time series, etc.)
  • Not normal → okay if n ≥ 30; otherwise transform
  • Not equal variance → transform or use weighted regression

Answer: LINE = Linear relationship, Independent observations, Normal residuals, Equal variance. Check using scatterplot, residual plot, and normal probability plot.

4Problem 4medium

Question:

Computer output shows: b₁ = 2.4, SE(b₁) = 0.8, t = 3.0, p = 0.006, n = 22. Interpret the p-value in context.

💡 Show Solution

Step 1: Identify the test Testing: H₀: β₁ = 0 (no relationship) Against: Hₐ: β₁ ≠ 0 (relationship exists)

Given: p-value = 0.006

Step 2: What p-value means statistically The probability of observing a slope as extreme as 2.4 (or more extreme) IF the true slope is actually 0.

Step 3: Interpret in context "If there were truly no linear relationship between x and y (β₁ = 0), the probability of obtaining a sample slope of 2.4 or more extreme (in either direction) is 0.006, or 0.6%."

Step 4: Practical interpretation This is very unlikely (less than 1% chance)!

Therefore: Strong evidence AGAINST H₀ The relationship is statistically significant.

Step 5: Decision at α = 0.05 Since p-value (0.006) < α (0.05): REJECT H₀

Conclusion: "There is strong evidence of a significant linear relationship. The slope is significantly different from zero (p = 0.006)."

Step 6: What this does NOT mean ✗ Does not mean slope is definitely 2.4 ✗ Does not mean x causes y ✗ Does not mean model fits well (could still have problems) ✓ Only means: slope significantly different from zero

Answer: If true slope were 0, probability of getting b₁ = 2.4 or more extreme is only 0.006. This provides strong evidence the slope is not zero - there is a significant linear relationship.

5Problem 5hard

Question:

Why do we use t-distribution with df = n-2 for regression inference instead of df = n-1?

💡 Show Solution

Step 1: Compare to one-sample t-test One-sample t-test: df = n - 1

  • Estimate 1 parameter: μ
  • Lose 1 df

Regression: df = n - 2

  • Estimate 2 parameters: β₀ AND β₁
  • Lose 2 df

Step 2: What we're estimating In regression, we estimate:

  1. Intercept (β₀)
  2. Slope (β₁)

Both use up degrees of freedom!

Step 3: Degrees of freedom explained Start with n observations

  • Use one to estimate β₀ (intercept)
  • Use one to estimate β₁ (slope)
  • Left with n - 2 for error estimation

df = n - 2

Step 4: Why it matters Smaller df → wider t* critical values → wider CIs

Example: n = 10, 95% confidence

  • One-sample (df = 9): t* = 2.262
  • Regression (df = 8): t* = 2.306

Regression CI slightly wider (more uncertainty).

Step 5: As n increases For large n, the difference is minimal:

  • df = 100 vs 98 → nearly same t*
  • Both approach z* = 1.96

Step 6: General pattern Degrees of freedom = n - (number of parameters estimated)

  • Mean only: n - 1
  • Regression: n - 2
  • Multiple regression with k predictors: n - k - 1

Answer: We estimate TWO parameters (β₀ and β₁), so we lose 2 degrees of freedom, giving df = n - 2. This accounts for the extra uncertainty from estimating both intercept and slope.