Coefficient of Determination

Understanding r-squared

Coefficient of Determination (r²)

What is r²?

Coefficient of Determination (r²): Proportion of variability in y explained by linear relationship with x

Formula:

r2=(r)2r^2 = (r)^2

Where r is the correlation coefficient

Range: 0 ≤ r² ≤ 1 (or 0% to 100%)

Interpreting r²

Template: "About [r² × 100]% of the variability in [y] is explained by the linear relationship with [x]."

Example: r² = 0.64

"About 64% of the variability in test scores is explained by the linear relationship with study hours."

Remaining variability (1 - r²):

  • Due to other variables
  • Random variation
  • Unexplained by this model

Example 1: Calculating r²

Height and weight: r = 0.8

r2=(0.8)2=0.64r^2 = (0.8)^2 = 0.64

Interpretation: "About 64% of the variability in weight is explained by the linear relationship with height. The remaining 36% is due to other factors."

r² vs r

r (correlation):

  • Shows strength AND direction
  • Range: -1 to 1
  • Negative values meaningful

r² (coefficient of determination):

  • Shows strength only (no direction)
  • Range: 0 to 1
  • Always positive
  • Easier to interpret as percentage

From r² cannot determine if relationship positive or negative!

  • Need to also report r or slope

What r² Means

r² = 0.90: Model explains 90% of variability (excellent fit)

r² = 0.70: Model explains 70% of variability (good fit)

r² = 0.50: Model explains 50% of variability (moderate fit)

r² = 0.25: Model explains 25% of variability (weak fit)

r² = 0: Model explains none of variability (no linear relationship)

Note: These are rough guidelines, context dependent!

Visualizing r²

Think of variability in y:

Total variability: How much y-values spread out from yˉ\bar{y}

Explained variability: How much y^\hat{y} varies (due to linear relationship)

Unexplained variability: How much points deviate from line (residuals)

Total=Explained+Unexplained\text{Total} = \text{Explained} + \text{Unexplained}

r2=ExplainedTotalr^2 = \frac{\text{Explained}}{\text{Total}}

Formal Definition

r2=(y^yˉ)2(yyˉ)2r^2 = \frac{\sum(\hat{y} - \bar{y})^2}{\sum(y - \bar{y})^2}

Numerator: Variability in predictions
Denominator: Total variability in y

Equivalently:

r2=1(yy^)2(yyˉ)2r^2 = 1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}

r2=1UnexplainedTotalr^2 = 1 - \frac{\text{Unexplained}}{\text{Total}}

Example 2: Detailed Calculation

Data: 5 points with yˉ\bar{y} = 10

Total variability: (yyˉ)2\sum(y - \bar{y})^2 = 100

Unexplained (residuals): (yy^)2\sum(y - \hat{y})^2 = 25

r2=125100=10.25=0.75r^2 = 1 - \frac{25}{100} = 1 - 0.25 = 0.75

Interpretation: 75% of variability explained, 25% unexplained

What r² Does NOT Mean

❌ r² is NOT probability

  • Not "probability model is correct"
  • Not "probability prediction is right"

❌ r² does NOT prove causation

  • High r² doesn't mean x causes y
  • Could be coincidence or confounding

❌ r² alone doesn't guarantee good model

  • Could have high r² but residuals show pattern
  • Always check residual plot!

❌ r² doesn't tell about prediction accuracy for individuals

  • Use s (standard error) for that

When is r² High?

High r² occurs when:

  1. Strong linear relationship (|r| close to 1)
  2. Points close to regression line
  3. Little unexplained variability
  4. x is good predictor of y

Does NOT require:

  • Large sample size (can have high r² with small n)
  • Causation
  • Practical importance

When is r² Low?

Low r² occurs when:

  1. Weak linear relationship
  2. Lots of scatter around line
  3. Much unexplained variability
  4. x is poor predictor of y

Possible reasons:

  • No relationship exists
  • Relationship is nonlinear
  • Other variables more important
  • High natural variability in y

Comparing Models

Use r² to compare models on same data:

Model 1: Height predicting weight, r² = 0.64
Model 2: Age predicting weight, r² = 0.45

Conclusion: Height explains more variability (better predictor)

Caution: Only compare r² for same response variable!

Adjusted r²

For multiple regression (multiple explanatory variables)

Problem: Adding variables always increases r² (even useless variables!)

Adjusted r²: Penalizes for number of variables

radj2=1(1r2)(n1)nk1r_{adj}^2 = 1 - \frac{(1-r^2)(n-1)}{n-k-1}

Where k = number of explanatory variables

Use: Compare models with different numbers of variables

Relationship to Standard Error

Related concepts:

r²: Proportion of variability explained

s: Typical prediction error (in original units)

Both measure model fit:

  • High r² ↔ small s
  • Low r² ↔ large s

s often more useful for predictions (gives actual error magnitude)

Common Mistakes

❌ Saying "r² is probability"
❌ Thinking high r² proves causation
❌ Using r² alone without checking residual plot
❌ Comparing r² across different response variables
❌ Not reporting direction of relationship (r² loses sign)

Practical Significance

Statistical vs Practical:

High r² in context:

  • Social sciences: r² > 0.50 often considered good
  • Physical sciences: r² > 0.90 often expected
  • Individual predictions: Even r² = 0.90 may not be precise enough

Consider:

  • What's typical in your field?
  • What's needed for practical use?
  • What's the cost of prediction errors?

Reporting Results

Complete report includes:

  1. Correlation (r): Shows direction and strength
  2. r²: Shows percent variability explained
  3. Equation: y^=a+bx\hat{y} = a + bx
  4. Standard error (s): Typical prediction error
  5. Residual plot: Visual check of model appropriateness

Don't report r² alone!

Quick Reference

r²: Proportion of variability in y explained by x

Formula: r² = (correlation)²

Range: 0 to 1 (0% to 100%)

Interpretation: "[r² × 100]% of variability in y explained by linear relationship with x"

High r²: Good fit, points close to line
Low r²: Poor fit, much unexplained variability

Remember: r² measures how well x predicts y, but doesn't prove causation. Always check residual plot! High r² alone doesn't guarantee good model.

📚 Practice Problems

1Problem 1easy

Question:

A regression has correlation r = 0.8. Calculate and interpret R².

💡 Show Solution

Step 1: Calculate R² Formula: R² = r²

R² = (0.8)² = 0.64

Step 2: Express as percentage R² = 0.64 = 64%

Step 3: Interpret "64% of the variability in y is explained by the linear relationship with x."

The remaining 36% is unexplained variation (random error, other factors).

Step 4: Implications R² = 0.64 suggests:

  • Strong relationship (64% explained)
  • Model captures most of pattern
  • Useful for predictions
  • But 36% still unexplained

Answer: R² = 0.64 or 64%. This means 64% of the variation in y is explained by the linear relationship with x.

2Problem 2easy

Question:

Model A has R² = 0.85, Model B has R² = 0.45. Which is better for predictions?

💡 Show Solution

Step 1: Compare R² values Model A: R² = 0.85 = 85% explained Model B: R² = 0.45 = 45% explained

Step 2: Model A interpretation

  • 85% of variation explained
  • Very strong relationship
  • Only 15% unexplained
  • More accurate predictions

Step 3: Model B interpretation

  • 45% of variation explained
  • Moderate relationship
  • 55% unexplained
  • Less accurate predictions

Step 4: Conclusion Model A is BETTER because:

  • More variation explained (85% vs 45%)
  • Smaller residuals on average
  • More reliable predictions
  • Stronger relationship

Answer: Model A is better. It explains 85% of variation versus only 45% for Model B, meaning more accurate predictions.

3Problem 3medium

Question:

A regression has R² = 0.49. What is the correlation r? Can you determine the sign?

💡 Show Solution

Step 1: Calculate |r| R² = r² 0.49 = r² r = ±√0.49 = ±0.7

So |r| = 0.7

Step 2: Determine sign From R² ALONE, cannot determine sign!

Both r = +0.7 and r = -0.7 give R² = 0.49

Step 3: How to find sign Need additional information:

  • Look at slope (same sign as r)
  • Look at scatterplot direction
  • Context (should relationship be positive or negative?)

Step 4: Why R² loses sign R² = r² means squaring eliminates sign: (+0.7)² = 0.49 (-0.7)² = 0.49

Answer: |r| = 0.7, but CANNOT determine sign from R² alone. Need slope sign or scatterplot to determine if r = +0.7 or -0.7.

4Problem 4medium

Question:

Explain why R² must be between 0 and 1.

💡 Show Solution

Step 1: R² definition R² = r² = (correlation)²

Step 2: Why R² ≥ 0 Any number squared is non-negative:

  • Even negative r gives positive R²
  • (-0.7)² = 0.49 ≥ 0
  • Minimum R² = 0 (no relationship)

Step 3: Why R² ≤ 1 Correlation is bounded: -1 ≤ r ≤ 1

Squaring preserves this:

  • Maximum |r| = 1
  • Maximum r² = 1² = 1
  • Cannot exceed 100% of variation

Step 4: Interpretation R² = 0: No linear relationship (0% explained) R² = 1: Perfect linear relationship (100% explained)

You cannot explain less than 0% or more than 100%!

Step 5: If you see R² = 1.5 or R² = -0.3 CALCULATION ERROR! Recheck your work.

Answer: R² must be 0 ≤ R² ≤ 1 because it equals r² (always non-negative) and correlation is bounded by -1 ≤ r ≤ 1. Cannot explain less than 0% or more than 100% of variation.

5Problem 5medium

Question:

A model has SST = 500 and SSE = 125. Calculate and interpret R².

💡 Show Solution

Step 1: Understand sum of squares SST = Total Sum of Squares = total variation SSE = Sum of Squared Errors = unexplained variation SSR = Regression Sum of Squares = explained variation

Relationship: SST = SSR + SSE

Step 2: Calculate SSR SSR = SST - SSE SSR = 500 - 125 = 375

Step 3: Calculate R² Formula: R² = SSR/SST

R² = 375/500 = 0.75

Alternative: R² = 1 - SSE/SST = 1 - 125/500 = 1 - 0.25 = 0.75 ✓

Step 4: Interpret R² = 0.75 = 75%

"75% of the total variation in y is explained by the regression model."

Explained: 375/500 = 75% Unexplained: 125/500 = 25%

Answer: R² = 0.75 or 75%. The model explains 375 out of 500 total units of variation, leaving 125 units (25%) unexplained.