Coefficient of Determination
Understanding r-squared
Coefficient of Determination (r²)
What is r²?
Coefficient of Determination (r²): Proportion of variability in y explained by linear relationship with x
Formula:
Where r is the correlation coefficient
Range: 0 ≤ r² ≤ 1 (or 0% to 100%)
Interpreting r²
Template: "About [r² × 100]% of the variability in [y] is explained by the linear relationship with [x]."
Example: r² = 0.64
"About 64% of the variability in test scores is explained by the linear relationship with study hours."
Remaining variability (1 - r²):
- Due to other variables
- Random variation
- Unexplained by this model
Example 1: Calculating r²
Height and weight: r = 0.8
Interpretation: "About 64% of the variability in weight is explained by the linear relationship with height. The remaining 36% is due to other factors."
r² vs r
r (correlation):
- Shows strength AND direction
- Range: -1 to 1
- Negative values meaningful
r² (coefficient of determination):
- Shows strength only (no direction)
- Range: 0 to 1
- Always positive
- Easier to interpret as percentage
From r² cannot determine if relationship positive or negative!
- Need to also report r or slope
What r² Means
r² = 0.90: Model explains 90% of variability (excellent fit)
r² = 0.70: Model explains 70% of variability (good fit)
r² = 0.50: Model explains 50% of variability (moderate fit)
r² = 0.25: Model explains 25% of variability (weak fit)
r² = 0: Model explains none of variability (no linear relationship)
Note: These are rough guidelines, context dependent!
Visualizing r²
Think of variability in y:
Total variability: How much y-values spread out from
Explained variability: How much varies (due to linear relationship)
Unexplained variability: How much points deviate from line (residuals)
Formal Definition
Numerator: Variability in predictions
Denominator: Total variability in y
Equivalently:
Example 2: Detailed Calculation
Data: 5 points with = 10
Total variability: = 100
Unexplained (residuals): = 25
Interpretation: 75% of variability explained, 25% unexplained
What r² Does NOT Mean
❌ r² is NOT probability
- Not "probability model is correct"
- Not "probability prediction is right"
❌ r² does NOT prove causation
- High r² doesn't mean x causes y
- Could be coincidence or confounding
❌ r² alone doesn't guarantee good model
- Could have high r² but residuals show pattern
- Always check residual plot!
❌ r² doesn't tell about prediction accuracy for individuals
- Use s (standard error) for that
When is r² High?
High r² occurs when:
- Strong linear relationship (|r| close to 1)
- Points close to regression line
- Little unexplained variability
- x is good predictor of y
Does NOT require:
- Large sample size (can have high r² with small n)
- Causation
- Practical importance
When is r² Low?
Low r² occurs when:
- Weak linear relationship
- Lots of scatter around line
- Much unexplained variability
- x is poor predictor of y
Possible reasons:
- No relationship exists
- Relationship is nonlinear
- Other variables more important
- High natural variability in y
Comparing Models
Use r² to compare models on same data:
Model 1: Height predicting weight, r² = 0.64
Model 2: Age predicting weight, r² = 0.45
Conclusion: Height explains more variability (better predictor)
Caution: Only compare r² for same response variable!
Adjusted r²
For multiple regression (multiple explanatory variables)
Problem: Adding variables always increases r² (even useless variables!)
Adjusted r²: Penalizes for number of variables
Where k = number of explanatory variables
Use: Compare models with different numbers of variables
Relationship to Standard Error
Related concepts:
r²: Proportion of variability explained
s: Typical prediction error (in original units)
Both measure model fit:
- High r² ↔ small s
- Low r² ↔ large s
s often more useful for predictions (gives actual error magnitude)
Common Mistakes
❌ Saying "r² is probability"
❌ Thinking high r² proves causation
❌ Using r² alone without checking residual plot
❌ Comparing r² across different response variables
❌ Not reporting direction of relationship (r² loses sign)
Practical Significance
Statistical vs Practical:
High r² in context:
- Social sciences: r² > 0.50 often considered good
- Physical sciences: r² > 0.90 often expected
- Individual predictions: Even r² = 0.90 may not be precise enough
Consider:
- What's typical in your field?
- What's needed for practical use?
- What's the cost of prediction errors?
Reporting Results
Complete report includes:
- Correlation (r): Shows direction and strength
- r²: Shows percent variability explained
- Equation:
- Standard error (s): Typical prediction error
- Residual plot: Visual check of model appropriateness
Don't report r² alone!
Quick Reference
r²: Proportion of variability in y explained by x
Formula: r² = (correlation)²
Range: 0 to 1 (0% to 100%)
Interpretation: "[r² × 100]% of variability in y explained by linear relationship with x"
High r²: Good fit, points close to line
Low r²: Poor fit, much unexplained variability
Remember: r² measures how well x predicts y, but doesn't prove causation. Always check residual plot! High r² alone doesn't guarantee good model.
📚 Practice Problems
1Problem 1easy
❓ Question:
A regression has correlation r = 0.8. Calculate and interpret R².
💡 Show Solution
Step 1: Calculate R² Formula: R² = r²
R² = (0.8)² = 0.64
Step 2: Express as percentage R² = 0.64 = 64%
Step 3: Interpret "64% of the variability in y is explained by the linear relationship with x."
The remaining 36% is unexplained variation (random error, other factors).
Step 4: Implications R² = 0.64 suggests:
- Strong relationship (64% explained)
- Model captures most of pattern
- Useful for predictions
- But 36% still unexplained
Answer: R² = 0.64 or 64%. This means 64% of the variation in y is explained by the linear relationship with x.
2Problem 2easy
❓ Question:
Model A has R² = 0.85, Model B has R² = 0.45. Which is better for predictions?
💡 Show Solution
Step 1: Compare R² values Model A: R² = 0.85 = 85% explained Model B: R² = 0.45 = 45% explained
Step 2: Model A interpretation
- 85% of variation explained
- Very strong relationship
- Only 15% unexplained
- More accurate predictions
Step 3: Model B interpretation
- 45% of variation explained
- Moderate relationship
- 55% unexplained
- Less accurate predictions
Step 4: Conclusion Model A is BETTER because:
- More variation explained (85% vs 45%)
- Smaller residuals on average
- More reliable predictions
- Stronger relationship
Answer: Model A is better. It explains 85% of variation versus only 45% for Model B, meaning more accurate predictions.
3Problem 3medium
❓ Question:
A regression has R² = 0.49. What is the correlation r? Can you determine the sign?
💡 Show Solution
Step 1: Calculate |r| R² = r² 0.49 = r² r = ±√0.49 = ±0.7
So |r| = 0.7
Step 2: Determine sign From R² ALONE, cannot determine sign!
Both r = +0.7 and r = -0.7 give R² = 0.49
Step 3: How to find sign Need additional information:
- Look at slope (same sign as r)
- Look at scatterplot direction
- Context (should relationship be positive or negative?)
Step 4: Why R² loses sign R² = r² means squaring eliminates sign: (+0.7)² = 0.49 (-0.7)² = 0.49
Answer: |r| = 0.7, but CANNOT determine sign from R² alone. Need slope sign or scatterplot to determine if r = +0.7 or -0.7.
4Problem 4medium
❓ Question:
Explain why R² must be between 0 and 1.
💡 Show Solution
Step 1: R² definition R² = r² = (correlation)²
Step 2: Why R² ≥ 0 Any number squared is non-negative:
- Even negative r gives positive R²
- (-0.7)² = 0.49 ≥ 0
- Minimum R² = 0 (no relationship)
Step 3: Why R² ≤ 1 Correlation is bounded: -1 ≤ r ≤ 1
Squaring preserves this:
- Maximum |r| = 1
- Maximum r² = 1² = 1
- Cannot exceed 100% of variation
Step 4: Interpretation R² = 0: No linear relationship (0% explained) R² = 1: Perfect linear relationship (100% explained)
You cannot explain less than 0% or more than 100%!
Step 5: If you see R² = 1.5 or R² = -0.3 CALCULATION ERROR! Recheck your work.
Answer: R² must be 0 ≤ R² ≤ 1 because it equals r² (always non-negative) and correlation is bounded by -1 ≤ r ≤ 1. Cannot explain less than 0% or more than 100% of variation.
5Problem 5medium
❓ Question:
A model has SST = 500 and SSE = 125. Calculate and interpret R².
💡 Show Solution
Step 1: Understand sum of squares SST = Total Sum of Squares = total variation SSE = Sum of Squared Errors = unexplained variation SSR = Regression Sum of Squares = explained variation
Relationship: SST = SSR + SSE
Step 2: Calculate SSR SSR = SST - SSE SSR = 500 - 125 = 375
Step 3: Calculate R² Formula: R² = SSR/SST
R² = 375/500 = 0.75
Alternative: R² = 1 - SSE/SST = 1 - 125/500 = 1 - 0.25 = 0.75 ✓
Step 4: Interpret R² = 0.75 = 75%
"75% of the total variation in y is explained by the regression model."
Explained: 375/500 = 75% Unexplained: 125/500 = 25%
Answer: R² = 0.75 or 75%. The model explains 375 out of 500 total units of variation, leaving 125 units (25%) unexplained.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics