Scatter Plots and Correlation
Visualizing and measuring linear relationships
Scatterplots and Correlation
Scatterplots
Scatterplot: Graph showing relationship between two quantitative variables
- x-axis: Explanatory variable (independent)
- y-axis: Response variable (dependent)
- Each point represents one individual
Purpose: Visualize relationship, identify patterns, detect outliers
Describing Scatterplots: DCFS
Direction: Positive, negative, or no association
Positive: As x increases, y tends to increase
Negative: As x increases, y tends to decrease
No association: No clear pattern
Cluster: Data grouped together or spread evenly
Form: Linear or nonlinear
Linear: Points follow straight-line pattern
Nonlinear: Curved pattern (quadratic, exponential, etc.)
Strength: How closely points follow pattern
Strong: Points close to pattern
Moderate: Some scatter but clear pattern
Weak: Lots of scatter, vague pattern
Outliers: Points far from overall pattern
Correlation Coefficient (r)
Measures: Strength and direction of linear relationship
Formula:
Properties:
- Range: -1 ≤ r ≤ 1
- r = 1: Perfect positive linear relationship
- r = -1: Perfect negative linear relationship
- r = 0: No linear relationship
- r > 0: Positive association
- r < 0: Negative association
Interpreting |r|
|r| = 1: Perfect linear relationship
0.8 < |r| < 1: Strong linear relationship
0.5 < |r| < 0.8: Moderate linear relationship
0 < |r| < 0.5: Weak linear relationship
|r| = 0: No linear relationship
Note: These are rough guidelines, context matters!
Important Properties of r
1. Unitless: No units (standardized)
2. Not affected by units: Converting x or y doesn't change r
3. Not affected by which variable is x or y: Switching variables doesn't change r
4. Affected by outliers: Single outlier can dramatically change r
5. Measures linear relationship only: Can be 0 even if strong nonlinear relationship exists!
Example: Calculating r
Data: (1, 2), (2, 4), (3, 5), (4, 7), (5, 8)
,
,
(strong positive)
In practice: Use calculator!
Calculator Method
TI-83/84:
- Enter x-values in L1, y-values in L2
- STAT → CALC → 8:LinReg(a+bx)
- r appears (if diagnostics on: 2nd 0 → DiagnosticOn)
Correlation vs Causation
CRITICAL: Correlation does NOT imply causation!
r = 0.9 means:
- Strong linear relationship exists
- x and y tend to vary together
r = 0.9 does NOT mean:
- x causes y
- y causes x
Possible explanations for correlation:
- x causes y
- y causes x
- Confounding variable causes both
- Coincidence
Example: Spurious Correlation
Ice cream sales and drowning deaths: r ≈ 0.9
NOT because:
- Ice cream causes drowning
- Drowning causes ice cream sales
ACTUALLY:
- Confounding variable: Summer/temperature
- Both increase in summer
Outliers and Influential Points
Outlier: Point far from overall pattern
Effect on r:
- Can increase or decrease r
- Can change sign of r
- Single outlier can dominate
Always: Identify outliers, consider their impact
Influential point: If removed, would substantially change r or regression line
When Correlation Inappropriate
Don't use r if:
- Relationship is nonlinear (r only measures linear!)
- Severe outliers present (distort r)
- Categorical variables (need different analysis)
Always plot data first! Don't rely on r alone.
Describing Associations
Template: "There is a [direction] [form] [strength] association between [x] and [y]."
Example: "There is a strong positive linear association between study hours and test scores."
Add: "With no outliers" or "With one outlier at..."
Association vs Relationship
Association: Variables vary together (correlation)
Relationship: Generic term (could be causal or not)
Causation: x directly causes changes in y
Always distinguish!
Quick Reference
DCFS: Direction, Cluster, Form, Strength (+ outliers)
Correlation r:
- Range: -1 to 1
- Measures linear relationship only
- Unitless
- Affected by outliers
Key: Correlation ≠ Causation
Remember: Always make scatterplot first! r alone can be misleading. A nonlinear relationship might have r ≈ 0 but still be strongly related!
📚 Practice Problems
1Problem 1easy
❓ Question:
A scatterplot shows a strong positive linear relationship between hours studied and test scores. What does this tell you about the correlation coefficient r?
💡 Show Solution
For a strong positive linear relationship: • r will be close to +1 • r will be positive (between 0 and 1) • The closer r is to +1, the stronger the relationship
Typical ranges: • r = 0.8 to 1.0: Strong positive • r = 0.5 to 0.8: Moderate positive • r = 0.2 to 0.5: Weak positive
Key point: "Strong" and "positive" both describe r. The correlation coefficient quantifies what we see visually in the scatterplot.
2Problem 2medium
❓ Question:
For five students, hours studied (x) and test scores (y) are: (2,70), (3,75), (4,85), (5,90), (6,95). Calculate the correlation coefficient r.
💡 Show Solution
Step 1: Calculate means x̄ = (2+3+4+5+6)/5 = 4 ȳ = (70+75+85+90+95)/5 = 83
Step 2: Calculate deviations and products (x,y): (x-x̄), (y-ȳ), (x-x̄)(y-ȳ), (x-x̄)², (y-ȳ)² (2,70): -2, -13, 26, 4, 169 (3,75): -1, -8, 8, 1, 64 (4,85): 0, 2, 0, 0, 4 (5,90): 1, 7, 7, 1, 49 (6,95): 2, 12, 24, 4, 144
Step 3: Calculate sums Σ(x-x̄)(y-ȳ) = 26+8+0+7+24 = 65 Σ(x-x̄)² = 4+1+0+1+4 = 10 Σ(y-ȳ)² = 169+64+4+49+144 = 430
Step 4: Calculate r r = Σ(x-x̄)(y-ȳ) / √[Σ(x-x̄)² × Σ(y-ȳ)²] r = 65 / √(10 × 430) r = 65 / √4300 r = 65 / 65.57 r ≈ 0.991
The very high positive correlation (r ≈ 0.99) indicates a strong positive linear relationship.
3Problem 3medium
❓ Question:
A study finds r = -0.85 between temperature and heating costs. Interpret this value and explain what it does NOT mean.
💡 Show Solution
Interpretation: • Strong negative linear association (|r| = 0.85 is strong) • As temperature increases, heating costs tend to decrease • 85% of the strength of a perfect negative linear relationship
What r does NOT mean:
-
NOT causation: Correlation doesn't prove temperature causes heating cost changes (though we might infer this from context)
-
NOT r² = 72%: We cannot say "72% of variation explained" without calculating r²
-
NOT applicable to nonlinear relationships: r only measures linear association
-
NOT robust to outliers: r can be heavily influenced by extreme points
-
NOT a slope: r = -0.85 doesn't mean "heating costs decrease by $0.85 per degree"
Common mistake: Confusing r with r² or interpreting r as causation.
4Problem 4hard
❓ Question:
Two variables have r = 0.02. A researcher concludes there is no relationship between the variables. Why might this conclusion be incorrect? Give two reasons.
💡 Show Solution
Reason 1: Nonlinear relationships r only measures LINEAR association. The variables could have a strong curved relationship (quadratic, exponential, etc.) that r = 0.02 doesn't detect.
Example: y = x² has r ≈ 0 for x ranging from -5 to 5, but there's a perfect (nonlinear) relationship.
Reason 2: Restricted range If the data only covers a small portion of the full range of values, r might be near zero even if a strong relationship exists over a wider range.
Example: Height vs. weight for adults aged 30-31 might show r ≈ 0, but for adults aged 2-80, r would be much stronger.
Other possibilities: • Outliers depressing r • Separate groups each with their own relationship • Measurement error reducing observed correlation
Conclusion: Always examine the scatterplot! r alone doesn't tell the full story.
5Problem 5hard
❓ Question:
A scatterplot of (height in inches, weight in pounds) has r = 0.70. If height is converted to centimeters and weight to kilograms, what is the new correlation? What if we swap which variable is x and which is y?
💡 Show Solution
Converting units: New correlation = 0.70 (unchanged)
Explanation: Correlation is unitless and unaffected by linear transformations (multiplying by a constant or adding a constant). Converting inches to cm (multiply by 2.54) and pounds to kg (divide by 2.2) are linear transformations.
Swapping x and y: New correlation = 0.70 (unchanged)
Explanation: Correlation is symmetric. The strength of linear association between height and weight is the same as between weight and height.
Properties of r: • -1 ≤ r ≤ 1 • Unitless • Symmetric: r(x,y) = r(y,x) • Unchanged by linear transformations • Measures only LINEAR relationships
What WOULD change r: • Nonlinear transformations (log, square, etc.) • Adding/removing data points • Changing the data itself
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics