Scatter Plots and Correlation

Visualizing and measuring linear relationships

Scatterplots and Correlation

Scatterplots

Scatterplot: Graph showing relationship between two quantitative variables

  • x-axis: Explanatory variable (independent)
  • y-axis: Response variable (dependent)
  • Each point represents one individual

Purpose: Visualize relationship, identify patterns, detect outliers

Describing Scatterplots: DCFS

Direction: Positive, negative, or no association

Positive: As x increases, y tends to increase
Negative: As x increases, y tends to decrease
No association: No clear pattern

Cluster: Data grouped together or spread evenly

Form: Linear or nonlinear

Linear: Points follow straight-line pattern
Nonlinear: Curved pattern (quadratic, exponential, etc.)

Strength: How closely points follow pattern

Strong: Points close to pattern
Moderate: Some scatter but clear pattern
Weak: Lots of scatter, vague pattern

Outliers: Points far from overall pattern

Correlation Coefficient (r)

Measures: Strength and direction of linear relationship

Formula:

r=1n1(xixˉsx)(yiyˉsy)r = \frac{1}{n-1} \sum \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Properties:

  • Range: -1 ≤ r ≤ 1
  • r = 1: Perfect positive linear relationship
  • r = -1: Perfect negative linear relationship
  • r = 0: No linear relationship
  • r > 0: Positive association
  • r < 0: Negative association

Interpreting |r|

|r| = 1: Perfect linear relationship
0.8 < |r| < 1: Strong linear relationship
0.5 < |r| < 0.8: Moderate linear relationship
0 < |r| < 0.5: Weak linear relationship
|r| = 0: No linear relationship

Note: These are rough guidelines, context matters!

Important Properties of r

1. Unitless: No units (standardized)

2. Not affected by units: Converting x or y doesn't change r

3. Not affected by which variable is x or y: Switching variables doesn't change r

4. Affected by outliers: Single outlier can dramatically change r

5. Measures linear relationship only: Can be 0 even if strong nonlinear relationship exists!

Example: Calculating r

Data: (1, 2), (2, 4), (3, 5), (4, 7), (5, 8)

xˉ=3\bar{x} = 3, sx=1.58s_x = 1.58
yˉ=5.2\bar{y} = 5.2, sy=2.39s_y = 2.39

r=14[(2/1.58)(3.2/2.39)+...+(2/1.58)(2.8/2.39)]r = \frac{1}{4}[(-2/1.58)(-3.2/2.39) + ... + (2/1.58)(2.8/2.39)]

r0.982r \approx 0.982 (strong positive)

In practice: Use calculator!

Calculator Method

TI-83/84:

  1. Enter x-values in L1, y-values in L2
  2. STAT → CALC → 8:LinReg(a+bx)
  3. r appears (if diagnostics on: 2nd 0 → DiagnosticOn)

Correlation vs Causation

CRITICAL: Correlation does NOT imply causation!

r = 0.9 means:

  • Strong linear relationship exists
  • x and y tend to vary together

r = 0.9 does NOT mean:

  • x causes y
  • y causes x

Possible explanations for correlation:

  1. x causes y
  2. y causes x
  3. Confounding variable causes both
  4. Coincidence

Example: Spurious Correlation

Ice cream sales and drowning deaths: r ≈ 0.9

NOT because:

  • Ice cream causes drowning
  • Drowning causes ice cream sales

ACTUALLY:

  • Confounding variable: Summer/temperature
  • Both increase in summer

Outliers and Influential Points

Outlier: Point far from overall pattern

Effect on r:

  • Can increase or decrease r
  • Can change sign of r
  • Single outlier can dominate

Always: Identify outliers, consider their impact

Influential point: If removed, would substantially change r or regression line

When Correlation Inappropriate

Don't use r if:

  1. Relationship is nonlinear (r only measures linear!)
  2. Severe outliers present (distort r)
  3. Categorical variables (need different analysis)

Always plot data first! Don't rely on r alone.

Describing Associations

Template: "There is a [direction] [form] [strength] association between [x] and [y]."

Example: "There is a strong positive linear association between study hours and test scores."

Add: "With no outliers" or "With one outlier at..."

Association vs Relationship

Association: Variables vary together (correlation)

Relationship: Generic term (could be causal or not)

Causation: x directly causes changes in y

Always distinguish!

Quick Reference

DCFS: Direction, Cluster, Form, Strength (+ outliers)

Correlation r:

  • Range: -1 to 1
  • Measures linear relationship only
  • Unitless
  • Affected by outliers

Key: Correlation ≠ Causation

Remember: Always make scatterplot first! r alone can be misleading. A nonlinear relationship might have r ≈ 0 but still be strongly related!

📚 Practice Problems

1Problem 1easy

Question:

A scatterplot shows a strong positive linear relationship between hours studied and test scores. What does this tell you about the correlation coefficient r?

💡 Show Solution

For a strong positive linear relationship: • r will be close to +1 • r will be positive (between 0 and 1) • The closer r is to +1, the stronger the relationship

Typical ranges: • r = 0.8 to 1.0: Strong positive • r = 0.5 to 0.8: Moderate positive • r = 0.2 to 0.5: Weak positive

Key point: "Strong" and "positive" both describe r. The correlation coefficient quantifies what we see visually in the scatterplot.

2Problem 2medium

Question:

For five students, hours studied (x) and test scores (y) are: (2,70), (3,75), (4,85), (5,90), (6,95). Calculate the correlation coefficient r.

💡 Show Solution

Step 1: Calculate means x̄ = (2+3+4+5+6)/5 = 4 ȳ = (70+75+85+90+95)/5 = 83

Step 2: Calculate deviations and products (x,y): (x-x̄), (y-ȳ), (x-x̄)(y-ȳ), (x-x̄)², (y-ȳ)² (2,70): -2, -13, 26, 4, 169 (3,75): -1, -8, 8, 1, 64 (4,85): 0, 2, 0, 0, 4 (5,90): 1, 7, 7, 1, 49 (6,95): 2, 12, 24, 4, 144

Step 3: Calculate sums Σ(x-x̄)(y-ȳ) = 26+8+0+7+24 = 65 Σ(x-x̄)² = 4+1+0+1+4 = 10 Σ(y-ȳ)² = 169+64+4+49+144 = 430

Step 4: Calculate r r = Σ(x-x̄)(y-ȳ) / √[Σ(x-x̄)² × Σ(y-ȳ)²] r = 65 / √(10 × 430) r = 65 / √4300 r = 65 / 65.57 r ≈ 0.991

The very high positive correlation (r ≈ 0.99) indicates a strong positive linear relationship.

3Problem 3medium

Question:

A study finds r = -0.85 between temperature and heating costs. Interpret this value and explain what it does NOT mean.

💡 Show Solution

Interpretation: • Strong negative linear association (|r| = 0.85 is strong) • As temperature increases, heating costs tend to decrease • 85% of the strength of a perfect negative linear relationship

What r does NOT mean:

  1. NOT causation: Correlation doesn't prove temperature causes heating cost changes (though we might infer this from context)

  2. NOT r² = 72%: We cannot say "72% of variation explained" without calculating r²

  3. NOT applicable to nonlinear relationships: r only measures linear association

  4. NOT robust to outliers: r can be heavily influenced by extreme points

  5. NOT a slope: r = -0.85 doesn't mean "heating costs decrease by $0.85 per degree"

Common mistake: Confusing r with r² or interpreting r as causation.

4Problem 4hard

Question:

Two variables have r = 0.02. A researcher concludes there is no relationship between the variables. Why might this conclusion be incorrect? Give two reasons.

💡 Show Solution

Reason 1: Nonlinear relationships r only measures LINEAR association. The variables could have a strong curved relationship (quadratic, exponential, etc.) that r = 0.02 doesn't detect.

Example: y = x² has r ≈ 0 for x ranging from -5 to 5, but there's a perfect (nonlinear) relationship.

Reason 2: Restricted range If the data only covers a small portion of the full range of values, r might be near zero even if a strong relationship exists over a wider range.

Example: Height vs. weight for adults aged 30-31 might show r ≈ 0, but for adults aged 2-80, r would be much stronger.

Other possibilities: • Outliers depressing r • Separate groups each with their own relationship • Measurement error reducing observed correlation

Conclusion: Always examine the scatterplot! r alone doesn't tell the full story.

5Problem 5hard

Question:

A scatterplot of (height in inches, weight in pounds) has r = 0.70. If height is converted to centimeters and weight to kilograms, what is the new correlation? What if we swap which variable is x and which is y?

💡 Show Solution

Converting units: New correlation = 0.70 (unchanged)

Explanation: Correlation is unitless and unaffected by linear transformations (multiplying by a constant or adding a constant). Converting inches to cm (multiply by 2.54) and pounds to kg (divide by 2.2) are linear transformations.

Swapping x and y: New correlation = 0.70 (unchanged)

Explanation: Correlation is symmetric. The strength of linear association between height and weight is the same as between weight and height.

Properties of r: • -1 ≤ r ≤ 1 • Unitless • Symmetric: r(x,y) = r(y,x) • Unchanged by linear transformations • Measures only LINEAR relationships

What WOULD change r: • Nonlinear transformations (log, square, etc.) • Adding/removing data points • Changing the data itself