Transformations for Linearity

Linearizing nonlinear relationships

Transformations to Achieve Linearity

Why Transform?

Problem: Many relationships are nonlinear

Solution: Transform one or both variables to make relationship linear

Benefits:

  • Can use linear regression tools
  • Easier interpretation
  • Better predictions

When to Transform

Indicators need transformation:

  1. Scatterplot shows curve (not line)
  2. Residual plot shows pattern (not random)
  3. Low r² despite clear relationship

Don't transform if:

  • Relationship already linear
  • Residual plot looks good

Common Transformations

For y:

  • log(y): Exponential growth/decay
  • √y: Moderate curve
  • 1/y: Inverse relationship

For x:

  • log(x): Logarithmic curve
  • x²: Quadratic relationship
  • √x: Moderate curve

Both:

  • log(y) vs log(x): Power relationship

Exponential Model

Original relationship: y=abxy = ab^x

Curved scatterplot, exponential growth/decay

Transform: Take log of y

Becomes linear: log(y)=log(a)+xlog(b)\log(y) = \log(a) + x\log(b)

Regression: log(y) on x gives linear relationship

Example: Population growth, compound interest, radioactive decay

Example 1: Exponential Transformation

Bacteria population over time:

Original data shows exponential growth (curved)

Transform: Calculate log(population) for each time

New scatterplot: log(population) vs time is linear!

Regression: log(y^)=2+0.3x\log(\hat{y}) = 2 + 0.3x

Back-transform for predictions:

y^=102+0.3x\hat{y} = 10^{2 + 0.3x}

Power Model

Original relationship: y=axpy = ax^p

Curved relationship

Transform: Take log of both

Becomes linear: log(y)=log(a)+plog(x)\log(y) = \log(a) + p\log(x)

Regression: log(y) on log(x) gives linear relationship

Example: Area vs radius, metabolic rate vs body mass

Example 2: Power Transformation

Planet orbital period vs distance from sun:

Both variables on logarithmic scale → linear!

Regression: log(period)=a+blog(distance)\log(\text{period}) = a + b\log(\text{distance})

Slope b ≈ 1.5 (Kepler's third law: pd1.5p \propto d^{1.5})

Square Root and Squaring

√y transformation:

  • Moderate upward curve
  • Spread-increasing pattern

x² transformation:

  • Quadratic relationship (parabola)
  • But limited to one side

Example: Free-fall distance (d) vs time (t)

d=12gt2d = \frac{1}{2}gt^2 suggests regress d on t²

Choosing the Right Transformation

Trial and error approach:

  1. Try transformation
  2. Make scatterplot of transformed data
  3. Check residual plot
  4. Check r²
  5. If not linear, try different transformation

Guided approach:

  • Exponential pattern → log(y)
  • Power relationship → log-log
  • Quadratic → x²
  • Fan shape in residuals → log(y)

Interpreting Transformed Models

Log(y) on x:

Slope interpretation: "For each unit increase in x, y is multiplied by 10b10^b"

Example: Slope = 0.301 in log(population) vs time

"Each year, population multiplies by 100.301210^{0.301} \approx 2"

(Population doubles each year)

Log(y) on log(x):

Slope interpretation: "A 1% increase in x is associated with approximately b% increase in y"

Back-Transformation

After fitting model on transformed data:

Make predictions on transformed scale, then back-transform

Example: Model is log(y^)=2+0.3x\log(\hat{y}) = 2 + 0.3x

For x = 10:

log(y^)=2+0.3(10)=5\log(\hat{y}) = 2 + 0.3(10) = 5

y^=105=100,000\hat{y} = 10^5 = 100,000

Don't just transform predictions after the fact!

Checking the Transformation

Good transformation produces:

  1. Linear scatterplot
  2. Random residual plot
  3. Higher r²
  4. Roughly constant spread

Compare before/after:

  • Original r² vs transformed r²
  • Original residual plot vs transformed residual plot

Multiple Transformations

Sometimes try several:

Example: Comparing transformations for curved data

  • log(y) vs x: r² = 0.85
  • √y vs x: r² = 0.92
  • y vs x²: r² = 0.78

Choose: √y vs x (highest r², simplest)

Common Patterns and Transformations

| Pattern | Try | |---------|-----| | Exponential growth/decay | log(y) | | Power relationship | log(y) and log(x) | | Quadratic (parabola) | x² | | Moderate upward curve | √y or √x | | Spread increases with y | log(y) |

Residual Plot After Transformation

Must check! Transformation successful if:

  • No pattern in residuals
  • Random scatter around 0
  • Constant spread

If still see pattern: Try different transformation

Linearizable vs Non-linearizable

Linearizable: Can be made linear with transformation

  • Exponential: y = ab^x
  • Power: y = ax^p
  • Quadratic: y = a + bx + cx²

Non-linearizable: Cannot be easily linearized

  • Some periodic functions
  • Complex curves
  • May need nonlinear regression

Common Mistakes

❌ Not checking residual plot after transformation
❌ Back-transforming incorrectly
❌ Transforming when already linear
❌ Misinterpreting slope of transformed model
❌ Comparing r² before and after (different y variable!)

Practical Considerations

Pros of transformation:

  • Use simple linear methods
  • Often theoretically motivated
  • Can improve predictions

Cons of transformation:

  • Harder to interpret
  • Must back-transform for predictions
  • Not all relationships linearizable

Alternative: Modern nonlinear regression (beyond AP Stats)

Example 3: Complete Transformation

Original: y vs x is curved (r² = 0.40, residuals show pattern)

Transform: Use log(y)

New: log(y) vs x is linear (r² = 0.95, random residuals)

Equation: log(y^)=1.5+0.2x\log(\hat{y}) = 1.5 + 0.2x

Interpretation: "Each unit increase in x multiplies y by 100.21.5810^{0.2} \approx 1.58"

For prediction at x = 10:

log(y^)=1.5+0.2(10)=3.5\log(\hat{y}) = 1.5 + 0.2(10) = 3.5

y^=103.53162\hat{y} = 10^{3.5} \approx 3162

Quick Reference

Exponential (y = ab^x): Use log(y) vs x

Power (y = ax^p): Use log(y) vs log(x)

Quadratic: Use y vs x²

Goal: Linear scatterplot, random residuals, high r²

Check: Always examine residual plot of transformed data

Interpret carefully: Slopes mean different things after transformation

Remember: Transform to fix nonlinearity, but always check if transformation worked! Linear models are powerful when applied to appropriately transformed data.

📚 Practice Problems

1Problem 1medium

Question:

A scatterplot of x vs y shows a curved exponential pattern. The residual plot for ŷ = a + bx is curved. Try plotting log(y) vs x. What pattern should you see if this transformation works?

💡 Show Solution

Step 1: Understand the original problem

  • Scatterplot shows exponential curve (y = ae^(bx))
  • Linear model residuals are curved
  • Need to linearize the relationship

Step 2: Why try log(y) vs x? Exponential relationship: y = ae^(bx) Take log of both sides: log(y) = log(a) + bx

This is LINEAR in x!

Step 3: What to look for after transformation If log transformation is appropriate: ✓ Scatterplot of log(y) vs x should be LINEAR ✓ Residual plot should show RANDOM scatter ✓ No curved pattern in residuals

Step 4: How to check

  1. Create new variable: y' = log(y)
  2. Plot y' vs x (should be linear)
  3. Fit regression: ŷ' = b₀ + b₁x
  4. Check residual plot (should be random)

Step 5: Interpretation After transformation:

  • Can use linear regression on log(y) vs x
  • To predict y: ŷ = e^(b₀ + b₁x)
  • Or: ŷ = e^(b₀) × e^(b₁x)

Answer: After log transformation, the plot of log(y) vs x should show a LINEAR pattern, and residuals should be randomly scattered with no curve.

2Problem 2hard

Question:

Data shows a power relationship: y = ax^b. What transformation will linearize this relationship?

💡 Show Solution

Step 1: Identify the relationship Power model: y = ax^b (Example: area = πr², where b = 2)

Step 2: Apply log transformation to BOTH variables Take log of both sides: log(y) = log(a × x^b) log(y) = log(a) + log(x^b) log(y) = log(a) + b·log(x)

Step 3: Recognize linear form Let: Y = log(y), X = log(x), A = log(a) Then: Y = A + bX

This is LINEAR!

Step 4: How to transform

  1. Create Y = log(y)
  2. Create X = log(x)
  3. Plot Y vs X (should be linear)
  4. Fit regression: Ŷ = b₀ + b₁X

Step 5: Interpret coefficients After regression:

  • b₁ = power (exponent b)
  • b₀ = log(a), so a = e^(b₀) or a = 10^(b₀)

To predict original y: ŷ = e^(b₀) × x^(b₁) [if using natural log] ŷ = 10^(b₀) × x^(b₁) [if using log base 10]

Example: If Ŷ = 2 + 1.5X (using log base 10) Then y = 10² × x^1.5 = 100x^1.5

Answer: Take log of BOTH variables. Plot log(y) vs log(x), which linearizes power relationships.

3Problem 3hard

Question:

After fitting y vs x, the residual plot fans out (variance increases). You try log(y) vs x and get a better residual plot. Why does this help?

💡 Show Solution

Step 1: Identify the original problem Fan-shaped residuals mean:

  • Variance increases with x
  • Violates constant variance assumption
  • Often occurs when y grows exponentially

Step 2: Why log(y) helps with variance When y is exponential or multiplicative:

  • Larger y values have larger variability
  • Variance proportional to mean
  • log transformation STABILIZES variance

Mathematical reason: If y has variance proportional to y²: Var(y) ∝ y²

Then: Var(log(y)) ≈ constant (Delta method from calculus)

Step 3: Additional benefit Log transformation often: ✓ Linearizes exponential relationships ✓ Stabilizes variance (fixes fan shape) ✓ Makes distribution more symmetric ✓ Reduces impact of outliers

Step 4: When to use log transformation Use log(y) when you see:

  • Exponential growth pattern
  • Fan-shaped residuals
  • Right-skewed distribution
  • Multiplicative relationships
  • Variance increases with mean

Step 5: Check after transformation After using log(y):

  1. Residual plot should show equal spread
  2. No fan shape
  3. Random scatter
  4. Valid for inference

Answer: Log transformation stabilizes variance. When variance increases with mean (fan shape), log(y) typically has constant variance, fixing the heteroscedasticity problem.

4Problem 4medium

Question:

You fit log(y) = 2 + 0.5x using natural log. Predict y when x = 10.

💡 Show Solution

Step 1: Understand the model Fitted equation: log(y) = 2 + 0.5x This uses NATURAL LOG (ln)

Step 2: Predict log(y) for x = 10 log(y) = 2 + 0.5(10) log(y) = 2 + 5 log(y) = 7

Step 3: Back-transform to get y Since we used natural log (ln): ln(y) = 7

To solve for y, use exponential: y = e^7

Step 4: Calculate y = e^7 ≈ 1,096.63

Step 5: Interpretation "When x = 10, y is predicted to be approximately 1,097."

Important notes:

  • Must back-transform using e^(predicted value)
  • If using log₁₀, would use 10^(predicted value)
  • Always specify which log was used!

Alternative form: Original model: y = e^(2 + 0.5x) = e² × e^(0.5x) y = e² × e^(0.5x) ≈ 7.39 × e^(0.5x)

When x = 10: y = 7.39 × e^5 ≈ 1,097

Answer: y = e^7 ≈ 1,097

5Problem 5hard

Question:

A residual plot shows both curvature AND fan shape. What transformations might you try?

💡 Show Solution

Step 1: Identify TWO problems

  1. Curvature → nonlinear relationship
  2. Fan shape → non-constant variance

Need transformation that fixes BOTH!

Step 2: Try log(y) vs x Often works for:

  • Exponential relationships (fixes curve)
  • Multiplicative error (fixes fan)
  • Right-skewed data

Check result: ✓ Should be linear ✓ Should have constant variance

Step 3: If log(y) doesn't work completely Try other transformations:

  • √y vs x (square root)
  • 1/y vs x (reciprocal)
  • log(y) vs log(x) (both sides)

Step 4: Systematic approach

  1. Try log(y) vs x first (most common)
  2. Check residual plot
  3. If still curved, try log-log or other
  4. If variance still not constant, try different transformation

Step 5: Decision guide Pattern → Try transformation:

  • Exponential curve + fan → log(y) vs x
  • Power relationship → log(y) vs log(x)
  • Moderate curve → √y vs x
  • Strong right skew → log(y)

Step 6: After transformation Must verify: ✓ Scatterplot is linear ✓ Residuals randomly scattered ✓ Constant variance (no fan) ✓ Approximately normal residuals

Answer: Try log(y) vs x first, as it often fixes both curvature (exponential) and fan shape (non-constant variance). Check residual plot; if issues remain, try other transformations like √y or log-log.