Outliers in Data
Identify and analyze outliers
Outliers in Data
What is an Outlier?
An outlier is a data value that is significantly different from the other values in a data set.
Think of it as: A data point that "stands out" or "doesn't fit"
Examples:
- Test scores: 78, 82, 85, 79, 83, 15 (15 is an outlier!)
- Heights: 65, 67, 64, 68, 120 (120 inches is an outlier!)
- Prices: 10, 12, 11, 13, 95 (95 is an outlier!)
Key point: Outliers are unusually high OR unusually low
Why Outliers Matter
1. Affect mean (average) Mean is sensitive to outliers!
Example: Salaries: 40k, 42k, 45k, 43k, 200k
- With outlier: Mean = 74k
- Without outlier: Mean = 42.5k
Big difference!
2. Don't affect median much Median is resistant to outliers
Same example:
- With outlier: Median = 43k
- Without outlier: Median = 42.5k
Small difference!
3. Can indicate errors
- Measurement mistakes
- Data entry errors
- Recording problems
4. Can reveal important information
- Exceptional cases
- New discoveries
- Special circumstances
Identifying Outliers: The IQR Method
Most common method: 1.5 × IQR rule
Steps:
Step 1: Find Q1 and Q3
Step 2: Calculate IQR = Q3 - Q1
Step 3: Calculate boundaries
- Lower boundary: Q1 - 1.5(IQR)
- Upper boundary: Q3 + 1.5(IQR)
Step 4: Any value outside boundaries is an outlier
Example: Data: 2, 5, 7, 8, 9, 10, 12, 15, 40
Q1 = 7 (25th percentile) Q3 = 12 (75th percentile) IQR = 12 - 7 = 5
Lower boundary: 7 - 1.5(5) = 7 - 7.5 = -0.5 Upper boundary: 12 + 1.5(5) = 12 + 7.5 = 19.5
Outlier: 40 (greater than 19.5)
Example 2: Test scores: 65, 70, 72, 75, 78, 80, 82, 85, 88, 20
Order: 20, 65, 70, 72, 75, 78, 80, 82, 85, 88
Q1 = 70 Q3 = 82 IQR = 12
Lower: 70 - 1.5(12) = 70 - 18 = 52 Upper: 82 + 1.5(12) = 82 + 18 = 100
Outlier: 20 (less than 52)
Why 1.5 × IQR?
The 1.5 multiplier is a convention:
- Widely accepted in statistics
- Balances sensitivity (finding real outliers) with specificity (not flagging too many)
- Works well for many distributions
- Used by box plots
Alternatives exist:
- 2 × IQR (more conservative, fewer outliers)
- 3 × IQR (very conservative, extreme outliers only)
- Standard deviation method (for normal distributions)
In Algebra 1: Stick with 1.5 × IQR unless told otherwise
Visual Identification
From dot plots, histograms, box plots:
Look for values far separated from the main cluster
Example: Dot plot
Values 10 through 15 have most of the data points clustered together, but value 25 has a single point far separated from the cluster. The point at 25 is separated from cluster at 10-15 and is likely an outlier.
From box plots: Outliers often shown as individual points beyond whiskers
From scatter plots: Points far from the trend line or main cluster
Types of Outliers
1. Mild outliers:
- Between 1.5 and 3 IQRs from Q1/Q3
- Somewhat unusual
2. Extreme outliers:
- More than 3 IQRs from Q1/Q3
- Very unusual
Example: Q1 = 10, Q3 = 20, IQR = 10
Mild outlier range:
- Lower: 10 - 1.5(10) to 10 - 3(10) = -5 to -20
- Upper: 20 + 1.5(10) to 20 + 3(10) = 35 to 50
Extreme outlier:
- Below -20 or above 50
Effect on Measures of Center
Mean:
- Very sensitive to outliers
- Pulled toward outlier
- Can be misleading with outliers
Example: 10, 12, 13, 14, 15, 100
Mean with outlier: (10+12+13+14+15+100)/6 = 27.3 Mean without: (10+12+13+14+15)/5 = 12.8
Huge difference!
Median:
- Resistant to outliers
- Not pulled significantly
- Better measure when outliers present
Same example: Median with outlier: 13.5 Median without: 13
Small difference!
Mode:
- Not affected by outliers
- Only shows most frequent value
Effect on Measures of Spread
Range:
- Very sensitive (uses min and max)
- Outliers inflate range
Example: 5, 7, 8, 9, 10, 50
Range with outlier: 50 - 5 = 45 Range without: 10 - 5 = 5
IQR:
- Resistant to outliers
- Only uses middle 50%
- Better measure when outliers present
Same example: IQR with outlier: Q3 - Q1 = 10 - 7 = 3 IQR without: 9 - 7 = 2
Less dramatic change
Standard deviation:
- Sensitive to outliers (in advanced statistics)
- Outliers increase variability
Causes of Outliers
1. Measurement error:
- Instrument malfunction
- Human error reading/recording
- Transcription mistake
Example: Recording 150 instead of 15.0
2. Data entry error:
- Typo when entering data
- Extra or missing digit
- Wrong decimal place
Example: Typing 1000 instead of 100
3. Sampling error:
- Wrong population sampled
- Non-random selection
4. Natural variation:
- True extreme value
- Rare but real occurrence
Example: Unusually tall person, genius IQ, record temperature
5. Different population:
- Value from different group
Example: Adult height in data of children's heights
What to Do with Outliers
Option 1: Investigate
- Check for errors
- Verify measurement
- Look for explanation
Option 2: Keep
- If legitimate data point
- If represents true variation
- Document its presence
Option 3: Remove
- If proven error
- If not from target population
- Report that you removed it!
NEVER: Remove without reason or justification!
Best practice:
- Analyze data both with and without outlier
- Report both results
- Explain any removal decision
Reporting Outliers
When writing about data:
"The data set contains one outlier (value = 95), which is more than 1.5 IQRs above Q3. This value appears to be a data entry error based on the source document, so it was excluded from further analysis."
OR:
"One outlier (150) was identified but retained because it represents a legitimate extreme value."
Be transparent!
Real-World Examples
Example 1: Income Data
Incomes: 35k, 40k, 42k, 38k, 45k, 2M (CEO)
2M is an outlier
- Median better than mean for "typical" income
- Outlier is real (some people earn much more)
- Keep it, but use median for reporting
Example 2: Test Scores
Scores: 78, 82, 85, 88, 90, 15
15 is an outlier
- Likely student left early or didn't try
- Or answer sheet error
- Investigate before deciding
Example 3: Product Weights
Weights (grams): 100, 101, 99, 102, 150
150 is an outlier
- Possible production error
- Check batch records
- May need quality control adjustment
Example 4: Reaction Times
Times (seconds): 0.8, 0.9, 0.85, 0.82, 5.2
5.2 is an outlier
- Person distracted?
- Timer error?
- Investigate before removing
Outliers in Different Contexts
Science experiments:
- May indicate errors
- Could be breakthrough discovery
- Repeat to verify
Quality control:
- Often indicate defects
- Trigger inspection
- May lead to process improvement
Sports statistics:
- Record-breaking performances
- Exceptional talent
- Keep for historical record
Economic data:
- Market crashes/booms
- Unusual events
- Important to analyze separately
Multiple Outliers
Data can have more than one!
Example: 5, 8, 10, 12, 15, 18, 75, 80
Both 75 and 80 are outliers (using IQR method)
Clustered outliers:
- Multiple outliers grouped together
- May indicate subpopulation
- Consider separate analysis
Outliers in Box Plots
Standard representation:
- Draw whiskers to last non-outlier
- Mark outliers as individual points (dots)
- Clearly visible
Example: In a box plot, an outlier would be marked as a dot beyond the whiskers, with the whiskers extending only to the last non-outlier value in the normal data range.
Benefits:
- Quick visual identification
- See number of outliers
- See if high or low
Z-Score Method (Preview)
Alternative method using standard deviation:
z = (value - mean) / standard deviation
Rule: If |z| > 3, likely outlier (In some contexts, |z| > 2)
Example: Mean = 50, SD = 5
Value = 70 z = (70 - 50) / 5 = 4
Since 4 > 3, value 70 is an outlier
Note: This is more common in advanced statistics
Practice Identifying Outliers
Example 1: 12, 15, 18, 20, 22, 25, 28, 65
Order: Already ordered Q1 = 16.5, Q3 = 26.5, IQR = 10
Lower: 16.5 - 15 = 1.5 Upper: 26.5 + 15 = 41.5
Outlier: 65 (> 41.5)
Example 2: 2, 3, 5, 7, 8, 9, 10, 11, 12
Q1 = 5, Q3 = 10, IQR = 5
Lower: 5 - 7.5 = -2.5 Upper: 10 + 7.5 = 17.5
No outliers (all values between -2.5 and 17.5)
Example 3: 50, 55, 60, 62, 65, 68, 70, 72, 120
Q1 = 60, Q3 = 70, IQR = 10
Lower: 60 - 15 = 45 Upper: 70 + 15 = 85
Outlier: 120 (> 85)
Common Mistakes to Avoid
-
Automatically removing outliers Must investigate first!
-
Using range instead of IQR IQR is resistant to outliers, range is not
-
Wrong IQR calculation IQR = Q3 - Q1 (not max - min!)
-
Forgetting both boundaries Check both lower and upper limits
-
Calculation errors with 1.5 1.5 × IQR, not 1.5 + IQR!
-
Not considering context Is the outlier meaningful or an error?
-
Not reporting removals Always document if you exclude data
Outliers and Technology
Calculators:
- Many show outliers on box plots
- Can calculate quartiles automatically
Spreadsheets:
- Use QUARTILE function
- Create formulas for boundaries
- Conditional formatting to highlight
Statistical software:
- Automatic outlier detection
- Multiple methods available
- Visual displays
When Outliers Are Most Important
Quality control: Outliers indicate defects
Medical data: Unusual values may indicate health issues
Fraud detection: Unusual transactions flagged
Climate data: Extreme values important for planning
Safety analysis: Worst-case scenarios matter
Outliers vs Extreme Values
Not all extreme values are outliers!
Extreme value: At the far end of distribution Outlier: Statistically defined as beyond 1.5 IQR
Example: Tallest person in class
Might be extreme (tallest) but not outlier (still within 1.5 IQR)
Example 2: Record high temperature
Extreme and probably an outlier
Quick Reference
Outlier: Data value far from others
IQR Method:
- Lower boundary: Q1 - 1.5(IQR)
- Upper boundary: Q3 + 1.5(IQR)
- Outside boundaries = outlier
Effects:
- Mean: Very sensitive
- Median: Resistant
- Range: Sensitive
- IQR: Resistant
Actions:
- Investigate
- Keep if legitimate
- Remove if error (and report!)
Never: Remove without reason
In box plots: Shown as individual points
Practice Strategy
- Calculate IQR carefully
- Don't forget the 1.5 multiplier
- Check both upper and lower boundaries
- Consider context and cause
- Practice with various data sets
- Learn to identify visually from graphs
- Understand effect on mean vs median
- Compare statistics with and without outliers
- Read real-world examples
- Use technology to verify
- Always investigate before removing
- Document your decisions
- Understand that outliers aren't always errors
- Practice explaining outliers to others
- Apply to real data from your life
Understanding outliers is crucial for accurate data analysis. They can reveal errors, exceptional cases, or important patterns. Master this skill and you'll be a more critical and careful data analyst!
📚 Practice Problems
1Problem 1easy
❓ Question:
Is 2 an outlier in the data set: 12, 15, 18, 20, 22, 25, 2?
💡 Show Solution
Step 1: Arrange data in order: 2, 12, 15, 18, 20, 22, 25
Step 2: Visual inspection: 2 is much smaller than all other values (which range from 12-25). It appears to be an outlier.
Step 3: Use the IQR method to confirm: Find Q1 and Q3: Q1 = 12 (median of lower half: 2, 12, 15) Q2 = 18 (median overall) Q3 = 22 (median of upper half: 20, 22, 25)
Step 4: Calculate IQR and boundaries: IQR = 22 - 12 = 10 Lower boundary: Q1 - 1.5(IQR) = 12 - 15 = -3 Upper boundary: Q3 + 1.5(IQR) = 22 + 15 = 37
Step 5: Check if 2 is outside the boundaries: 2 > -3 and 2 < 37 2 is NOT outside the boundaries by the 1.5 × IQR rule.
Answer: By the IQR method, 2 is technically NOT an outlier, though it appears unusual visually.
2Problem 2easy
❓ Question:
For the data set 5, 8, 10, 12, 15, 40, identify any outliers using the 1.5 × IQR rule.
💡 Show Solution
Step 1: Data is already in order. Find quartiles: Q1 = 8 (median of 5, 8, 10) Q2 = 11 (median of all: between 10 and 12) Q3 = 15 (median of 12, 15, 40)
Step 2: Calculate IQR: IQR = Q3 - Q1 = 15 - 8 = 7
Step 3: Calculate boundaries: Lower: Q1 - 1.5(IQR) = 8 - 1.5(7) = 8 - 10.5 = -2.5 Upper: Q3 + 1.5(IQR) = 15 + 1.5(7) = 15 + 10.5 = 25.5
Step 4: Check each value: 5 > -2.5 ✓ 8, 10, 12, 15 all within boundaries ✓ 40 > 25.5 ✗ (exceeds upper boundary)
Step 5: Identify outliers: 40 is an outlier
Answer: 40 is an outlier
3Problem 3medium
❓ Question:
A data set has Q1 = 30, Q3 = 50. What values would be considered outliers?
💡 Show Solution
Step 1: Calculate IQR: IQR = Q3 - Q1 = 50 - 30 = 20
Step 2: Calculate lower boundary: Lower boundary = Q1 - 1.5(IQR) = 30 - 1.5(20) = 30 - 30 = 0
Step 3: Calculate upper boundary: Upper boundary = Q3 + 1.5(IQR) = 50 + 1.5(20) = 50 + 30 = 80
Step 4: Determine outlier ranges: Any value less than 0 is a low outlier Any value greater than 80 is a high outlier
Answer: Values below 0 or above 80 are outliers
4Problem 4medium
❓ Question:
Explain how outliers affect the mean and median differently.
💡 Show Solution
Step 1: Effect on the mean (average): The mean is calculated by adding all values and dividing by the count. Outliers significantly affect the mean because they are included in the sum.
Example: Data set: 10, 12, 13, 15, 100 Without 100: mean = (10 + 12 + 13 + 15)/4 = 12.5 With 100: mean = (10 + 12 + 13 + 15 + 100)/5 = 30
The outlier (100) dramatically increases the mean from 12.5 to 30.
Step 2: Effect on the median (middle value): The median is the middle value when data is ordered. Outliers have little to no effect on the median.
Same data: 10, 12, 13, 15, 100 Median = 13 (middle value)
If we remove 100: 10, 12, 13, 15 Median = (12 + 13)/2 = 12.5
The median changed only slightly (13 to 12.5).
Step 3: Conclusion:
- Mean is sensitive to outliers (not resistant)
- Median is resistant to outliers
- When outliers exist, median often better represents "typical" value
Answer: Outliers significantly affect the mean but have minimal effect on the median. The median is resistant to outliers.
5Problem 5hard
❓ Question:
Test scores: 72, 75, 78, 80, 82, 85, 88, 90, 45. Is 45 an outlier? Should it be removed from the data?
💡 Show Solution
Step 1: Order the data and find quartiles: 45, 72, 75, 78, 80, 82, 85, 88, 90
Q1 = 75 (median of 45, 72, 75, 78) Q2 = 80 Q3 = 85 (median of 82, 85, 88, 90)
Step 2: Calculate IQR and boundaries: IQR = 85 - 75 = 10 Lower: 75 - 1.5(10) = 75 - 15 = 60 Upper: 85 + 1.5(10) = 85 + 15 = 100
Step 3: Check if 45 is an outlier: 45 < 60, so YES, 45 is an outlier
Step 4: Investigate the cause: Ask: Why is this score so different? Possible reasons:
- Student was absent and made up test later
- Student had an emergency during test
- Data entry error (typed 45 instead of 85?)
- Student genuinely struggled
Step 5: Decide whether to remove it:
- If it's a data error: Remove or correct it
- If it's a legitimate score: Keep it, but note it
- Report statistics with and without the outlier
- Use median instead of mean to reduce its impact
Step 6: Calculate both scenarios: With 45: mean ≈ 76.1, median = 80 Without 45: mean ≈ 81.25, median = 81
Answer: Yes, 45 is an outlier. Whether to remove it depends on why it occurred. If legitimate, keep it but use resistant measures like median. If it's an error, investigate and correct.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics