Describing Distributions
Shape, center, spread, and outliers (SOCS)
Describing Distributions
Introduction
Looking at a graph is just the first step. To fully understand data, we must describe what we see using precise statistical language. The framework SOCS (Shape, Outliers, Center, Spread) provides a systematic approach to describing any distribution.
Shape
Shape describes the overall pattern of the distribution.
Symmetry
Symmetric Distribution:
- Left side mirrors right side
- Mean ≈ Median
- Balanced around center
Examples:
- Normal (bell-shaped) distributions
- Uniform distributions
- Heights of adult males
How to identify: If you fold the distribution at the center, both sides match
Skewness
Right-Skewed (Positively Skewed):
- Tail extends to the right
- Mean > Median
- Most data on left, few high values pull mean right
Examples:
- Income (most people earn moderate amounts, few earn very high)
- Home prices
- Test scores when test is easy (most score high, few score low)
Visual: Peak on left, tail stretches right
Left-Skewed (Negatively Skewed):
- Tail extends to the left
- Mean < Median
- Most data on right, few low values pull mean left
Examples:
- Age at death (most live to old age, few die young)
- Test scores when test is hard (most score low, few score high)
Visual: Peak on right, tail stretches left
Memory aid: Skewness direction = direction of the tail (not the peak!)
Modality
Number of peaks (modes) in distribution:
Unimodal: One clear peak
- Most common pattern
- Examples: heights, standardized test scores
Bimodal: Two distinct peaks
- Suggests two different groups
- Examples: Heights of adults (male peak and female peak)
Multimodal: More than two peaks
- Multiple distinct groups
- Less common
Uniform: No peaks, all values equally likely
- Flat distribution
- Example: Rolling a fair die
How to determine: Count prominent "humps" in the distribution
Special Shapes
Normal (Bell-Shaped):
- Symmetric
- Unimodal
- Mean = Median = Mode
- Most data near center, decreasing towards extremes
- Follows empirical rule (68-95-99.7)
Uniform:
- All values equally likely
- Rectangular shape
- No mode
Exponential:
- Decreasing pattern
- Extremely right-skewed
- Many small values, few large values
Outliers
Outliers are observations that fall notably far from the overall pattern.
Identifying Outliers
Visual method:
- Look for isolated points
- Values separated from main cluster
1.5 × IQR Rule (for boxplots):
- Calculate
- Lower fence:
- Upper fence:
- Outliers fall beyond fences
Example:
- Q1 = 65, Q3 = 85
- IQR = 85 - 65 = 20
- Lower fence: 65 - 1.5(20) = 65 - 30 = 35
- Upper fence: 85 + 1.5(20) = 85 + 30 = 115
- Values below 35 or above 115 are outliers
Standard deviation method:
- Outliers > 2 or 3 standard deviations from mean
- Less commonly used
- Appropriate for symmetric distributions
Reporting Outliers
Always:
- Note their presence: "There is one outlier at 150"
- Give actual values if possible
- Consider potential causes
Potential causes:
- Measurement error: Mistake in recording
- Data entry error: Typo when entering data
- Legitimate extreme value: Unusual but real observation
- Different population: Doesn't belong in this group
What to do:
- Investigate cause if possible
- Report with and without outliers (if they affect conclusions)
- Don't automatically delete (unless proven error)
Center
Center describes the "typical" or "middle" value.
Mean vs. Median
When to use each:
Mean ():
- Symmetric distributions
- No outliers
- Want to use all data values
- Mathematical properties needed
Median:
- Skewed distributions
- Presence of outliers
- Want resistant measure
- Ordinal data
Relationship to shape:
- Symmetric: Mean ≈ Median
- Right-skewed: Mean > Median (mean pulled right by tail)
- Left-skewed: Mean < Median (mean pulled left by tail)
Mode
Definition: Most frequently occurring value
When reported:
- Categorical data
- Describing bimodal distributions
- Identifying popular values
Limitations:
- May not exist (all values occur once)
- May not be unique (multiple modes)
- Not useful for continuous data with no repeated values
Spread
Spread describes the variability or dispersion of data.
Range
Definition: Maximum - Minimum
Formula:
Advantages:
- Easy to calculate
- Easy to understand
- Gives sense of total spread
Disadvantages:
- Affected by outliers
- Ignores distribution between extremes
- Only uses two values
Example:
- Data: 12, 15, 18, 20, 22, 25, 100
- Range = 100 - 12 = 88
- Dominated by outlier (100)
Interquartile Range (IQR)
Definition: Range of middle 50% of data
Formula:
Advantages:
- Resistant to outliers
- Focuses on middle of distribution
- Useful for boxplots
Disadvantages:
- Ignores lowest 25% and highest 25%
- Less intuitive than range
Example:
- Q1 = 65, Q3 = 85
- IQR = 85 - 65 = 20
- Middle 50% of data spans 20 points
Interpretation: "Half the data falls within [IQR] points"
Standard Deviation
Definition: Average distance from the mean
Interpretation: Typical deviation from mean
Advantages:
- Uses all data values
- Has important mathematical properties
- Basis for many statistical methods
Disadvantages:
- Affected by outliers
- Less intuitive than range
- Only meaningful for roughly symmetric distributions
When to report:
- Symmetric distributions
- No extreme outliers
- Want to use standard statistical methods
Context Matters!
Units
Always include units in descriptions:
❌ "The mean is 68"
✓ "The mean height is 68 inches"
❌ "The standard deviation is 3.5"
✓ "The standard deviation of test scores is 3.5 points"
Comparison
Describe in context of:
- What you'd expect
- Other groups
- Previous studies
Examples:
- "Students averaged 85%, which is higher than last year's 78%"
- "The standard deviation of 15 points shows high variability"
Complete Description Template
A complete distribution description includes:
Shape: "The distribution of [variable] is [symmetric/right-skewed/left-skewed] and [unimodal/bimodal/etc.]"
Outliers: "There is/are [number] outlier(s) at [value(s)]" or "There are no apparent outliers"
Center: "The [mean/median] [variable] is [value with units]"
Spread: "The [variable] ranges from [min] to [max] [units]" or "The standard deviation is [value] [units]"
Example:
Data: Test scores in AP Statistics class
"The distribution of test scores is slightly right-skewed and unimodal with one outlier at 45%. The median score is 82%, indicating that half the students scored below 82%. Scores range from 45% to 98%, with an IQR of 12 percentage points, meaning the middle 50% of students scored within a 12-point range. The outlier at 45% is notably below the main cluster of scores between 70% and 98%."
Common Patterns and Interpretations
What Shape Tells Us
Symmetric:
- Process or measurement is balanced
- Natural variation around center
- Use mean and standard deviation
Right-skewed:
- Floor effect (minimum limit)
- Most values small, few very large
- Use median and IQR
Left-skewed:
- Ceiling effect (maximum limit)
- Most values large, few very small
- Use median and IQR
Bimodal:
- Two distinct groups mixed together
- Consider separating and analyzing separately
What Outliers Tell Us
Potential meanings:
- Errors (investigate and possibly correct)
- Unusual but legitimate cases
- Different population mixed in
- Rare but important events
Impact:
- Affect mean more than median
- Affect standard deviation more than IQR
- Can change conclusions if not addressed
What Spread Tells Us
Large spread:
- High variability
- Data quite different from typical value
- Less predictability
Small spread:
- Low variability
- Data close to typical value
- More consistency, predictability
Comparing Distributions
When comparing two or more distributions:
Address each of SOCS:
Shape:
- "Group A is symmetric while Group B is right-skewed"
Outliers:
- "Both groups have outliers, but Group A's are more extreme"
Center:
- "Group A has a higher median (75) than Group B (68)"
Spread:
- "Group A shows more variability (SD = 12) than Group B (SD = 8)"
Example comparison:
"Both male and female height distributions are roughly symmetric and unimodal. Males have a higher mean height (70 inches) compared to females (64 inches), a difference of 6 inches. Both distributions have similar spreads, with standard deviations of approximately 3 inches. Neither distribution shows outliers."
Common Mistakes
❌ Confusing skewness direction (it's the tail, not the peak!)
❌ Using mean with skewed data (median is more appropriate)
❌ Reporting center without spread (both are needed!)
❌ Ignoring units (always include them)
❌ Incomplete descriptions (use full SOCS framework)
❌ Not describing in context (relate to actual situation)
❌ Confusing SD and IQR (they measure spread differently)
Quick Reference
SOCS Framework:
- Shape: Symmetric? Skewed (which direction)? Unimodal/bimodal?
- Outliers: Present? Where? How many?
- Center: Mean or median (with units!)
- Spread: Range, IQR, or SD (with units!)
Mean vs. Median:
- Symmetric, no outliers → Use mean
- Skewed or outliers → Use median
SD vs. IQR:
- Symmetric, no outliers → Use SD
- Skewed or outliers → Use IQR
Skewness:
- Right-skewed: Mean > Median, tail to right
- Left-skewed: Mean < Median, tail to left
- Symmetric: Mean ≈ Median
Remember: A complete description tells the story of the data. Don't just report numbers — interpret them in context and explain what they mean!
📚 Practice Problems
1Problem 1easy
❓ Question:
Describe the shape of a distribution that is: a) Symmetric b) Skewed right c) Skewed left
💡 Show Solution
Step 1: Symmetric distribution
- Mirror image on both sides of center
- Mean ≈ Median
- Example: Normal distribution, heights
- Tail length equal on both sides
Step 2: Skewed right (positive skew)
- Tail extends to the right
- Mean > Median (pulled toward tail)
- Example: Income, house prices
- Most data on left, few high values
Step 3: Skewed left (negative skew)
- Tail extends to the left
- Mean < Median (pulled toward tail)
- Example: Test scores (when easy), age at death
- Most data on right, few low values
Memory trick: "The skew points where the tail points"
Visual summary: Symmetric: <-center-> Right skew: <-center----> Left skew: <----center->
Answer: a) Symmetric: balanced on both sides, mean = median b) Skewed right: long right tail, mean > median c) Skewed left: long left tail, mean < median
2Problem 2easy
❓ Question:
A dataset has the following properties: Mean = 75, Median = 80. What can you conclude about the shape of the distribution?
💡 Show Solution
Step 1: Compare mean and median Mean = 75 Median = 80 Mean < Median
Step 2: Recall the relationship When Mean < Median:
- Distribution is skewed LEFT (negative skew)
- Tail points to lower values
- A few low values pull the mean down
Step 3: Explain why The mean is sensitive to extreme values The median is resistant to outliers If mean is pulled below median, there must be some low outliers or a left tail
Step 4: Visualize Most data is clustered around 80 (median) Some lower values around or below 75 These low values drag the mean down below the median
Example: If test scores are mostly in 70s-90s, but a few students scored in 40s-50s, mean would be pulled down while median stays high.
Answer: The distribution is skewed LEFT (negatively skewed) because Mean < Median, indicating a long tail toward lower values.
3Problem 3medium
❓ Question:
Identify whether each distribution is unimodal, bimodal, or multimodal: a) Heights of adult humans (all genders) b) Test scores where most students got A or F c) Ages of people at a kids movie theater
💡 Show Solution
Step 1: Understand modes Unimodal: One clear peak Bimodal: Two distinct peaks Multimodal: More than two peaks
Step 2: Analyze each scenario
a) Heights of all adult humans
- Women cluster around ~5'4" (163 cm)
- Men cluster around ~5'9" (175 cm)
- Two distinct groups Answer: BIMODAL
b) Test scores with mostly A or F
- Cluster around 90-100 (A students)
- Cluster around 0-60 (F students)
- Few in between (B, C, D) Answer: BIMODAL
c) Ages at kids movie
- Young children (ages 5-12)
- Parents (ages 30-45)
- Possibly grandparents (ages 60-75)
- Could have 2-3 distinct groups Answer: BIMODAL or MULTIMODAL (likely 2-3 peaks)
Answer: a) Bimodal (male and female heights) b) Bimodal (A and F peaks) c) Bimodal/Multimodal (children and adults)
4Problem 4medium
❓ Question:
Describe this distribution using the SOCS framework (Shape, Outliers, Center, Spread): Data shows exam scores with most values between 70-85, mean=77, median=78, one score at 45, and range=40.
💡 Show Solution
SOCS Framework for describing distributions:
S - SHAPE: Mean (77) ≈ Median (78), very close This suggests roughly SYMMETRIC distribution However, presence of low outlier (45) suggests slight left skew Overall: Roughly symmetric, possibly slight left skew
O - OUTLIERS: Score of 45 is notably low With most scores 70-85 and one at 45: 45 is likely an outlier (more than 25 points below typical) Need to check with 1.5×IQR rule, but appears to be outlier
C - CENTER: Mean = 77 Median = 78 Typical exam score around 77-78 Mean slightly pulled down by low outlier
S - SPREAD: Range = 40 points (from 45 to 85) Most data in 70-85 range (about 15 points) Without outlier, spread would be smaller IQR likely around 10-15 points
Complete SOCS description: "The distribution of exam scores is roughly symmetric with a possible slight left skew due to one low outlier at 45. The center of the distribution is around 77-78 (mean and median nearly equal). The scores spread from 45 to 85, a range of 40 points, though most scores cluster between 70-85. The score of 45 appears to be an outlier, sitting well below the main body of data."
Answer: Symmetric/slight left skew, one low outlier (45), center ~77-78, range=40 with most data in 15-point range.
5Problem 5hard
❓ Question:
Two distributions have the same mean (50) and same range (20-80). Distribution A is uniform (flat), while Distribution B is normal (bell-shaped). Which distribution would have a larger standard deviation, and why?
💡 Show Solution
Step 1: Visualize both distributions Both: Mean = 50, Range = 60 (from 20 to 80)
Distribution A (Uniform):
- Data spread EVENLY from 20 to 80
- Every value equally likely
- Flat histogram
Distribution B (Normal):
- Data concentrated near mean (50)
- Fewer values at extremes (20 and 80)
- Bell-shaped curve
Step 2: Understand standard deviation SD measures average distance from the mean Larger SD = more spread out from center
Step 3: Compare spread from mean
Distribution A (Uniform):
- Many values far from mean (50)
- Values at 20 and 80 are 30 units from mean
- Lots of data at extremes
- Higher average distance from mean
Distribution B (Normal):
- Most data near mean (50)
- Few values at 20 and 80
- Less data at extremes
- Lower average distance from mean
Step 4: Calculate mental estimate Uniform: Roughly SD ≈ range/3.5 ≈ 60/3.5 ≈ 17 Normal: Roughly SD ≈ range/6 ≈ 60/6 ≈ 10 (These are approximations)
Answer: Distribution A (uniform) has LARGER standard deviation because more of its data is spread far from the mean, while Distribution B (normal) has most data clustered near the center. Even with the same range, uniform distributions have more variability than normal distributions.
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics