Displaying Distributions with Graphs

Histograms, dot plots, stem plots, and boxplots

Displaying Distributions with Graphs

Introduction

"A picture is worth a thousand words" — especially in statistics! Graphs help us visualize data distributions, identify patterns, spot outliers, and communicate findings effectively. Choosing the right graph type depends on your data type and what you want to show.

Graphs for Categorical Data

Bar Graph (Bar Chart)

Purpose: Compare frequencies or percentages across categories

Structure:

  • Categorical variable on x-axis
  • Frequency or percentage on y-axis
  • Bars have gaps between them (not touching)
  • Heights represent frequencies

When to use:

  • Categorical data
  • Comparing categories
  • Showing frequencies or percentages

Example: Favorite ice cream flavors among students

  • Chocolate: 45 students
  • Vanilla: 32 students
  • Strawberry: 18 students
  • Other: 15 students

Key features:

  • Bars can be ordered (by frequency) or kept in natural order
  • Easy to compare categories visually
  • Clear and simple

Pie Chart

Purpose: Show parts of a whole

Structure:

  • Circle divided into slices
  • Each slice represents a category
  • Slice size proportional to percentage

When to use:

  • Want to show proportions
  • Have relatively few categories (3-6 ideal)
  • Emphasizing "part of whole" relationship

Example: Student transportation methods

  • Bus: 40%
  • Car: 30%
  • Walk: 20%
  • Bike: 10%

Advantages:

  • Shows proportions clearly
  • Visually appealing
  • Good for presentations

Disadvantages:

  • Hard to compare similar-sized slices
  • Difficult with many categories
  • Can be misleading with 3D effects

Segmented Bar Chart

Purpose: Compare distributions across multiple groups

Structure:

  • Bars divided into segments
  • Each segment represents a category
  • Can show counts or percentages

When to use:

  • Comparing categorical distributions across groups
  • Two categorical variables
  • Want to see both totals and breakdowns

Example: Transportation method by grade level

  • Each grade has a bar
  • Bars divided by transportation type
  • Can compare both across and within grades

Graphs for Quantitative Data

Dotplot

Purpose: Display individual values for small to moderate datasets

Structure:

  • Number line showing possible values
  • Dot for each observation
  • Dots stack when values repeat

When to use:

  • Small datasets (n < 50)
  • Want to see individual values
  • Looking for clusters, gaps, outliers

Example: Test scores: 75, 80, 80, 82, 85, 85, 85, 90, 95

  • Stack three dots above 85
  • Stack two dots above 80
  • Single dots for 75, 82, 90, 95

Advantages:

  • Shows every data point
  • Easy to create
  • Good for small datasets

Disadvantages:

  • Impractical for large datasets
  • Can become cluttered

Stemplot (Stem-and-Leaf Plot)

Purpose: Display data while retaining actual values

Structure:

  • Split each value into "stem" (leading digit(s)) and "leaf" (trailing digit)
  • Stems listed vertically
  • Leaves listed horizontally

When to use:

  • Small to moderate datasets
  • Want to preserve actual data values
  • Quick hand-drawn analysis

Example: Test scores: 67, 72, 75, 78, 81, 83, 85, 85, 92

Stem  Leaf
6     7
7     2 5 8
8     1 3 5 5
9     2

Key: 7 | 2 represents 72

Back-to-back stemplot: Compare two distributions

  • Shared stems in middle
  • One dataset's leaves on left
  • Other dataset's leaves on right

Advantages:

  • Retains actual values
  • Shows distribution shape
  • Can reconstruct original data

Disadvantages:

  • Tedious for large datasets
  • Choice of stems affects appearance

Histogram

Purpose: Display distribution of continuous data

Structure:

  • Quantitative variable on x-axis (divided into bins)
  • Frequency or relative frequency on y-axis
  • Bars touching (continuous data)
  • Bar height = frequency in that interval

When to use:

  • Large datasets
  • Continuous or discrete quantitative data
  • Want to see distribution shape

Example: Heights of students (in inches)

  • 60-62: 5 students
  • 62-64: 12 students
  • 64-66: 23 students
  • 66-68: 18 students
  • 68-70: 8 students

Important considerations:

Bin width:

  • Too narrow → choppy, hard to see pattern
  • Too wide → lose detail, miss features
  • Experiment to find appropriate width

Number of bins:

  • General rule: n\sqrt{n} or log2(n)+1\log_2(n) + 1
  • 5-20 bins usually works well
  • More data → can use more bins

Advantages:

  • Shows distribution shape clearly
  • Handles large datasets
  • Identifies outliers, gaps, clusters

Disadvantages:

  • Appearance depends on bin choices
  • Loses individual data values
  • Can mislead if bins chosen poorly

Boxplot (Box-and-Whisker Plot)

Purpose: Display five-number summary and identify outliers

Structure:

  • Box from Q1 to Q3 (contains middle 50%)
  • Line at median inside box
  • Whiskers extend to min and max (excluding outliers)
  • Outliers plotted individually

Five-number summary:

  1. Minimum (excluding outliers)
  2. Q1 (first quartile, 25th percentile)
  3. Median (50th percentile)
  4. Q3 (third quartile, 75th percentile)
  5. Maximum (excluding outliers)

Outlier definition:

  • Below: Q11.5×IQRQ1 - 1.5 \times IQR
  • Above: Q3+1.5×IQRQ3 + 1.5 \times IQR
  • Where IQR=Q3Q1IQR = Q3 - Q1

When to use:

  • Comparing multiple distributions
  • Identifying outliers
  • Showing spread and center
  • Large datasets

Modified boxplot:

  • Whiskers go to last value within 1.5 × IQR
  • Outliers plotted as individual points
  • More informative than regular boxplot

Side-by-side boxplots:

  • Compare distributions across groups
  • Same scale for all boxes
  • Easy to see differences in center, spread, shape

Advantages:

  • Compact display
  • Shows spread clearly
  • Easy to compare groups
  • Identifies outliers automatically

Disadvantages:

  • Doesn't show distribution shape well
  • Can hide bimodality or other features
  • Less detail than histogram

Cumulative Frequency Plot (Ogive)

Purpose: Show cumulative frequencies or percentages

Structure:

  • Data values on x-axis
  • Cumulative frequency/percentage on y-axis
  • Line connects points
  • Always increasing (or flat)

When to use:

  • Want to find percentiles
  • Show how data accumulates
  • Identify median and quartiles

Uses:

  • Read off percentiles directly
  • See what percentage falls below a value
  • Identify quartile locations

Describing Distributions (SOCS)

When analyzing any graph, describe using SOCS:

S - Shape

Symmetric: Balanced around center (mirror image)

  • Normal (bell-shaped)
  • Uniform (flat, rectangular)

Skewed:

  • Right-skewed (positive): Tail extends to right, mean > median
  • Left-skewed (negative): Tail extends to left, mean < median

Modality:

  • Unimodal: One peak
  • Bimodal: Two peaks
  • Multimodal: Multiple peaks
  • Uniform: No peaks

O - Outliers

Outliers: Observations unusually far from bulk of data

Identify:

  • Visual inspection (far from others)
  • 1.5 × IQR rule (for boxplots)
  • More than 2-3 standard deviations from mean

Report:

  • Note presence
  • Give values if possible
  • Consider causes (error? legitimate?)

C - Center

Typical value: Where data tends to cluster

Measures:

  • Median (middle value)
  • Mean (average)
  • Mode (most common)

In description: "The center is around [value]" or "The median is [value]"

S - Spread

Variability: How spread out data is

Measures:

  • Range (max - min)
  • IQR (Q3 - Q1)
  • Standard deviation

In description: "Values range from [min] to [max]" or "Most values fall between [Q1] and [Q3]"

Choosing the Right Graph

Decision Guide

Categorical data:

  • Few categories, show proportions → Pie chart
  • Compare categories → Bar graph
  • Compare across groups → Segmented bar chart

Quantitative data:

  • Small dataset (n < 30) → Dotplot or stemplot
  • Show distribution shape → Histogram
  • Compare groups → Side-by-side boxplots
  • Identify outliers → Boxplot
  • Find percentiles → Cumulative frequency plot

Common Mistakes to Avoid

Pie charts for quantitative data
3D or decorative effects (distort perception)
Inconsistent scales when comparing
Too many/too few bins in histograms
Bar graph with touching bars (that's a histogram!)
Missing labels on axes
No scale on axes

Best Practices

Label axes clearly with variable names and units
Include title describing what graph shows
Use consistent scales when comparing
Choose appropriate graph type for data
Make it readable (not too small, cluttered)
Describe using SOCS in analysis
Note any outliers or unusual features

Quick Reference

Graph Selection:

  • Categorical: Bar graph or pie chart
  • Small quantitative: Dotplot or stemplot
  • Large quantitative: Histogram or boxplot
  • Comparisons: Side-by-side boxplots or segmented bar charts
  • Percentiles: Cumulative frequency plot

SOCS Description:

  • Shape: symmetric, skewed (left/right), unimodal/bimodal
  • Outliers: identify and report
  • Center: median, mean
  • Spread: range, IQR, standard deviation

Remember: The best graph clearly communicates the story in your data. When in doubt, try multiple types and choose the one that reveals the most!

📚 Practice Problems

1Problem 1easy

Question:

What type of graph would be most appropriate for displaying: a) The distribution of test scores (0-100) for a class b) The number of students in each major at a university c) The relationship between study hours and exam scores

💡 Show Solution

Step 1: Match data type to graph type

a) Test scores (0-100) - Quantitative, continuous Best choice: HISTOGRAM

  • Shows distribution shape
  • Can see center, spread, outliers Alternative: Boxplot, Dotplot (for small datasets)

b) Number of students in each major - Categorical Best choice: BAR GRAPH

  • Each major is a category
  • Height shows frequency/count
  • Bars should NOT touch (categorical)

c) Study hours vs exam scores - Two quantitative variables Best choice: SCATTERPLOT

  • Shows relationship between two quantitative variables
  • Each point represents one student
  • Can assess correlation

Answer: a) Histogram b) Bar graph c) Scatterplot

2Problem 2easy

Question:

Given this data on ages of 20 people: 18, 19, 19, 20, 20, 20, 21, 21, 22, 22, 23, 23, 24, 25, 26, 27, 30, 35, 40, 55. Create a stemplot (stem-and-leaf plot) for this data.

💡 Show Solution

Step 1: Organize by stems (tens place) Stem = tens digit Leaf = ones digit

Step 2: List all data points by stem 1|8, 9, 9 2|0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 5, 6, 7 3|0, 5 4|0 5|5

Step 3: Create the stemplot with key

Stem-and-Leaf Plot: 1 | 8 9 9 2 | 0 0 0 1 1 2 2 3 3 4 5 6 7 3 | 0 5 4 | 0 5 | 5

Key: 1|8 = 18 years old

Step 4: Observations

  • Most people in their 20s (heavily concentrated)
  • Few outliers in 40s and 50s
  • Roughly symmetric in the 18-27 range
  • Gap between 30 and 35, and after 40

Answer: See stemplot above

3Problem 3medium

Question:

The following histogram shows test scores. Describe the shape, center, and spread of the distribution. [Histogram with bins: 50-59(2), 60-69(5), 70-79(12), 80-89(8), 90-99(3)]

💡 Show Solution

Step 1: Determine the shape Look at the overall pattern:

  • Peak at 70-79 (most frequent)
  • Decreases on both sides of peak
  • Roughly symmetric, slight left skew
  • One mode (unimodal)

Shape: Roughly symmetric, unimodal, slightly skewed left

Step 2: Estimate the center Peak bin: 70-79 Most data in 70-89 range Approximate mean/median: around 75-77

Step 3: Describe the spread Range: 50 to 99 (approximately 50 points) Most data spans about 30-40 points (60-90) Variability: Moderate spread

Step 4: Look for unusual features

  • Small tail on left (50s and 60s)
  • Very few extreme scores
  • No major outliers
  • Gap in very low scores (no scores below 50)

Answer: Shape: Unimodal, roughly symmetric with slight left skew Center: Around 75-77 Spread: Scores range from 50s to 90s, with most between 60-90 Unusual: Small left tail, no scores below 50

4Problem 4medium

Question:

Compare using histograms vs. boxplots. What are the advantages and disadvantages of each for displaying distributions?

💡 Show Solution

HISTOGRAMS:

Advantages:

  1. Show the shape of distribution clearly
  2. Can see multiple modes (bimodal, multimodal)
  3. Display frequency/count information
  4. Show gaps in data
  5. Can see actual data density

Disadvantages:

  1. Appearance depends on bin width choice
  2. Harder to compare multiple distributions
  3. Don't show specific summary statistics
  4. Take more space for multiple groups

BOXPLOTS:

Advantages:

  1. Show 5-number summary clearly (min, Q1, median, Q3, max)
  2. Excellent for comparing multiple distributions side-by-side
  3. Clearly identify outliers
  4. Compact representation
  5. Good for large datasets

Disadvantages:

  1. Don't show the shape as clearly
  2. Can't see multiple modes
  3. Hide detailed distribution features
  4. Don't show sample size
  5. Can't see gaps in data

WHEN TO USE EACH:

Use Histogram when:

  • Need to see detailed shape
  • Checking for normality
  • Looking for multiple modes
  • Single distribution to display

Use Boxplot when:

  • Comparing multiple groups
  • Quick summary needed
  • Identifying outliers is priority
  • Limited space available

Answer: Histograms show shape better; boxplots better for comparisons and outlier detection. Choice depends on analysis goals.

5Problem 5hard

Question:

Create a boxplot for this five-number summary: Min=12, Q1=18, Median=23, Q3=29, Max=45. Then identify if there are any outliers using the 1.5×IQR rule.

💡 Show Solution

Step 1: Calculate IQR IQR = Q3 - Q1 = 29 - 18 = 11

Step 2: Calculate outlier boundaries Lower fence = Q1 - 1.5×IQR = 18 - 1.5(11) = 18 - 16.5 = 1.5

Upper fence = Q3 + 1.5×IQR = 29 + 1.5(11) = 29 + 16.5 = 45.5

Step 3: Identify outliers Any value < 1.5 or > 45.5 is an outlier

Check our values: Min = 12: Is 12 < 1.5? No → Not an outlier Max = 45: Is 45 > 45.5? No → Not an outlier

Step 4: Draw the boxplot No outliers, so whiskers extend to actual min and max

Boxplot: |------[====|====]------| 12 18 23 29 45

Box: From Q1(18) to Q3(29) Line in box: Median(23) Left whisker: To Min(12) Right whisker: To Max(45)

Step 5: Observations

  • Median closer to Q1 than Q3 (slight right skew)
  • Right whisker longer than left (confirms right skew)
  • No outliers

Answer: No outliers. Boxplot shows slight right skew with all data within fences.