Types of Data and Sampling

Categorical vs quantitative data, sampling methods

Types of Data and Sampling

Introduction

Statistics is the science of collecting, organizing, analyzing, and interpreting data. Understanding the different types of data and proper sampling methods is fundamental to conducting valid statistical analyses.

Types of Data

Categorical vs. Quantitative

Categorical (Qualitative) Data:

  • Describes characteristics or qualities
  • Places individuals into categories
  • Cannot be measured numerically in a meaningful way

Examples:

  • Eye color (blue, brown, green)
  • Political party (Democrat, Republican, Independent)
  • Type of car (sedan, SUV, truck)
  • Opinion rating (agree, neutral, disagree)

Quantitative (Numerical) Data:

  • Consists of numerical measurements or counts
  • Can be added, averaged, or otherwise manipulated mathematically

Examples:

  • Height (68 inches, 72 inches)
  • Test score (85, 92, 78)
  • Number of siblings (0, 1, 2, 3)
  • Temperature (72°F, 85°F)

Discrete vs. Continuous

Within quantitative data, we distinguish:

Discrete Data:

  • Countable values
  • Usually whole numbers
  • Often from counting

Examples:

  • Number of students in a class (25, 30, 18)
  • Number of cars owned (0, 1, 2, 3)
  • Number of errors on a test (2, 5, 0)

Continuous Data:

  • Can take any value in an interval
  • Usually from measuring
  • Infinite possible values between any two points

Examples:

  • Height (5.7 feet, 5.75 feet, 5.752 feet...)
  • Weight (142.3 lbs, 142.35 lbs...)
  • Time (3.2 seconds, 3.25 seconds...)

Levels of Measurement

Understanding the level of measurement helps determine appropriate statistical analyses.

Nominal

Characteristics:

  • Categories with no inherent order
  • Most basic level
  • Can only count frequencies

Examples:

  • Blood type (A, B, AB, O)
  • Gender (male, female, non-binary)
  • Favorite color (red, blue, green)

Valid operations: Count, mode

Ordinal

Characteristics:

  • Categories with meaningful order
  • Differences between ranks not necessarily equal
  • Cannot measure exact distance between values

Examples:

  • Class rank (1st, 2nd, 3rd)
  • Letter grades (A, B, C, D, F)
  • Satisfaction rating (very satisfied, satisfied, neutral, dissatisfied)

Valid operations: Count, mode, median

Interval

Characteristics:

  • Numerical scale with equal intervals
  • No true zero point
  • Zero doesn't mean "absence of"

Examples:

  • Temperature in Celsius or Fahrenheit (0°F doesn't mean "no temperature")
  • IQ scores
  • Calendar years (year 0 is arbitrary)

Valid operations: Count, mode, median, mean, addition/subtraction

Ratio

Characteristics:

  • Numerical scale with equal intervals
  • Has true zero point
  • Zero means complete absence
  • Can form ratios (twice as much, half as big)

Examples:

  • Height (0 inches = no height)
  • Weight (0 lbs = no weight)
  • Age (0 years = newborn)
  • Income (0 dollars = no money)

Valid operations: All mathematical operations

Populations vs. Samples

Population

Definition: The entire group of individuals or items we want to study

Characteristics:

  • Complete collection
  • Often too large or expensive to study completely
  • Denoted by NN for size

Examples:

  • All students in the United States
  • All adults registered to vote in California
  • Every car manufactured by Toyota in 2024

Parameters: Numerical characteristics of populations

  • Population mean: μ\mu (mu)
  • Population standard deviation: σ\sigma (sigma)
  • Population proportion: pp

Sample

Definition: A subset of the population, selected for study

Characteristics:

  • Representative portion of population
  • Practical and economical to study
  • Denoted by nn for size

Examples:

  • 500 randomly selected U.S. students
  • 1,000 California voters surveyed
  • 100 Toyota cars tested from 2024 production

Statistics: Numerical characteristics of samples

  • Sample mean: xˉ\bar{x} (x-bar)
  • Sample standard deviation: ss
  • Sample proportion: p^\hat{p} (p-hat)

Key relationship: We use statistics from samples to make inferences about parameters of populations.

Sampling Methods

Random Sampling

Simple Random Sample (SRS):

  • Every individual has equal chance of selection
  • Every group of size nn has equal chance
  • "Gold standard" of sampling

How to obtain:

  • Assign numbers to all population members
  • Use random number generator
  • Select corresponding individuals

Example: Put all 500 student names in a hat, mix thoroughly, draw 50 names

Advantages:

  • Unbiased
  • Simple to understand
  • Known probability of selection

Disadvantages:

  • Requires complete list of population
  • May not represent subgroups well
  • Can be impractical for large populations

Stratified Random Sample

Method:

  • Divide population into homogeneous groups (strata)
  • Take SRS from each stratum
  • Combine samples

Example: Divide school by grade level (9th, 10th, 11th, 12th), randomly sample 25 students from each grade

When to use:

  • Want to ensure representation of subgroups
  • Strata are internally similar but different from each other
  • Interested in comparing groups

Advantages:

  • Guarantees representation from each stratum
  • More precise estimates
  • Can compare strata

Disadvantages:

  • Requires knowledge of population characteristics
  • More complex than SRS

Cluster Sample

Method:

  • Divide population into groups (clusters)
  • Randomly select some clusters
  • Study ALL individuals in selected clusters

Example: Divide city into neighborhoods (clusters), randomly select 5 neighborhoods, survey all households in those 5

When to use:

  • No complete population list available
  • Geographically dispersed population
  • Cost-effective approach needed

Advantages:

  • Practical and economical
  • No need for complete population list
  • Reduces travel/contact costs

Disadvantages:

  • Less precise than SRS
  • Clusters should be heterogeneous (like mini-populations)

Systematic Sample

Method:

  • Select every kkth individual from list
  • Random starting point
  • k=Nnk = \frac{N}{n} (population size / sample size)

Example: From 1000 students, select every 10th student (random start between 1-10), get sample of 100

When to use:

  • Have organized list
  • Want easy implementation
  • Population not cyclical

Advantages:

  • Simple to implement
  • Spreads sample across population
  • Often as good as SRS

Disadvantages:

  • Problems if list has hidden patterns
  • Not truly random

Sampling Bias

Types of Bias

Selection Bias:

  • Some individuals more likely to be selected
  • Sample not representative of population

Example: Surveying only people in shopping mall (excludes those who don't shop there)

Voluntary Response Bias:

  • Individuals choose to participate
  • Often those with strong opinions respond

Example: Online poll where anyone can vote (those who care most will participate)

Undercoverage:

  • Some groups systematically excluded
  • Sampling frame incomplete

Example: Phone survey excludes those without phones

Nonresponse Bias:

  • Selected individuals don't respond
  • Respondents differ from non-respondents

Example: Survey with 20% response rate (80% non-response)

Best Practices

For Valid Sampling:

Use random selection when possible
Define population clearly
Ensure sampling frame matches population
Minimize nonresponse
Watch for sources of bias
Use stratification when subgroups matter
Make sample size adequate for precision needed

Common Mistakes to Avoid:

❌ Convenience sampling (just because it's easy)
❌ Voluntary response (self-selection bias)
❌ Assuming bigger is always better (quality > quantity)
❌ Ignoring nonresponse
❌ Using outdated sampling frame

Quick Reference

Data Type Decision Tree:

  1. Is it numerical? → Quantitative (otherwise Categorical)
  2. Can it be counted? → Discrete (otherwise Continuous)
  3. Does it have true zero? → Ratio (otherwise Interval)

Sampling Method Selection:

  • Want simplicity and have complete list → SRS
  • Need to ensure subgroup representation → Stratified
  • Population spread out geographically → Cluster
  • Have organized list, want efficiency → Systematic

Remember: Good sampling is the foundation of valid statistical inference. A biased sample, no matter how large, leads to invalid conclusions!

📚 Practice Problems

1Problem 1easy

Question:

Classify each variable as categorical or quantitative: a) Eye color of students b) Number of siblings c) Brand of smartphone d) Height in centimeters

💡 Show Solution

Step 1: Understand the distinction Categorical: Places individuals into groups/categories Quantitative: Takes numerical values with meaningful operations

Step 2: Analyze each variable a) Eye color: Categories (blue, brown, green, etc.) → CATEGORICAL b) Number of siblings: Numerical count, can calculate average → QUANTITATIVE c) Brand of smartphone: Categories (Apple, Samsung, etc.) → CATEGORICAL d) Height in centimeters: Numerical measurement, can calculate mean → QUANTITATIVE

Answer: a) Categorical b) Quantitative c) Categorical d) Quantitative

2Problem 2easy

Question:

A survey asks students: "Rate your satisfaction with the cafeteria food on a scale of 1-5." Is this categorical or quantitative? Explain.

💡 Show Solution

Step 1: Analyze the data type Scale: 1-5 (numbers are used)

Step 2: Consider the nature of the scale

  • Numbers represent categories of satisfaction (very unsatisfied → very satisfied)
  • The numbers are ordinal (ordered categories)
  • Differences between numbers aren't necessarily equal
  • Can't meaningfully say "2 is twice as satisfied as 1"

Step 3: Classify This is CATEGORICAL (specifically ordinal categorical data)

  • Even though numbers are used, they represent categories
  • The numbers are labels for satisfaction levels
  • Also called "Likert scale" data

Note: Some statisticians treat ordinal data as quantitative in certain contexts, but strictly speaking, it's categorical with an order.

Answer: Categorical (ordinal)

3Problem 3medium

Question:

Identify whether each sampling method is: Simple Random Sample (SRS), Stratified, Cluster, or Systematic. a) Select every 10th person entering a store b) Divide school by grade level, then randomly select students from each grade c) Randomly select 5 classrooms and survey all students in those classrooms

💡 Show Solution

Step 1: Review sampling methods SRS: Every individual has equal probability Stratified: Divide into groups (strata), sample from each Cluster: Divide into groups, randomly select some groups, use ALL from selected Systematic: Select every kth individual

Step 2: Classify each method

a) Every 10th person Pattern: Select at regular intervals This is SYSTEMATIC sampling

b) Divide by grade, sample from each Pattern: Create homogeneous groups (grades), sample from ALL groups This is STRATIFIED sampling

c) Select 5 classrooms, survey all students Pattern: Groups (clusters) selected, then ALL within those groups surveyed This is CLUSTER sampling

Answer: a) Systematic b) Stratified c) Cluster

4Problem 4medium

Question:

A researcher wants to estimate the average income in a city. She divides the city into neighborhoods based on property values (low, medium, high), then randomly samples 50 households from each neighborhood. What type of sampling is this, and why might it be better than a simple random sample?

💡 Show Solution

Step 1: Identify the sampling method Process:

  1. Divide population into groups (neighborhoods by property value)
  2. Sample from EACH group
  3. Use proportional or equal sampling from each stratum

This is STRATIFIED sampling

Step 2: Explain advantages over SRS

Why stratified is better here:

  1. Ensures representation: Guarantees all income levels represented
  2. Reduces variability: Within each stratum, incomes are more similar
  3. Increases precision: Can get more accurate estimates with same sample size
  4. Allows subgroup analysis: Can compare neighborhoods

With SRS:

  • Might randomly miss low-income or high-income areas
  • Higher chance of sampling error
  • Less efficient estimation

Step 3: Statistical benefit Stratified sampling reduces the standard error of the estimate when:

  • Strata are homogeneous within
  • Strata are heterogeneous between
  • Income varies greatly by neighborhood (which it does!)

Answer: Stratified sampling. It's better because it ensures all income levels are represented, reduces sampling variability, and provides more precise estimates than SRS when the population has distinct subgroups.

5Problem 5hard

Question:

A college has 10,000 students: 6,000 freshmen, 2,500 sophomores, 1,000 juniors, and 500 seniors. Design a stratified random sample of 200 students that maintains the same proportions. How many students should be selected from each class?

💡 Show Solution

Step 1: Find the proportion of each class Total students: 10,000

Freshmen: 6,000/10,000 = 0.60 = 60% Sophomores: 2,500/10,000 = 0.25 = 25% Juniors: 1,000/10,000 = 0.10 = 10% Seniors: 500/10,000 = 0.05 = 5%

Step 2: Apply proportions to sample size Sample size: 200 students

Freshmen: 200 × 0.60 = 120 students Sophomores: 200 × 0.25 = 50 students Juniors: 200 × 0.10 = 20 students Seniors: 200 × 0.05 = 10 students

Step 3: Verify total 120 + 50 + 20 + 10 = 200 ✓

Step 4: Verify proportions maintained Freshmen: 120/200 = 60% ✓ Sophomores: 50/200 = 25% ✓ Juniors: 20/200 = 10% ✓ Seniors: 10/200 = 5% ✓

Answer: Freshmen: 120 Sophomores: 50 Juniors: 20 Seniors: 10