Types of Data and Sampling
Categorical vs quantitative data, sampling methods
Types of Data and Sampling
Introduction
Statistics is the science of collecting, organizing, analyzing, and interpreting data. Understanding the different types of data and proper sampling methods is fundamental to conducting valid statistical analyses.
Types of Data
Categorical vs. Quantitative
Categorical (Qualitative) Data:
- Describes characteristics or qualities
- Places individuals into categories
- Cannot be measured numerically in a meaningful way
Examples:
- Eye color (blue, brown, green)
- Political party (Democrat, Republican, Independent)
- Type of car (sedan, SUV, truck)
- Opinion rating (agree, neutral, disagree)
Quantitative (Numerical) Data:
- Consists of numerical measurements or counts
- Can be added, averaged, or otherwise manipulated mathematically
Examples:
- Height (68 inches, 72 inches)
- Test score (85, 92, 78)
- Number of siblings (0, 1, 2, 3)
- Temperature (72°F, 85°F)
Discrete vs. Continuous
Within quantitative data, we distinguish:
Discrete Data:
- Countable values
- Usually whole numbers
- Often from counting
Examples:
- Number of students in a class (25, 30, 18)
- Number of cars owned (0, 1, 2, 3)
- Number of errors on a test (2, 5, 0)
Continuous Data:
- Can take any value in an interval
- Usually from measuring
- Infinite possible values between any two points
Examples:
- Height (5.7 feet, 5.75 feet, 5.752 feet...)
- Weight (142.3 lbs, 142.35 lbs...)
- Time (3.2 seconds, 3.25 seconds...)
Levels of Measurement
Understanding the level of measurement helps determine appropriate statistical analyses.
Nominal
Characteristics:
- Categories with no inherent order
- Most basic level
- Can only count frequencies
Examples:
- Blood type (A, B, AB, O)
- Gender (male, female, non-binary)
- Favorite color (red, blue, green)
Valid operations: Count, mode
Ordinal
Characteristics:
- Categories with meaningful order
- Differences between ranks not necessarily equal
- Cannot measure exact distance between values
Examples:
- Class rank (1st, 2nd, 3rd)
- Letter grades (A, B, C, D, F)
- Satisfaction rating (very satisfied, satisfied, neutral, dissatisfied)
Valid operations: Count, mode, median
Interval
Characteristics:
- Numerical scale with equal intervals
- No true zero point
- Zero doesn't mean "absence of"
Examples:
- Temperature in Celsius or Fahrenheit (0°F doesn't mean "no temperature")
- IQ scores
- Calendar years (year 0 is arbitrary)
Valid operations: Count, mode, median, mean, addition/subtraction
Ratio
Characteristics:
- Numerical scale with equal intervals
- Has true zero point
- Zero means complete absence
- Can form ratios (twice as much, half as big)
Examples:
- Height (0 inches = no height)
- Weight (0 lbs = no weight)
- Age (0 years = newborn)
- Income (0 dollars = no money)
Valid operations: All mathematical operations
Populations vs. Samples
Population
Definition: The entire group of individuals or items we want to study
Characteristics:
- Complete collection
- Often too large or expensive to study completely
- Denoted by for size
Examples:
- All students in the United States
- All adults registered to vote in California
- Every car manufactured by Toyota in 2024
Parameters: Numerical characteristics of populations
- Population mean: (mu)
- Population standard deviation: (sigma)
- Population proportion:
Sample
Definition: A subset of the population, selected for study
Characteristics:
- Representative portion of population
- Practical and economical to study
- Denoted by for size
Examples:
- 500 randomly selected U.S. students
- 1,000 California voters surveyed
- 100 Toyota cars tested from 2024 production
Statistics: Numerical characteristics of samples
- Sample mean: (x-bar)
- Sample standard deviation:
- Sample proportion: (p-hat)
Key relationship: We use statistics from samples to make inferences about parameters of populations.
Sampling Methods
Random Sampling
Simple Random Sample (SRS):
- Every individual has equal chance of selection
- Every group of size has equal chance
- "Gold standard" of sampling
How to obtain:
- Assign numbers to all population members
- Use random number generator
- Select corresponding individuals
Example: Put all 500 student names in a hat, mix thoroughly, draw 50 names
Advantages:
- Unbiased
- Simple to understand
- Known probability of selection
Disadvantages:
- Requires complete list of population
- May not represent subgroups well
- Can be impractical for large populations
Stratified Random Sample
Method:
- Divide population into homogeneous groups (strata)
- Take SRS from each stratum
- Combine samples
Example: Divide school by grade level (9th, 10th, 11th, 12th), randomly sample 25 students from each grade
When to use:
- Want to ensure representation of subgroups
- Strata are internally similar but different from each other
- Interested in comparing groups
Advantages:
- Guarantees representation from each stratum
- More precise estimates
- Can compare strata
Disadvantages:
- Requires knowledge of population characteristics
- More complex than SRS
Cluster Sample
Method:
- Divide population into groups (clusters)
- Randomly select some clusters
- Study ALL individuals in selected clusters
Example: Divide city into neighborhoods (clusters), randomly select 5 neighborhoods, survey all households in those 5
When to use:
- No complete population list available
- Geographically dispersed population
- Cost-effective approach needed
Advantages:
- Practical and economical
- No need for complete population list
- Reduces travel/contact costs
Disadvantages:
- Less precise than SRS
- Clusters should be heterogeneous (like mini-populations)
Systematic Sample
Method:
- Select every th individual from list
- Random starting point
- (population size / sample size)
Example: From 1000 students, select every 10th student (random start between 1-10), get sample of 100
When to use:
- Have organized list
- Want easy implementation
- Population not cyclical
Advantages:
- Simple to implement
- Spreads sample across population
- Often as good as SRS
Disadvantages:
- Problems if list has hidden patterns
- Not truly random
Sampling Bias
Types of Bias
Selection Bias:
- Some individuals more likely to be selected
- Sample not representative of population
Example: Surveying only people in shopping mall (excludes those who don't shop there)
Voluntary Response Bias:
- Individuals choose to participate
- Often those with strong opinions respond
Example: Online poll where anyone can vote (those who care most will participate)
Undercoverage:
- Some groups systematically excluded
- Sampling frame incomplete
Example: Phone survey excludes those without phones
Nonresponse Bias:
- Selected individuals don't respond
- Respondents differ from non-respondents
Example: Survey with 20% response rate (80% non-response)
Best Practices
For Valid Sampling:
✓ Use random selection when possible
✓ Define population clearly
✓ Ensure sampling frame matches population
✓ Minimize nonresponse
✓ Watch for sources of bias
✓ Use stratification when subgroups matter
✓ Make sample size adequate for precision needed
Common Mistakes to Avoid:
❌ Convenience sampling (just because it's easy)
❌ Voluntary response (self-selection bias)
❌ Assuming bigger is always better (quality > quantity)
❌ Ignoring nonresponse
❌ Using outdated sampling frame
Quick Reference
Data Type Decision Tree:
- Is it numerical? → Quantitative (otherwise Categorical)
- Can it be counted? → Discrete (otherwise Continuous)
- Does it have true zero? → Ratio (otherwise Interval)
Sampling Method Selection:
- Want simplicity and have complete list → SRS
- Need to ensure subgroup representation → Stratified
- Population spread out geographically → Cluster
- Have organized list, want efficiency → Systematic
Remember: Good sampling is the foundation of valid statistical inference. A biased sample, no matter how large, leads to invalid conclusions!
📚 Practice Problems
1Problem 1easy
❓ Question:
Classify each variable as categorical or quantitative: a) Eye color of students b) Number of siblings c) Brand of smartphone d) Height in centimeters
💡 Show Solution
Step 1: Understand the distinction Categorical: Places individuals into groups/categories Quantitative: Takes numerical values with meaningful operations
Step 2: Analyze each variable a) Eye color: Categories (blue, brown, green, etc.) → CATEGORICAL b) Number of siblings: Numerical count, can calculate average → QUANTITATIVE c) Brand of smartphone: Categories (Apple, Samsung, etc.) → CATEGORICAL d) Height in centimeters: Numerical measurement, can calculate mean → QUANTITATIVE
Answer: a) Categorical b) Quantitative c) Categorical d) Quantitative
2Problem 2easy
❓ Question:
A survey asks students: "Rate your satisfaction with the cafeteria food on a scale of 1-5." Is this categorical or quantitative? Explain.
💡 Show Solution
Step 1: Analyze the data type Scale: 1-5 (numbers are used)
Step 2: Consider the nature of the scale
- Numbers represent categories of satisfaction (very unsatisfied → very satisfied)
- The numbers are ordinal (ordered categories)
- Differences between numbers aren't necessarily equal
- Can't meaningfully say "2 is twice as satisfied as 1"
Step 3: Classify This is CATEGORICAL (specifically ordinal categorical data)
- Even though numbers are used, they represent categories
- The numbers are labels for satisfaction levels
- Also called "Likert scale" data
Note: Some statisticians treat ordinal data as quantitative in certain contexts, but strictly speaking, it's categorical with an order.
Answer: Categorical (ordinal)
3Problem 3medium
❓ Question:
Identify whether each sampling method is: Simple Random Sample (SRS), Stratified, Cluster, or Systematic. a) Select every 10th person entering a store b) Divide school by grade level, then randomly select students from each grade c) Randomly select 5 classrooms and survey all students in those classrooms
💡 Show Solution
Step 1: Review sampling methods SRS: Every individual has equal probability Stratified: Divide into groups (strata), sample from each Cluster: Divide into groups, randomly select some groups, use ALL from selected Systematic: Select every kth individual
Step 2: Classify each method
a) Every 10th person Pattern: Select at regular intervals This is SYSTEMATIC sampling
b) Divide by grade, sample from each Pattern: Create homogeneous groups (grades), sample from ALL groups This is STRATIFIED sampling
c) Select 5 classrooms, survey all students Pattern: Groups (clusters) selected, then ALL within those groups surveyed This is CLUSTER sampling
Answer: a) Systematic b) Stratified c) Cluster
4Problem 4medium
❓ Question:
A researcher wants to estimate the average income in a city. She divides the city into neighborhoods based on property values (low, medium, high), then randomly samples 50 households from each neighborhood. What type of sampling is this, and why might it be better than a simple random sample?
💡 Show Solution
Step 1: Identify the sampling method Process:
- Divide population into groups (neighborhoods by property value)
- Sample from EACH group
- Use proportional or equal sampling from each stratum
This is STRATIFIED sampling
Step 2: Explain advantages over SRS
Why stratified is better here:
- Ensures representation: Guarantees all income levels represented
- Reduces variability: Within each stratum, incomes are more similar
- Increases precision: Can get more accurate estimates with same sample size
- Allows subgroup analysis: Can compare neighborhoods
With SRS:
- Might randomly miss low-income or high-income areas
- Higher chance of sampling error
- Less efficient estimation
Step 3: Statistical benefit Stratified sampling reduces the standard error of the estimate when:
- Strata are homogeneous within
- Strata are heterogeneous between
- Income varies greatly by neighborhood (which it does!)
Answer: Stratified sampling. It's better because it ensures all income levels are represented, reduces sampling variability, and provides more precise estimates than SRS when the population has distinct subgroups.
5Problem 5hard
❓ Question:
A college has 10,000 students: 6,000 freshmen, 2,500 sophomores, 1,000 juniors, and 500 seniors. Design a stratified random sample of 200 students that maintains the same proportions. How many students should be selected from each class?
💡 Show Solution
Step 1: Find the proportion of each class Total students: 10,000
Freshmen: 6,000/10,000 = 0.60 = 60% Sophomores: 2,500/10,000 = 0.25 = 25% Juniors: 1,000/10,000 = 0.10 = 10% Seniors: 500/10,000 = 0.05 = 5%
Step 2: Apply proportions to sample size Sample size: 200 students
Freshmen: 200 × 0.60 = 120 students Sophomores: 200 × 0.25 = 50 students Juniors: 200 × 0.10 = 20 students Seniors: 200 × 0.05 = 10 students
Step 3: Verify total 120 + 50 + 20 + 10 = 200 ✓
Step 4: Verify proportions maintained Freshmen: 120/200 = 60% ✓ Sophomores: 50/200 = 25% ✓ Juniors: 20/200 = 10% ✓ Seniors: 10/200 = 5% ✓
Answer: Freshmen: 120 Sophomores: 50 Juniors: 20 Seniors: 10
Practice with Flashcards
Review key concepts with our flashcard system
Browse All Topics
Explore other calculus topics