3.4. Sampling with and without replacement#
This notebook introduces the idea of sampling and the pandas
function df.sample()
When we sample from a population or parent distribution, we can do so with or without replacement.
Sampling without replacement is what we usually do when running an experiment or survey. A real world example of sampling without replacement would be if we give 100 students a wellbeing questionairre - each student in our sample is a member of the parent distribution (all students/ allstudents in the college we were sampling from etc). We will only want to survey each student once.
Sampling with replacement is often a good way to model random events. A real world of example of sampling with replacement is rolling a dice. Each diceroll yeilds an outcome (1-6) that is a sample from an infinite selection of possible outcomes - if I roll a three on one turn, I don’t ‘use up’ the three - it is still possible to roll a three on the next turn.
A direct comparison would be drawing cards from a deck. Without replacement, each card once drawn is set aside, so it is impossible to draw the same card twice. With replacement, each card is tucked back into the deck after being drawn, so it can be drawn again.
3.4.1. Set up Python libraries#
As usual, run the code cell below to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
3.4.2. Toy example#
Let’s explore the idea of sampling with and without replacement using a very simple example (a simple example designed just to illustrate a point is sometimes called a toy example)
Say I have a dataset listing four children’s pets:
[cat, dog, cat, rabbit]
If I sample from this dataset, I get a new list of pets. Say I draw a sample of size \(𝑛=2\), my sample might be [cat, cat] (if I’m lucky- I love cats!)
Without replacement#
If I sample without replacement, after I have ‘drawn’ my first sample pet fromt the original dataset, I cannot draw it again - my next sample pet will be drawn from the remaining three. The consequence of this is that all samples of size 𝑛=4 contain all of the original 4 pets, albeit in a different order
[cat, cat, dog, rabbit]
[rabbit, cat, dog, cat]
[cat, dog, rabbit, cat]
etc
With replacement#
If I sample with replacement, each ‘draw’ can be any of the four animals (think of it like pulling a card from a deck, checking which animal is on it, and then replacing the card in the deck before the next sample is drawn).
So I could get:
[cat, cat, cat, cat]
if I’m really lucky!
or more likely:
[cat, dog, cat, cat]
[rabbit, dog, cat, rabbit]
… etc
3.4.3. Sampling from a Pandas
dataframe#
Pandas
has a handy built-in sampling function called df.sample()
Let’s see it at work:
pets = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/pets.csv')
pets
Child | Pet | |
---|---|---|
0 | Anna | cat |
1 | Betty | cat |
2 | Charley | cat |
3 | David | dog |
4 | Egbert | cat |
5 | Freddie | rabbit |
6 | Georgia | dog |
7 | Henrietta | cat |
# draw a sample of size n=3 without replacement
pets.sample(3, replace=False)
Child | Pet | |
---|---|---|
3 | David | dog |
0 | Anna | cat |
1 | Betty | cat |
# draw a sample of size n=12 with replacement
pets.sample(12, replace=True)
Child | Pet | |
---|---|---|
5 | Freddie | rabbit |
3 | David | dog |
6 | Georgia | dog |
0 | Anna | cat |
3 | David | dog |
2 | Charley | cat |
7 | Henrietta | cat |
4 | Egbert | cat |
4 | Egbert | cat |
4 | Egbert | cat |
1 | Betty | cat |
0 | Anna | cat |
Summarizing samples#
Often we are not interested in the exact contents of the sample, but some summary value - for example, how many cats are there?
# Make a new sample and just get the column 'Pet'
pets.sample(8, replace=True).Pet
5 rabbit
6 dog
2 cat
6 dog
3 dog
1 cat
7 cat
3 dog
Name: Pet, dtype: object
# Make a new sample count the cats!
sum(pets.sample(8, replace=True).Pet=='cat')
5