7.3. Sampling with and without replacement#
Set up Python libraries#
As usual, run the code cell below to import the relevant Python libraries
#Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas
import seaborn as sns
sns.set_theme()
Toy example#
Let’s explore the idea of sampling with and without replacement using a very simple example (a simple example designed just to illustrate a point is sometimes called a toy example)
Say I have a dataset listing four children’s pets:
[cat, dog, cat, rabbit]
If I sample from this dataset, I get a new list of pets. Say I draw a sample of size $𝑛=2$, my sample might be [cat, cat] (if I’m lucky- I love cats!)
Without replacement#
If I sample without replacement, after I have ‘drawn’ my first sample pet fromt the original dataset, I cannot draw it again - my next sample pet will be drawn from the remaining three. The consequence of this is that all samples of size 𝑛=4 contain all of the original 4 pets, albeit in a different order
[cat, cat, dog, rabbit]
[rabbit, cat, dog, cat]
[cat, dog, rabbit, cat]
etc
With replacement#
If I sample with replacement, each ‘draw’ can be any of the four animals (think of it like pulling a card from a deck, checking which animal is on it, and then replacing the card in the deck before the next sample is drawn).
So I could get:
[cat, cat, cat, cat]
if I’m really lucky!
or more likely:
[cat, dog, cat, cat]
[rabbit, dog, cat, rabbit]
… etc
Let’s try it, replacing the animals with numbers: dog = 1, cat = 2, rabbit = 3
# Sampling without replacement
data = [2,1,3,2]
nReps= 10
for i in range(nReps):
sample = np.random.choice(data,4,replace=False)
print('sample ' + str(i) + ' = ' + str(sample))
sample 0 = [1 3 2 2]
sample 1 = [3 2 2 1]
sample 2 = [1 3 2 2]
sample 3 = [2 2 1 3]
sample 4 = [2 3 1 2]
sample 5 = [3 1 2 2]
sample 6 = [1 2 2 3]
sample 7 = [2 1 2 3]
sample 8 = [3 2 2 1]
sample 9 = [2 2 3 1]
When sampling without replacement each sample should be identical in contents (albeit in a random order) - this is clearer if I sort the sample values in ascending order:
# Sampling without replacement
data = [2,1,3,2]
nReps= 10
for i in range(nReps):
sample = np.random.choice(data,4,replace=False)
print('sample ' + str(i) + ' = ' + str(np.sort(sample)))
sample 0 = [1 2 2 3]
sample 1 = [1 2 2 3]
sample 2 = [1 2 2 3]
sample 3 = [1 2 2 3]
sample 4 = [1 2 2 3]
sample 5 = [1 2 2 3]
sample 6 = [1 2 2 3]
sample 7 = [1 2 2 3]
sample 8 = [1 2 2 3]
sample 9 = [1 2 2 3]
- Can you change the code above to sample with replacement?
- In your samples drawn with replacement, look for those which are not simply permutations of the original four datapoints (animals) but contain more or fewer cats/dogs/rabbits than expected
Real world examples#
A real world example of sampling without replacement would be if we give 100 students a wellbeing questionairre - each student in our sample is a member of the parent distribution (all students/ allstudents in the college we were sampling from etc). We will only want to survey each student once.
A real world of example of sampling with replacement is rolling a dice. Each diceroll yeilds an outcome (1-6) that is a sample from an infinite selection of possible outcomes (think of it as drawing a card from a deck of cards with numbers 1-6 on them, but because there are an infinite number of cards, ‘using up’ a six doesn’t reduce the change of the next diceroll being a six).
Sampling from a Pandas dataframe#
In the example above our ‘pets’ were a numpy array, but more often our data are supplied as a dataframe (a table containing columns with text and numbers, with headings etc)
Pandas has a handy built-in sampling function which does a similar job to numpy.random.choice() but for sampling within a Pandas dataframe
Let’s see it at work:
pets = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/pets.csv')
pets
Child | Pet | |
---|---|---|
0 | Anna | cat |
1 | Betty | cat |
2 | Charley | cat |
3 | David | dog |
4 | Egbert | cat |
5 | Freddie | rabbit |
6 | Georgia | dog |
7 | Henrietta | cat |
# draw a sample of size n=3 without replacement
pets.sample(3, replace=False)
Child | Pet | |
---|---|---|
5 | Freddie | rabbit |
2 | Charley | cat |
6 | Georgia | dog |
# draw a sample of size n=8 with replacement
pets.sample(8, replace=True)
Child | Pet | |
---|---|---|
2 | Charley | cat |
2 | Charley | cat |
6 | Georgia | dog |
5 | Freddie | rabbit |
1 | Betty | cat |
4 | Egbert | cat |
4 | Egbert | cat |
1 | Betty | cat |
# just get the column 'Pet'
pets.sample(8, replace=True)['Pet']
0 cat
6 dog
2 cat
3 dog
4 cat
6 dog
6 dog
4 cat
Name: Pet, dtype: object
# count the cats!
sum(pets.sample(8, replace=True)['Pet']=='cat')
5
OK, now we are ready to meet The Bootstrap!