Sampling with and without replacement

3.4. Sampling with and without replacement#

This notebook introduces the idea of sampling and the pandas function df.sample()

When we sample from a population or parent distribution, we can do so with or without replacement.

Sampling without replacement is what we usually do when running an experiment or survey. A real world example of sampling without replacement would be if we give 100 students a wellbeing questionairre - each student in our sample is a member of the parent distribution (all students/ allstudents in the college we were sampling from etc). We will only want to survey each student once.

Sampling with replacement is often a good way to model random events. A real world of example of sampling with replacement is rolling a dice. Each diceroll yeilds an outcome (1-6) that is a sample from an infinite selection of possible outcomes - if I roll a three on one turn, I don’t ‘use up’ the three - it is still possible to roll a three on the next turn.

A direct comparison would be drawing cards from a deck. Without replacement, each card once drawn is set aside, so it is impossible to draw the same card twice. With replacement, each card is tucked back into the deck after being drawn, so it can be drawn again.

3.4.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

3.4.2. Toy example#

Let’s explore the idea of sampling with and without replacement using a very simple example (a simple example designed just to illustrate a point is sometimes called a toy example)

Say I have a dataset listing four children’s pets:

[cat, dog, cat, rabbit]

If I sample from this dataset, I get a new list of pets. Say I draw a sample of size $𝑛 = 2$ , my sample might be [cat, cat] (if I’m lucky- I love cats!)

Without replacement#

If I sample without replacement, after I have ‘drawn’ my first sample pet fromt the original dataset, I cannot draw it again - my next sample pet will be drawn from the remaining three. The consequence of this is that all samples of size 𝑛=4 contain all of the original 4 pets, albeit in a different order

[cat, cat, dog, rabbit]

[rabbit, cat, dog, cat]

[cat, dog, rabbit, cat]

etc

With replacement#

If I sample with replacement, each ‘draw’ can be any of the four animals (think of it like pulling a card from a deck, checking which animal is on it, and then replacing the card in the deck before the next sample is drawn).

So I could get:

[cat, cat, cat, cat]

if I’m really lucky!

or more likely:

[cat, dog, cat, cat]

[rabbit, dog, cat, rabbit]

… etc

3.4.3. Sampling from a `Pandas` dataframe#

Pandas has a handy built-in sampling function called df.sample()

Let’s see it at work:

pets = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/pets.csv')
pets

	Child	Pet
0	Anna	cat
1	Betty	cat
2	Charley	cat
3	David	dog
4	Egbert	cat
5	Freddie	rabbit
6	Georgia	dog
7	Henrietta	cat

# draw a sample of size n=3 without replacement
pets.sample(3, replace=False)

	Child	Pet
1	Betty	cat
6	Georgia	dog
2	Charley	cat

# draw a sample of size n=12 with replacement
pets.sample(12, replace=True)

	Child	Pet
4	Egbert	cat
1	Betty	cat
4	Egbert	cat
5	Freddie	rabbit
1	Betty	cat
7	Henrietta	cat
1	Betty	cat
6	Georgia	dog
1	Betty	cat
2	Charley	cat
2	Charley	cat
3	David	dog

Summarizing samples#

Often we are not interested in the exact contents of the sample, but some summary value - for example, how many cats are there?

# Make a new sample and just get the column 'Pet'
pets.sample(8, replace=True).Pet

0       cat
5    rabbit
4       cat
0       cat
7       cat
5    rabbit
2       cat
1       cat
Name: Pet, dtype: object

# Make a new sample count the cats!
sum(pets.sample(8, replace=True).Pet=='cat')