Extension Work - Central Limit Theorem (reading)

1.10. Extension Work - Central Limit Theorem (reading)#

The relationship between the binomial and normal distributions has parallels with a really important result in statistics - the Central Limit Theorem.

The central limmit theorem tells us about the random variation we would expect to get in the sample mean, if we were to take many many sammples from the same underlying population.

In this notebook, we start by looking at the unusual case in which the population or parent distribution is available to us. We will simulate the process of drawing many samples of size \(n\) from a parent distribution, and taking the mean of each sample. The distribution of these means, known as the sampling distribution of the mean, describes the expected random variability in the mean for different samples.

We see that due to the Central Limit Theorem, when \(n\) is large enough (above about 50) the sampling distribution of the mean is well approximated by a normal distribution, whose standard deviation is the standard error of the mean.

1.10.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

1.10.2. Load and plot the data#

We will work with a distinctly non-normal data distribution - the scores for a large number of individuals on a fictional 100-item political questionairre called BrexDex.

The questions are designed and scored so that a high score overall score on the questionairre indicates an attitude agains Brexit, and a low score indicates an attitude in favour of Brexit.

Because the scores relate to a polarizing topic, the data distribution is bimodal

(These are made up data by the way!)

First load the data:

UKBrexdex=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/UKBrexdex.csv')
UKBrexdex

	ID_code	score
0	186640	53
1	588140	90
2	977390	30
3	948470	42
4	564360	84
...	...	...
9995	851780	81
9996	698340	45
9997	693580	51
9998	872730	78
9999	385642	88

10000 rows × 2 columns

We can see that the dataset contains 10,000 individuals’ scores on the BrexDex questionnaire.

Let’s plot them to get a sense of the distribution:

sns.histplot(UKBrexdex['score'], bins=range(101))
plt.xlabel('score on BrexDex')
plt.ylabel('frequency')
plt.show()

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../_images/a47163e1c385e39ce24ed0fd8b021bb293a39d45240dec593c699cc35a7c00f6.png

Obviously, 10,000 is a very large sample (perhaps a national sample). We might be interested giving the questionnarie to a smaller group (say, 100 first year statistics students in Oxford) to see if their attitudes were similar to the larger national sample.

What kind of distribution of responses would we expect to get in a sample of 100?

1.10.3. The sample distribution resembles the parent distribution#

If the Oxford students have attitudes typical of the national sample, we could simulate what their data might look like by drawing a random sample of 100 from our national dataset.

We can do this using the tool numpy.random.choice which makes a random selection of datapoints from a larger dataset:

sample = np.random.choice(UKBrexdex['score'], 100, replace=False)
print(sample)

[84 42 13  7 54 44 24 30  9 26 76 23 89 86 22 10 33 15 17 22 41 54 19 34
78 79 28 28 33 57 56 75 77 74 26 12 26 13 31 17 58 79 84 83 58 79 83
20 39 75 52 77 44 27 34 11 81 82 35 36 80 79 27 33 28 72 44 78 36 77
17 81 28 21 74 41 78 79 45 78 63 33 20 27 26 48 44 32 47 21 59 41 34
24 26 36]

Let’s plot the data and compare to our national sample:

plt.subplot(2,1,1)
sns.histplot(UKBrexdex['score'], bins=range(101))
plt.subplot(2,1,2)
sns.histplot(sample, bins=range(0,101,5)) # use wider bins for the sample as there are fewer datapoints
plt.xlabel('Score on Brexdex (%)')
plt.show()

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../_images/bbe0ebc3cc4d5e454fd7b77604610f753de48a839ad8c5841a44eafa778481b7.png

Hopefully we can see that the distribution within the sample resembles the shape of the distribution in the national sample, with two peaks, although somewhat noisier

Let’s draw a few more random samples, each time of size 100, to check this:

plt.figure()

for i in range(1,13):
    sample = np.random.choice(UKBrexdex['score'], 100, replace=False)
    plt.subplot(3,4,i)
    sns.histplot(sample, bins=range(0,101,5)) # use wider bins for the sample as there are fewer datapoints   

plt.tight_layout()    
plt.show()

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../_images/a43ec2c5c520e886e774ca443ba399d860c61c1fd4cce53ab30038ea8036ca3f.png

Notice that we always manage to reproduce the bimodal shape, albeit with random variability.

The distribution within each sample resembles the parent distribution from which it is drawn, ie the UK national sample.

1.11. The sampling distribution of the mean#

The mean Brexdex score can be obtained from the UK national sample as follows:

UKBrexdex['score'].mean()

49.8748

The mean score in the national sample is 49.9%.

Given that each of our smaller samples (with \(n=100\)) resemble the parent distribution, we might expect that the mean of each of these samples approximates the mean of the UK national sample.

Let’s try drawing a large number of random samples with \(n=100\), and getting the mean of each one:

nSamples = 10000 # we will draw 10,000 samples
n=100 # each sample contains n people

m=np.empty(nSamples) # make an array to store the means

for i in range(nSamples):
    sample = np.random.choice(UKBrexdex['score'], 100, replace=False)
    m[i]=sample.mean()

Let’s plot the resulting means:

sns.histplot(m)
plt.xlabel('sample mean')

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

Text(0.5, 0, 'sample mean')

../_images/f97842e9f0ef826ee041db403751bc1577270daf21f1f7dba2f0ebf106d94740.png

The distribution of sample means is approximately normal.

1.11.1. Expected value of the mean#

The expected value of the sample mean is simply the mean of the parent distribution. In other words, the means obtained from our 10,000 samples cluster around the mean of the UK national sample, which was 49.9.

m.mean() # get the mean of the 10,000 sample means

49.895796

1.11.2. Standard Error of the mean#

Although the sample means group around the mean of the parent distrbution, there is some random variation, as some samples (by chance) contain higher Brexdex scores than others.

The variability of the sample means is quantified by the standard deviation of the sampling distribution of the mean (ie the sd of the data in the histogram above), which is about 2.46.

m.std()

2.4684589294505184

As we saw in the lecture, the standard deviation of the sampling distribution of the mean has a special name, the standard error of the mean or SEM, and is given by the formula:

\[ SEM = \frac{\sigma}{\sqrt{n}} \]

… where \(\sigma\) is that standard deviation of the parent distribution, which in this case where we (unusually) have access to the UK national sample of 10000 individuals, we can obtain as follows:

UKBrexdex['score'].std()

24.792720561876358

Shall we check if the formula for the SEM gives us a match to the standard deviation of the sampling distribution of the mean?

print('sd of sampling distribution (from simulation) = ' + str(m.std()))

SEM = UKBrexdex['score'].std()/(n**0.5) # n to the power 0.5 is sqrt of n
print('SEM from the formula = ' + str(SEM))

sd of sampling distribution (from simulation) = 2.4684589294505184
SEM from the formula = 2.479272056187636

This is not a bad match!

1.11.3. \(SEM \propto \frac{1}{\sqrt{n}} \)#

The standard error of the mean is inversely proportional to \(\sqrt{n}\)

In other words, the random variability in sample means decreases as sample size \(n\) increases - but in proportion to \(\sqrt{n}\) not \(n\) itself

We can see this if we construct the sampling distribution of the mean for samples of different sizes, say \(n=50, 100, 1000\)

nSamples = 10000 # we will draw 10,000 samples
m=np.empty(nSamples) # make an array to store the means

nVals = [50,100,1000] # values of n to try

# this loop tries different values of n
for j in range(len(nVals)): 
    
    # this loop draws 10,000 samples of size n
    for i in range(nSamples):
        sample = np.random.choice(UKBrexdex['score'], nVals[j], replace=False)
        m[i]=sample.mean()
        
    # plot the distribution for each value of n
    plt.subplot(3,1,j+1)
    sns.histplot(m)
    plt.legend('n = ' + str(nVals[j]))
    plt.xlim([40,60])

plt.tight_layout()
plt.xlabel('sample mean')
plt.show()

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../_images/1c397346d7a6867408529f842fe28c3a30cd085a2470e429ea32840accf9707b.png

1.11.4. Normal Distribution#

The Central Limit Theorem states that when \(n\) is sufficiently large, the sampling distribution of the means of samples of size \(n\) is a normal distribution, with a mean equivalent to the mean of the parent distribution, and a standard deviation equivalent to the SEM.

Let’s check how well our sampling distribution for samples of size n=100 matches the predicted normal distribution.

First we regenerate our 10,000 samples of size 100:

nSamples = 10000 # we will draw 10,000 samples
n=100 # each sample contains n people

m=np.empty(nSamples) # make an array to store the means

for i in range(nSamples):
    sample = np.random.choice(UKBrexdex['score'], n, replace=False)
    m[i]=sample.mean()
    
sns.histplot(m)
plt.xlabel('sample mean')
plt.show()

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../_images/2ebfa68f10d380399c697193fae2332713670a803fd7284a4e24e88041b3ed61.png

Now we make our predicted normal sampling distribution of the mean.

Its mean \(\mu\) should be the mean of the parent distribution:

mu = UKBrexdex['score'].mean()
print(mu)

49.8748

Its standard deviation should be the SEM:

\[SEM = \frac{\sigma}{\sqrt{n}} \]

where \(\sigma\) is the standard deviaition of the parent distribution:

SEM = UKBrexdex['score'].std()/(n**0.5)
print(SEM)

2.479272056187636

Then we obtain the PDF of the normal distribution \(\mathcal{N}(\mu, SEM)\) for a suitable range of x-axis values (based on the histogram above):

x = np.arange(40.5,60.5) # x axis values are from 40.5 to 60.5 (the .5 is to match the middle of the bars in the histogram)
p = stats.norm.pdf(x,mu,SEM) 
freq = p*nSamples # expected frequency of each sample mean is the probability of that sample mean, time total number of samples (10,000 in our example)

sns.histplot(m, bins=range(40,60))
plt.plot(x,freq,'r')
plt.xlabel('sample mean')
plt.plot()

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

[]

../_images/7549fc4da76428c09f68abec7c301514f840c11283dacb082853101243569092.png

This is not a bad match!

1.11.5. How unusual is my sample mean?#

Say we give the Brexdex questionnaire to 100 statistics undergraduates in Oxford, and their mean score is 55.1, as opposed to the mean in the UK national sample, 49.9. Can we infer that the students have different political attitude to the UK population as a whole?

The sampling distribution of the mean tells us the distribution of sample means we expect to get if we draw samples from the parent population (the UK national sample)

Using the simulated sampling distribution of the mean#

What proportion of sample means in our simulated sampling distribution exceeded 55.1?

# regenerate the sampling distribubtion
nSamples = 10000 # we will draw 10,000 samples
n=100 # each sample contains n people

m=np.empty(nSamples) # make an array to store the means

for i in range(nSamples):
    sample = np.random.choice(UKBrexdex['score'], 100, replace=False)
    m[i]=sample.mean()
    
print('proportion of sample means greater than 55.1 = ' + str((m>55.1).mean()) +  ' or ' + str((m>55.1).mean()*100) + '%')

proportion of sample means greater than 55.1 = 0.0193 or 1.9300000000000002%

In other words, the mean Brexdex score of the students was high enough that we would expect it to occur less than 2% of the time for samples drawn from the UK national population - we might conclude that the political attitudes of the students differ from those of the population as a whole.

Using the Normal distribution#

We can ask how likely this mean was to have occurred under the assumption that the Oxford students have similar political attitudes to the population as a whole, using the CDF of the normal distribution:

mu = UKBrexdex['score'].mean()
SEM = UKBrexdex['score'].std()/(n**0.5)

p = 1-stats.norm.cdf(55.1,mu,SEM) 
print('proportion of sample means greater than 55.1 = ' + str(p) +  ' or ' + str(p*100) + '%')

proportion of sample means greater than 55.1 = 0.01753478848973844 or 1.753478848973844%

Hopefully this proportion agrees fairly well with the proportion of actual sample means exceeding 55.1 in the simulation