1.10. Permutation testing - more practice#

The aim of these exercises is to give you some practice running permutation tests and deciding what to permute and what your test statistic is in each case!

1.10.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

1.10.2. Import and view the data#

We will work with a fictional dataset containing wellbeing scores pre- and post the vacation for 300 Oxford students.

For each student was also have the following information:

  • subject studied

  • college

wb = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/WellbeingSample.csv')
wb
ID_code College Subject Score_preVac Score_postVac
0 247610 Lonsdale PPE 60 35
1 448590 Lonsdale PPE 43 44
2 491100 Lonsdale engineering 79 69
3 316150 Lonsdale PPE 55 61
4 251870 Lonsdale engineering 62 65
... ... ... ... ... ...
296 440570 Beaufort history 75 70
297 826030 Beaufort maths 52 49
298 856260 Beaufort Biology 83 84
299 947060 Beaufort engineering 62 65
300 165780 Beaufort PPE 48 56

301 rows × 5 columns

1.10.3. 1. For engineering students, is wellbeing increased after the vacation compared to before?#

We hypothesise that engineering students will feel better after the vacation. Test the hypothesis using a permutation test.

First let’s remind ourselves how to get just the rows of the dataframe containing engineering students:

wb.query('Subject == "engineering"')
ID_code College Subject Score_preVac Score_postVac
2 491100 Lonsdale engineering 79 69
4 251870 Lonsdale engineering 62 65
6 841260 Lonsdale engineering 71 58
7 960120 Lonsdale engineering 54 54
15 670880 Lonsdale engineering 70 69
... ... ... ... ... ...
266 842870 Beaufort engineering 52 58
278 414020 Beaufort engineering 76 82
291 384240 Beaufort engineering 78 89
294 457900 Beaufort engineering 72 62
299 947060 Beaufort engineering 62 65

61 rows × 5 columns

Things you need to decide:#

  • what is our null hypothesis

  • what is our alternative hypothesis?

Is it a paired or unpaired test?

  • therefore which permutation_type is needed, samples, pairings or independent?

  • are we testing for the mean difference (within pairs) mean(x-y), or the difference of means mean(x)-mean(y)

  • what actually are x and y in this case?

Is it a one- or two-tailed test?

  • therefore which alternative hypothesis type is needed, two-sided, greater or less?

What \(\alpha\) value will you use?

  • what value must \(p\) be smaller than, to reject the null hypothesis?

  • this is the experimenter’s choice but usually 0.05 is used (sometimes 0.001 or 0.001)

The answers to the things you need to decide!#

\(\mathcal{H_0}\): The mean difference in wellbeing post-pre vacation is zero for engineering students

\(\mathcal{H_a}\): The mean difference in wellbeing post-pre vacation is greater than zero for engineering students, ie wellbeing improves over the vacation.

This is a paired test as we are comparing each person’s wellbeing before the vacation to their own wellbeing after the vacation. Therefore we are testing the mean difference mean(x-y) and need to shuffle the samples (flip pre and post labels with pairs, but don’t break the pairings)

# define a funtion that gives the mean difference for two series x and y
def mDiff(x,y):
    return np.mean(x-y)

It’s a one-tailed test because our hypothesis is directional (we expect wellbeing to increase after the vacation) and if we calculate of difference as post-pre, a significant effect would be a big positive difference, so we need alternative='greater'

We will test at the 5% (\(\alpha=0.05\)) level

Run the test!#

Here is some code to run the test

post = wb.query('Subject == "engineering"').Score_postVac
pre = wb.query('Subject == "engineering"').Score_preVac

stats.permutation_test((post,pre), mDiff, alternative='greater', permutation_type='samples', n_resamples=10000)
PermutationTestResult(statistic=1.0491803278688525, pvalue=0.21377862213778623, null_distribution=array([ 0.45901639, -2.59016393,  0.03278689, ..., -0.72131148,
        0.32786885, -0.91803279]))

Is it significant?#

Reading the test output:

statistic=1.0491803278688525

The mean post-pre wellfare score difference in our ‘real’ sample was 1.049 points (welfare was 1.049 points higher after the vac - this is a small increase as it’s a 100-point scale)

pvalue=0.214

The probability of obtaining such a large increase in welfare due to chance if there was really no effect of the vacation is 21.4%

Conclusions#

We fail to reject the null hypothesis as \(p\) is not smaller than \(\alpha\)

How would you write that up in a journal article?#

We hypothesised that wellbeing increased over the vacation. We compared wellbeing in engineering students (n=61) before and after the vacation using a paired samples permutation test. There was no significant increase in wellbeing after the vacation (mean difference is an increase of 1.049 points, p=0.214, one tailed).

1.10.4. 2. Do does wellbeing (before the vacation) differ between history and maths students?#

We hypothesise that students studying different subjects may have different levels of wellbeing but we don’t have an a priori expectation about which subjects will have higher wellbeing.

Test the hypothesis using a permutation test.

Let’s get just the prevacation wellbeing scores for history students and just maths students:

history = wb.query('Subject == "history"').Score_preVac
maths = wb.query('Subject == "maths"').Score_preVac

Things you need to decide:#

  • what is our null hypothesis

  • what is our alternative hypothesis?

Is it a paired or unpaired test?

  • therefore which permutation_type is needed, samples, pairings or independent?

  • are we testing for the mean difference (within pairs) mean(x-y), or the difference of means mean(x)-mean(y)

  • what actually are x and y in this case?

Is it a one- or two-tailed test?

  • therefore which alternative hypothesis type is needed, two-sided, greater or less?

What \(\alpha\) value will you use?

  • what value must \(p\) be smaller than, to reject the null hypothesis?

  • this is the experimenter’s choice but usually 0.05 is used (sometimes 0.001 or 0.001)

The answers to the things you need to decide!#

\(\mathcal{H_0}\): The difference mean wellbeing pre vacation for history and maths students is zero

\(\mathcal{H_a}\): The difference mean wellbeing pre vacation for history and maths students is non-zero, ie one subject has higher wellbbeing

This is an independent samples test as we are comparing each person’s wellbeing before the vacation to their own wellbeing after the vacation. Therefore we are testing the mean difference mean(x)-mean(y) and need permutation_type = 'independent' (shuffle which datapoints are labelled ‘history’ or ‘maths’ but keep the number of students labelled as history and maths matching the original sample)

# define a funtion that gives the mean difference for two series x and y
def dMeans(x,y):
    return np.mean(x)-np.mean(y)

It’s a two-tailed test because our hypothesis is non-directional (we don’t have particular predictions about which subject we might have higher wellbeing, we are just interested in whether there is a difference) so we need alternative=’two-sided’

We will test at the 5% (\(𝛼=0.05\)) level

Run the test!#

Here is some code to run the test:

stats.permutation_test((history,maths), dMeans, alternative='two-sided', permutation_type='independent', n_resamples=10000)
PermutationTestResult(statistic=3.486388384755003, pvalue=0.18018198180181982, null_distribution=array([3.83484574, 0.69872958, 0.35027223, ..., 2.87658802, 1.56987296,
       3.92196007]))

Is it significant?#

Reading the test output:

statistic=3.486388384755003

The difference in mean welfare score in our ‘real’ sample was 3.49 points (welfare was 3.49 points higher for history students)

pvalue=0.16718328167183283

The probability of obtaining such a large difference in welfare due to chance if there was really no effect of the subject is 16.7%

Conclusions#

We fail to reject the null hypothesis as \(p\) is not smaller than \(\alpha\)

How would you write that up in a journal article?#

We hypothesised that wellbeing might differ between students in different subjects. We compared wellbeing in maths students (n=29) and history students (n=19) before the vacation using an independent samples samples permutation test. There was no significant difference in wellbeing between subjects (difference of means = 3.48 points, p=0.167, two tailed).

Note#

If we plot the distribution of wellbeing in the two subjects, we can see taht there is a lot of overlap

sns.kdeplot(history, color='b')
sns.kdeplot(maths, color='r')
plt.legend(['history','maths'])
plt.show()
../_images/d6058fd2e28b0b53dc3f2f321bcb8821ff14f26d23a51b20fe57744fca8a34a7.png

1.10.5. 3. Are welfare scores pre- and post-vac correlated for biology students?#

We hypothesise that there is a strong individual difference effect on welfare scores and that scores pre- and post-vac are correlated. Let’s say the sample we have availablbe to test this are biology students.

Test the hypothesis using a permutation test.

Let’s get two series containing the data we need:

prevac = wb.query('Subject=="Biology"').Score_preVac
postvac = wb.query('Subject=="Biology"').Score_postVac

Plot the data#

A scatterplot may help us get a sense of whether there is a correlation

sns.scatterplot(x=prevac, y=postvac)
plt.show()
../_images/0b48fb3aa1706aea896b18ac4f07aebd36a6ba086ec6747f1f496129b452c8ac.png

…looks like there is a correlation. Let’s test it.

The answers to the things you need to decide!#

\(\mathcal{H_0}\): The correlation in wellbeing pre- and post- vacation is zero for students in Lonsdale college

\(\mathcal{H_a}\): The correlation in wellbeing pre- and post- vacation is greater than zero (students with high wellbeing bbefore the vac also have high wellbeing after the vac)

As this is a correlation, we need permutation_type = 'pairings' (shuffle which datapoints are paired with which)

# define a function that gives the correlation for two series x and y
# note that np.corrcoef returns a 2x2 matrix so we need to get just the relevant element
def mycorr(x,y):
    c = np.corrcoef(x,y)
    return c[0,1]

It’s a bit ambiguous whether a one- or two-tailed test is needed here.

The question does not state the expected direction of effect, which would suggest a two tailed test.

However, there is really only one reasonable direction in which a correlation could be expected, namely a positive correlation (we expect the students with the highest wellbeing before the vac also have highest wellbeing after the vac, hence we expect a positive correlation coefficient) - the reverse (students with high wellbeing before the vac have low wellbeing after) is nonsensical. So on the basis of common sense you could argue for a one-tailed test (and hence use alternative='greater').

Either of these is an acceptable answer as long as you briefly explain your decision.

We will test at the 5% (\(𝛼=0.05\)) level

Run the test!#

Here is some code to run the test:

stats.permutation_test((prevac,postvac), mycorr, alternative='greater', permutation_type='pairings', n_resamples=10000)
PermutationTestResult(statistic=0.7401813753937694, pvalue=9.999000099990002e-05, null_distribution=array([-0.11474344,  0.12424848,  0.14747036, ..., -0.11446699,
        0.06494973, -0.16187834]))

Is it significant?#

Reading the test output:

statistic=0.7401813753937694

The correlation between pre and post vac wellbeing scores in out ‘real’ sample was 0.74

pvalue=9.999000099990002e-05

The probability of obtaining such a large difference in welfare due to chance if there was really no effect of the subject is 9.99 x \(10^{-5}\)%, which is 0.0000999 or about one in ten thousand

Conclusions#

We reject the null hypothesis as \(p\) is much smaller than \(\alpha\) and conclude there is a correlation between wellbeing scores pre and post vacation

How would you write that up in a journal article?#

We hypothesised that wellbeing before and after the vacation would be correlated across individuals. We calculated Pearson’s \(r\) for a group of biology students (n=51) before and after the vacation, and tested its significance using a permutation test. There was a highly significant positive correlation (r=0.74, p<0.0001, one tailed), meaning that students with higher wellbeing before the vacation also had high wellbeing after the vacation.