3.8. Tutorial Exercises: \(t\)-test and non-parametric equivalents#

Here are some more exercises on comparing means using the t-test and non-parametric equivalents

These exercises are very similar to what you did in the t-test and Mann-Whitney/Wilcoxon examples so in most cases you will be able to copy and adapt code and text from the examples.

3.8.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

3.8.2. 1. Whose peaches are heavier?#

There should be a picture of some peaches here

As last week:

Mr Robinson’s juice factory buys peaches from farmers by the tray. Each tray contains 50 peaches. Farmer MacDonald claims that this is unfair as his peaches are juicier and therefore weigh more than the peaches of his rival, Mr McGregor.

Mr Robinson weighs eight trays of Farmer MacDonald’s peaches and 8 trays of Mr McGregor’s peaches. The weights, in kilograms are given in the file peaches.csv

Investigate whether McDonald’s claim is justified by testing for a difference in weight between MacDonald and McGregor’s peaches. Use both a parametric and non-parametric test.

a) Load the data into a Pandas dataframe

peaches = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2025/main/data/peaches.csv')
peaches
McGregor MacDonald
0 7.867 8.289
1 7.637 7.972
2 7.652 8.237
3 7.772 7.789
4 7.510 7.345
5 7.743 7.861
6 7.356 7.779
7 7.944 7.974

b) Plot the data and comment on whether they are noramlly distributed.

A KDE plot (to show the distribution) and rug plot (to show individual data points) would be a good choice here. You should comment on whether the data appear to be Normally distriubted and hence the suitability of the t-test.

# your code here to plot the data

d) We can assume (based on the Central Limit Theorem) that these data points are normally distributed. Explain why.

Your text here explaining why the data should be Normal according to the CLT

e) Conduct a t-test to test Farmer MacDonald’s claim

  • State your hypotheses

  • State relevant descriptive statistics

  • Carry out the test using the built in function from scipy.stats with appropriate option choices

  • State your conclusions

Your answer here! You will need to add additional cells

f) Look back at the rank-based and permutation tests we carried out on the same data in the previous section. How do the results of the three tests differ? Which test was the best choice, and why?

3.8.3. 2. IQ and vitamins#

There should be a picture of some vitamin pills here

The VitalVit company claim that after taking their VitalVit supplement, IQ is increased.

They run a trial in which 22 participants complete a baseline IQ test, then take VitalVit for six weeks, then complete another IQ test.

a) Load the data into a Pandas dataframe

vitamin = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2025/main/data/vitalVit.csv')
vitamin
ID_code before after
0 688870 82.596 83.437
1 723650 117.200 119.810
2 445960 85.861 83.976
3 708780 125.640 127.680
4 109960 96.751 99.103
5 968530 105.680 106.890
6 164930 142.410 145.550
7 744410 109.650 109.320
8 499380 128.210 125.110
9 290560 84.773 87.249
10 780690 110.470 112.650
11 660820 100.870 99.074
12 758780 94.117 95.951
13 363320 96.952 96.801
14 638840 86.280 87.669
15 483930 89.413 94.379
16 102800 85.283 88.316
17 581620 94.477 96.300
18 754980 90.649 94.158
19 268960 103.190 104.300
20 314040 92.880 94.556
21 324960 97.843 97.969

b) The requirement for a paired t-test is that the pairwise differences in scores are normally distributed. Plot the data in such a way as to check this assumption. Comment on your plot.

  • A KDE plot of the pairwise differences, after-before, would be a good choice hereA scatterplot would be a good choice as these are paired data.

# Your code here

In real IQ tests, IQ scores are normally distributed by design (the tests are designed to yeild a normal distribution of scores). Therefore we should be able to use a t-test to compare the scores from before and after taking VitalVit.

e) Conduct a t-test to test VitalVit’s claim

  • State your hypotheses

  • State relevant descriptive statistics

  • Carry out the test using the biilt in function from scipy.stats with appropriate option choices

  • State your conclusions

Your answer here.

f) Look back to the rank-based and permutation tests on the same data, which you carried out last week. How do the results differ? Why test was the best choice, and why?

Your answer here.

3.8.4. 3. Who has the tallest students?#

A student from Lonsdale college claims that Lonsdale students are taller than students from Beaufort college.

Heights of 30 randomly selected male undergraduates from each college are found in the file heightsCollege.csv

Test the student’s hypothesis using a t-test (this is justified as heights are generally normally distributed) and write up your report as if for a scientific publication. Your report should include the following elements:

  • A plot of the data to show the data distribution

  • The relevant descriptive statistics

  • The results of the t-test

  • A conclusion

You can use the write-up sections of the t-test example notebooks as a model

# Load the data
heights = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2025/main/data/heightsCollege.csv')