3.8. Tutorial Exercises: \(t\)-test and non-parametric equivalents#
Here are some more exercises on comparing means using the t-test and non-parametric equivalents
These exercises are very similar to what you did in the t-test and Mann-Whitney/Wilcoxon examples so in most cases you will be able to copy and adapt code and text from the examples.
3.8.1. Set up Python libraries#
As usual, run the code cell below to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
3.8.2. 1. Whose peaches are heavier?#
As last week:
Mr Robinson’s juice factory buys peaches from farmers by the tray. Each tray contains 50 peaches. Farmer MacDonald claims that this is unfair as his peaches are juicier and therefore weigh more than the peaches of his rival, Mr McGregor.
Mr Robinson weighs eight trays of Farmer MacDonald’s peaches and 8 trays of Mr McGregor’s peaches. The weights, in kilograms are given in the file peaches.csv
Investigate whether McDonald’s claim is justified by testing for a difference in weight between MacDonald and McGregor’s peaches. Use both a parametric and non-parametric test.
a) Load the data into a Pandas dataframe
peaches = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2025/main/data/peaches.csv')
peaches
McGregor | MacDonald | |
---|---|---|
0 | 7.867 | 8.289 |
1 | 7.637 | 7.972 |
2 | 7.652 | 8.237 |
3 | 7.772 | 7.789 |
4 | 7.510 | 7.345 |
5 | 7.743 | 7.861 |
6 | 7.356 | 7.779 |
7 | 7.944 | 7.974 |
b) Plot the data and comment on whether they are noramlly distributed.
A KDE plot (to show the distribution) and rug plot (to show individual data points) would be a good choice here. You should comment on whether the data appear to be Normally distriubted and hence the suitability of the t-test.
# your code here to plot the data
d) We can assume (based on the Central Limit Theorem) that these data points are normally distributed. Explain why.
Your text here explaining why the data should be Normal according to the CLT
e) Conduct a t-test to test Farmer MacDonald’s claim
State your hypotheses
State relevant descriptive statistics
Carry out the test using the built in function from
scipy.stats
with appropriate option choicesState your conclusions
Your answer here! You will need to add additional cells
f) Look back at the rank-based and permutation tests we carried out on the same data in the previous section. How do the results of the three tests differ? Which test was the best choice, and why?
3.8.3. 2. IQ and vitamins#
The VitalVit company claim that after taking their VitalVit supplement, IQ is increased.
They run a trial in which 22 participants complete a baseline IQ test, then take VitalVit for six weeks, then complete another IQ test.
a) Load the data into a Pandas dataframe
vitamin = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2025/main/data/vitalVit.csv')
vitamin
ID_code | before | after | |
---|---|---|---|
0 | 688870 | 82.596 | 83.437 |
1 | 723650 | 117.200 | 119.810 |
2 | 445960 | 85.861 | 83.976 |
3 | 708780 | 125.640 | 127.680 |
4 | 109960 | 96.751 | 99.103 |
5 | 968530 | 105.680 | 106.890 |
6 | 164930 | 142.410 | 145.550 |
7 | 744410 | 109.650 | 109.320 |
8 | 499380 | 128.210 | 125.110 |
9 | 290560 | 84.773 | 87.249 |
10 | 780690 | 110.470 | 112.650 |
11 | 660820 | 100.870 | 99.074 |
12 | 758780 | 94.117 | 95.951 |
13 | 363320 | 96.952 | 96.801 |
14 | 638840 | 86.280 | 87.669 |
15 | 483930 | 89.413 | 94.379 |
16 | 102800 | 85.283 | 88.316 |
17 | 581620 | 94.477 | 96.300 |
18 | 754980 | 90.649 | 94.158 |
19 | 268960 | 103.190 | 104.300 |
20 | 314040 | 92.880 | 94.556 |
21 | 324960 | 97.843 | 97.969 |
b) The requirement for a paired t-test is that the pairwise differences in scores are normally distributed. Plot the data in such a way as to check this assumption. Comment on your plot.
A KDE plot of the pairwise differences, after-before, would be a good choice hereA scatterplot would be a good choice as these are paired data.
# Your code here
In real IQ tests, IQ scores are normally distributed by design (the tests are designed to yeild a normal distribution of scores). Therefore we should be able to use a t-test to compare the scores from before and after taking VitalVit.
e) Conduct a t-test to test VitalVit’s claim
State your hypotheses
State relevant descriptive statistics
Carry out the test using the biilt in function from scipy.stats with appropriate option choices
State your conclusions
Your answer here.
f) Look back to the rank-based and permutation tests on the same data, which you carried out last week. How do the results differ? Why test was the best choice, and why?
Your answer here.
3.8.4. 3. Who has the tallest students?#
A student from Lonsdale college claims that Lonsdale students are taller than students from Beaufort college.
Heights of 30 randomly selected male undergraduates from each college are found in the file heightsCollege.csv
Test the student’s hypothesis using a t-test (this is justified as heights are generally normally distributed) and write up your report as if for a scientific publication. Your report should include the following elements:
A plot of the data to show the data distribution
The relevant descriptive statistics
The results of the t-test
A conclusion
You can use the write-up sections of the t-test example notebooks as a model
# Load the data
heights = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2025/main/data/heightsCollege.csv')