{
"cells": [
{
"cell_type": "markdown",
"id": "49887e83",
"metadata": {},
"source": [
"# Tutorial Exercises: non-parametric tests \n",
"\n",
" \n",
"Here are some exercises on comparing groups of data (medians or means) using rank-based non-parametric tests, or permutation tests\n"
]
},
{
"cell_type": "markdown",
"id": "741220b6",
"metadata": {},
"source": [
"### Set up Python libraries\n",
"\n",
"As usual, run the code cell below to import the relevant Python libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "692abf91",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Set-up Python libraries - you need to run this but you don't need to change it\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import scipy.stats as stats\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"sns.set_theme(style='white')\n",
"import statsmodels.api as sm\n",
"import statsmodels.formula.api as smf\n",
"import warnings \n",
"warnings.simplefilter('ignore', category=FutureWarning)"
]
},
{
"cell_type": "markdown",
"id": "91eef546",
"metadata": {},
"source": [
"## 1. Whose peaches are heavier?\n",
"\n",
"\n",
"\n",
"\n",
"Mr Robinson’s juice factory buys peaches from farmers by the tray. Each tray contains 50 peaches. Farmer McDonald claims that this is unfair as his peaches are juicier and therefore weigh more than the peaches of his rival, Mr McGregor. \n",
"\n",
"Mr Robinson weighs eight trays of Farmer McDonald’s peaches and 8 trays of Mr McGregor’s peaches. \n",
"\n",
"Investigate whether McDonald's claim is justified by testing for a difference in weight between McDonald and McGregor's peaches using a non-parametric (rank-based) test."
]
},
{
"cell_type": "markdown",
"id": "134615c7",
"metadata": {},
"source": [
"a) Load the data into a Pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e524eb4b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
McGregor
\n",
"
MacDonald
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
7.867
\n",
"
8.289
\n",
"
\n",
"
\n",
"
1
\n",
"
7.637
\n",
"
7.972
\n",
"
\n",
"
\n",
"
2
\n",
"
7.652
\n",
"
8.237
\n",
"
\n",
"
\n",
"
3
\n",
"
7.772
\n",
"
7.789
\n",
"
\n",
"
\n",
"
4
\n",
"
7.510
\n",
"
7.345
\n",
"
\n",
"
\n",
"
5
\n",
"
7.743
\n",
"
7.861
\n",
"
\n",
"
\n",
"
6
\n",
"
7.356
\n",
"
7.779
\n",
"
\n",
"
\n",
"
7
\n",
"
7.944
\n",
"
7.974
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" McGregor MacDonald\n",
"0 7.867 8.289\n",
"1 7.637 7.972\n",
"2 7.652 8.237\n",
"3 7.772 7.789\n",
"4 7.510 7.345\n",
"5 7.743 7.861\n",
"6 7.356 7.779\n",
"7 7.944 7.974"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"peaches = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/peaches.csv')\n",
"peaches"
]
},
{
"cell_type": "markdown",
"id": "cf39faa0",
"metadata": {},
"source": [
"b) Plot the data and comment. \n",
"\n",
"A Kernel desity estimate plot (to show the distribution) and rug plot (to show individual data points) would be a good choice here. You should comment on the data distribution "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ecb9049b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here to plot the data\n"
]
},
{
"cell_type": "markdown",
"id": "be0c251b",
"metadata": {},
"source": [
"c) Conduct an appropriate rank-based non-parametric test of Farmer McDonald's claim\n",
"\n",
"* State your hypotheses\n",
"* State relevant descriptive statistics\n",
"* Carry out the test using the built in function from `scipy.stats` with appropriate option choices\n",
"* State your conclusions"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6d79296d-b266-44d5-ab6b-c8d6de3aa949",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"id": "a9d9b2b9-c8e0-4d8f-8b07-4188e0ca3fb9",
"metadata": {
"tags": []
},
"source": [
"d) Conduct a permutation test of the same claim\n",
"\n",
"* State your hypotheses\n",
"* State relevant descriptive statistics\n",
"* Carry out the test using the built in function from `scipy.stats` with appropriate option choices\n",
"* State your conclusions"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5a021668-2dc2-4592-8ffc-3013e123c5c7",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"id": "bf2702d0",
"metadata": {},
"source": [
"## 2. IQ and vitamins\n",
"\n",
"\n",
"\n",
"The VitalVit company claim that after taking their VitalVit supplement, IQ is increased. \n",
"\n",
"They run a trial in which 22 participants complete a baseline IQ test, then take VitalVit for six weeks, then complete another IQ test."
]
},
{
"cell_type": "markdown",
"id": "3ea0f132",
"metadata": {},
"source": [
"a) What kind of design is this."
]
},
{
"cell_type": "markdown",
"id": "d0f6401b",
"metadata": {},
"source": [
"< your answer here >\n"
]
},
{
"cell_type": "markdown",
"id": "4c839052",
"metadata": {},
"source": [
"b) What are the advantages and possible disadvantages of this type of design? Should the company have done something different or additional to rule out confounding factors?"
]
},
{
"cell_type": "markdown",
"id": "2f1b84e7",
"metadata": {},
"source": [
"< your answer here >"
]
},
{
"cell_type": "markdown",
"id": "36be5eb5",
"metadata": {},
"source": [
"c) Load the data into a Pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "864de087",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ID_code
\n",
"
before
\n",
"
after
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
688870
\n",
"
82.596
\n",
"
83.437
\n",
"
\n",
"
\n",
"
1
\n",
"
723650
\n",
"
117.200
\n",
"
119.810
\n",
"
\n",
"
\n",
"
2
\n",
"
445960
\n",
"
85.861
\n",
"
83.976
\n",
"
\n",
"
\n",
"
3
\n",
"
708780
\n",
"
125.640
\n",
"
127.680
\n",
"
\n",
"
\n",
"
4
\n",
"
109960
\n",
"
96.751
\n",
"
99.103
\n",
"
\n",
"
\n",
"
5
\n",
"
968530
\n",
"
105.680
\n",
"
106.890
\n",
"
\n",
"
\n",
"
6
\n",
"
164930
\n",
"
142.410
\n",
"
145.550
\n",
"
\n",
"
\n",
"
7
\n",
"
744410
\n",
"
109.650
\n",
"
109.320
\n",
"
\n",
"
\n",
"
8
\n",
"
499380
\n",
"
128.210
\n",
"
125.110
\n",
"
\n",
"
\n",
"
9
\n",
"
290560
\n",
"
84.773
\n",
"
87.249
\n",
"
\n",
"
\n",
"
10
\n",
"
780690
\n",
"
110.470
\n",
"
112.650
\n",
"
\n",
"
\n",
"
11
\n",
"
660820
\n",
"
100.870
\n",
"
99.074
\n",
"
\n",
"
\n",
"
12
\n",
"
758780
\n",
"
94.117
\n",
"
95.951
\n",
"
\n",
"
\n",
"
13
\n",
"
363320
\n",
"
96.952
\n",
"
96.801
\n",
"
\n",
"
\n",
"
14
\n",
"
638840
\n",
"
86.280
\n",
"
87.669
\n",
"
\n",
"
\n",
"
15
\n",
"
483930
\n",
"
89.413
\n",
"
94.379
\n",
"
\n",
"
\n",
"
16
\n",
"
102800
\n",
"
85.283
\n",
"
88.316
\n",
"
\n",
"
\n",
"
17
\n",
"
581620
\n",
"
94.477
\n",
"
96.300
\n",
"
\n",
"
\n",
"
18
\n",
"
754980
\n",
"
90.649
\n",
"
94.158
\n",
"
\n",
"
\n",
"
19
\n",
"
268960
\n",
"
103.190
\n",
"
104.300
\n",
"
\n",
"
\n",
"
20
\n",
"
314040
\n",
"
92.880
\n",
"
94.556
\n",
"
\n",
"
\n",
"
21
\n",
"
324960
\n",
"
97.843
\n",
"
97.969
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID_code before after\n",
"0 688870 82.596 83.437\n",
"1 723650 117.200 119.810\n",
"2 445960 85.861 83.976\n",
"3 708780 125.640 127.680\n",
"4 109960 96.751 99.103\n",
"5 968530 105.680 106.890\n",
"6 164930 142.410 145.550\n",
"7 744410 109.650 109.320\n",
"8 499380 128.210 125.110\n",
"9 290560 84.773 87.249\n",
"10 780690 110.470 112.650\n",
"11 660820 100.870 99.074\n",
"12 758780 94.117 95.951\n",
"13 363320 96.952 96.801\n",
"14 638840 86.280 87.669\n",
"15 483930 89.413 94.379\n",
"16 102800 85.283 88.316\n",
"17 581620 94.477 96.300\n",
"18 754980 90.649 94.158\n",
"19 268960 103.190 104.300\n",
"20 314040 92.880 94.556\n",
"21 324960 97.843 97.969"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vitamin = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/vitalVit.csv')\n",
"vitamin"
]
},
{
"cell_type": "markdown",
"id": "fa85f514",
"metadata": {},
"source": [
"d) Plot the data and comment. \n",
"A scatterplot would be a good choice as these are paired data. \n",
"You could add the line of equality (line x=y) to the graph so we can see whether most people score higer on the IQ test before or after taking VitalVit"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1ff48986",
"metadata": {},
"outputs": [],
"source": [
"# Your code here for a scatter plot."
]
},
{
"cell_type": "markdown",
"id": "8a3f3840",
"metadata": {},
"source": [
"e) Conduct a suitable rank-based non-parametric test of VitalVit's claim\n",
"\n",
"* State your hypotheses\n",
"* State relevant descriptive statistics\n",
"* Carry out the test using the built in function from `scipy.stats` with appropriate option choices\n",
"* State your conclusions"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c592df98-dd5c-42e5-9b6a-4910f43f0e31",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"id": "8e2cd017",
"metadata": {},
"source": [
"f) Conduct a suitable permutation test of VitalVit's claim\n",
"\n",
"* State your hypotheses\n",
"* State relevant descriptive statistics\n",
"* Carry out the test using the built in function from `scipy.stats` with appropriate option choices\n",
"* State your conclusions"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3a944291-9de8-4d62-8f75-b005480e3d42",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"id": "e874c135-6079-408e-8ab6-040a3fcc3db3",
"metadata": {},
"source": [
"## 3. Socks\n",
"\n",
"In the section on permutation testing, we introduced a dataset on sock ownership (number of pairs of socks owned for 14 husband-wife couples. We noticed that when using a permutation test for difference of means, the null distribution of the difference of means was strongly affected by the presences of an outlier: \n",
"* in one couple the husband owned about 30 more pairs of socks than the wife\n",
"* wheter the difference of means in each permutation was positive or negative depended disproportionately on whether this couple were 'flipped' or not in that particular permutation\n",
"\n",
"Let's compare the use of the rank-based (Wilcoxon's Sign-Rank test) test with the permutation test for the mean difference.\n",
"\n",
"**a. Load the data (done for you)**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "bc76e3f0-ed73-45cf-bc85-82a81d0be296",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Husband
\n",
"
Wife
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
10
\n",
"
12
\n",
"
\n",
"
\n",
"
1
\n",
"
17
\n",
"
13
\n",
"
\n",
"
\n",
"
2
\n",
"
48
\n",
"
20
\n",
"
\n",
"
\n",
"
3
\n",
"
28
\n",
"
25
\n",
"
\n",
"
\n",
"
4
\n",
"
23
\n",
"
18
\n",
"
\n",
"
\n",
"
5
\n",
"
16
\n",
"
14
\n",
"
\n",
"
\n",
"
6
\n",
"
18
\n",
"
13
\n",
"
\n",
"
\n",
"
7
\n",
"
34
\n",
"
26
\n",
"
\n",
"
\n",
"
8
\n",
"
27
\n",
"
22
\n",
"
\n",
"
\n",
"
9
\n",
"
22
\n",
"
14
\n",
"
\n",
"
\n",
"
10
\n",
"
12
\n",
"
10
\n",
"
\n",
"
\n",
"
11
\n",
"
13
\n",
"
17
\n",
"
\n",
"
\n",
"
12
\n",
"
22
\n",
"
21
\n",
"
\n",
"
\n",
"
13
\n",
"
15
\n",
"
16
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Husband Wife\n",
"0 10 12\n",
"1 17 13\n",
"2 48 20\n",
"3 28 25\n",
"4 23 18\n",
"5 16 14\n",
"6 18 13\n",
"7 34 26\n",
"8 27 22\n",
"9 22 14\n",
"10 12 10\n",
"11 13 17\n",
"12 22 21\n",
"13 15 16"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"socks = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/socks.csv')\n",
"socks"
]
},
{
"cell_type": "markdown",
"id": "a906cde9-a9cc-40be-a7a9-301cac7d08c0",
"metadata": {
"tags": []
},
"source": [
"**b. Plot the data (done for you)**"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b8a62d76-9d5f-45ae-b60b-e268bd6b835b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(data=socks, color=[0.8,0.8,0.8])\n",
"sns.lineplot(data=socks.T, marker='o')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "e449fcbc-d629-4069-90a5-21fd32be91d3",
"metadata": {},
"source": [
"**c. Carry out a suitable rank-based non-parametric test of the hypothesis that men own more socks than women**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "4ceb291d-513a-49f2-953c-deeb780acecf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"id": "58c4dcf7-dde9-49f1-95e9-a05082e9cbe2",
"metadata": {},
"source": [
"**d. Carry out a suitable permutation test test of the hypothesis that men own more socks than women**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "0c8671ef-8cf5-4085-a6d8-ed4e0cb14761",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"id": "535f5ead-26d5-4be3-ba2b-3f9e9642d487",
"metadata": {},
"source": [
"**e. Compare the two tests.**\n",
"\n",
"In this case the rank-based test has a (slightly) smaller $p$-value than the permutation test. \n",
"\n",
"The permutation test preserves thhe following features of the data:\n",
"1. In each couple one partner usually has more socks (what we shuffle is *which* partner this is)\n",
"2. One couple has an extreme difference in sock-counts (we shuffle whether it is the husband or wife who has more socks)\n",
"3. We retain the sample sizes and overall distribution of difference of means\n",
"\n",
"The rank-based test 'neutralizes' one of these features, which is it and what is the effect?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9a9a976-f7cd-424e-9e6c-5fc874dfdb80",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}