{ "cells": [ { "cell_type": "markdown", "id": "c33197dd", "metadata": {}, "source": [ "# The Wilcoxon Sign-Rank Test\n", "\n", "The Wilcoxon Sign-Rank Test is a rank-based test for the median **difference** in paired samples. It tests whether the median difference between the members of each pair is greater than zero. As such it is often considered to be a non-parametric equivalent for the **paired samples** t-test (which we will meet next week).\n", "\n", "The Wilcoxon Sign-rank test is **not** the same as the Wilcoxon Rank Sum test (Mann Whitney U test) which is for independent samples\n", "\n", "We will us a Python function called `stats.wilcoxon()` from the `scipy.stats` library to run the test" ] }, { "cell_type": "markdown", "id": "b563240d", "metadata": {}, "source": [ "## Set up Python libraries\n", "\n", "As usual, run the code cell below to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "93d0feed", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "8b8bd7b3", "metadata": {}, "source": [ "## Example: the Sign-Rank Test\n", "\n", "It has been argued that birth order in families affects how independent individuals are as adults - either that first-born children tend to be more independent than later born children or vice versa.\n", "\n", "In a (fictional!) study, a researcher identified 20 sibling pairs, each comprising a first- and second- born child from a two-child family. The participants were young adults; each participant was interviewed at the age of 21. \n", "\n", "The researcher scored independence for each participant, using a 25 point scale where a higher score means the person is more independent, based on a structured interview.\n", "\n", "Carry out a statistical test for a difference in independence scores between the first- and second-born children.\n", "\n", "Note that this is a paired samples design - each member of one group (the first-borns) has a paired member of the other group (second-borns).\n", "\n", "\n", "### Inspect the data\n", "\n", "The data are provided in a text (.csv) file.\n", "\n", "Let's load the data as a Pandas dataframe, and plot them to get a sense for their distribution (is it normal?) and any outliers\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "da622d51", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstBornSecondBorn
01210
11812
21315
31713
489
51512
61613
758
8810
9128
10138
1159
12148
132010
141914
151711
1627
1757
181513
191812
\n", "
" ], "text/plain": [ " FirstBorn SecondBorn\n", "0 12 10\n", "1 18 12\n", "2 13 15\n", "3 17 13\n", "4 8 9\n", "5 15 12\n", "6 16 13\n", "7 5 8\n", "8 8 10\n", "9 12 8\n", "10 13 8\n", "11 5 9\n", "12 14 8\n", "13 20 10\n", "14 19 14\n", "15 17 11\n", "16 2 7\n", "17 5 7\n", "18 15 13\n", "19 18 12" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load the data and have a look\n", "birthOrder = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/BirthOrderIndependence.csv')\n", "birthOrder" ] }, { "cell_type": "markdown", "id": "5481f8e9", "metadata": {}, "source": [ "### Scatterplot\n", "\n", "In the case of paired data, the most effective way to get a sense of the data is a scatterplot:" ] }, { "cell_type": "code", "execution_count": 4, "id": "ce6a5214", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.scatterplot(data = birthOrder, x=\"FirstBorn\", y=\"SecondBorn\")\n", "plt.xlabel(\"independence: first born\")\n", "plt.ylabel(\"independence: second born\")\n", "\n", "# add the line x=y (ie a line from point(50,50) to (110,110)) for reference \n", "plt.plot([0,20],[0,20],'r--')" ] }, { "cell_type": "markdown", "id": "850462ee", "metadata": {}, "source": [ "Comments:\n", " \n", "" ] }, { "cell_type": "markdown", "id": "1e0d04e2", "metadata": {}, "source": [ "### Check the data distribution\n", "\n", "In the case of paired data, we are interested in the distribution of *differences* within pairs.\n", "\n", "Let's add a column to our dataframe to contain the differences" ] }, { "cell_type": "code", "execution_count": 7, "id": "ce83eb84", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstBornSecondBornDiff
012102
118126
21315-2
317134
489-1
515123
616133
758-3
8810-2
91284
101385
1159-4
121486
13201010
1419145
1517116
1627-5
1757-2
1815132
1918126
\n", "
" ], "text/plain": [ " FirstBorn SecondBorn Diff\n", "0 12 10 2\n", "1 18 12 6\n", "2 13 15 -2\n", "3 17 13 4\n", "4 8 9 -1\n", "5 15 12 3\n", "6 16 13 3\n", "7 5 8 -3\n", "8 8 10 -2\n", "9 12 8 4\n", "10 13 8 5\n", "11 5 9 -4\n", "12 14 8 6\n", "13 20 10 10\n", "14 19 14 5\n", "15 17 11 6\n", "16 2 7 -5\n", "17 5 7 -2\n", "18 15 13 2\n", "19 18 12 6" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "birthOrder['Diff'] = birthOrder.FirstBorn - birthOrder.SecondBorn\n", "birthOrder" ] }, { "cell_type": "markdown", "id": "3953720c", "metadata": {}, "source": [ "Now let's plot the differences to get a sense of whether they are normally distributed." ] }, { "cell_type": "code", "execution_count": 9, "id": "8f5e728d", "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.kdeplot(data=birthOrder, x='Diff', color='b', fill=True)\n", "sns.rugplot(data=birthOrder, x='Diff', height=0.1, color='b')\n", "plt.xlabel('Difference 1st-2nd born')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "41d53375", "metadata": {}, "source": [ "The distribution hasa hint of bimodaility, with a hint of bimodality (two peaks)." ] }, { "cell_type": "markdown", "id": "30e5dff0", "metadata": {}, "source": [ "### Hypotheses\n", "\n", "$\\mathcal{H_o}$: the median difference in independence between first- and second-born siblings is is zero\n", "\n", "$\\mathcal{H_a}$: the median difference in independence is not zero\n", " \n", "This is a two-tailed test as the researcher's hypothesis (described above) is not directional.\n", "\n", "We will test at the $\\alpha = 0.05$ significance level" ] }, { "cell_type": "markdown", "id": "0ed79570", "metadata": {}, "source": [ "## Descriptive statistics\n", "\n", "We obtain some relevant descriptive statistics. \n", "\n", "Since we are testing for a difference of medians, we will want the median for each group; it would also be useful to have a measure of spread, and the sample size." ] }, { "cell_type": "code", "execution_count": 19, "id": "11de87b4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstBornSecondBornDiff
count20.00000020.00000020.000000
mean12.60000010.4500002.150000
std5.3646012.4381834.120232
min2.0000007.000000-5.000000
25%8.0000008.000000-2.000000
50%13.50000010.0000003.000000
75%17.00000012.2500005.250000
max20.00000015.00000010.000000
\n", "
" ], "text/plain": [ " FirstBorn SecondBorn Diff\n", "count 20.000000 20.000000 20.000000\n", "mean 12.600000 10.450000 2.150000\n", "std 5.364601 2.438183 4.120232\n", "min 2.000000 7.000000 -5.000000\n", "25% 8.000000 8.000000 -2.000000\n", "50% 13.500000 10.000000 3.000000\n", "75% 17.000000 12.250000 5.250000\n", "max 20.000000 15.000000 10.000000" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "birthOrder.describe()" ] }, { "cell_type": "markdown", "id": "b7ab9c7d", "metadata": {}, "source": [ "### Carry out the test\n", "\n", "We carry out the test using the function wilcoxon from scipy.stats, here loaded as stats" ] }, { "cell_type": "code", "execution_count": 29, "id": "85858db8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "WilcoxonResult(statistic=164.0, pvalue=0.0133209228515625)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.wilcoxon(birthOrder.FirstBorn,birthOrder.SecondBorn,alternative='greater')\n", "#help(stats.wilcoxon)" ] }, { "cell_type": "markdown", "id": "3dcd664c", "metadata": {}, "source": [ "The inputs to `stats.wilcoxon()` are:\n", "\n", "* the two samples to be compared (the values of FirstBorn and SecondBorn from our Pandas data frame birthOrder)\n", "* the argument `alternative='greater'`, which tells the computer to run a one tailed test that median of the first input (FirstBorn) is greater than the second (SecondBorn).\n", " \n", "The outputs are a value of the test statistic ($T=164$) and pvalue ($p=0.0133$) - if this is less than our $\\alpha$ value 0.5, there is a significant difference.\n", "\n", "More explanation of how T is calculated below." ] }, { "cell_type": "markdown", "id": "69dcbf20", "metadata": {}, "source": [ "### Draw conclusions\n", "\n", "As the p value of 0.0133 is less than our alpha value of 0.05, the test is significant. \n", "\n", "We can conclude that the median difference in idenpendence is positive, ie the first borns are more independent" ] }, { "cell_type": "markdown", "id": "1a80a337", "metadata": {}, "source": [ "### How the Wilcoxon Sign-Rank test works\n", "\n", "The mechanism of the test but is similar in principle to the ranksum test, except that here we work with ranked *differences*.\n" ] }, { "cell_type": "markdown", "id": "7ea850fa-1f9c-4cea-bee4-7b70ba7360b2", "metadata": {}, "source": [ "### How to do the test (if you were doing it with pencil and paper)\n", "\n", "1. Obtain the difference (in independence score) for each pair\n", "\n", "1. Rank the differences regardless of sign (e.g. a difference of +4 is greater than a difference of -3, which is greater than a difference of +2). Remove pairs with zero difference\n", "\n", "1. Calculate the sum of ranks assigned to pairs with a positive difference (first-born more independent than second-born) - this is $R+$\n", "1. Calculate the sum of ranks assigned to pairs with a negative difference (first-born more independent than second-born) - this is $R-$\n", "\n", "1. The test statistic $T$ is either:\n", " * $R+$ if we expect positive differences to have the larger ranks (in this case, that equates to expecting first-borns to have higher scores)\n", " * $R-$ if we expect negative differences to have the larger ranks (in this case, that equates to expecting second-borns to have higher scores)\n", " * The smaller of $R+$ and $R-$ for a two tailed test (as in the example, we have no a-prior hypothesis about direction of effect)\n", "\n", "1. $T$ is compared with a null distribution (the expected distribubtion of $T$ obtained in samples drawn from a population in which there is no true difference between groups)\n", "\n", "\n", "We will not build code to do this here, although if you are feeling brave you are welcome to have a try yourself, using the between groups rank sum test as a model. \n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }