{ "cells": [ { "cell_type": "markdown", "id": "627c4901", "metadata": {}, "source": [ "# ANOVA and Kruskal-Wallis in Python\n", "\n", "We’ll use the data from the chimps examples to demonstrate the python code. First, as usual, we’ll import the relevant packages and import the data as a `.csv` file. \n", "\n", "### Set up Python libraries\n", "\n", "As usual, run the code cell below to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 2, "id": "1f55cac1", "metadata": {}, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas \n", "import seaborn as sns\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "a311658a", "metadata": {}, "source": [ "### Import and view the data" ] }, { "cell_type": "code", "execution_count": 3, "id": "36563533", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chimpgroupyawnsage3
013Baboons2young
112Baboons1young
214Baboons1young
310Baboons2middle
411Baboons0old
515Control (human, no yawn)0young
616Control (human, no yawn)2young
719Control (human, no yawn)1middle
818Control (human, no yawn)0old
917Control (human, no yawn)1old
104Familiar humans5young
115Familiar humans3middle
122Familiar humans4middle
133Familiar humans2old
141Familiar humans3old
157Unfamiliar humans3young
169Unfamiliar humans5middle
176Unfamiliar humans4middle
188Unfamiliar humans3old
\n", "
" ], "text/plain": [ " chimp group yawns age3\n", "0 13 Baboons 2 young\n", "1 12 Baboons 1 young\n", "2 14 Baboons 1 young\n", "3 10 Baboons 2 middle\n", "4 11 Baboons 0 old\n", "5 15 Control (human, no yawn) 0 young\n", "6 16 Control (human, no yawn) 2 young\n", "7 19 Control (human, no yawn) 1 middle\n", "8 18 Control (human, no yawn) 0 old\n", "9 17 Control (human, no yawn) 1 old\n", "10 4 Familiar humans 5 young\n", "11 5 Familiar humans 3 middle\n", "12 2 Familiar humans 4 middle\n", "13 3 Familiar humans 2 old\n", "14 1 Familiar humans 3 old\n", "15 7 Unfamiliar humans 3 young\n", "16 9 Unfamiliar humans 5 middle\n", "17 6 Unfamiliar humans 4 middle\n", "18 8 Unfamiliar humans 3 old" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chimps=pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/chimps.csv')\n", "chimps" ] }, { "cell_type": "markdown", "id": "7ecc08ca", "metadata": {}, "source": [ "Have a look at the data\n", "\n", "Each row is a 'participant' (a chimp). \n", "\n", "We have the following information on each chimp;\n", "* ID number\n", "* Experimental group (who did the chimp see yawning?)\n", "* yawns (number of yawns produced)\n", "* age3 (age of the chimp, in three categories)\n", "\n", "\n", "What is the dependent variable? What is the independent variabble? What are the control variables?\n", "\n", "## Running the ANOVA\n", "\n", "We want to run an ANOVA to test whether the species of the yawner affects the number of yawns produced by the chimp.\n", "\n", "Here is some code to do so:" ] }, { "cell_type": "code", "execution_count": 10, "id": "faf358cc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sum_sq df F PR(>F)\n", "group 31.607895 3.0 11.66343 0.000333\n", "Residual 13.550000 15.0 NaN NaN\n" ] } ], "source": [ "# First we create the ANOVA model:\n", "chimps_lm = smf.ols('yawns ~ group', data=chimps).fit()\n", "\n", "# Then output the ANOVA table\n", "table = sm.stats.anova_lm(chimps_lm, typ=2) \n", "print(table)" ] }, { "cell_type": "markdown", "id": "8f677858", "metadata": {}, "source": [ "The python output confirms our longhand example above. The p-value is 0.0003. There is a statistically significant difference in yawn rates between the groups.\n", "\n", "### Control variable\n", "\n", "We can add a control variable in a “two-way ANOVA”. Here we want to control for age which is a categorical variable of the chimp’s age: young, middle-aged, and old" ] }, { "cell_type": "code", "execution_count": 11, "id": "1cf52c49", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sum_sq df F PR(>F)\n", "group 27.701754 3.0 13.907182 0.000237\n", "age3 4.918421 2.0 3.703811 0.053331\n", "Residual 8.631579 13.0 NaN NaN\n" ] } ], "source": [ "# First we create the ANOVA model:\n", "chimps_lm = smf.ols('yawns ~ group + age3', data=chimps).fit()\n", "\n", "# Then output the ANOVA table\n", "table = sm.stats.anova_lm(chimps_lm, typ=2) \n", "print(table)" ] }, { "cell_type": "markdown", "id": "ad995bc2", "metadata": {}, "source": [ "The results show that the experimental treatment group is statistically significant (p=0.0002) but age is (just) not statistically significant (p=0.0533). " ] }, { "cell_type": "markdown", "id": "6950fad7", "metadata": {}, "source": [ "### Interaction terms\n", "\n", "ANOVA can also handle interaction terms (as we explored with Regression Analysis last term).\n", "\n", "Let's look at whether there is an interaction between `group` and `age3` (this would mean that the yawning behaviour of young and old chimpanzees was differently affected by the species of the yawner)" ] }, { "cell_type": "code", "execution_count": 14, "id": "a868e626", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " sum_sq df F PR(>F)\n", "group 27.701754 3.0 13.850877 0.002496\n", "age3 4.918421 2.0 3.688816 0.080526\n", "group:age3 3.964912 6.0 0.991228 0.496102\n", "Residual 4.666667 7.0 NaN NaN\n" ] } ], "source": [ "# First we create the ANOVA model:\n", "chimps_lm = smf.ols('yawns ~ group + age3 + group:age3', data=chimps).fit()\n", "\n", "# Then output the ANOVA table\n", "table = sm.stats.anova_lm(chimps_lm, typ=2) \n", "print(table)" ] }, { "cell_type": "markdown", "id": "65783b49", "metadata": {}, "source": [ "The interaction is not statistically significant (p=0.4961), which we can interpret to mean that the effect of the treatment group was the same for chimps of different ages. \n", "\n", "## Kruskal-Wallis Test" ] }, { "cell_type": "markdown", "id": "1353503e", "metadata": {}, "source": [ "We can also run a Kruskal-Wallis Test in python, using a function from `scipy.stats` called `kruskal`\n", "\n", "Here is the syntax - remember the Kruskall-Wallis test is similar to a one-way ANOVA, in that it tests for the effect of only one (categorical) varialbe, no control variables or interactions:" ] }, { "cell_type": "code", "execution_count": 18, "id": "f4c48119", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KruskalResult(statistic=13.314130434782625, pvalue=0.004004258990022785)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# annoyingly, we have to give stats.kruskall each group's data as a separate series\n", "\n", "stats.kruskal(chimps[chimps.group == 'Baboons']['yawns'], \n", " chimps[chimps.group == 'Control (human, no yawn)']['yawns'],\n", " chimps[chimps.group == 'Familiar humans']['yawns'],\n", " chimps[chimps.group == 'Unfamiliar humans']['yawns'],)" ] }, { "cell_type": "markdown", "id": "5327ae29", "metadata": {}, "source": [ "The Kruskal-Wallis test produces an H-statistic of 12.894 and a p-value of 0.0049. It therefore gives the same result as the one-way ANOVA, suggesting a statistically significant difference between treatment groups in the chimp experiment." ] }, { "cell_type": "code", "execution_count": null, "id": "a4ee522f", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }