{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "627c4901",
   "metadata": {},
   "source": [
    "# ANOVA and Kruskal-Wallis in Python\n",
    "\n",
    "We’ll use the data from the chimps examples to demonstrate the python code. First, as usual, we’ll import the relevant packages and import the data as a `.csv` file. \n",
    "\n",
    "### Set up Python libraries\n",
    "\n",
    "As usual, run the code cell below to import the relevant Python libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1f55cac1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set-up Python libraries - you need to run this but you don't need to change it\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats\n",
    "import pandas \n",
    "import seaborn as sns\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a311658a",
   "metadata": {},
   "source": [
    "### Import and view the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "36563533",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chimp</th>\n",
       "      <th>group</th>\n",
       "      <th>yawns</th>\n",
       "      <th>age3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>13</td>\n",
       "      <td>Baboons</td>\n",
       "      <td>2</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>12</td>\n",
       "      <td>Baboons</td>\n",
       "      <td>1</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>14</td>\n",
       "      <td>Baboons</td>\n",
       "      <td>1</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10</td>\n",
       "      <td>Baboons</td>\n",
       "      <td>2</td>\n",
       "      <td>middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11</td>\n",
       "      <td>Baboons</td>\n",
       "      <td>0</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>15</td>\n",
       "      <td>Control (human, no yawn)</td>\n",
       "      <td>0</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>16</td>\n",
       "      <td>Control (human, no yawn)</td>\n",
       "      <td>2</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>19</td>\n",
       "      <td>Control (human, no yawn)</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>18</td>\n",
       "      <td>Control (human, no yawn)</td>\n",
       "      <td>0</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>17</td>\n",
       "      <td>Control (human, no yawn)</td>\n",
       "      <td>1</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>4</td>\n",
       "      <td>Familiar humans</td>\n",
       "      <td>5</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>5</td>\n",
       "      <td>Familiar humans</td>\n",
       "      <td>3</td>\n",
       "      <td>middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2</td>\n",
       "      <td>Familiar humans</td>\n",
       "      <td>4</td>\n",
       "      <td>middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>3</td>\n",
       "      <td>Familiar humans</td>\n",
       "      <td>2</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>1</td>\n",
       "      <td>Familiar humans</td>\n",
       "      <td>3</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>7</td>\n",
       "      <td>Unfamiliar humans</td>\n",
       "      <td>3</td>\n",
       "      <td>young</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>9</td>\n",
       "      <td>Unfamiliar humans</td>\n",
       "      <td>5</td>\n",
       "      <td>middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>6</td>\n",
       "      <td>Unfamiliar humans</td>\n",
       "      <td>4</td>\n",
       "      <td>middle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>8</td>\n",
       "      <td>Unfamiliar humans</td>\n",
       "      <td>3</td>\n",
       "      <td>old</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    chimp                     group  yawns    age3\n",
       "0      13                   Baboons      2   young\n",
       "1      12                   Baboons      1   young\n",
       "2      14                   Baboons      1   young\n",
       "3      10                   Baboons      2  middle\n",
       "4      11                   Baboons      0     old\n",
       "5      15  Control (human, no yawn)      0   young\n",
       "6      16  Control (human, no yawn)      2   young\n",
       "7      19  Control (human, no yawn)      1  middle\n",
       "8      18  Control (human, no yawn)      0     old\n",
       "9      17  Control (human, no yawn)      1     old\n",
       "10      4           Familiar humans      5   young\n",
       "11      5           Familiar humans      3  middle\n",
       "12      2           Familiar humans      4  middle\n",
       "13      3           Familiar humans      2     old\n",
       "14      1           Familiar humans      3     old\n",
       "15      7         Unfamiliar humans      3   young\n",
       "16      9         Unfamiliar humans      5  middle\n",
       "17      6         Unfamiliar humans      4  middle\n",
       "18      8         Unfamiliar humans      3     old"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chimps=pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/chimps.csv')\n",
    "chimps"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ecc08ca",
   "metadata": {},
   "source": [
    "Have a look at the data\n",
    "\n",
    "Each row is a 'participant' (a chimp). \n",
    "\n",
    "We have the following information on each chimp;\n",
    "* ID number\n",
    "* Experimental group (who did the chimp see yawning?)\n",
    "* yawns (number of yawns produced)\n",
    "* age3 (age of the chimp, in three categories)\n",
    "\n",
    "\n",
    "What is the dependent variable? What is the independent variabble? What are the control variables?\n",
    "\n",
    "## Running the ANOVA\n",
    "\n",
    "We want to run an ANOVA to test whether the species of the yawner affects the number of yawns produced by the chimp.\n",
    "\n",
    "Here is some code to do so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "faf358cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             sum_sq    df         F    PR(>F)\n",
      "group     31.607895   3.0  11.66343  0.000333\n",
      "Residual  13.550000  15.0       NaN       NaN\n"
     ]
    }
   ],
   "source": [
    "# First we create the ANOVA model:\n",
    "chimps_lm = smf.ols('yawns ~ group', data=chimps).fit()\n",
    "\n",
    "# Then output the ANOVA table\n",
    "table = sm.stats.anova_lm(chimps_lm, typ=2) \n",
    "print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f677858",
   "metadata": {},
   "source": [
    "The python output confirms our longhand example above. The p-value is 0.0003. There is a statistically significant difference in yawn rates between the groups.\n",
    "\n",
    "### Control variable\n",
    "\n",
    "We can add a control variable in a “two-way ANOVA”. Here we want to control for age which is a categorical variable of the chimp’s age: young, middle-aged, and old"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1cf52c49",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             sum_sq    df          F    PR(>F)\n",
      "group     27.701754   3.0  13.907182  0.000237\n",
      "age3       4.918421   2.0   3.703811  0.053331\n",
      "Residual   8.631579  13.0        NaN       NaN\n"
     ]
    }
   ],
   "source": [
    "# First we create the ANOVA model:\n",
    "chimps_lm = smf.ols('yawns ~ group + age3', data=chimps).fit()\n",
    "\n",
    "# Then output the ANOVA table\n",
    "table = sm.stats.anova_lm(chimps_lm, typ=2) \n",
    "print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad995bc2",
   "metadata": {},
   "source": [
    "The results show that the experimental treatment group is statistically significant (p=0.0002) but age is (just) not statistically significant (p=0.0533). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6950fad7",
   "metadata": {},
   "source": [
    "### Interaction terms\n",
    "\n",
    "ANOVA can also handle interaction terms (as we explored with Regression Analysis last term).\n",
    "\n",
    "Let's look at whether there is an interaction between `group` and `age3` (this would mean that the yawning behaviour of young and old chimpanzees was differently affected by the species of the yawner)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "a868e626",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               sum_sq   df          F    PR(>F)\n",
      "group       27.701754  3.0  13.850877  0.002496\n",
      "age3         4.918421  2.0   3.688816  0.080526\n",
      "group:age3   3.964912  6.0   0.991228  0.496102\n",
      "Residual     4.666667  7.0        NaN       NaN\n"
     ]
    }
   ],
   "source": [
    "# First we create the ANOVA model:\n",
    "chimps_lm = smf.ols('yawns ~ group + age3 + group:age3', data=chimps).fit()\n",
    "\n",
    "# Then output the ANOVA table\n",
    "table = sm.stats.anova_lm(chimps_lm, typ=2) \n",
    "print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65783b49",
   "metadata": {},
   "source": [
    "The interaction is not statistically significant (p=0.4961), which we can interpret to mean that the effect of the treatment group was the same for chimps of different ages. \n",
    "\n",
    "## Kruskal-Wallis Test"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1353503e",
   "metadata": {},
   "source": [
    "We can also run a Kruskal-Wallis Test in python, using a function from `scipy.stats` called `kruskal`\n",
    "\n",
    "Here is the syntax - remember the Kruskall-Wallis test is similar to a one-way ANOVA, in that it tests for the effect of only one (categorical) varialbe, no control variables or interactions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f4c48119",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KruskalResult(statistic=13.314130434782625, pvalue=0.004004258990022785)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# annoyingly, we have to give stats.kruskall each group's data as a separate series\n",
    "\n",
    "stats.kruskal(chimps[chimps.group == 'Baboons']['yawns'], \n",
    "              chimps[chimps.group == 'Control (human, no yawn)']['yawns'],\n",
    "              chimps[chimps.group == 'Familiar humans']['yawns'],\n",
    "              chimps[chimps.group == 'Unfamiliar humans']['yawns'],)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5327ae29",
   "metadata": {},
   "source": [
    "The Kruskal-Wallis test produces an H-statistic of 12.894 and a p-value of 0.0049. It therefore gives the same result as the one-way ANOVA, suggesting a statistically significant difference between treatment groups in the chimp experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4ee522f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}