4.8. ANOVA and Kruskal-Wallis in Python#
We’ll use the data from the chimps examples to demonstrate the python code. First, as usual, we’ll import the relevant packages and import the data as a .csv
file.
4.8.1. Set up Python libraries#
As usual, run the code cell below to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
4.8.2. Import and view the data#
chimps=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/chimps.csv')
chimps
chimp | group | yawns | age3 | |
---|---|---|---|---|
0 | 13 | Baboons | 2 | young |
1 | 12 | Baboons | 1 | young |
2 | 14 | Baboons | 1 | young |
3 | 10 | Baboons | 2 | middle |
4 | 11 | Baboons | 0 | old |
5 | 15 | Control (human, no yawn) | 0 | young |
6 | 16 | Control (human, no yawn) | 2 | young |
7 | 19 | Control (human, no yawn) | 1 | middle |
8 | 18 | Control (human, no yawn) | 0 | old |
9 | 17 | Control (human, no yawn) | 1 | old |
10 | 4 | Familiar humans | 5 | young |
11 | 5 | Familiar humans | 3 | middle |
12 | 2 | Familiar humans | 4 | middle |
13 | 3 | Familiar humans | 2 | old |
14 | 1 | Familiar humans | 3 | old |
15 | 7 | Unfamiliar humans | 3 | young |
16 | 9 | Unfamiliar humans | 5 | middle |
17 | 6 | Unfamiliar humans | 4 | middle |
18 | 8 | Unfamiliar humans | 3 | old |
Have a look at the data
Each row is a ‘participant’ (a chimp).
We have the following information on each chimp;
ID number
Experimental group (who did the chimp see yawning?)
yawns (number of yawns produced)
age3 (age of the chimp, in three categories)
What is the dependent variable? What is the independent variabble? What are the control variables?
4.8.3. Running the ANOVA#
We want to run an ANOVA to test whether the species of the yawner affects the number of yawns produced by the chimp.
Here is some code to do so:
# First we create the ANOVA model:
chimps_lm = smf.ols('yawns ~ group', data=chimps).fit()
# Then output the ANOVA table
table = sm.stats.anova_lm(chimps_lm, typ=2)
print(table)
sum_sq df F PR(>F)
group 31.607895 3.0 11.66343 0.000333
Residual 13.550000 15.0 NaN NaN
The python output confirms our longhand example above. The p-value is 0.0003. There is a statistically significant difference in yawn rates between the groups.
Control variable#
We can add a control variable in a “two-way ANOVA”. Here we want to control for age which is a categorical variable of the chimp’s age: young, middle-aged, and old
# First we create the ANOVA model:
chimps_lm = smf.ols('yawns ~ group + age3', data=chimps).fit()
# Then output the ANOVA table
table = sm.stats.anova_lm(chimps_lm, typ=2)
print(table)
sum_sq df F PR(>F)
group 27.701754 3.0 13.907182 0.000237
age3 4.918421 2.0 3.703811 0.053331
Residual 8.631579 13.0 NaN NaN
The results show that the experimental treatment group is statistically significant (p=0.0002) but age is (just) not statistically significant (p=0.0533).
Interaction terms#
ANOVA can also handle interaction terms (as we explored with Regression Analysis last term).
Let’s look at whether there is an interaction between group
and age3
(this would mean that the yawning behaviour of young and old chimpanzees was differently affected by the species of the yawner)
# First we create the ANOVA model:
chimps_lm = smf.ols('yawns ~ group + age3 + group:age3', data=chimps).fit()
# Then output the ANOVA table
table = sm.stats.anova_lm(chimps_lm, typ=2)
print(table)
sum_sq df F PR(>F)
group 27.701754 3.0 13.850877 0.002496
age3 4.918421 2.0 3.688816 0.080526
group:age3 3.964912 6.0 0.991228 0.496102
Residual 4.666667 7.0 NaN NaN
The interaction is not statistically significant (p=0.4961), which we can interpret to mean that the effect of the treatment group was the same for chimps of different ages.
4.8.4. Kruskal-Wallis Test#
We can also run a Kruskal-Wallis Test in python, using a function from scipy.stats
called kruskal
Here is the syntax - remember the Kruskall-Wallis test is similar to a one-way ANOVA, in that it tests for the effect of only one (categorical) varialbe, no control variables or interactions:
# annoyingly, we have to give stats.kruskall each group's data as a separate series
stats.kruskal(chimps.query('group == "Baboons"').yawns,
chimps.query('group == "Control (human, no yawn)"').yawns,
chimps.query('group == "Familiar humans"').yawns,
chimps.query('group == "Unfamiliar humans"').yawns)
KruskalResult(statistic=13.314130434782625, pvalue=0.004004258990022785)
The Kruskal-Wallis test produces an H-statistic of 13.314 and a p-value of 0.004. It therefore gives the same conclusion as the one-way ANOVA, suggesting a statistically significant difference between treatment groups in the chimp experiment.