4.8. ANOVA and Kruskal-Wallis in Python#

We’ll use the data from the chimps examples to demonstrate the python code. First, as usual, we’ll import the relevant packages and import the data as a .csv file.

4.8.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

4.8.2. Import and view the data#

chimps=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/chimps.csv')
chimps
chimp group yawns age3
0 13 Baboons 2 young
1 12 Baboons 1 young
2 14 Baboons 1 young
3 10 Baboons 2 middle
4 11 Baboons 0 old
5 15 Control (human, no yawn) 0 young
6 16 Control (human, no yawn) 2 young
7 19 Control (human, no yawn) 1 middle
8 18 Control (human, no yawn) 0 old
9 17 Control (human, no yawn) 1 old
10 4 Familiar humans 5 young
11 5 Familiar humans 3 middle
12 2 Familiar humans 4 middle
13 3 Familiar humans 2 old
14 1 Familiar humans 3 old
15 7 Unfamiliar humans 3 young
16 9 Unfamiliar humans 5 middle
17 6 Unfamiliar humans 4 middle
18 8 Unfamiliar humans 3 old

Have a look at the data

Each row is a ‘participant’ (a chimp).

We have the following information on each chimp;

  • ID number

  • Experimental group (who did the chimp see yawning?)

  • yawns (number of yawns produced)

  • age3 (age of the chimp, in three categories)

What is the dependent variable? What is the independent variabble? What are the control variables?

4.8.3. Running the ANOVA#

We want to run an ANOVA to test whether the species of the yawner affects the number of yawns produced by the chimp.

Here is some code to do so:

# First we create the ANOVA model:
chimps_lm = smf.ols('yawns ~ group', data=chimps).fit()

# Then output the ANOVA table
table = sm.stats.anova_lm(chimps_lm, typ=2) 
print(table)
             sum_sq    df         F    PR(>F)
group     31.607895   3.0  11.66343  0.000333
Residual  13.550000  15.0       NaN       NaN

The python output confirms our longhand example above. The p-value is 0.0003. There is a statistically significant difference in yawn rates between the groups.

Control variable#

We can add a control variable in a “two-way ANOVA”. Here we want to control for age which is a categorical variable of the chimp’s age: young, middle-aged, and old

# First we create the ANOVA model:
chimps_lm = smf.ols('yawns ~ group + age3', data=chimps).fit()

# Then output the ANOVA table
table = sm.stats.anova_lm(chimps_lm, typ=2) 
print(table)
             sum_sq    df          F    PR(>F)
group     27.701754   3.0  13.907182  0.000237
age3       4.918421   2.0   3.703811  0.053331
Residual   8.631579  13.0        NaN       NaN

The results show that the experimental treatment group is statistically significant (p=0.0002) but age is (just) not statistically significant (p=0.0533).

Interaction terms#

ANOVA can also handle interaction terms (as we explored with Regression Analysis last term).

Let’s look at whether there is an interaction between group and age3 (this would mean that the yawning behaviour of young and old chimpanzees was differently affected by the species of the yawner)

# First we create the ANOVA model:
chimps_lm = smf.ols('yawns ~ group + age3 + group:age3', data=chimps).fit()

# Then output the ANOVA table
table = sm.stats.anova_lm(chimps_lm, typ=2) 
print(table)
               sum_sq   df          F    PR(>F)
group       27.701754  3.0  13.850877  0.002496
age3         4.918421  2.0   3.688816  0.080526
group:age3   3.964912  6.0   0.991228  0.496102
Residual     4.666667  7.0        NaN       NaN

The interaction is not statistically significant (p=0.4961), which we can interpret to mean that the effect of the treatment group was the same for chimps of different ages.

4.8.4. Kruskal-Wallis Test#

We can also run a Kruskal-Wallis Test in python, using a function from scipy.stats called kruskal

Here is the syntax - remember the Kruskall-Wallis test is similar to a one-way ANOVA, in that it tests for the effect of only one (categorical) varialbe, no control variables or interactions:

# annoyingly, we have to give stats.kruskall each group's data as a separate series

stats.kruskal(chimps.query('group == "Baboons"').yawns, 
              chimps.query('group == "Control (human, no yawn)"').yawns, 
              chimps.query('group == "Familiar humans"').yawns, 
              chimps.query('group == "Unfamiliar humans"').yawns)
KruskalResult(statistic=13.314130434782625, pvalue=0.004004258990022785)

The Kruskal-Wallis test produces an H-statistic of 13.314 and a p-value of 0.004. It therefore gives the same conclusion as the one-way ANOVA, suggesting a statistically significant difference between treatment groups in the chimp experiment.