Spearman’s Rank Correlation

2.7. Spearman’s Rank Correlation#

In Chapter 1: Describing Data we looked at Spearman’s Rank correlation coefficient, which is a robust correlation based on ranks.

If you are unsure about correlation coefficients, please revisit the page on correlation in Chapter 1: Describing Data

In this section on rank-based tests, we revisit Spearman’s $r$ and see how to get a $p$ -value for it using scipy.stats

The reasons for using Spearman’srank correlation rather than Pearson’s correlation are recapped there.

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

2.7.1. Load the data#

Let’s use the CO2 data discussed in the section on correlation in Chapter 1: Describing Data. The dataset contains GDP (weath) and carbon emissions per person for 164 countries.

carbon = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/CO2vGDP.csv')
carbon

	Country	CO2	GDP	population
0	Afghanistan	0.2245	1934.555054	36686788
1	Albania	1.6422	11104.166020	2877019
2	Algeria	3.8241	14228.025390	41927008
3	Angola	0.7912	7771.441895	31273538
4	Argentina	4.0824	18556.382810	44413592
...	...	...	...	...
159	Venezuela	4.1602	10709.950200	29825652
160	Vietnam	2.3415	6814.142090	94914328
161	Yemen	0.3503	2284.889893	30790514
162	Zambia	0.4215	3534.033691	17835898
163	Zimbabwe	0.8210	1611.405151	15052191

164 rows × 4 columns

2.7.2. Plot the data#

From a scatter plot, we can see that the data are unsuitable for Pearson’s correlation (please check the notes for Correlation in the section Describing Data if unsure why)

sns.regplot(data=carbon, x='GDP', y='CO2')
plt.show()

../_images/e6d2f5f199a5af8d915e4d27abd1fe006ab4f8934e63e3f544ef725a62ec002d.png

2.7.3. Calculating correlation#

We have seen that we can get the correlation ( $r$ -value) between all pairs of columns using a pandas function df.corr() as follows:

carbon.corr(numeric_only=True, method='spearman')

	CO2	GDP	population
CO2	1.000000	0.914369	-0.098554
GDP	0.914369	1.000000	-0.122920
population	-0.098554	-0.122920	1.000000

Or between two particular columns like this:

carbon.GDP.corr(carbon.CO2, method='spearman')

0.9143688871356085

However, the pandas function df.corr() doesn’t calculate the significance of the correlation. We could calculate it using a permutation test (as last week) but we can also use a built in function from scipy.stats, called stats.spearmanr

stats.spearmanr(carbon.GDP, carbon.CO2)

SignificanceResult(statistic=0.9143688871356085, pvalue=1.6676605949335523e-65)

This gives us the correlation coefficient $r = 0.79$ (which is a very strong correlation) and the $p$ -value $4.9 \times 10^{- 37}$ (it is highliy significant)

Note on Hypotheses#

For Pearson’s correlation (the ‘standard’ correlation coefficient, calculated on actual data values rather than ranks) we might express our null and alternative hypotheses as follows:

$H_{o}$ There is no linear relationship between GDP and CO2 emissions per capita

$H_{a}$ There is a positive linear relationship between GDP and CO2 emissions per capita

in plain English, CO2 emissions are proportional to GDP

(remember from the section on correlation in Describing Data that Pearson’s correlation assumes that the relationship, if there is one, is a straight line)

For Spearman’s rank correlation coefficient, our null and alternative hypotheses are slightly different:

$H_{o}$ There is no relationship between GDP and CO2 emissions per capita

$H_{a}$ There is a relationship between CO2 and GDP rank

in plain English, richer a country is, the higher its carbon emissions