Spearman’s Rank Correlation

4.7. Spearman’s Rank Correlation#

Climate Example#

We will again use the a dataset containing carbon emissions, GDP and population for 164 countries (data from 2018).

These data came from Our World in Data, a fabulous Oxford-based organization that provides datasets and visualizations addressing global issues.

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme() # use pretty defaults

Load and inspect the data#

Load the data from the file CO2vGDP.csv.

CO2vGDP = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/CO2vGDP.csv')
display(CO2vGDP)

	Country	CO2	GDP	population
0	Afghanistan	0.2245	1934.555054	36686788
1	Albania	1.6422	11104.166020	2877019
2	Algeria	3.8241	14228.025390	41927008
3	Angola	0.7912	7771.441895	31273538
4	Argentina	4.0824	18556.382810	44413592
...	...	...	...	...
159	Venezuela	4.1602	10709.950200	29825652
160	Vietnam	2.3415	6814.142090	94914328
161	Yemen	0.3503	2284.889893	30790514
162	Zambia	0.4215	3534.033691	17835898
163	Zimbabwe	0.8210	1611.405151	15052191

164 rows × 4 columns

We noted previously that our climate dataset is not suitable for a ‘normal’ Pearson’s correlation because it has one of the three freatures that violate the assumptions of Pearson’s correlation:

Non-Straight-Line relationship between x and y
Heteroscedasticity
Outliers

In this case the problem was heteroscedasticity

Spearman’s Rank Correlation#

We therefore use a rank-based form of correlation called Spearman’s rank correlation coefficient $r_s$, which does not rely on the same assumptions as Pearson’s correlation.

$r_s$ can be calculated using the same built-in function pandas.df.corr():

CO2vGDP.corr(method='spearman')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 CO2vGDP.corr(method='spearman')

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:10054, in DataFrame.corr(self, method, min_periods, numeric_only)
  10052 cols = data.columns
  10053 idx = cols.copy()
> 10054 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  10056 if method == "pearson":
  10057     correl = libalgos.nancorr(mat, minp=min_periods)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:1838, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1836 if dtype is not None:
   1837     dtype = np.dtype(dtype)
-> 1838 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1839 if result.dtype is not dtype:
   1840     result = np.array(result, dtype=dtype, copy=False)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:1732, in BlockManager.as_array(self, dtype, copy, na_value)
   1730         arr.flags.writeable = False
   1731 else:
-> 1732     arr = self._interleave(dtype=dtype, na_value=na_value)
   1733     # The underlying data was copied within _interleave, so no need
   1734     # to further copy if copy=True or setting na_value
   1736 if na_value is not lib.no_default:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:1794, in BlockManager._interleave(self, dtype, na_value)
   1792     else:
   1793         arr = blk.get_values(dtype)
-> 1794     result[rl.indexer] = arr
   1795     itemmask[rl.indexer] = 1
   1797 if not itemmask.all():

ValueError: could not convert string to float: 'Afghanistan'

The ‘methods’ available are:

Pearson ('normal' correlation),
Spearman (rank based) and
Kendal (categorical; rarely used).

The default (the method used if no method is specified) is Pearson - we can see this by comparing the results with method specified as Pearson and no method specified:

CO2vGDP.corr(method='pearson')

	CO2	GDP	population
CO2	1.000000	0.795214	-0.003745
GDP	0.795214	1.000000	-0.054502
population	-0.003745	-0.054502	1.000000

CO2vGDP.corr()

	CO2	GDP	population
CO2	1.00000	0.79512	-0.00019
GDP	0.79512	1.00000	-0.02783
population	-0.00019	-0.02783	1.00000

The equation#

Let’s check our understanding by calculating Spearman’s correlation coefficient oursleves.

The equation is:

$$ r_s = 1 - 6 \sum{\frac{d^2}{n(n^2 - 1)}} $$

… where d is the difference in ranks between paired datapoints.

Let’s walk that through.

First, we create new columns containing the rank of each datapoint -

the lowest GDP will have rank 1 and the highest will have rank 165
the lowest CO2 emissions will have rank 1 and the highest will have rank 165
the GDP aand CO2 ranks for each country need not match, but if GDP and CO2 are correlated,
high-ranked countries for GDP should also be high-ranked for CO2

CO2vGDP['CO2_rank'] = CO2vGDP['CO2'].rank()
CO2vGDP['GDP_rank'] = CO2vGDP['GDP'].rank()

# We can see these most clearly if we sort the dataframe before displaying
display(CO2vGDP.sort_values(by='CO2'))

	Country	CO2	GDP	population	CO2_rank	GDP_rank
39	Democratic Republic of Congo	0.0331	859.381714	87087352	1.0	4.0
26	Central African Republic	0.0471	623.488892	5094795	2.0	1.0
21	Burundi	0.0608	651.358887	11493476	3.0	2.0
27	Chad	0.0675	2046.363159	15604213	4.0	24.0
108	Niger	0.0830	964.660095	22577060	5.0	5.0
...	...	...	...	...	...	...
128	Saudi Arabia	18.4541	50304.750000	35018132	160.0	154.0
9	Bahrain	20.7778	39498.765630	1487346	161.0	143.0
79	Kuwait	23.1008	65520.738280	4317190	162.0	160.0
148	Trinidad and Tobago	29.1223	28549.408200	1504707	163.0	128.0
122	Qatar	38.4397	153764.171900	2766743	164.0	164.0

164 rows × 6 columns

You can see that countries with low ranks for CO2 do tend to have low ranks for GDP.

Let’s plot the data, and the ranked data, on scatterplots. You can see that the ranked data do not have the same heteroscedasticity issue as the data themselves.

plt.subplot(1,2,1)
sns.scatterplot(data=CO2vGDP, x='CO2', y='GDP')

plt.subplot(1,2,2)
sns.scatterplot(data=CO2vGDP, x='CO2_rank', y='GDP_rank')

plt.subplots_adjust(wspace = 0.5) # shift the plots sideways so they don't overlap

_images/cde852ba218160ff732a09b13120d5eca1b5887a85f421d7fb8c6061d4f937a4.png

To continue the process of applying the equation, we make a new column containing the difference of ranks:

CO2vGDP['d'] = CO2vGDP['CO2_rank']-CO2vGDP['GDP_rank']

… and apply the formula:

n = len(CO2vGDP)

r_s = 1 - 6*sum(CO2vGDP['d']**2)/(n*(n**2 - 1))
print('r = ' + str(r_s))

r = 0.9143688871356087

Ta-daa! This should match the value from the built in function.

Spearman = Pearson on Ranks#

The equation for Spearman’s $r_s$ looks pretty weird, doesn’t it? What is that 6 doing there?!

I can’t tell you the derivation (although I believe I did once know it), but I can tell you a fun thing.

Spearman’s $r_s$ is exactly the same value as Pearson’s $r$ claculated on the ranks.

Let’s try it!

# make a new dataframe with only ranks
CO2vGDP_ranks = CO2vGDP[['CO2_rank','GDP_rank']]
display(CO2vGDP_ranks)

	CO2_rank	GDP_rank
0	16.0	22.0
1	58.0	73.0
2	94.0	93.0
3	41.0	59.0
4	99.0	105.0
...	...	...
160	77.0	55.0
161	106.0	96.0
162	24.0	26.0
163	29.0	36.0
164	44.0	14.0

165 rows × 2 columns

# Calculate **Pearson's** correlation on the ranks
CO2vGDP_ranks.corr(method='pearson')

	CO2_rank	GDP_rank
CO2_rank	1.000000	0.913841
GDP_rank	0.913841	1.000000

Yep, the correlation is 0.9138 - the exact same value as $r_s$