{ "cells": [ { "cell_type": "markdown", "id": "42b9ccaa-fc57-446a-8168-5253839647e8", "metadata": {}, "source": [ "# Spearman's Rank Correlation\n", "\n", "In Chapter 1: Describing Data we looked at Spearman's Rank correlation coefficient, which is a robust correlation based on ranks.\n", "\n", "**If you are unsure about correlation coefficients, please revisit the page on correlation in Chapter 1: Describing Data**\n", "\n", "In this section on rank-based tests, we revisit Spearman's $r$ and see how to get a $p$-value for it using `scipy.stats`\n", "\n", "The reasons for using Spearman'srank correlation rather than Pearson's correlation are recapped there." ] }, { "cell_type": "code", "execution_count": 2, "id": "be001071-90c6-462a-ba35-2e731a4096a3", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "c6d7c73e-abbe-4d5a-9ce3-01bd155a993c", "metadata": {}, "source": [ "## Load the data\n", "\n", "Let's use the CO2 data discussed in the section on correlation in Chapter 1: Describing Data. The dataset contains GDP (weath) and carbon emissions per person for 164 countries." ] }, { "cell_type": "code", "execution_count": 3, "id": "dc0ce091-df7a-4796-bbc9-9b6a752189d8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryCO2GDPpopulation
0Afghanistan0.22451934.55505436686788
1Albania1.642211104.1660202877019
2Algeria3.824114228.02539041927008
3Angola0.79127771.44189531273538
4Argentina4.082418556.38281044413592
...............
159Venezuela4.160210709.95020029825652
160Vietnam2.34156814.14209094914328
161Yemen0.35032284.88989330790514
162Zambia0.42153534.03369117835898
163Zimbabwe0.82101611.40515115052191
\n", "

164 rows × 4 columns

\n", "
" ], "text/plain": [ " Country CO2 GDP population\n", "0 Afghanistan 0.2245 1934.555054 36686788\n", "1 Albania 1.6422 11104.166020 2877019\n", "2 Algeria 3.8241 14228.025390 41927008\n", "3 Angola 0.7912 7771.441895 31273538\n", "4 Argentina 4.0824 18556.382810 44413592\n", ".. ... ... ... ...\n", "159 Venezuela 4.1602 10709.950200 29825652\n", "160 Vietnam 2.3415 6814.142090 94914328\n", "161 Yemen 0.3503 2284.889893 30790514\n", "162 Zambia 0.4215 3534.033691 17835898\n", "163 Zimbabwe 0.8210 1611.405151 15052191\n", "\n", "[164 rows x 4 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carbon = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/CO2vGDP.csv')\n", "carbon" ] }, { "cell_type": "markdown", "id": "471a1535-7ec6-4f4c-8883-bdd3a493e459", "metadata": {}, "source": [ "## Plot the data\n", "\n", "From a scatter plot, we can see that the data are unsuitable for Pearson's correlation (please check the notes for Correlation in the section **Describing Data** if unsure why)" ] }, { "cell_type": "code", "execution_count": 9, "id": "a0ebb0d3-3924-4836-9902-9d2e485e0f82", "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.regplot(data=carbon, x='GDP', y='CO2')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "af1276d2-8515-4bf6-80f4-305a510240d0", "metadata": {}, "source": [ "## Calculating correlation\n", "\n", "We have seen that we can get the correlation ($r$-value) between all pairs of columns using a `pandas` function `df.corr()` as follows:" ] }, { "cell_type": "code", "execution_count": 4, "id": "3b5c78e4-6e93-40f9-a38e-928bcfbfadba", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CO2GDPpopulation
CO21.0000000.914369-0.098554
GDP0.9143691.000000-0.122920
population-0.098554-0.1229201.000000
\n", "
" ], "text/plain": [ " CO2 GDP population\n", "CO2 1.000000 0.914369 -0.098554\n", "GDP 0.914369 1.000000 -0.122920\n", "population -0.098554 -0.122920 1.000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carbon.corr(numeric_only=True, method='spearman')" ] }, { "cell_type": "markdown", "id": "4ca28233-2a51-459c-9587-84068b22e720", "metadata": {}, "source": [ "Or between two particular columns like this:" ] }, { "cell_type": "code", "execution_count": 5, "id": "ed788a1e-bef9-469b-9fa0-78a6a34dfb7d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.9143688871356085" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carbon.GDP.corr(carbon.CO2, method='spearman')" ] }, { "cell_type": "markdown", "id": "572cec75-c862-41c8-ae21-3d70966ced6e", "metadata": {}, "source": [ "However, the `pandas` function `df.corr()` doesn't calculate the significance of the correlation. We could calculate it using a permutation test (as last week) but we can also use a built in function from `scipy.stats`, called `stats.spearmanr`" ] }, { "cell_type": "code", "execution_count": 6, "id": "d959c076-88e5-4720-bb72-8a84ec0db59b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "SignificanceResult(statistic=0.9143688871356085, pvalue=1.6676605949335523e-65)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.spearmanr(carbon.GDP, carbon.CO2)" ] }, { "cell_type": "markdown", "id": "fda80ee3-31d0-47d1-8258-03e564950553", "metadata": { "tags": [] }, "source": [ "This gives us the correlation coefficient $r=0.79$ (which is a very strong correlation) and the $p$-value $4.9 \\times 10^{-37}$ (it is highliy significant)" ] }, { "cell_type": "markdown", "id": "a59e0754-28e8-425f-9102-51e5ee8a8c88", "metadata": {}, "source": [ "### Note on Hypotheses\n", "\n", "For Pearson's correlation (the 'standard' correlation coefficient, calculated on actual data values rather than ranks) we might express our null and alternative hypotheses as follows:\n", "\n", "$\\mathcal{H_o}$ There is no linear relationship between GDP and CO2 emissions per capita\n", "\n", "$\\mathcal{H_a}$ There is a positive linear relationship between GDP and CO2 emissions per capita\n", "* in plain English, CO2 emissions are proportional to GDP\n", "\n", "(remember from the section on correlation in **Describing Data** that Pearson's correlation assumes that the relationship, if there is one, is a straight line)\n", "\n", "For Spearman's rank correlation coefficient, our null and alternative hypotheses are slightly different:\n", "\n", "$\\mathcal{H_o}$ There is no relationship between GDP and CO2 emissions per capita\n", "\n", "$\\mathcal{H_a}$ There is a relationship between CO2 and GDP rank\n", "* in plain English, richer a country is, the higher its carbon emissions" ] }, { "cell_type": "code", "execution_count": null, "id": "1c5874c1-e95f-4495-bf8c-52f578a9a1c0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }