{
"cells": [
{
"cell_type": "markdown",
"id": "8501b536",
"metadata": {},
"source": [
"# Climate example"
]
},
{
"cell_type": "markdown",
"id": "06a3540a",
"metadata": {},
"source": [
"## Example\n",
"\n",
"We will look at a dataset containing carbon emissions, GDP and population for 164 countries (data from 2018).\n",
"\n",
"These data are adapted from a dataset downloaded from Our World in Data, a fabulous Oxford-based organization that provides datasets and visualizations addressing global issues.\n",
"\n",
"\n",
"### Set up Python libraries\n",
"\n",
"As usual, run the code cell below to import the relevant Python libraries"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7f1d34e0",
"metadata": {},
"outputs": [],
"source": [
"# Set-up Python libraries - you need to run this but you don't need to change it\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import scipy.stats as stats\n",
"import pandas \n",
"import seaborn as sns\n",
"sns.set_theme() # use pretty defaults"
]
},
{
"cell_type": "markdown",
"id": "5f633741",
"metadata": {},
"source": [
"### Load and inspect the data\n",
"\n",
"Load the data from the file CO2vGDP.csv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "075f78c7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Country
\n",
"
CO2
\n",
"
GDP
\n",
"
population
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Afghanistan
\n",
"
0.2245
\n",
"
1934.555054
\n",
"
36686788
\n",
"
\n",
"
\n",
"
1
\n",
"
Albania
\n",
"
1.6422
\n",
"
11104.166020
\n",
"
2877019
\n",
"
\n",
"
\n",
"
2
\n",
"
Algeria
\n",
"
3.8241
\n",
"
14228.025390
\n",
"
41927008
\n",
"
\n",
"
\n",
"
3
\n",
"
Angola
\n",
"
0.7912
\n",
"
7771.441895
\n",
"
31273538
\n",
"
\n",
"
\n",
"
4
\n",
"
Argentina
\n",
"
4.0824
\n",
"
18556.382810
\n",
"
44413592
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
160
\n",
"
Vietnam
\n",
"
2.3415
\n",
"
6814.142090
\n",
"
94914328
\n",
"
\n",
"
\n",
"
161
\n",
"
World
\n",
"
4.8022
\n",
"
15212.415040
\n",
"
7683789824
\n",
"
\n",
"
\n",
"
162
\n",
"
Yemen
\n",
"
0.3503
\n",
"
2284.889893
\n",
"
30790514
\n",
"
\n",
"
\n",
"
163
\n",
"
Zambia
\n",
"
0.4215
\n",
"
3534.033691
\n",
"
17835898
\n",
"
\n",
"
\n",
"
164
\n",
"
Zimbabwe
\n",
"
0.8210
\n",
"
1611.405151
\n",
"
15052191
\n",
"
\n",
" \n",
"
\n",
"
165 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Country CO2 GDP population\n",
"0 Afghanistan 0.2245 1934.555054 36686788\n",
"1 Albania 1.6422 11104.166020 2877019\n",
"2 Algeria 3.8241 14228.025390 41927008\n",
"3 Angola 0.7912 7771.441895 31273538\n",
"4 Argentina 4.0824 18556.382810 44413592\n",
".. ... ... ... ...\n",
"160 Vietnam 2.3415 6814.142090 94914328\n",
"161 World 4.8022 15212.415040 7683789824\n",
"162 Yemen 0.3503 2284.889893 30790514\n",
"163 Zambia 0.4215 3534.033691 17835898\n",
"164 Zimbabwe 0.8210 1611.405151 15052191\n",
"\n",
"[165 rows x 4 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"CO2vGDP = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/CO2vGDP.csv')\n",
"display(CO2vGDP)"
]
},
{
"cell_type": "markdown",
"id": "d3cce993",
"metadata": {},
"source": [
"Aside - \n",
"I notice that the GDP values contain loads of decimal places which makes them hard to read. \n",
"Let's just round those:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "732fb048",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Country
\n",
"
CO2
\n",
"
GDP
\n",
"
population
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Afghanistan
\n",
"
0.2245
\n",
"
1935.0
\n",
"
36686788
\n",
"
\n",
"
\n",
"
1
\n",
"
Albania
\n",
"
1.6422
\n",
"
11104.0
\n",
"
2877019
\n",
"
\n",
"
\n",
"
2
\n",
"
Algeria
\n",
"
3.8241
\n",
"
14228.0
\n",
"
41927008
\n",
"
\n",
"
\n",
"
3
\n",
"
Angola
\n",
"
0.7912
\n",
"
7771.0
\n",
"
31273538
\n",
"
\n",
"
\n",
"
4
\n",
"
Argentina
\n",
"
4.0824
\n",
"
18556.0
\n",
"
44413592
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
160
\n",
"
Vietnam
\n",
"
2.3415
\n",
"
6814.0
\n",
"
94914328
\n",
"
\n",
"
\n",
"
161
\n",
"
World
\n",
"
4.8022
\n",
"
15212.0
\n",
"
7683789824
\n",
"
\n",
"
\n",
"
162
\n",
"
Yemen
\n",
"
0.3503
\n",
"
2285.0
\n",
"
30790514
\n",
"
\n",
"
\n",
"
163
\n",
"
Zambia
\n",
"
0.4215
\n",
"
3534.0
\n",
"
17835898
\n",
"
\n",
"
\n",
"
164
\n",
"
Zimbabwe
\n",
"
0.8210
\n",
"
1611.0
\n",
"
15052191
\n",
"
\n",
" \n",
"
\n",
"
165 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Country CO2 GDP population\n",
"0 Afghanistan 0.2245 1935.0 36686788\n",
"1 Albania 1.6422 11104.0 2877019\n",
"2 Algeria 3.8241 14228.0 41927008\n",
"3 Angola 0.7912 7771.0 31273538\n",
"4 Argentina 4.0824 18556.0 44413592\n",
".. ... ... ... ...\n",
"160 Vietnam 2.3415 6814.0 94914328\n",
"161 World 4.8022 15212.0 7683789824\n",
"162 Yemen 0.3503 2285.0 30790514\n",
"163 Zambia 0.4215 3534.0 17835898\n",
"164 Zimbabwe 0.8210 1611.0 15052191\n",
"\n",
"[165 rows x 4 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"CO2vGDP['GDP']=CO2vGDP['GDP'].round()\n",
"display(CO2vGDP)"
]
},
{
"cell_type": "markdown",
"id": "2abcbb1a",
"metadata": {},
"source": [
"It is easier to comapre the values now as the larger GDPs actually take up more space!"
]
},
{
"cell_type": "markdown",
"id": "abc7e702",
"metadata": {},
"source": [
"### Plot the data\n",
"\n",
"Let's plot the data. A scatterplot is a good choice for bivariate data such as these."
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "65493236",
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'CO2 emissions: tonnes/person/year')"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.scatterplot(data=CO2vGDP, x='GDP', y='CO2')\n",
"sns.scatterplot(data=CO2vGDP[CO2vGDP['Country']=='United Kingdom'], x='GDP', y='CO2',color='r') # see what I did there to plot the UK in red?\n",
"plt.xlabel('GDP: $/person/year')\n",
"plt.ylabel('CO2 emissions: tonnes/person/year')"
]
},
{
"cell_type": "markdown",
"id": "ac0bb52f",
"metadata": {},
"source": [
"### Calculate the correlation\n",
"\n",
"We can calculate the correlation using the built in function pandas.df.corr()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "7da276fe",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
CO2
\n",
"
GDP
\n",
"
population
\n",
"
\n",
" \n",
" \n",
"
\n",
"
CO2
\n",
"
1.000000
\n",
"
0.795121
\n",
"
-0.000190
\n",
"
\n",
"
\n",
"
GDP
\n",
"
0.795121
\n",
"
1.000000
\n",
"
-0.027832
\n",
"
\n",
"
\n",
"
population
\n",
"
-0.000190
\n",
"
-0.027832
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" CO2 GDP population\n",
"CO2 1.000000 0.795121 -0.000190\n",
"GDP 0.795121 1.000000 -0.027832\n",
"population -0.000190 -0.027832 1.000000"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CO2vGDP.corr()"
]
},
{
"cell_type": "markdown",
"id": "b528b58b",
"metadata": {},
"source": [
"Humph, population was included in my correlation matrix, which I didn't really want. \n",
"\n",
"The function pandas.df.corr() returns the matrix of correlations between all pairs of variables in your dataframe. \n",
"\n",
"This isn't a big problem in the current case, but if you had a big dataframe with many irrelevant columns, it would be an issue, because we don't want the reader to have to search through a huge correlation matrix to find the correlation we are interested in.\n",
"\n",
"We have two options to avoid this - one is to create a new dataframe with only the columns you want to correlate, like this:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "b4e3fbea",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
CO2
\n",
"
GDP
\n",
"
\n",
" \n",
" \n",
"
\n",
"
CO2
\n",
"
1.000000
\n",
"
0.795121
\n",
"
\n",
"
\n",
"
GDP
\n",
"
0.795121
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" CO2 GDP\n",
"CO2 1.000000 0.795121\n",
"GDP 0.795121 1.000000"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CO2vGDP_reduced = CO2vGDP[['CO2','GDP']] # new dataframe has only columns 'CO2' and 'GDP'\n",
"CO2vGDP_reduced.corr()"
]
},
{
"cell_type": "markdown",
"id": "c14907a2",
"metadata": {},
"source": [
"The other is to correlate just the two columns we want, rather than getting the whole correlation matrix:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "4e998580",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7951213612309438"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CO2vGDP['CO2'].corr(CO2vGDP['GDP'])"
]
},
{
"cell_type": "markdown",
"id": "3203dcc3",
"metadata": {},
"source": [
"### Outliers\n",
"\n",
"The correlation between GDP and C02 looks quite high, 0.79.\n",
"\n",
"However, looking at our scatterplot, I can see a problem - there is one bad outlier with very high GDP and high CO2 emissions.\n",
"\n",
"Any guesses what this country is? \n",
"\n",
"We can find out by sorting the dataframe by GDP:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "17f150d6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Country
\n",
"
CO2
\n",
"
GDP
\n",
"
population
\n",
"
\n",
" \n",
" \n",
"
\n",
"
122
\n",
"
Qatar
\n",
"
38.4397
\n",
"
153764.0
\n",
"
2766743
\n",
"
\n",
"
\n",
"
112
\n",
"
Norway
\n",
"
8.3307
\n",
"
84580.0
\n",
"
5312321
\n",
"
\n",
"
\n",
"
154
\n",
"
United Arab Emirates
\n",
"
16.0112
\n",
"
76398.0
\n",
"
9140172
\n",
"
\n",
"
\n",
"
133
\n",
"
Singapore
\n",
"
7.9898
\n",
"
68402.0
\n",
"
5814543
\n",
"
\n",
"
\n",
"
79
\n",
"
Kuwait
\n",
"
23.1008
\n",
"
65521.0
\n",
"
4317190
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
108
\n",
"
Niger
\n",
"
0.0830
\n",
"
965.0
\n",
"
22577060
\n",
"
\n",
"
\n",
"
39
\n",
"
Democratic Republic of Congo
\n",
"
0.0331
\n",
"
859.0
\n",
"
87087352
\n",
"
\n",
"
\n",
"
85
\n",
"
Liberia
\n",
"
0.2252
\n",
"
818.0
\n",
"
4889396
\n",
"
\n",
"
\n",
"
21
\n",
"
Burundi
\n",
"
0.0608
\n",
"
651.0
\n",
"
11493476
\n",
"
\n",
"
\n",
"
26
\n",
"
Central African Republic
\n",
"
0.0471
\n",
"
623.0
\n",
"
5094795
\n",
"
\n",
" \n",
"
\n",
"
165 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Country CO2 GDP population\n",
"122 Qatar 38.4397 153764.0 2766743\n",
"112 Norway 8.3307 84580.0 5312321\n",
"154 United Arab Emirates 16.0112 76398.0 9140172\n",
"133 Singapore 7.9898 68402.0 5814543\n",
"79 Kuwait 23.1008 65521.0 4317190\n",
".. ... ... ... ...\n",
"108 Niger 0.0830 965.0 22577060\n",
"39 Democratic Republic of Congo 0.0331 859.0 87087352\n",
"85 Liberia 0.2252 818.0 4889396\n",
"21 Burundi 0.0608 651.0 11493476\n",
"26 Central African Republic 0.0471 623.0 5094795\n",
"\n",
"[165 rows x 4 columns]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CO2vGDP.sort_values(by='GDP', ascending=False) # sort in descending order to put the richest country at the top"
]
},
{
"cell_type": "markdown",
"id": "8cdcd6bf",
"metadata": {},
"source": [
"It's Qatar - maybe not what you expected?"
]
},
{
"cell_type": "markdown",
"id": "8038acab",
"metadata": {},
"source": [
"### Remove outlier\n",
"\n",
"Let's exclude Qatar from our dataset and re-calculate the correlation.\n",
"\n",
"We erase the values for Qatar data values for CO2 and GDP for Qatar to Nan but in this case, since they are not misrecorded but just unusual values, let's not do that, as we don't want to hide the data point.\n",
"\n",
"Instead we conduct the correlation on the dataframe excluding Qatar:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "951a05b8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
CO2
\n",
"
GDP
\n",
"
population
\n",
"
\n",
" \n",
" \n",
"
\n",
"
CO2
\n",
"
1.000000
\n",
"
0.732323
\n",
"
0.005751
\n",
"
\n",
"
\n",
"
GDP
\n",
"
0.732323
\n",
"
1.000000
\n",
"
-0.025626
\n",
"
\n",
"
\n",
"
population
\n",
"
0.005751
\n",
"
-0.025626
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" CO2 GDP population\n",
"CO2 1.000000 0.732323 0.005751\n",
"GDP 0.732323 1.000000 -0.025626\n",
"population 0.005751 -0.025626 1.000000"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CO2vGDP[CO2vGDP['Country']!='Qatar'].corr()"
]
},
{
"cell_type": "markdown",
"id": "20ff0ed2",
"metadata": {},
"source": [
"Hm, the correlation went down from $r$=0.79 to $r$=0.073 - lower but still strong\n",
"\n",
"Here's a plot of the data with Qatar excluded"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "9dc42324",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'CO2 emissions: tonnes/person/year')"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.scatterplot(data=CO2vGDP[CO2vGDP['Country']!='Qatar'], x='GDP', y='CO2')\n",
"plt.xlabel('GDP: $/person/year')\n",
"plt.ylabel('CO2 emissions: tonnes/person/year')"
]
},
{
"cell_type": "markdown",
"id": "36a341c1",
"metadata": {},
"source": [
"We no longer have an obvious outlier, but we do have a problem, called heteroscedasticity\n",
"\n",
"Heteroscedasticty is when the variance of the data in $y$ depends on the value in $x$. In this case, CO2 emissions are more variable for high income countries (which can be high- or low poluting) compared to low income countries\n",
"\n",
"This property violates the assumptions of Pearson's correlation coefficient, so for these dataset we would be better off using Spearman's rank correlation coefficient, as explored in the next section."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ffff910",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}