4.6. Climate example#

Example#

We will look at a dataset containing carbon emissions, GDP and population for 164 countries (data from 2018).

These data are adapted from a dataset downloaded from Our World in Data, a fabulous Oxford-based organization that provides datasets and visualizations addressing global issues.

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme() # use pretty defaults

Load and inspect the data#

Load the data from the file CO2vGDP.csv

CO2vGDP = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/CO2vGDP.csv')
display(CO2vGDP)
Country CO2 GDP population
0 Afghanistan 0.2245 1934.555054 36686788
1 Albania 1.6422 11104.166020 2877019
2 Algeria 3.8241 14228.025390 41927008
3 Angola 0.7912 7771.441895 31273538
4 Argentina 4.0824 18556.382810 44413592
... ... ... ... ...
159 Venezuela 4.1602 10709.950200 29825652
160 Vietnam 2.3415 6814.142090 94914328
161 Yemen 0.3503 2284.889893 30790514
162 Zambia 0.4215 3534.033691 17835898
163 Zimbabwe 0.8210 1611.405151 15052191

164 rows × 4 columns

Aside - I notice that the GDP values contain loads of decimal places which makes them hard to read. Let’s just round those:

CO2vGDP['GDP']=CO2vGDP['GDP'].round()
display(CO2vGDP)
Country CO2 GDP population
0 Afghanistan 0.2245 1935.0 36686788
1 Albania 1.6422 11104.0 2877019
2 Algeria 3.8241 14228.0 41927008
3 Angola 0.7912 7771.0 31273538
4 Argentina 4.0824 18556.0 44413592
... ... ... ... ...
159 Venezuela 4.1602 10710.0 29825652
160 Vietnam 2.3415 6814.0 94914328
161 Yemen 0.3503 2285.0 30790514
162 Zambia 0.4215 3534.0 17835898
163 Zimbabwe 0.8210 1611.0 15052191

164 rows × 4 columns

It is easier to comapre the values now as the larger GDPs actually take up more space!

Plot the data#

Let’s plot the data. A scatterplot is a good choice for bivariate data such as these.

sns.scatterplot(data=CO2vGDP, x='GDP', y='CO2')
plt.xlabel('GDP: $/person/year')
plt.ylabel('CO2 emissions: tonnes/person/year')
Text(0, 0.5, 'CO2 emissions: tonnes/person/year')
_images/c48fb77c492f72174e9dad3a0f56615dbbd5481f93ff81b52657657187b68cb7.png

Let’s find the UK on that graph:

sns.scatterplot(data=CO2vGDP, x='GDP', y='CO2')
sns.scatterplot(data=CO2vGDP[CO2vGDP['Country']=='United Kingdom'], x='GDP', y='CO2',color='r') # see what I did there to plot the UK in red?
plt.xlabel('GDP: $/person/year')
plt.ylabel('CO2 emissions: tonnes/person/year')
Text(0, 0.5, 'CO2 emissions: tonnes/person/year')
_images/1a65feca418b04f5512e554c8c5dd6a77c3c232dbd7bfc6007d5c4e83cf7e780.png

Calculate the correlation#

We can calculate the correlation using the built in function pandas.df.corr()

CO2vGDP.corr()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 CO2vGDP.corr()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:10054, in DataFrame.corr(self, method, min_periods, numeric_only)
  10052 cols = data.columns
  10053 idx = cols.copy()
> 10054 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  10056 if method == "pearson":
  10057     correl = libalgos.nancorr(mat, minp=min_periods)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:1838, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1836 if dtype is not None:
   1837     dtype = np.dtype(dtype)
-> 1838 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1839 if result.dtype is not dtype:
   1840     result = np.array(result, dtype=dtype, copy=False)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:1732, in BlockManager.as_array(self, dtype, copy, na_value)
   1730         arr.flags.writeable = False
   1731 else:
-> 1732     arr = self._interleave(dtype=dtype, na_value=na_value)
   1733     # The underlying data was copied within _interleave, so no need
   1734     # to further copy if copy=True or setting na_value
   1736 if na_value is not lib.no_default:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:1794, in BlockManager._interleave(self, dtype, na_value)
   1792     else:
   1793         arr = blk.get_values(dtype)
-> 1794     result[rl.indexer] = arr
   1795     itemmask[rl.indexer] = 1
   1797 if not itemmask.all():

ValueError: could not convert string to float: 'Afghanistan'

Humph, population was included in my correlation matrix, which I didn’t really want.

The function pandas.df.corr() returns the matrix of correlations between all pairs of variables in your dataframe.

This isn’t a big problem in the current case, but if you had a big dataframe with many irrelevant columns, it would be an issue, because we don’t want the reader to have to search through a huge correlation matrix to find the correlation we are interested in.

We have two options to avoid this - one is to create a new dataframe with only the columns you want to correlate, like this:

CO2vGDP_reduced = CO2vGDP[['CO2','GDP']] # new dataframe has only columns 'CO2' and 'GDP'
CO2vGDP_reduced.corr()
CO2 GDP
CO2 1.000000 0.795121
GDP 0.795121 1.000000

The other is to correlate just the two columns we want, rather than getting the whole correlation matrix:

CO2vGDP['CO2'].corr(CO2vGDP['GDP'])
0.7951213612309438

Outliers#

The correlation between GDP and C02 looks quite high, 0.79.

However, looking at our scatterplot, I can see a problem - there is one bad outlier with very high GDP and high CO2 emissions.

Any guesses what this country is?

We can find out by sorting the dataframe by GDP:

CO2vGDP.sort_values(by='GDP', ascending=False) # sort in descending order to put the richest country at the top
Country CO2 GDP population
122 Qatar 38.4397 153764.0 2766743
112 Norway 8.3307 84580.0 5312321
154 United Arab Emirates 16.0112 76398.0 9140172
133 Singapore 7.9898 68402.0 5814543
79 Kuwait 23.1008 65521.0 4317190
... ... ... ... ...
108 Niger 0.0830 965.0 22577060
39 Democratic Republic of Congo 0.0331 859.0 87087352
85 Liberia 0.2252 818.0 4889396
21 Burundi 0.0608 651.0 11493476
26 Central African Republic 0.0471 623.0 5094795

165 rows × 4 columns

It’s Qatar - maybe not what you expected?

Remove outlier#

Let’s exclude Qatar from our dataset and re-calculate the correlation.

We erase the values for Qatar data values for CO2 and GDP for Qatar to Nan but in this case, since they are not misrecorded but just unusual values, let’s not do that, as we don’t want to hide the data point.

Instead we conduct the correlation on the dataframe excluding Qatar:

CO2vGDP[CO2vGDP['Country']!='Qatar'].corr()
CO2 GDP population
CO2 1.000000 0.732323 0.005751
GDP 0.732323 1.000000 -0.025626
population 0.005751 -0.025626 1.000000

Hm, the correlation went down from $r$=0.79 to $r$=0.073 - lower but still strong

Here’s a plot of the data with Qatar excluded

sns.scatterplot(data=CO2vGDP[CO2vGDP['Country']!='Qatar'], x='GDP', y='CO2')
plt.xlabel('GDP: $/person/year')
plt.ylabel('CO2 emissions: tonnes/person/year')
Text(0, 0.5, 'CO2 emissions: tonnes/person/year')
_images/6307a1ab7b2d5f9324a3f06a329d5967433448a7a0f875de1a433abb4e298c2c.png

We no longer have an obvious outlier, but we do have a problem, called heteroscedasticity

Heteroscedasticty is when the variance of the data in $y$ depends on the value in $x$. In this case, CO2 emissions are more variable for high income countries (which can be high- or low poluting) compared to low income countries

This property violates the assumptions of Pearson’s correlation coefficient, so for these dataset we would be better off using Spearman’s rank correlation coefficient, as explored in the next section.