Regression models in Python

1.5. Regression models in Python#

We will be using the statsmodels package in Python, so we will need to import this along with the other Python packages we have been using.

1.5.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

1.5.2. Happiness toy example#

The python code, once you get the hang of it, is pretty straightforward. The output that Python gives is a whole table of statistics, some of which are important to us in this class, and some will be important later, and some we won’t need in the course at all.

So, when looking at the Python output, the key objective for the moment is to know where to find the key things, e.g., the intercept and slope coefficients.

Let’s run a regression model in Python for the ‘toy’ country happiness data.

Import and view data#

# import and view data
happiness=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/happiness10.csv')
happiness

	Country	GDPpc	LifeSat
0	Austria	51.9	7.2
1	Czechia	38.5	6.9
2	Finland	47.2	7.8
3	Greece	27.1	5.9
4	Ireland	90.8	7.0
5	Luxembourg	112.6	7.4
6	Poland	32.4	6.1
7	Slovakia	30.5	6.4
8	Sweden	50.9	7.4
9	Ukraine	12.4	5.1

Run the regression using `statsmodels`#

# run the regression model

# first we run this line to tell statsmodels where to find the data and the explanatory variables
reg_formula = sm.regression.linear_model.OLS.from_formula(data = happiness, formula = 'LifeSat ~ GDPpc')

# then we run this line to fit the regression (work out the values of intercept and slope)
# the output is a structure which we will call reg_results
reg_results = reg_formula.fit()

# let's view a summary of the regression results
reg_results.summary() 

# note you get a warning message because the dataset is quite small - disregard this
                                             

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

OLS Regression Results
Dep. Variable:	LifeSat	R-squared:	0.418
Model:	OLS	Adj. R-squared:	0.345
Method:	Least Squares	F-statistic:	5.738
Date:	Thu, 26 Sep 2024	Prob (F-statistic):	0.0435
Time:	11:27:01	Log-Likelihood:	-9.1084
No. Observations:	10	AIC:	22.22
Df Residuals:	8	BIC:	22.82
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	5.8485	0.421	13.878	0.000	4.877	6.820
GDPpc	0.0176	0.007	2.395	0.043	0.001	0.035

Omnibus:	0.349	Durbin-Watson:	1.689
Prob(Omnibus):	0.840	Jarque-Bera (JB):	0.448
Skew:	0.289	Prob(JB):	0.799
Kurtosis:	2.139	Cond. No.	113.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpreting the output#

The summary output is three tables

The top table tells us about the regression we ran, including

what was the dependent variable
how many observations there were
type of regression model used (in this case Ordinary Least Squares, OLS - don’t worry about this for now)
various measures of model fit such as $R^2$, AIC, BIC, Log-Likelihood - some (but not all) of these will be covered in later weeks

The middle table gives the output of the regression

coefficients (beta values) - in this simple regression these are the intercept and the slope
statistical significance of the effects (t and p values)

The bottom table contains some information about whether the assummptions of linear regression were satisfied

Predicted values, residuals#

For now let’s focus on the middle table

find the values for the intercept and the slope. Can you write out the regression equation in the form

\[ \hat{y} = \beta_0 + \beta_1 x \]

… replacing the $\beta$ values with numbers?

You could work out the predicted value of $y$, namely $\hat{y}$, for each value of $x4 from the equation. However statsmodels can give us the predicted values and residuals.

Run the following code which will give us the predicted values $\hat{y}$ for each row of the data. Display the result.

# return predicted values
reg_results.fittedvalues

  6.763547
  6.527300
  6.680684
  6.326313
  7.449370
  7.833713
  6.419754
  6.386256
  6.745917
  6.067146
dtype: float64

… and the residuals

# return residuals
reg_results.resid

  0.436453
  0.372700
  1.119316
 -0.426313
 -0.449370
 -0.433713
 -0.319754
  0.013744
  0.654083
 -0.967146
dtype: float64

These would be easier to view if we added them in as new columns in our dataframe:

happiness['yhat']=reg_results.fittedvalues
happiness['residuals']=reg_results.resid
happiness

	Country	GDPpc	LifeSat	yhat	residuals
0	Austria	51.9	7.2	6.763547	0.436453
1	Czechia	38.5	6.9	6.527300	0.372700
2	Finland	47.2	7.8	6.680684	1.119316
3	Greece	27.1	5.9	6.326313	-0.426313
4	Ireland	90.8	7.0	7.449370	-0.449370
5	Luxembourg	112.6	7.4	7.833713	-0.433713
6	Poland	32.4	6.1	6.419754	-0.319754
7	Slovakia	30.5	6.4	6.386256	0.013744
8	Sweden	50.9	7.4	6.745917	0.654083
9	Ukraine	12.4	5.1	6.067146	-0.967146

This should look familiar! We had the same table back in concepts section. (There may be small differences due to rounding).

Hopefully you can see that:

The $\hat{y}$ values are fairly similar to the corresponding values of $y$, LifeSat
The residuals are positive when $y>\hat{y}$ and negative $y<\hat{y}$
The size of the residual is larger when the difference between $y$ and $\hat{y}$ is larger

Do you think there are any outliers in the data? As we noted earlier, there was one very high and one very low value among the residuals.

Let’s try re-running the regression model without these two potential outliers: Finland and Ukraine. Let’s replace the whole row of the dataframe with ‘NaN’ for these two countries and re-run the regression command.

# Code for changing Finland and Ukraine to NaN
happiness_clean = happiness.copy()
happiness_clean.loc[(happiness_clean.Country=='Ukraine')]=np.nan
happiness_clean.loc[(happiness_clean.Country=='Finland')]=np.nan
happiness_clean

	Country	GDPpc	LifeSat	yhat	residuals
0	Austria	51.9	7.2	6.763547	0.436453
1	Czechia	38.5	6.9	6.527300	0.372700
2	NaN	NaN	NaN	NaN	NaN
3	Greece	27.1	5.9	6.326313	-0.426313
4	Ireland	90.8	7.0	7.449370	-0.449370
5	Luxembourg	112.6	7.4	7.833713	-0.433713
6	Poland	32.4	6.1	6.419754	-0.319754
7	Slovakia	30.5	6.4	6.386256	0.013744
8	Sweden	50.9	7.4	6.745917	0.654083
9	NaN	NaN	NaN	NaN	NaN

# re-run the regression model

# first we run this line to tell statsmodels where to find the data and the explanatory variables
reg_formula = sm.regression.linear_model.OLS.from_formula(data = happiness_clean, formula = 'LifeSat ~ GDPpc')

# then we run this line to fit the regression (work out the values of intercept and slope)
# the output is a structure which we will call reg_results
reg_results = reg_formula.fit()

# let's view a summary of the regression results
reg_results.summary() 

/opt/anaconda3/anaconda3/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

OLS Regression Results
Dep. Variable:	LifeSat	R-squared:	0.467
Model:	OLS	Adj. R-squared:	0.379
Method:	Least Squares	F-statistic:	5.266
Date:	Thu, 26 Sep 2024	Prob (F-statistic):	0.0615
Time:	11:27:01	Log-Likelihood:	-3.9966
No. Observations:	8	AIC:	11.99
Df Residuals:	6	BIC:	12.15
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.0904	0.345	17.671	0.000	5.247	6.934
GDPpc	0.0128	0.006	2.295	0.062	-0.001	0.027

Omnibus:	1.068	Durbin-Watson:	1.229
Prob(Omnibus):	0.586	Jarque-Bera (JB):	0.669
Skew:	0.315	Prob(JB):	0.716
Kurtosis:	1.731	Cond. No.	131.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What changed when we excluded outliers?#

Has the slope value become bigger or smaller?
Has the conclusion changed?

1.6. The calculation for the slope and intercept#

Here’s a new data set looking at height and finger length among a sample of adults, which we will use to demonstrate how it is possible to calculate regression coefficients, using the equations.

The equation for finding $b$ is

\[ b = \frac{s_{xy}}{s^2_x} \]

# load and view the data
heightFinger = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/HeightFingerInches.csv')
display(heightFinger)

	Height	FingerLength
0	56	3.875
1	57	4.000
2	58	3.875
3	58	4.000
4	58	4.000
...	...	...
2995	74	4.875
2996	74	5.000
2997	74	5.000
2998	74	5.250
2999	77	4.375

3000 rows × 2 columns

The equation for finding $b$ is

\[ b = \frac{s_{xy}}{s^2_x} \]

Let’s apply that in Python

s_x = heightFinger.Height.std()
s_y = heightFinger.FingerLength.std()
s_xy = heightFinger.Height.cov(heightFinger.FingerLength)

b = s_xy/(s_x**2)
print('b = ' + str(b))

b = 0.0550010848727285

What is the value of the intercept? The equation for finding the intercept is as follows:

\[ a = \bar{y} - b\bar{x} \]

$\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$ for the 10 data points.

Use Python to find the mean of $y$ and $x$.

x_bar = heightFinger.Height.mean()
y_bar = heightFinger.FingerLength.mean()

Can you calculate $a$?

a = y_bar - b*x_bar
a

0.9465806367945135

Let’s run a regression model in Python for the finger length data, to check our results.

# run the regression model on height/finger data

# first we run this line to tell statsmodels where to find the data and the explanatory variables
reg_formula = sm.regression.linear_model.OLS.from_formula(data = heightFinger, formula = 'FingerLength ~ Height')

# then we run this line to fit the regression (work out the values of intercept and slope)
# the output is a structure which we will call reg_results
reg_results = reg_formula.fit()

# let's view a summary of the regression results
reg_results.summary() 

OLS Regression Results
Dep. Variable:	FingerLength	R-squared:	0.423
Model:	OLS	Adj. R-squared:	0.423
Method:	Least Squares	F-statistic:	2200.
Date:	Thu, 26 Sep 2024	Prob (F-statistic):	0.00
Time:	11:27:01	Log-Likelihood:	1163.6
No. Observations:	3000	AIC:	-2323.
Df Residuals:	2998	BIC:	-2311.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.9466	0.077	12.321	0.000	0.796	1.097
Height	0.0550	0.001	46.909	0.000	0.053	0.057

Omnibus:	20.597	Durbin-Watson:	0.182
Prob(Omnibus):	0.000	Jarque-Bera (JB):	28.786
Skew:	-0.062	Prob(JB):	5.61e-07
Kurtosis:	3.464	Cond. No.	1.68e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Find $a$ and $b$ in the regression output tablbe (they are not called $a$ and $b$ in the table - what are they called?). Do they match the values we calculated ourselves using the equation?

Great! Our calculations based on the equations are confirmed by Python’s regression table.

1.6.1. Quick extra questions#

Here are some extra questions on the Regression output table, to see if you can begin to pick out all the important information.

In the top left, it gives the method as ‘Least Squares’. Above, it gives the model type as ‘OLS’. Do you know what OLS stand for?
- Ordinary Least Squares
How many observations are in the Height vs Finger model according to the regression output table?
- 3000
What do you think ‘std err’ might stand for, in the column after ‘coef’?
- Standard Error. You have learned about this important concept alreaady! We’ll come back to it again in a couple of weeks.