5.6. Logistic Regression in Python#

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression5_darwin.jpg

Example: The General Social Survey is a regular survey of the population of the United States that gathers information on people’s opinions since 1972.

Here we use data from 2018, when the following question was put the respondents: “Human beings, as we know them today, developed from earlier species of animals. True or false?”

In this analysis, \(y\) = opinion about evolution (1 = true, 0 = false).

We have three \(x\) variables for the analysis:

  • age (measured in years),

  • political ideology (measured from 1 = extremely conservative to 7 = extremely liberal, which we treat as continuous here), and

  • whether the participant studied science at college (yes, no).

Before we start, have a think about these \(x\) variables: which direction do you expect the relationship to go? E.g., will older or younger people be more likely to answer “true”?

5.6.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

5.6.2. Import and view the data#

evolution = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/evolution.csv')
evolution
age colsci evolved polviews
0 43 no 0 2
1 42 yes 1 3
2 30 yes 0 2
3 34 no 1 2
4 56 yes 0 2
... ... ... ... ...
498 38 no 1 6
499 57 yes 1 6
500 39 yes 1 6
501 75 no 0 3
502 72 yes 1 2

503 rows × 4 columns

Which column is which variable, from the list above?

5.6.3. Logistic regression#

We ask Python to run a logistic regression, predicting people’s view on evolutions (True or False) based on their age, political views and whether they studied science at college;

# create the logistic regression model and fit it
logistic_model = smf.logit('evolved ~ age + polviews + colsci', data=evolution).fit()

# print out the summary table
logistic_model.summary()
Optimization terminated successfully.
         Current function value: 0.629955
         Iterations 5
Logit Regression Results
Dep. Variable: evolved No. Observations: 503
Model: Logit Df Residuals: 499
Method: MLE Df Model: 3
Date: Wed, 25 Sep 2024 Pseudo R-squ.: 0.08140
Time: 14:12:09 Log-Likelihood: -316.87
converged: True LL-Null: -344.95
Covariance Type: nonrobust LLR p-value: 3.892e-12
coef std err z P>|z| [0.025 0.975]
Intercept -0.5169 0.383 -1.351 0.177 -1.267 0.233
colsci[T.yes] 0.2923 0.191 1.533 0.125 -0.081 0.666
age -0.0178 0.005 -3.471 0.001 -0.028 -0.008
polviews 0.3837 0.068 5.670 0.000 0.251 0.516

The standard model output looks rather like the output table from linear regression.

The coefficients in the column “coef” are in log odds.

We can interpret these coefficients in the following way:

  • as people get older they are less likely to agree that humans evolved from earlier species (hence the negative coef)

  • as people’s political views get more liberal they are more likely to agree (hence the positive coef)

  • the log odds for studying science at college is also positive; however, it is not statistically significant

Because logs odds are rather hard to interpret intuitively (for most of us anyway), we can convert them into odds ratios by exponentiating them, to help with interpretation.

# logistic_model.params contains the beta coefficients for each explanatory variable

# use numpy's exp function to exponentiate them:
np.exp(logistic_model.params)
Intercept        0.596346
colsci[T.yes]    1.339446
age              0.982374
polviews         1.467663
dtype: float64

Interpreting the odds ratios#

Remember the practice session with logs above? The same principles apply here.

Odds of 1 relate to a 50/50 probability, which we can interpret in a logistic regression model as no effect (i.e., the \(x\) variable is not associated with a higher or lower probability of the outcome occurring).

Positive log odds become odds ratios with a value greater than 1, and negative log odds become odds ratios with a value below 1.

Here, we can interpret the model output as follows:

  • The odds ratio of 1.47 for political views suggests that for each additional point along the scale from conservative to liberal is associated with a 1.47 times greater likelihood of agreeing that humans evolved from earlier species.

  • For age, the odds ratio tells us that for each additional year the odds of answering ‘true’ are 0.02 (or 2%) lower (using 1-0.98 for the effect size). Here I will not interpret ‘colsci’ as it is not statistically significant.

Making a prediction#

Python will compute a predicted probability for us.

Let’s find out the predicted probability of believing in evolution for a 50-year-old with a score of 5 on the conservative scale, who did not study science at college.

The syntax for this is a bit faffy, as we need to use a dictionary (dictionaries in Python are actually really useful, but we haven’t used them on this course so don’t worry too much about it)

vals = dict(age=50, polviews=5, colsci='no')

# Code for calculating predicted probability
logistic_model.predict(vals)
0    0.62534
dtype: float64

The probability this person believes in evolution is 0.62 or 62%

  • Try plugging in some different values - what about yourself or your family members?

5.6.4. Assessing the model#

Just as we did in linear regression, we can obtain predicted values for the model for the people in the original dataset - that is for each person in the original dataframe, we get the model’s predicted probability they believe in evolution (we also know the ground truth, whether they replied that they do believe in evolution, or not)

The predicted values can help us to understand how well our model did. They take a value between 0 and 1 and can be treated as a predicted probability of each individual answering ‘true’ to the survey question, given the \(x\) variables that we have modelled

Here, instead of a dictionary, we give the function logistic_model.predict() a reduced version of the pandas dataframe, containing the columns ['age','polviews','colsci']:

# Get predicted values for each row of the dataframe
logistic_model.predict(evolution[['age','polviews','colsci']])
0      0.374196
1      0.544744
2      0.502294
3      0.412362
4      0.388606
         ...   
498    0.752007
499    0.743402
500    0.799603
501    0.331893
502    0.323507
Length: 503, dtype: float64

We can compare how well the model prediction matches the observed data by classifying (using Pr(y=1)>0.5 as cut-off) which cases would be predicted as true or false, and comparing to whether the observed value was true or false.

# Get predicted values for each row of the dataframe
yhat = logistic_model.predict(evolution[['age','polviews','colsci']])

# add column to the dataframe with the probability, and another with the binary prediction
evolution['yhat']=yhat
evolution['prediction']=(yhat>0.5)


# work out proportion correctly classified
# a participant (row) is correctly classified if columns 'evolved' and 'prediction' match
evolution['correct'] = 0
evolution.loc[(evolution.evolved==evolution.prediction), 'correct']=1

sum(evolution.correct)/len(evolution.correct)
0.6500994035785288
evolution.count()
age           503
colsci        503
evolved       503
polviews      503
yhat          503
prediction    503
correct       503
dtype: int64

Overall, 65% of the cases were correctly classified by this model

While a linear regression uses the method of Ordinary Least Squares (OLS) to fit the model, logistic regression uses the Maximum Likelihood Estimation method. This method uses the values of the model parameters that are most consistent with the observed data, so that with the intercept and slope values estimated in the model, the observed data have a greater chance of occurring than with any other estimated values (See Agresti, Chapter 5).

The log-likelihood is based on summing the probabilities of the observed and actual outcomes and tells us how much unexplained information there is after the model has been fitted (thus analogous to residual sum of squares in linear regression). The model log likelihood can be found in the model summary table.

Like in linear regression \(R^2\), we use a ‘baseline’ model with no \(x\) variables (predicting the outcome that occurs most often), and compare the log-likelihood after adding the \(x\) variables. The likelihood-ratio test, (in python the LLR \(p\)-value) produces a \(p\)-value so that we can tell if our model is statistically significant. Low p-values (below 0.05) tell us that the model with the \(x\)s is significantly better at predicting the outcome than the baseline model.