3.9. Tutorial Exercises#

The tutorial exercises this week will form the basis of your next hand-in assignment. You will be using real data that were collected in 2019 about people’s perception of their social standing and their happiness. These data were collected online by the well-respected polling company YouGov. The data are intended to be representative of the UK population.

Note: as well as completing these exercises, it’s a good idea to review the instructions for the assignment ahead of the tutorial, so that you can check you understanding with your tutor.

The variables are as follows:

  • happy (a continuous measure ranging from 0-10, where higher scores are greater happiness)

  • ladder (a continuous measure of 1-11 where participants rate themselves in their standing in society, where the lowest rung on the ladder was labelled “bottom of society” and the top rung as “top of society”)

  • age (a continuous measure in years)

  • marital (a categorical measure of marital status with three categories)

  • work (a categorical measure of working status with four categories)

  • educ ( a categorical measure of educational qualifications summarised into 3 categories)

  • sex (male, female)

  • leftout (a categorical variable in which people state whether they agree or disagree that they feel left out of society)

  • income (a categorical variable with four categories)

  • region (a categorical variable with twelve categories)

3.9.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.9.2. Import and view the data#

# Your code here to import the data from the file Happy.csv

3.9.3. Data cleaning#

First, get to know your data and do any necessary data cleaning.

YouGov uses codes of 99 to indicate missing data. Change these to NaN.

# Your code here to remove missing values!

3.9.4. Control Variables#

  • The outcome variable here is happy

  • The main explanatory variable is ladder

  • There are a set of 8 possible control variables.

Which do you think might be important controls here?

There is no right or wrong answer here but think about your reasons for selecting your control variables (don’t just throw all of them in!).

Specify two regression models - Model 1 includes just the main explanatory variable. Model 2 adds the control variables of your choice (and keeps the main explanatory variable). Calculate the RMSE for both.

# Your code here!

3.9.5. Compare your two models:#

  • Which is better fitting in terms of the \(R^2\)?

  • And which has a smaller spread of values around the regression line?

# Your answer here!
# You may need to add extra cells

3.9.6. Interpret your regression models.#

Make some notes:

  • Which coefficients are significant?

  • What are the confidence intervals around the slope for ‘ladder’?

  • Does the coefficient for ‘ladder’ change much between model 1 and model 2?

  • What can we conclude about the relationship between perceptions of social standing and happiness?

  • Looking at the association between the control variables and happiness, are these as you might have expected, or are there any surprises here?

# Your answer here!
# You may need to add extra cells

3.9.7. Check the regression assumptions.#

First, check whether the residuals are normally distributed by plotting a histogram of the residuals (refer bback to the preparatory notebook for an example). Do you think this assumption has been met?

# Your code here

Let’s also trying checking the assumption of constant variance: can you plot a scatter plot of residuals (\(y\)-axis) and \(\hat{y}\) (\(x\)-axis)?

Residuals are found in reg_results.resid and \(\hat{y}\) is obtained using reg_results.predict()

Do you think the variance in the residuals looks roughly constant for all values of \(\hat{y}\)?

# Your code here

The variance of the residuals looks roughly similar at all values of yhat. It looks like the constant variance assumption has been met