1.2. Regression Concepts#

Let’s begin with a ‘toy’ dataset, by which I mean a dataset with a small number of data points so that we can easily see what is going on.

There are only 10 data points here, but the data are real! These are country average levels of life satisfaction and GDP per capita for a selection of countries in Europe in 2020.

I downloaded these data from the Our World in Data website for all of Europe, then randomly selected 10 countries. There are a lot of real-world studies on the topic of the relationship between wealth and happiness.

Here, we’ll focus on the concepts. We’ll get onto the Python code for regression later.

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/images/regression1_happinessTable.png

First, we can examine the relationship visually in a scatter plot:

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/images/regression1_happinessScatter.png
  • What can you say about the relationship from eyeballing the scatter plot?

The correlation between life satisfaction and GDP per capita is 0.65 meaning, as you recall from topic 4, that there is a positive relationship of moderate strength. As GDP increases, so does average life satisfaction.

  • The regression equation takes the form \(\hat{y}=a+bx\) where \(\hat{y}\) (known as “y-hat”) is the predicted value of \(y\), \(a\) is the intercept, and \(b\) is the slope. What are the \(y\) and \(x\) in the life satisfaction example?

  • In regression analysis, you need to be careful about saying which variable is \(y\) and which is \(x\). Why is this?

Other terms to remember are dependent variable (\(y\)) and independent variable (\(x\)). We can have multiple independent variables in regression model, a point we will examine in detail next week.

  • So, let’s get to our regression equation for the life satisfaction data. We’ll get to the calculation later, so for now, I am just providing the regression equation for you. The regression equation here is Life Satisfaction= 5.85 + 0.018(GDPpc). Look back at the scatter plot, do the coefficients make sense?

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/images/regression1_happinessRegplot.png
  • One function of regression is that we can use it to “predict” the outcome variable of a hypothetical country. Imagine we want to know: what would be the predicted level of life satisfaction in a country with a GDP per capita of 150 thousand dollars (a very rich country!)? Plug in ‘150’ in place of \(x\) in the equation and find \(y\)-hat (just use a calculator, or excel, or pen and paper at this point).

In the same way that we can plug in a hypothetical value (like 150 thousand dollars), we could also plug in the actual (or “observed”) values of \(x\) for our 10 countries. If we calculate y-hat for each country we can see the difference between the predicted level of life satisfaction and observed value. These have been added to the data table.

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression1_happinessTable_res.png

*What is the name for \(y\) minus \(\hat{y}\)?

Looking at the residuals (\(y - \hat{y}\)), which are the largest values? (Take “largest” here to mean absolute values, i.e., both positive and negative numbers). How might you interpret those residuals?