15.4. Conditional Distributions#

Conditional distributions refer to spread of value around the regression line.

In this example, where $y$ = income, and $x$ = years of education, the expected value of income = -5000 + 3000*years of education.

For people with 12 years of education ($𝑥$ = 12), the expected mean income is 31,000.

However, we can think of the predicted value as a predicted mean, and that there will be a normal distribution of values around that mean. The word ‘conditional’ here refers to the distribution of $y$, conditional on a given value of $x$.

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression3_conditional.png

We can use a statistic called the “Root mean square error” (Or RMSE) to estimate spread around the regression line.

The RMSE provides the estimated standard deviation of conditional distribution of $y$ at each value of $x$. It is also known as the standard deviation of the residuals.

The equation for the RMSE is:

$$ \sqrt{MSE} = s_y\sqrt{1-r^2} $$

Again, you’ll notice that it is comprised of familiar components, namely, the standard deviation of $y$, and $R^2$.

  • Coming back to the immigration data from last week, where $s_y$ = 2.533, and $r$ = -0.1572, plug these values into the equation and find the RMSE.

  • How do we interpret the RMSE?