3.4. Conditional Distributions#

Conditional distributions refer to spread of value around the regression line.

In this example, where \(y\) = income, and \(x\) = years of education, the expected value of income = -5000 + 3000*years of education.

For people with 12 years of education (\(𝑥\) = 12), the expected mean income is 31,000.

However, we can think of the predicted value as a predicted mean, and that there will be a normal distribution of values around that mean. The word ‘conditional’ here refers to the distribution of \(y\), conditional on a given value of \(x\).

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression3_conditional.png

We can use a statistic called the “Root mean square error” (Or RMSE) to estimate spread around the regression line.

The RMSE provides the estimated standard deviation of conditional distribution of \(y\) at each value of \(x\). It is also known as the standard deviation of the residuals.

The equation for the RMSE is:

\[ \sqrt{MSE} = s_y\sqrt{1-r^2} \]

Again, you’ll notice that it is comprised of familiar components, namely, the standard deviation of \(y\), and \(R^2\).

  • Coming back to the immigration data from last week, where \(s_y\) = 2.533, and \(r\) = -0.1572, plug these values into the equation and find the RMSE.

  • How do we interpret the RMSE?