Conditional Distributions

3.4. Conditional Distributions#

Conditional distributions refer to spread of value around the regression line.

In this example, where $y$ = income, and $x$ = years of education, the expected value of income = -5000 + 3000*years of education.

For people with 12 years of education ( $𝑥$ = 12), the expected mean income is 31,000.

However, we can think of the predicted value as a predicted mean, and that there will be a normal distribution of values around that mean. The word ‘conditional’ here refers to the distribution of $y$ , conditional on a given value of $x$ .

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression3_conditional.png

We can use a statistic called the “Root mean square error” (Or RMSE) to estimate spread around the regression line.

The RMSE provides the estimated standard deviation of conditional distribution of $y$ at each value of $x$ . It is also known as the standard deviation of the residuals.

The equation for the RMSE is:

\sqrt{M S E} = s_{y} \sqrt{1 - r^{2}}

Again, you’ll notice that it is comprised of familiar components, namely, the standard deviation of $y$ , and $R^{2}$ .

Coming back to the immigration data from last week, where $s_{y}$ = 2.533, and $r$ = -0.1572, plug these values into the equation and find the RMSE.

Click to reveal answer

R M S E = 2.533 \sqrt{0.975288)} = 2.50

How do we interpret the RMSE?

Click to reveal answer

We expect 95% of the values of $y$ at a given $x$ to fall within 1.96 standard deviations of the mean.

note that if the sample size were smaller, we would not use 1.96, which is the critical value of $t$ for $α = 0.05$ and for large sample sizes (over about 50). Instead we would use the critical values of $t$ is for $α$ =0.05 and out actual sample size - there is a section on $t$ confidence intervals in the final lecture from Michaelmas Term.

Let’s look at an example when $x$ = 65. The regression line gives us a mean expected answer to the immigration question of 5.380 (based on the equation $\hat{y} = 6.790 - 0.0217 \times a g e$ ).

5.380 $\pm$ (1.96*2.50) = [0.480, 10.280]

When $x$ = 65, we expect 95% of values to fall between 0.480 and 10.280, which is a wide spread of values around the regression line.