3.4. Conditional Distributions#
Conditional distributions refer to spread of value around the regression line.
In this example, where \(y\) = income, and \(x\) = years of education, the expected value of income = -5000 + 3000*years of education.
For people with 12 years of education (\(𝑥\) = 12), the expected mean income is 31,000.
However, we can think of the predicted value as a predicted mean, and that there will be a normal distribution of values around that mean. The word ‘conditional’ here refers to the distribution of \(y\), conditional on a given value of \(x\).
We can use a statistic called the “Root mean square error” (Or RMSE) to estimate spread around the regression line.
The RMSE provides the estimated standard deviation of conditional distribution of \(y\) at each value of \(x\). It is also known as the standard deviation of the residuals.
The equation for the RMSE is:
Again, you’ll notice that it is comprised of familiar components, namely, the standard deviation of \(y\), and \(R^2\).
Coming back to the immigration data from last week, where \(s_y\) = 2.533, and \(r\) = -0.1572, plug these values into the equation and find the RMSE.
Click to reveal answer
How do we interpret the RMSE?
Click to reveal answer
We expect 95% of the values of \(y\) at a given \(x\) to fall within 1.96 standard deviations of the mean.
note that if the sample size were smaller, we would not use 1.96, which is the critical value of \(t\) for \(\alpha=0.05\) and for large sample sizes (over about 50). Instead we would use the critical values of \(t\) is for \(\alpha\)=0.05 and out actual sample size - there is a section on \(t\) confidence intervals in the final lecture from Michaelmas Term.
Let’s look at an example when \(x\) = 65. The regression line gives us a mean expected answer to the immigration question of 5.380 (based on the equation \(\hat{y}=6.790-0.0217\times age\)).
5.380 \(\pm\) (1.96*2.50) = [0.480, 10.280]
When \(x\) = 65, we expect 95% of values to fall between 0.480 and 10.280, which is a wide spread of values around the regression line.