15.5. Statistical Significance in Regression#

  • What is statistical inference?

To be able to draw conclusions about the population from the results of a regression model, we need to check two things. First, whether the slope is statistically significantly different to zero, and second, whether the regression assumptions have been met. We’ll start by thinking about testing the slope for significance.

When we test a regression slope for significance, we are running a hypothesis test. We can set up the hypotheses in the following way, where $\beta$ is the slope for $x$ in the population.

The null hypothesis can be written as $\mathcal{H_0}: \beta = 0 $ And the alternative hypothesis as $\mathcal{H_a}: \beta \neq 0 $

Note: In regression we use the two-tailed test, as we are interested in testing whether there is an association and not to predict the direction of the association.

The test for significance of a slope in regression can also be called a test of independence. We consider $x$ and $y$ to be independent when the population mean of $y$ is identical at each $x$-value, in other words, the distribution of $y$ is the same at each $x$-value. For the linear regression function $y = α+ βx$, this happens when the slope $β=0$.

The null hypothesis for statistical independence is thus: $\mathcal{H_0}: \beta = 0$.

We test the slope for significance with a $t$-test.

Testing a slope for significance Example#

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression3_hat.png

The General Social Survey sampled 2,428 respondents, asking about $y$ = number of years of education and $x$ = number of years of mother’s education. The prediction equation is $\hat{y} =10.5 + 0.294𝑥$. The standard error (se) of the slope is 0.0149.

  • How do you go about testing the null hypothesis that these variables are independent? (Without Python).

  • How would you find a 95% confidence interval for the population slope.

In education example, the standard error was rather conveniently provided for us. But where did this come from?

The equation for calculating the standard error is:

$$SE(b) = \frac{b\sqrt{1-r^2}}{r\sqrt{n-2}}$$

Note its familiar components! It is computed using the known values for the slope, $r,$ $R^2$, and the sample size.

In the immigration data from last week’s tute, the correlation between age and immigration attitudes is -0.1572, the regression slope is -0.0217, and $n$ = 2,155. Test your ability to work with equations by plugging these values into excel or a calculator to find the standard error of the slope.

  • you can check the SE calculated by Python in the regression results summary table - does it match what you calculated here?

  • Just by eyeballing and slope and the standard error, can you tell if it is statistically significant?