2.2. Multivariate Analysis Concepts#

Let’s begin by thinking about spuriousness. We can explain what is meant by a “spurious association” with examples.

A famous example is the firefighters example: For all fires in London last year, data are available on \(x\) = number of firefighters at the fire and \(y\) = cost of damages due to the fire. The correlation between \(x\) and \(y\) is positive.

  • Does this mean that having more firefighters at a fire causes the damage to be worse? Can you identify a third variable that could be a common cause of \(x\) and \(y\)?

Another example is shoe size and literacy among children. This is another spurious association, that can be explained by age. As children get older, their feet grow and their reading skills improve. We can illustrate a spurious relationship graphically, with variable labels and arrows, like below.

https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/images/regression2_BooksShoes.png

The reason that it is important to think about spuriousness and different types of multivariate analysis is because scientific endeavour is interested in establishing causal relationships. It is not enough to cite an association between two variables; scientists try to understand how the world works in their research, which means furthering our understanding of causal processes. In real-world research it can be difficult to prove causality, especially with observational data (more on this in Trinity Term).

Recall from the lecture, what are the three criteria for establishing causality?

An important concept in multivariate research is the control variable. This concept relates to point 3) in the causal criteria. A control variable can be used in analysis to help us to rule out alternative explanations. Thinking about our examples of spurious relationships, if we really wanted to study the relationship between number of firefighters and fire damage, we would need to control for size of the fire. If we wanted to examine the link between shoe size and reading, we would obviously need to control for age. In this last example, I would expect the association between shoe size and reading skills to disappear completely once we control for age.

These examples are not very subtle!

Often, as scientists in the social, human, and medical sphere, we need to deal with complex and overlapping concepts, making it potentially more difficult to make causal links. To illustrate, let’s take the link between income and health as an example, where people with higher incomes have better health due to access to better resources. An alternative theory might propose that people with higher incomes tend to have higher education, and people with higher education tend to have healthier lifestyles. So, is it about resources or education? To be sure we would want to examine the association between income and health, after controlling for education (and possibly after controlling for health behaviours too, e.g., smoking). In short, to really understand relationships between variables to explain real-world research questions, we need multivariate analysis.