3.6. Data Disaggregation#

Disaggregation means describing or plotting data separately for different categories of individuals.

As we saw in the first lecture of the series, data in a single dataset can arise from different causal processes, for example:

  • The distribution of age at death in 1840 includes a set of deaths caused by infant/child mortality, and a set caused by old age

  • The distribution of reaction times in a psychological experiment may include a mixture of ‘true’ responses, false starts, and missed trials

Disaggregating data so that we are reporting statistics separately for these different groups is an important part of describing and analyzing data. For example:

  • We would like to report the mean reaction time for each condition of our psychological experiment based on ‘true’ responses, not including missed trials, which contribute a lot of noise to our estimate of the mean.

Disaggregation becomes even more important when we think about making predictions based on data. For example:

  • If a patient presents with chest pain, is it more likely to be indigestion or a heart attack? The answer to this question partly depends on the age of the patient (heart attacks are much less likely in young patients), BUT that is different again for men and women.

3.6.1. Equality#

If a dataset includes a majority and minority group (for example, if the dataset consistes of more men than women, or more white people than black people), then failure to disaggregate data results in findings being biased towards the majority group

3.6.2. Disaggregation skills#

Working out which categories of data should be presented in disaggretgated form is a skill that you will learn through practice. Too little disaggregation can obscure important group differences or retains noise that could be removed; but too much disaggregation can result in an ocean of graphs and statistics that makes it hard to see the big picture.

In this section we will look at disaggregation in the context of the heart attack dataset.

3.6.3. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.6.4. Import a dataset to work with#

Let’s continue with the NYC heart attack dataset:

hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)
CHARGES LOS AGE SEX DRG DIED
0 4752.00 10 79.0 F 122.0 0.0
1 3941.00 6 34.0 F 122.0 0.0
2 3657.00 5 76.0 F 122.0 0.0
3 1481.00 2 80.0 F 122.0 0.0
4 1681.00 1 55.0 M 122.0 0.0
... ... ... ... ... ... ...
12839 22603.57 14 79.0 F 121.0 0.0
12840 NaN 7 91.0 F 121.0 0.0
12841 14359.14 9 79.0 F 121.0 0.0
12842 12986.00 5 70.0 M 121.0 0.0
12843 NaN 1 81.0 M 123.0 1.0

12844 rows × 6 columns

Clean the data#

We have reloaded the dataframe from the .csv file, so we need to re-implement the data cleaning steps we already decided were necessary:

hospital.replace(9999, np.nan, inplace=True)
hospital.AGE.replace(774, np.nan, inplace=True )
hospital.describe() # check it worked
CHARGES LOS AGE DRG DIED
count 12145.000000 12843.000000 12840.000000 12841.000000 12841.000000
mean 9879.087615 7.567858 66.288162 121.690523 0.109805
std 6558.399650 5.114357 13.654237 0.658289 0.312658
min 3.000000 0.000000 20.000000 121.000000 0.000000
25% 5422.200000 4.000000 57.000000 121.000000 0.000000
50% 8445.000000 7.000000 67.000000 122.000000 0.000000
75% 12569.040000 10.000000 77.000000 122.000000 0.000000
max 47910.120000 38.000000 103.000000 123.000000 1.000000

3.6.5. Age vs Sex#

Is the age distribution of heart attack patients the same regardless of sex?

Let’s find out by plotting the age distribution separately for men and women:

sns.kdeplot(data=hospital, x='AGE', hue='SEX', fill=True) 
plt.show()
../_images/d4af5f059c4f5eeaf88565104b750d3868589b7dab3837ebe0ffb1e0035ec927.png

Note-

  • More men had heart attacks than women

  • The female patients tend to be older.

Think

A 40 year old patient presents with chest pain. It could be a heart attack or it could be indigestion. The doctor needs to decide the likely cause. Does it matter whether the patient is a man or a woman?

Length of stay#

Let’s plot the distribution of Length of Stay in hospital:

sns.histplot(data=hospital, x='LOS', bins=range(0,40))
plt.show()
../_images/5aebcc734ba7ece58c1b0da6df514bf95401339f522ab9b5603710fac8a6b5ca.png

Note-

There is something unusual here - the dataset is bimodal, with a large number of people staying just one day in hospital.

Often a bimodal distribution is a hint that data the data distribution is a mixture of data arising from two causes. In other words, we suspect the length of stay data could be meaningfully disaggregated.

If we disaggregate these by the categorical variable DIED, we can get a clearer picture what happened:

sns.histplot(data=hospital, x='LOS', hue='DIED', bins=range(0,40))
plt.show()
../_images/5685003e2f87375fff34d169332bc5da382bb41e3dc4f6bcbb25ed2a7cb97988.png

Note-

People who sadly died from their heart attack tended to have short stays in hosiptal, with many dying on the same day they were brought to hospital.

For those who eventually survived, it was more typical to stay in hospital for 7-10 days.

3.6.6. Mortality by sex#

At first glance, female patients are much more likely to die than males:

sns.barplot(data=hospital, y='DIED', x='SEX')
plt.ylabel('proportion who died')
plt.show()
../_images/a7e208ba9ab14041ebe6f90aef467d4bb3152f147b2ee8b0854fda20b0716c51.png

Is this due to difference in severity of heart attacks by sex, or perhaps differences int he effectiveness of treatment?

Probably not. We noticed earlier that the female patients were older than the males, and it is reasonable to wonder whether younger patients are more likely to survive.

It turns out this is true, younger patients are more likely to survive. We can see this clearly by plotting the proportion who died at each age

  • This is a plot of mortality conditional upon age

    • If you are interested you can have a look at the syntax to produce this graph, which is actually a form of KDE plot

    • remember that is you google sns.kdeplot() you will find the manual page for KDE plot which explains this form

    • you won’t be required to reproduce this type of plot

plt.figure(figsize=(10,2))
plt.ylabel('Proportion Died')
sns.kdeplot(data=hospital, x='AGE', hue='DIED', multiple='fill', legend=False)

plt.tight_layout()
plt.show()
../_images/996ecb83ff8930d0c9fa5729988eea55e9b746d15a1dd9bfced7add1ebb23a3b.png

If we plot mortality conditional upon age disaggregated by men and women, we can see that across all ages men are actually more likely to die than women; the higher overall mortality for women is explained by the presence of more older women in the sample (which in turn probably reflects the fact that women have longer life expectany than men).

plt.figure(figsize=(10,4))

plt.subplot(2,1,1)
plt.title('Men')
plt.ylabel('Proportion Died')
sns.kdeplot(data=hospital.query('SEX == "M"'), x='AGE', hue='DIED', multiple='fill', legend=False)

plt.subplot(2,1,2)
plt.title('Women')
plt.ylabel('Proportion Died')
sns.kdeplot(data=hospital.query('SEX == "F"'), x='AGE', hue='DIED', multiple='fill', legend=False)

plt.tight_layout()
plt.show()
../_images/3d02640640e65f4d5e8439c12a6972a827d5168d84ed5b1fdb44f528c45ad3d4.png

Conclusion#

You can learn a lot by disaggregating data!

The process of breaking data down to find evidence of different underlying distributions and relationships between variables is at the core of what a good data scientist, or indeed a research scientist, does.