Data Disaggregation

3.6. Data Disaggregation#

Disaggregation means describing or plotting data separately for different categories of individuals.

As we saw in the first lecture of the series, data in a single dataset can arise from different causal processes, for example:

The distribution of age at death in 1840 includes a set of deaths caused by infant/child mortality, and a set caused by old age
The distribution of reaction times in a psychological experiment may include a mixture of ‘true’ responses, false starts, and missed trials

Disaggregating data so that we are reporting statistics separately for these different groups is an important part of describing and analyzing data. For example:

We would like to report the mean reaction time for each condition of our psychological experiment based on ‘true’ responses, not including missed trials, which contribute a lot of noise to our estimate of the mean.

Disaggregation becomes even more important when we think about making predictions based on data. For example:

If a patient presents with chest pain, is it more likely to be indigestion or a heart attack? The answer to this question partly depends on the age of the patient (heart attacks are much less likely in young patients), BUT that is different again for men and women.

3.6.1. Equality#

If a dataset includes a majority and minority group (for example, if the dataset consistes of more men than women, or more white people than black people), then failure to disaggregate data results in findings being biased towards the majority group

For example, shockingly, black women are four times more likely to die in childbirth than white women in the UK, a statistic that was long un-remarked because data on maternal outcomes were not routinely disaggregated by race

3.6.2. Disaggregation skills#

Working out which categories of data should be presented in disaggretgated form is a skill that you will learn through practice. Too little disaggregation can obscure important group differences or retains noise that could be removed; but too much disaggregation can result in an ocean of graphs and statistics that makes it hard to see the big picture.

In this section we will look at disaggregation in the context of the heart attack dataset.

3.6.3. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.6.4. Import a dataset to work with#

Let’s continue with the NYC heart attack dataset:

hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)

	CHARGES	LOS	AGE	SEX	DRG	DIED
0	4752.00	10	79.0	F	122.0	0.0
1	3941.00	6	34.0	F	122.0	0.0
2	3657.00	5	76.0	F	122.0	0.0
3	1481.00	2	80.0	F	122.0	0.0
4	1681.00	1	55.0	M	122.0	0.0
...	...	...	...	...	...	...
12839	22603.57	14	79.0	F	121.0	0.0
12840	NaN	7	91.0	F	121.0	0.0
12841	14359.14	9	79.0	F	121.0	0.0
12842	12986.00	5	70.0	M	121.0	0.0
12843	NaN	1	81.0	M	123.0	1.0

12844 rows × 6 columns

Clean the data#

We have reloaded the dataframe from the .csv file, so we need to re-implement the data cleaning steps we already decided were necessary:

hospital.replace(9999, np.nan, inplace=True)
hospital.AGE.replace(774, np.nan, inplace=True )
hospital.describe() # check it worked

	CHARGES	LOS	AGE	DRG	DIED
count	12145.000000	12843.000000	12840.000000	12841.000000	12841.000000
mean	9879.087615	7.567858	66.288162	121.690523	0.109805
std	6558.399650	5.114357	13.654237	0.658289	0.312658
min	3.000000	0.000000	20.000000	121.000000	0.000000
25%	5422.200000	4.000000	57.000000	121.000000	0.000000
50%	8445.000000	7.000000	67.000000	122.000000	0.000000
75%	12569.040000	10.000000	77.000000	122.000000	0.000000
max	47910.120000	38.000000	103.000000	123.000000	1.000000

3.6.5. Age vs Sex#

Is the age distribution of heart attack patients the same regardless of sex?

Let’s find out by plotting the age distribution separately for men and women:

sns.kdeplot(data=hospital, x='AGE', hue='SEX', fill=True) 
plt.show()

../_images/d4af5f059c4f5eeaf88565104b750d3868589b7dab3837ebe0ffb1e0035ec927.png

Note-

More men had heart attacks than women
The female patients tend to be older.

Think

A 40 year old patient presents with chest pain. It could be a heart attack or it could be indigestion. The doctor needs to decide the likely cause. Does it matter whether the patient is a man or a woman?

Length of stay#

Let’s plot the distribution of Length of Stay in hospital:

sns.histplot(data=hospital, x='LOS', bins=range(0,40))
plt.show()

../_images/5aebcc734ba7ece58c1b0da6df514bf95401339f522ab9b5603710fac8a6b5ca.png

Note-

There is something unusual here - the dataset is bimodal, with a large number of people staying just one day in hospital.

Often a bimodal distribution is a hint that data the data distribution is a mixture of data arising from two causes. In other words, we suspect the length of stay data could be meaningfully disaggregated.

If we disaggregate these by the categorical variable DIED, we can get a clearer picture what happened:

sns.histplot(data=hospital, x='LOS', hue='DIED', bins=range(0,40))
plt.show()

../_images/5685003e2f87375fff34d169332bc5da382bb41e3dc4f6bcbb25ed2a7cb97988.png

Note-

People who sadly died from their heart attack tended to have short stays in hosiptal, with many dying on the same day they were brought to hospital.

For those who eventually survived, it was more typical to stay in hospital for 7-10 days.

3.6.6. Mortality by sex#

At first glance, female patients are much more likely to die than males:

sns.barplot(data=hospital, y='DIED', x='SEX')
plt.ylabel('proportion who died')
plt.show()

../_images/a7e208ba9ab14041ebe6f90aef467d4bb3152f147b2ee8b0854fda20b0716c51.png

Is this due to difference in severity of heart attacks by sex, or perhaps differences int he effectiveness of treatment?

Probably not. We noticed earlier that the female patients were older than the males, and it is reasonable to wonder whether younger patients are more likely to survive.

It turns out this is true, younger patients are more likely to survive. We can see this clearly by plotting the proportion who died at each age

This is a plot of mortality conditional upon age
- If you are interested you can have a look at the syntax to produce this graph, which is actually a form of KDE plot
- remember that is you google sns.kdeplot() you will find the manual page for KDE plot which explains this form
- you won’t be required to reproduce this type of plot

plt.figure(figsize=(10,2))
plt.ylabel('Proportion Died')
sns.kdeplot(data=hospital, x='AGE', hue='DIED', multiple='fill', legend=False)

plt.tight_layout()
plt.show()

../_images/996ecb83ff8930d0c9fa5729988eea55e9b746d15a1dd9bfced7add1ebb23a3b.png

If we plot mortality conditional upon age disaggregated by men and women, we can see that across all ages men are actually more likely to die than women; the higher overall mortality for women is explained by the presence of more older women in the sample (which in turn probably reflects the fact that women have longer life expectany than men).

plt.figure(figsize=(10,4))

plt.subplot(2,1,1)
plt.title('Men')
plt.ylabel('Proportion Died')
sns.kdeplot(data=hospital.query('SEX == "M"'), x='AGE', hue='DIED', multiple='fill', legend=False)

plt.subplot(2,1,2)
plt.title('Women')
plt.ylabel('Proportion Died')
sns.kdeplot(data=hospital.query('SEX == "F"'), x='AGE', hue='DIED', multiple='fill', legend=False)

plt.tight_layout()
plt.show()

../_images/3d02640640e65f4d5e8439c12a6972a827d5168d84ed5b1fdb44f528c45ad3d4.png

Conclusion#

You can learn a lot by disaggregating data!

The process of breaking data down to find evidence of different underlying distributions and relationships between variables is at the core of what a good data scientist, or indeed a research scientist, does.

Data Disaggregation

Contents

3.6. Data Disaggregation#

3.6.1. Equality#

3.6.2. Disaggregation skills#

3.6.3. Set up Python Libraries#

3.6.4. Import a dataset to work with#

Clean the data#

3.6.5. Age vs Sex#

Length of stay#

3.6.6. Mortality by sex#

Conclusion#