{ "cells": [ { "cell_type": "markdown", "id": "213984e9", "metadata": {}, "source": [ "# Data Disaggregation\n", "\n", "**Disaggregation** means describing or plotting data separately for different categories of individuals.\n", "\n", "As we saw in the first lecture of the series, data in a single dataset can arise from different causal processes, for example:\n", "* The distribution of age at death in 1840 includes a set of deaths caused by infant/child mortality, and a set caused by old age\n", "* The distribution of reaction times in a psychological experiment may include a mixture of 'true' responses, false starts, and missed trials\n", "\n", "*Disaggregating* data so that we are reporting statistics separately for these different groups is an important part of describing and analyzing data. For example:\n", "* We would like to report the mean reaction time for each condition of our psychological experiment based on 'true' responses, not including missed trials, which contribute a lot of noise to our estimate of the mean.\n", "\n", "\n", "Disaggregation becomes even more important when we think about making predictions based on data. For example:\n", "* If a patient presents with chest pain, is it more likely to be indigestion or a heart attack? The answer to this question partly depends on the age of the patient (heart attacks are much less likely in young patients), BUT that is different again for men and women.\n", "\n", "\n", "#### Equality\n", "\n", "If a dataset includes a majority and minority group (for example, if the dataset consistes of more men than women, or more white people than black people), then failure to disaggregate data results in findings being biased towards the majority group\n", "\n", "* For example, shockingly, black women are four times more likely to die in childbirth than white women in the UK, a statistic that was long un-remarked because data on maternal outcomes were not routinely disaggregated by race\n", "\n", " \n", "#### Disaggregation skills\n", " \n", "Working out which categories of data should be presented in disaggretgated form is a skill that you will learn through practice. Too little disaggregation can obscure important group differences or retains noise that could be removed; but too much disaggregation can result in an ocean of graphs and statistics that makes it hard to see the big picture.\n", "\n", "In this section we will look at disaggregation in the context of the heart attack dataset. \n", " \n", "### Set up Python Libraries\n", "\n", "As usual you will need to run this code block to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "be5b5d06", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "c7fee4fc", "metadata": {}, "source": [ "### Import a dataset to work with\n", "\n", "Let's continue with the NYC heart attack dataset:" ] }, { "cell_type": "code", "execution_count": 2, "id": "ac9c9d93", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | CHARGES | \n", "LOS | \n", "AGE | \n", "SEX | \n", "DRG | \n", "DIED | \n", "
---|---|---|---|---|---|---|
0 | \n", "4752.00 | \n", "10 | \n", "79.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
1 | \n", "3941.00 | \n", "6 | \n", "34.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
2 | \n", "3657.00 | \n", "5 | \n", "76.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
3 | \n", "1481.00 | \n", "2 | \n", "80.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
4 | \n", "1681.00 | \n", "1 | \n", "55.0 | \n", "M | \n", "122.0 | \n", "0.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
12839 | \n", "22603.57 | \n", "14 | \n", "79.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12840 | \n", "NaN | \n", "7 | \n", "91.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12841 | \n", "14359.14 | \n", "9 | \n", "79.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12842 | \n", "12986.00 | \n", "5 | \n", "70.0 | \n", "M | \n", "121.0 | \n", "0.0 | \n", "
12843 | \n", "NaN | \n", "1 | \n", "81.0 | \n", "M | \n", "123.0 | \n", "1.0 | \n", "
12844 rows × 6 columns
\n", "\n", " | CHARGES | \n", "LOS | \n", "AGE | \n", "DRG | \n", "DIED | \n", "
---|---|---|---|---|---|
count | \n", "12145.000000 | \n", "12843.000000 | \n", "12840.000000 | \n", "12841.000000 | \n", "12841.000000 | \n", "
mean | \n", "9879.087615 | \n", "7.567858 | \n", "66.288162 | \n", "121.690523 | \n", "0.109805 | \n", "
std | \n", "6558.399650 | \n", "5.114357 | \n", "13.654237 | \n", "0.658289 | \n", "0.312658 | \n", "
min | \n", "3.000000 | \n", "0.000000 | \n", "20.000000 | \n", "121.000000 | \n", "0.000000 | \n", "
25% | \n", "5422.200000 | \n", "4.000000 | \n", "57.000000 | \n", "121.000000 | \n", "0.000000 | \n", "
50% | \n", "8445.000000 | \n", "7.000000 | \n", "67.000000 | \n", "122.000000 | \n", "0.000000 | \n", "
75% | \n", "12569.040000 | \n", "10.000000 | \n", "77.000000 | \n", "122.000000 | \n", "0.000000 | \n", "
max | \n", "47910.120000 | \n", "38.000000 | \n", "103.000000 | \n", "123.000000 | \n", "1.000000 | \n", "