{ "cells": [ { "cell_type": "markdown", "id": "f50bfb80", "metadata": {}, "source": [ "# Handling NaNs\n", "\n", "`NaN` (Not a Number) is a special value used to indicate missing data in many scientific programming languages.\n", "\n", "Using `NaN` instead of a numerical dummy value like 9999 or 0 is helpful because Python functions either ignore `NaN`s by default, or can be set to ignore `NaN`s using an optional function argument. So for example, if you get the mean of a column that includes 9999 as a dummmy value, those 9999s will be included in the calculation of the mean; but if the dummy value is `NaN`, the will not.\n", "\n", "In this section we will review:\n", "\n", "* Why `NaN` is better than a numerical dummy value\n", "* How to check for `NaN`s in a dataframe\n", "* Setting the `NaN`-handling in Python functions\n", "\n", "Set up Python Libraries\n", "\n", "As usual you will need to run this code block to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "3ee92b38", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "a366df9a", "metadata": {}, "source": [ "### Import a dataset to work with\n", "\n", "We again work with the NYC heart attack dataset\n", "\n", "\n", "The data will be automatically loaded fromt he internet when you run this code block:" ] }, { "cell_type": "code", "execution_count": 2, "id": "ca767deb", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CHARGESLOSAGESEXDRGDIED
04752.001079.0F122.00.0
13941.00634.0F122.00.0
23657.00576.0F122.00.0
31481.00280.0F122.00.0
41681.00155.0M122.00.0
.....................
1283922603.571479.0F121.00.0
12840NaN791.0F121.00.0
1284114359.14979.0F121.00.0
1284212986.00570.0M121.00.0
12843NaN181.0M123.01.0
\n", "

12844 rows × 6 columns

\n", "
" ], "text/plain": [ " CHARGES LOS AGE SEX DRG DIED\n", "0 4752.00 10 79.0 F 122.0 0.0\n", "1 3941.00 6 34.0 F 122.0 0.0\n", "2 3657.00 5 76.0 F 122.0 0.0\n", "3 1481.00 2 80.0 F 122.0 0.0\n", "4 1681.00 1 55.0 M 122.0 0.0\n", "... ... ... ... .. ... ...\n", "12839 22603.57 14 79.0 F 121.0 0.0\n", "12840 NaN 7 91.0 F 121.0 0.0\n", "12841 14359.14 9 79.0 F 121.0 0.0\n", "12842 12986.00 5 70.0 M 121.0 0.0\n", "12843 NaN 1 81.0 M 123.0 1.0\n", "\n", "[12844 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')\n", "display(hospital)" ] }, { "cell_type": "markdown", "id": "cec11d6f", "metadata": {}, "source": [ "## `NaN` is not a number!\n", "\n", "Humans may recognise dummy values like 9999 for what they are, but the computer will treat them as numbers.\n", "\n", "Say we want to find the mean and standard deviation of the age of patients in out hospital dataset (remembering that missing data were coded as 9999):" ] }, { "cell_type": "code", "execution_count": 3, "id": "d1afa59d", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "67.11672636660957\n", "88.92599753124392\n" ] } ], "source": [ "print(hospital.AGE.mean())\n", "print(hospital.AGE.std())\n" ] }, { "cell_type": "markdown", "id": "7d3bc231", "metadata": {}, "source": [ "**Think** is the value for standard deviation realistic?\n", "\n", "These values include the 9999s just as if there were really people 9999 years old in the sample.\n", "\n", "If we replace the 9999s with `NaN` we get the correct mean and standard deviation for the 'real' values, excluding the missing data" ] }, { "cell_type": "code", "execution_count": 4, "id": "1923bd18", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "66.34327544583755\n", "15.014263316814123\n" ] } ], "source": [ "hospital.AGE.replace(9999, np.nan, inplace=True)\n", "print(hospital.AGE.mean())\n", "print(hospital.AGE.std())" ] }, { "cell_type": "markdown", "id": "05fab5e1", "metadata": {}, "source": [ "The mean has changed slightly, and the standard deviation is now much more reasonable." ] }, { "cell_type": "markdown", "id": "0a4868a9", "metadata": {}, "source": [ "## Creating `NaN`s\n", "\n", "If we want to set a value to `NaN`, we can't just type NaN or \"NaN\"\n", "\n", "Istead, we 'create' the value `NaN` using the `numpy` function `np.nan`, for example:\n", "\n", "``\n", "hospital.loc[1, 'CHARGES']=np.nan # set the value of CHARGES in row 2 to be `NaN`\n", "``\n" ] }, { "cell_type": "markdown", "id": "c91ffe17", "metadata": {}, "source": [ "## Check for `NaNs`\n", "\n", "- `df.isna()`\n", "- `df.isna().sum()`\n", "\n", "`NaN`s are ignored by many Python functions, however you may still want to know if there were any (and how many) in any given set of data.\n", "\n", "To check for missing values, coded as `NaN`, we use the function `df.isna()`:" ] }, { "cell_type": "code", "execution_count": 5, "id": "e72c3b6f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "12839 False\n", "12840 False\n", "12841 False\n", "12842 False\n", "12843 False\n", "Name: AGE, Length: 12844, dtype: bool" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hospital.AGE.isna()" ] }, { "cell_type": "markdown", "id": "09c677c5", "metadata": {}, "source": [ "`df.isna()` returned a column with True or False for each value of AGE - True for people where the age is coded as `NaN` and False otherwise.\n", "\n", "This isn't very readable, but if we want to know how many `NaN`s were in the column, we can use a trick: Python treats True as 1 and False as 0. So if we just take the sum of the column, we get the total nuber of `NaN`s:" ] }, { "cell_type": "code", "execution_count": 6, "id": "cf4e26cd", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hospital.AGE.isna().sum()" ] }, { "cell_type": "markdown", "id": "462978ed", "metadata": {}, "source": [ "Three people's age was coded as `NaN`.\n", "\n", "## NaN handling by Python functions\n", "\n", "Many Python functions automatically ignore NaNs.\n", "\n", "These include \n", "* `df.mean()`\n", "* `df.std()`\n", "* `df.quantile()`\n", ".... and most other descriptive statistics\n", "\n", "* `sns.histogram()`\n", "* `sns.scatter()`\n", "... and most other `Seaborn` and `Matplotlib` functions\n", "\n", "However, some functions do *not* automatically ignore `NaN`s, and instead will give an error message, or return the value `NaN`, if the input data contains `NaN`s.\n", "\n", "This includes a lot of functions from the library `scipy.stats`, which we will use later in the course. For example, say I want to use a $t$-test to ask if the male patients are older than the females\n", "* don't worry if you don't yet know what a $t$-test is - this will make sense when you return to it for revision" ] }, { "cell_type": "code", "execution_count": 7, "id": "f61363f9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Ttest_indResult(statistic=nan, pvalue=nan)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.ttest_ind(hospital.query('SEX == \"M\"').AGE, hospital.query('SEX == \"F\"').AGE)" ] }, { "cell_type": "markdown", "id": "4d102b3d", "metadata": {}, "source": [ "The function `stats.ttest_ind()` performs an independent samples $t$-test between the two samples we gave it (the ages of male and female patients) and should return a $t$-value (statistic) and a $p$ value (pvalue)\n", " \n", "Right now both of these are `NaN` because the `NaN`s in the input were not ignored.\n", " \n", "We can tell the function `stats.ttest_ind()` to ignore `NaN`s, using the argumment `nan_policy='omit'`:" ] }, { "cell_type": "code", "execution_count": 8, "id": "bd6efa6a", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Ttest_indResult(statistic=-32.5161215541649, pvalue=6.0679255155843785e-223)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.ttest_ind(hospital.query('SEX == \"M\"').AGE, hospital.query('SEX == \"F\"').AGE, nan_policy='omit')" ] }, { "cell_type": "markdown", "id": "f1608265", "metadata": {}, "source": [ "Now we have actual values instead of NaN: $t = -35.4$ and $p = 3.1 x 10^{-262}$ (a very small number)\n", "\n", "**If you run a Python function and the output is `NaN`, you very probably need to change how the function handles `NaN`s using an argument. Check the function's help page online to get the correct syntax.**" ] }, { "cell_type": "code", "execution_count": null, "id": "3c257b5a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }