{ "cells": [ { "cell_type": "markdown", "id": "f50bfb80", "metadata": {}, "source": [ "# Handling NaNs\n", "\n", "`NaN` (Not a Number) is a special value used to indicate missing data in many scientific programming languages.\n", "\n", "Using `NaN` instead of a numerical dummy value like 9999 or 0 is helpful because Python functions either ignore `NaN`s by default, or can be set to ignore `NaN`s using an optional function argument. So for example, if you get the mean of a column that includes 9999 as a dummmy value, those 9999s will be included in the calculation of the mean; but if the dummy value is `NaN`, the will not.\n", "\n", "In this section we will review:\n", "\n", "* Why `NaN` is better than a numerical dummy value\n", "* How to check for `NaN`s in a dataframe\n", "* Setting the `NaN`-handling in Python functions\n", "\n", "Set up Python Libraries\n", "\n", "As usual you will need to run this code block to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "3ee92b38", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "a366df9a", "metadata": {}, "source": [ "### Import a dataset to work with\n", "\n", "We again work with the NYC heart attack dataset\n", "\n", "\n", "The data will be automatically loaded fromt he internet when you run this code block:" ] }, { "cell_type": "code", "execution_count": 2, "id": "ca767deb", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | CHARGES | \n", "LOS | \n", "AGE | \n", "SEX | \n", "DRG | \n", "DIED | \n", "
---|---|---|---|---|---|---|
0 | \n", "4752.00 | \n", "10 | \n", "79.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
1 | \n", "3941.00 | \n", "6 | \n", "34.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
2 | \n", "3657.00 | \n", "5 | \n", "76.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
3 | \n", "1481.00 | \n", "2 | \n", "80.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
4 | \n", "1681.00 | \n", "1 | \n", "55.0 | \n", "M | \n", "122.0 | \n", "0.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
12839 | \n", "22603.57 | \n", "14 | \n", "79.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12840 | \n", "NaN | \n", "7 | \n", "91.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12841 | \n", "14359.14 | \n", "9 | \n", "79.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12842 | \n", "12986.00 | \n", "5 | \n", "70.0 | \n", "M | \n", "121.0 | \n", "0.0 | \n", "
12843 | \n", "NaN | \n", "1 | \n", "81.0 | \n", "M | \n", "123.0 | \n", "1.0 | \n", "
12844 rows × 6 columns
\n", "