{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "213984e9",
   "metadata": {},
   "source": [
    "# Neutralizing bad datapoints\n",
    "\n",
    "Once you have found dummy values (such as 9999), bad datapoints or outliers, you will want to neutralize them in some way.\n",
    "\n",
    "You have about three options. Most often, the first option is best.\n",
    "\n",
    "* Replace the bad values with a dummy value (such as `NaN`)\n",
    "    - This retains as much information as possible\n",
    "<br>\n",
    "* Replace the whole record (row of the dataframe) with a dummy value (such as `NaN`)\n",
    "    - Useful if the bad value for one variable casts doubt on the data quality for the whole record, or for some reason you need to retain only complete records\n",
    "<br>\n",
    "* Delete the whole record (row of the dataframe)\n",
    "    - Generally not recommended as could be seen as dishonest (see below)\n",
    "    - Exception would be obvious duplicate records or completely blank records\n",
    "<br>\n",
    " \n",
    "Here is a video introducing the `NaN` (Not a Number) placeholder:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1c2463d9-7872-4c24-a141-8881b9b144f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/nxhUlpsuj10?si=jUq6W-NtbNoVv0Ev\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen></iframe>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/nxhUlpsuj10?si=jUq6W-NtbNoVv0Ev\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen></iframe>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d38c977f-6544-4f64-bf8b-92b475f9bc54",
   "metadata": {},
   "source": [
    "<center>\n",
    "\n",
    "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/UpKwt_e02So?si=l_ab6HFYNb0GiqF3\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n",
    "\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60b4d9af-6bf2-4ae9-8fd3-dc3fad7f02dc",
   "metadata": {},
   "source": [
    "\n",
    "### Set up Python Libraries\n",
    "\n",
    "As usual you will need to run this code block to import the relevant Python libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "be5b5d06",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set-up Python libraries - you need to run this but you don't need to change it\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "sns.set_theme(style='white')\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7fee4fc",
   "metadata": {},
   "source": [
    "### Import a dataset to work with\n",
    "\n",
    "\n",
    "The data will be automatically loaded from the internet when you run this code block:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ac9c9d93",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>SEX</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4752.00</td>\n",
       "      <td>10</td>\n",
       "      <td>79.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3941.00</td>\n",
       "      <td>6</td>\n",
       "      <td>34.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3657.00</td>\n",
       "      <td>5</td>\n",
       "      <td>76.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1481.00</td>\n",
       "      <td>2</td>\n",
       "      <td>80.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1681.00</td>\n",
       "      <td>1</td>\n",
       "      <td>55.0</td>\n",
       "      <td>M</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12839</th>\n",
       "      <td>22603.57</td>\n",
       "      <td>14</td>\n",
       "      <td>79.0</td>\n",
       "      <td>F</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12840</th>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>91.0</td>\n",
       "      <td>F</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12841</th>\n",
       "      <td>14359.14</td>\n",
       "      <td>9</td>\n",
       "      <td>79.0</td>\n",
       "      <td>F</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12842</th>\n",
       "      <td>12986.00</td>\n",
       "      <td>5</td>\n",
       "      <td>70.0</td>\n",
       "      <td>M</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12843</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>81.0</td>\n",
       "      <td>M</td>\n",
       "      <td>123.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12844 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        CHARGES  LOS   AGE SEX    DRG  DIED\n",
       "0       4752.00   10  79.0   F  122.0   0.0\n",
       "1       3941.00    6  34.0   F  122.0   0.0\n",
       "2       3657.00    5  76.0   F  122.0   0.0\n",
       "3       1481.00    2  80.0   F  122.0   0.0\n",
       "4       1681.00    1  55.0   M  122.0   0.0\n",
       "...         ...  ...   ...  ..    ...   ...\n",
       "12839  22603.57   14  79.0   F  121.0   0.0\n",
       "12840       NaN    7  91.0   F  121.0   0.0\n",
       "12841  14359.14    9  79.0   F  121.0   0.0\n",
       "12842  12986.00    5  70.0   M  121.0   0.0\n",
       "12843       NaN    1  81.0   M  123.0   1.0\n",
       "\n",
       "[12844 rows x 6 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')\n",
    "display(hospital)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2887910e",
   "metadata": {},
   "source": [
    "## Replace only the bad values\n",
    "\n",
    "- `df.replace(old_value, new_value, inplace=True)`\n",
    "- `df.loc[row_index, column_index] = new_value`\n",
    "\n",
    "In most cases the best option is to replace only the bad values, retaining the rest of the record.\n",
    "\n",
    "We will replace the values with `NaN` - 'not a number' which is a dummy value that will be ignored by most Python functions (for example, if we calculate the mean of a column containing NaNs, Pandas just calculates the mean of all the non-NaN values)\n",
    "\n",
    "### Replace a dummy value with `df.replace()`\n",
    "\n",
    "If all the bad datapoints have the same value, for example <tt>9999</tt>, we can easily replace them as follows: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2438d50c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>12145.000000</td>\n",
       "      <td>12843.000000</td>\n",
       "      <td>12840.000000</td>\n",
       "      <td>12841.000000</td>\n",
       "      <td>12841.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>9879.087615</td>\n",
       "      <td>7.567858</td>\n",
       "      <td>66.288162</td>\n",
       "      <td>121.690523</td>\n",
       "      <td>0.109805</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>6558.399650</td>\n",
       "      <td>5.114357</td>\n",
       "      <td>13.654237</td>\n",
       "      <td>0.658289</td>\n",
       "      <td>0.312658</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>121.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>5422.200000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>57.000000</td>\n",
       "      <td>121.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>8445.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>67.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>12569.040000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>77.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>47910.120000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>103.000000</td>\n",
       "      <td>123.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            CHARGES           LOS           AGE           DRG          DIED\n",
       "count  12145.000000  12843.000000  12840.000000  12841.000000  12841.000000\n",
       "mean    9879.087615      7.567858     66.288162    121.690523      0.109805\n",
       "std     6558.399650      5.114357     13.654237      0.658289      0.312658\n",
       "min        3.000000      0.000000     20.000000    121.000000      0.000000\n",
       "25%     5422.200000      4.000000     57.000000    121.000000      0.000000\n",
       "50%     8445.000000      7.000000     67.000000    122.000000      0.000000\n",
       "75%    12569.040000     10.000000     77.000000    122.000000      0.000000\n",
       "max    47910.120000     38.000000    103.000000    123.000000      1.000000"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# note use argument inplace=True to edit the original dataframe\n",
    "hospital.LOS.replace(9999, np.NaN, inplace=True) \n",
    "hospital.AGE.replace(9999, np.NaN, inplace=True) \n",
    "\n",
    "# check they have gone - max values for LOS and AGE should no longer be 9999\n",
    "hospital.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b293acd6",
   "metadata": {},
   "source": [
    "### Replace values over a cutoff with df.loc[]\n",
    "\n",
    "In some cases we want to replace a range of values; say for example we decided to remove charges over 30000 dollars.\n",
    "\n",
    "To do this we unfortunately have to use different syntax from our regular indexing with `df.query()`, as we are *setting* values in the dataframe. We use the function `df.loc[]` which accesses specific *locations* in the dataframe defined by row and column numbers.\n",
    "\n",
    "Say if I wanted to replace the value of <tt>SEX</tt> in the first row to <tt>'bananas'</tt> (!):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "defa7bd5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>SEX</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4752.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>79.0</td>\n",
       "      <td>bananas</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3941.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>34.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3657.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>76.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1481.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1681.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>M</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   CHARGES   LOS   AGE      SEX    DRG  DIED\n",
       "0   4752.0  10.0  79.0  bananas  122.0   0.0\n",
       "1   3941.0   6.0  34.0        F  122.0   0.0\n",
       "2   3657.0   5.0  76.0        F  122.0   0.0\n",
       "3   1481.0   2.0  80.0        F  122.0   0.0\n",
       "4   1681.0   1.0  55.0        M  122.0   0.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hospital.loc[0,'SEX']='bananas' # remember row zero is the first row in Python!\n",
    "hospital.head() # check it worked"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1db17ad",
   "metadata": {},
   "source": [
    "Now, instead of giving a row number, I can use a criterion (<tt>hospital.CHARGES > 30000</tt>) to find the rows I want to edit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e1f02d13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>11928.000000</td>\n",
       "      <td>12843.000000</td>\n",
       "      <td>12840.000000</td>\n",
       "      <td>12841.000000</td>\n",
       "      <td>12841.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>9402.119925</td>\n",
       "      <td>7.567858</td>\n",
       "      <td>66.288162</td>\n",
       "      <td>121.690523</td>\n",
       "      <td>0.109805</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>5537.499427</td>\n",
       "      <td>5.114357</td>\n",
       "      <td>13.654237</td>\n",
       "      <td>0.658289</td>\n",
       "      <td>0.312658</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>121.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>5369.150000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>57.000000</td>\n",
       "      <td>121.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>8329.775000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>67.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>12278.375000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>77.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>30000.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>103.000000</td>\n",
       "      <td>123.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            CHARGES           LOS           AGE           DRG          DIED\n",
       "count  11928.000000  12843.000000  12840.000000  12841.000000  12841.000000\n",
       "mean    9402.119925      7.567858     66.288162    121.690523      0.109805\n",
       "std     5537.499427      5.114357     13.654237      0.658289      0.312658\n",
       "min        3.000000      0.000000     20.000000    121.000000      0.000000\n",
       "25%     5369.150000      4.000000     57.000000    121.000000      0.000000\n",
       "50%     8329.775000      7.000000     67.000000    122.000000      0.000000\n",
       "75%    12278.375000     10.000000     77.000000    122.000000      0.000000\n",
       "max    30000.000000     38.000000    103.000000    123.000000      1.000000"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hospital.loc[(hospital.CHARGES > 30000),'CHARGES']=np.nan # remember row zero is the first row in Python!\n",
    "hospital.describe() # check it worked"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff75acda",
   "metadata": {},
   "source": [
    "* **Note** the maximum value of CHARGES should now be 30000 dollars"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f097b844",
   "metadata": {},
   "source": [
    "### Replace the whole record\n",
    "\n",
    "Occasionally we decide a whole record or row of the data table (often corresponding to an individual) should be replaced with `NaN`.\n",
    "\n",
    "Some situations in which we would replace the whole record with `NaN` would be:\n",
    "\n",
    "* We realise after the fact that a participant didn't meet includion criteria for our study - for example, we are studying unmedicated patients witha  certain condition, and they disclosed after data collection that they are already on medication\n",
    "* They have bad values for several variables\n",
    "* we wish to only retain records that are complete\n",
    "\n",
    "We can replace the whole record with `NaN` using `df.loc[]` and simply not specifying a column. For example, let's make the whole second row of the dataframe become `NaN`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "111d184b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>SEX</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4752.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>79.0</td>\n",
       "      <td>bananas</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3657.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>76.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1481.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1681.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>M</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   CHARGES   LOS   AGE      SEX    DRG  DIED\n",
       "0   4752.0  10.0  79.0  bananas  122.0   0.0\n",
       "1      NaN   NaN   NaN      NaN    NaN   NaN\n",
       "2   3657.0   5.0  76.0        F  122.0   0.0\n",
       "3   1481.0   2.0  80.0        F  122.0   0.0\n",
       "4   1681.0   1.0  55.0        M  122.0   0.0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hospital.loc[1] = np.nan # remember we count from zero in Python so second row = row 1 !\n",
    "hospital.head() # check it worked"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "676aeffe",
   "metadata": {},
   "source": [
    "Or say we want to replace the whole record for anyone with <tt>CHARGES</tt> under 100 dollars with `NaN`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c8a82645",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>11919.000000</td>\n",
       "      <td>12834.000000</td>\n",
       "      <td>12831.000000</td>\n",
       "      <td>12832.000000</td>\n",
       "      <td>12832.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>9408.858857</td>\n",
       "      <td>7.568568</td>\n",
       "      <td>66.294053</td>\n",
       "      <td>121.689994</td>\n",
       "      <td>0.109570</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>5534.051880</td>\n",
       "      <td>5.113899</td>\n",
       "      <td>13.651050</td>\n",
       "      <td>0.658085</td>\n",
       "      <td>0.312365</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>101.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>121.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>5374.960000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>57.000000</td>\n",
       "      <td>121.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>8334.000000</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>67.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>12285.550000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>77.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>30000.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>103.000000</td>\n",
       "      <td>123.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            CHARGES           LOS           AGE           DRG          DIED\n",
       "count  11919.000000  12834.000000  12831.000000  12832.000000  12832.000000\n",
       "mean    9408.858857      7.568568     66.294053    121.689994      0.109570\n",
       "std     5534.051880      5.113899     13.651050      0.658085      0.312365\n",
       "min      101.000000      0.000000     20.000000    121.000000      0.000000\n",
       "25%     5374.960000      4.000000     57.000000    121.000000      0.000000\n",
       "50%     8334.000000      7.000000     67.000000    122.000000      0.000000\n",
       "75%    12285.550000     10.000000     77.000000    122.000000      0.000000\n",
       "max    30000.000000     38.000000    103.000000    123.000000      1.000000"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hospital.loc[(hospital.CHARGES<100)] = np.nan # remember we count from zero in Python so second row = row 1 !\n",
    "hospital.describe() # check it worked - min charge should now be >= $100"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d61fb43",
   "metadata": {},
   "source": [
    "### Drop the record from the dataframe\n",
    "\n",
    "* `df.drop(row_index)`\n",
    "\n",
    "**Generally not recommended** - in general we replace missing data with `NaN` so that it is transparent how many bad datapoints were collected. \n",
    "\n",
    "Imagine if we conducted a survey including a question \"have you ever committed a crime\". Perhaps some people would not answer this question (and indeed this may be more likely if the answer is 'yes'). \n",
    "\n",
    "If we deleted all the records for people who skipped the question, this could be misleading.\n",
    "\n",
    "**However** sometimes we need to just remove rows. An example would be if a pile of paper forms were automatically scanned, and it turned out a lot of the were completely blank (spare forms). We would want to delete the records corresponding to those spare forms.\n",
    "\n",
    "Another example would be if a record was an exact duplicate of another (if the same form was scanned twice, for example)\n",
    "\n",
    "To remove records (rows) we use `df.drop()`. \n",
    "\n",
    "For example, perhaps we want to remove row 0 (the first row), as we this it is a prank entry (<tt>SEX == 'bananas'</tt>):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7c9df25f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHARGES</th>\n",
       "      <th>LOS</th>\n",
       "      <th>AGE</th>\n",
       "      <th>SEX</th>\n",
       "      <th>DRG</th>\n",
       "      <th>DIED</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3657.00</td>\n",
       "      <td>5.0</td>\n",
       "      <td>76.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1481.00</td>\n",
       "      <td>2.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>F</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1681.00</td>\n",
       "      <td>1.0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>M</td>\n",
       "      <td>122.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6378.64</td>\n",
       "      <td>9.0</td>\n",
       "      <td>84.0</td>\n",
       "      <td>M</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12839</th>\n",
       "      <td>22603.57</td>\n",
       "      <td>14.0</td>\n",
       "      <td>79.0</td>\n",
       "      <td>F</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12840</th>\n",
       "      <td>NaN</td>\n",
       "      <td>7.0</td>\n",
       "      <td>91.0</td>\n",
       "      <td>F</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12841</th>\n",
       "      <td>14359.14</td>\n",
       "      <td>9.0</td>\n",
       "      <td>79.0</td>\n",
       "      <td>F</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12842</th>\n",
       "      <td>12986.00</td>\n",
       "      <td>5.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>M</td>\n",
       "      <td>121.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12843</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>81.0</td>\n",
       "      <td>M</td>\n",
       "      <td>123.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12843 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        CHARGES   LOS   AGE  SEX    DRG  DIED\n",
       "1           NaN   NaN   NaN  NaN    NaN   NaN\n",
       "2       3657.00   5.0  76.0    F  122.0   0.0\n",
       "3       1481.00   2.0  80.0    F  122.0   0.0\n",
       "4       1681.00   1.0  55.0    M  122.0   0.0\n",
       "5       6378.64   9.0  84.0    M  121.0   0.0\n",
       "...         ...   ...   ...  ...    ...   ...\n",
       "12839  22603.57  14.0  79.0    F  121.0   0.0\n",
       "12840       NaN   7.0  91.0    F  121.0   0.0\n",
       "12841  14359.14   9.0  79.0    F  121.0   0.0\n",
       "12842  12986.00   5.0  70.0    M  121.0   0.0\n",
       "12843       NaN   1.0  81.0    M  123.0   1.0\n",
       "\n",
       "[12843 rows x 6 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hospital.drop(0, inplace=True)\n",
    "hospital # check it's gone"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77f37fe7",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "Complete the following exercises to get used to the syntax for replacing values.\n",
    "It's boring but if you master it now it will save some stress later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f7c5f79c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. For all cases where AGE is greater than 120, replace the value of AGE with `NaN`\n",
    "# Reload the data:\n",
    "hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# uncomment the next line to check it worked\n",
    "#hospital.sort_values(by='AGE')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "da554730",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2. For all cases where AGE is greater than 120, replace the entire record with `NaN`\n",
    "# Reload the data:\n",
    "hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# uncomment the next line to check it worked\n",
    "#hospital.sort_values(by='AGE')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c66b460b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3. For all values 9999 in all columns, replace the 9999 with NaN\n",
    "# Reload the data:\n",
    "hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# uncomment the next line to check it worked\n",
    "# hospital.sort_values(by='LOS')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ddd656b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 4. Replace the code DRG = 123 with the string 'DIED'\n",
    "# Reload the data:\n",
    "hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# uncomment the next line to check it worked\n",
    "# hospital"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b039b83",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}