{ "cells": [ { "cell_type": "markdown", "id": "5c0aae22", "metadata": {}, "source": [ "# Locating bad datapoints\n", "\n", "In this notebook we cover:\n", "* Approaches for **locating outliers**\n", "* Replacing missing data with **NaN** (how to do it; what choices we need to make)\n", "\n", "\n", "Outliers, by definition, have extreme values (very large or very small values). Therefore if a dataset contains outliers this can distort the calculated values of statistics such as the mean and standard deviation\n", "\n", "In real datasets, outliers are common, often arising from one of the following causes:\n", "\n", "* Real but unusual values \n", " * many basketball players are outliers in terms of height\n", "* Noise in a data recording system \n", " * in brain imaging data, motion artefacts generated by head movements (MRI) or blinks (EEG) are much larger than the real brain activity we are trying to record\n", "* Data entry error \n", " * a human types the wrong number, uses the wrong units or misplaces a decimal point\n", "* Dummy values\n", " * In some datasets an 'obvioulsy wrong' numerical value, such as 9999, is used to indicate a missing datapoint\n", "\n", "Identifying and removing outliers and bad data points is a crucial step in the process of preparing our data for analysis, sometimes called *data wrangling*\n", "\n", "\n", "## Set up Python Libraries\n", "\n", "As usual you will need to run this code block to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 2, "id": "0e9731e8", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "5b538a34", "metadata": {}, "source": [ "## Import a dataset to work with\n", "\n", "We will work with the file heartAttack.csv, which contains data on several thousand patients admitted to hospital in New York City, diagnosed with a heart attack.\n", "\n", "From this dataset we can explore how demographic and disease factors affect the duration of stay in hospital and the dollar cost of treatment.\n", "\n", "The dataset are downloaded with thanks (and with slight modifications for teaching purposes) from the website DASL (Data and Story Library)\n", "\n", "The data will be automatically loaded fromt he internet when you run this code block:" ] }, { "cell_type": "code", "execution_count": 3, "id": "659aba7e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | CHARGES | \n", "LOS | \n", "AGE | \n", "SEX | \n", "DRG | \n", "DIED | \n", "
---|---|---|---|---|---|---|
0 | \n", "4752.00 | \n", "10 | \n", "79.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
1 | \n", "3941.00 | \n", "6 | \n", "34.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
2 | \n", "3657.00 | \n", "5 | \n", "76.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
3 | \n", "1481.00 | \n", "2 | \n", "80.0 | \n", "F | \n", "122.0 | \n", "0.0 | \n", "
4 | \n", "1681.00 | \n", "1 | \n", "55.0 | \n", "M | \n", "122.0 | \n", "0.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
12839 | \n", "22603.57 | \n", "14 | \n", "79.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12840 | \n", "NaN | \n", "7 | \n", "91.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12841 | \n", "14359.14 | \n", "9 | \n", "79.0 | \n", "F | \n", "121.0 | \n", "0.0 | \n", "
12842 | \n", "12986.00 | \n", "5 | \n", "70.0 | \n", "M | \n", "121.0 | \n", "0.0 | \n", "
12843 | \n", "NaN | \n", "1 | \n", "81.0 | \n", "M | \n", "123.0 | \n", "1.0 | \n", "
12844 rows × 6 columns
\n", "\n", " | CHARGES | \n", "LOS | \n", "AGE | \n", "DRG | \n", "DIED | \n", "
---|---|---|---|---|---|
count | \n", "12145.000000 | \n", "12844.000000 | \n", "12842.000000 | \n", "12841.000000 | \n", "12841.000000 | \n", "
mean | \n", "9879.087615 | \n", "8.345765 | \n", "67.116726 | \n", "121.690523 | \n", "0.109805 | \n", "
std | \n", "6558.399650 | \n", "88.309430 | \n", "88.925998 | \n", "0.658289 | \n", "0.312658 | \n", "
min | \n", "3.000000 | \n", "0.000000 | \n", "20.000000 | \n", "121.000000 | \n", "0.000000 | \n", "
25% | \n", "5422.200000 | \n", "4.000000 | \n", "57.000000 | \n", "121.000000 | \n", "0.000000 | \n", "
50% | \n", "8445.000000 | \n", "7.000000 | \n", "67.000000 | \n", "122.000000 | \n", "0.000000 | \n", "
75% | \n", "12569.040000 | \n", "10.000000 | \n", "77.000000 | \n", "122.000000 | \n", "0.000000 | \n", "
max | \n", "47910.120000 | \n", "9999.000000 | \n", "9999.000000 | \n", "123.000000 | \n", "1.000000 | \n", "