3.5. Handling NaNs#
NaN
(Not a Number) is a special value used to indicate missing data in many scientific programming languages.
Using NaN
instead of a numerical dummy value like 9999 or 0 is helpful because Python functions either ignore NaN
s by default, or can be set to ignore NaN
s using an optional function argument. So for example, if you get the mean of a column that includes 9999 as a dummmy value, those 9999s will be included in the calculation of the mean; but if the dummy value is NaN
, the will not.
In this section we will review:
Why
NaN
is better than a numerical dummy valueHow to check for
NaN
s in a dataframeSetting the
NaN
-handling in Python functions
Set up Python Libraries
As usual you will need to run this code block to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
3.5.1. Import a dataset to work with#
We again work with the NYC heart attack dataset
The data will be automatically loaded fromt he internet when you run this code block:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)
CHARGES | LOS | AGE | SEX | DRG | DIED | |
---|---|---|---|---|---|---|
0 | 4752.00 | 10 | 79.0 | F | 122.0 | 0.0 |
1 | 3941.00 | 6 | 34.0 | F | 122.0 | 0.0 |
2 | 3657.00 | 5 | 76.0 | F | 122.0 | 0.0 |
3 | 1481.00 | 2 | 80.0 | F | 122.0 | 0.0 |
4 | 1681.00 | 1 | 55.0 | M | 122.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
12839 | 22603.57 | 14 | 79.0 | F | 121.0 | 0.0 |
12840 | NaN | 7 | 91.0 | F | 121.0 | 0.0 |
12841 | 14359.14 | 9 | 79.0 | F | 121.0 | 0.0 |
12842 | 12986.00 | 5 | 70.0 | M | 121.0 | 0.0 |
12843 | NaN | 1 | 81.0 | M | 123.0 | 1.0 |
12844 rows × 6 columns
3.5.2. NaN
is not a number!#
Humans may recognise dummy values like 9999 for what they are, but the computer will treat them as numbers.
Say we want to find the mean and standard deviation of the age of patients in out hospital dataset (remembering that missing data were coded as 9999):
print(hospital.AGE.mean())
print(hospital.AGE.std())
67.83507241862638
124.70055361883249
Think is the value for standard deviation realistic?
These values include the 9999s just as if there were really people 9999 years old in the sample.
If we replace the 9999s with NaN
we get the correct mean and standard deviation for the ‘real’ values, excluding the missing data
hospital.AGE.replace(9999, np.nan, inplace=True)
print(hospital.AGE.mean())
print(hospital.AGE.std())
66.28816199376946
13.654236726825335
The mean has changed slightly, and the standard deviation is now much more reasonable.
3.5.3. Creating NaN
s#
If we want to set a value to NaN
, we can’t just type NaN or ”NaN”
Instead, we ‘create’ the value NaN
using the numpy
function np.nan
, for example:
hospital.loc[1, 'CHARGES']=np.nan # set the value of CHARGES in row 2 to be `NaN`
3.5.4. Check for NaNs
#
df.isna()
df.isna().sum()
NaN
s are ignored by many Python functions, however you may still want to know if there were any (and how many) in any given set of data.
To check for missing values, coded as NaN
, we use the function df.isna()
:
hospital.AGE.isna()
0 False
1 False
2 False
3 False
4 False
...
12839 False
12840 False
12841 False
12842 False
12843 False
Name: AGE, Length: 12844, dtype: bool
df.isna()
returned a column with True or False for each value of AGE - True for people where the age is coded as NaN
and False otherwise.
This isn’t very readable, but if we want to know how many NaN
s were in the column, we can use a trick: Python treats True as 1 and False as 0. So if we just take the sum of the column, we get the total nuber of NaN
s:
hospital.AGE.isna().sum()
4
Three people’s age was coded as NaN
.
3.5.5. NaN handling by Python functions#
Many Python functions automatically ignore NaNs.
These include
df.mean()
df.std()
df.quantile()
…. and most other descriptive statisticssns.histogram()
sns.scatter()
… and most otherSeaborn
andMatplotlib
functions
However, some functions do not automatically ignore NaN
s, and instead will give an error message, or return the value NaN
, if the input data contains NaN
s.
This includes a lot of functions from the library scipy.stats
, which we will use later in the course. For example, say I want to use a \(t\)-test to ask if the male patients are older than the females
don’t worry if you don’t yet know what a \(t\)-test is - this will make sense when you return to it for revision
stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE)
Ttest_indResult(statistic=nan, pvalue=nan)
The function stats.ttest_ind()
performs an independent samples \(t\)-test between the two samples we gave it (the ages of male and female patients) and should return a \(t\)-value (statistic) and a \(p\) value (pvalue)
Right now both of these are NaN
because the NaN
s in the input were not ignored.
We can tell the function stats.ttest_ind()
to ignore NaN
s, using the argumment nan_policy='omit'
:
stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE, nan_policy='omit')
Ttest_indResult(statistic=-35.41617555682539, pvalue=3.1864909732541125e-262)
Now we have actual values instead of NaN: \(t = -35.4\) and \(p = 3.1 x 10^{-262}\) (a very small number)
If you run a Python function and the output is NaN
, you very probably need to change how the function handles NaN
s using an argument. Check the function’s help page online to get the correct syntax.