3.4. Neutralizing bad datapoints#

Once you have found dummy values (such as 9999), bad datapoints or outliers, you will want to neutralize them in some way.

You have about three options. Most often, the first option is best.

  • Replace the bad values with a dummy value (such as NaN)

    • This retains as much information as possible

  • Replace the whole record (row of the dataframe) with a dummy value (such as NaN)

    • Useful if the bad value for one variable casts doubt on the data quality for the whole record, or for some reason you need to retain only complete records

  • Delete the whole record (row of the dataframe)

    • Generally not recommended as could be seen as dishonest (see below)

    • Exception would be obvious duplicate records or completely blank records

3.4.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.4.2. Import a dataset to work with#

The data will be automatically loaded fromt he internet when you run this code block:

hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)
CHARGES LOS AGE SEX DRG DIED
0 4752.00 10 79.0 F 122.0 0.0
1 3941.00 6 34.0 F 122.0 0.0
2 3657.00 5 76.0 F 122.0 0.0
3 1481.00 2 80.0 F 122.0 0.0
4 1681.00 1 55.0 M 122.0 0.0
... ... ... ... ... ... ...
12839 22603.57 14 79.0 F 121.0 0.0
12840 NaN 7 91.0 F 121.0 0.0
12841 14359.14 9 79.0 F 121.0 0.0
12842 12986.00 5 70.0 M 121.0 0.0
12843 NaN 1 81.0 M 123.0 1.0

12844 rows × 6 columns

3.4.3. Replace only the bad values#

  • df.replace(old_value, new_value, inplace=True)

  • df.loc[row_index, column_index] = new_value

In most cases the best option is to replace only the bad values, retaining the rest of the record.

We will replace the values with NaN - ‘not a number’ which is a dummy value that will be ignored by most Python functions (for example, if we calculate the mean of a column containing NaNs, Pandas just calculates the mean of all the non-NaN values)

Replace a dummy value with df.replace()#

If all the bad datapoints have the same value, for example 9999, we can easily replace them as follows:

# note use argument inplace=True to edit the original dataframe
hospital.LOS.replace(9999, np.NaN, inplace=True) 
hospital.AGE.replace(9999, np.NaN, inplace=True) 

# check they have gone - max values for LOS and AGE should no longer be 9999
hospital.describe()
CHARGES LOS AGE DRG DIED
count 12145.000000 12843.000000 12840.000000 12841.000000 12841.000000
mean 9879.087615 7.567858 66.288162 121.690523 0.109805
std 6558.399650 5.114357 13.654237 0.658289 0.312658
min 3.000000 0.000000 20.000000 121.000000 0.000000
25% 5422.200000 4.000000 57.000000 121.000000 0.000000
50% 8445.000000 7.000000 67.000000 122.000000 0.000000
75% 12569.040000 10.000000 77.000000 122.000000 0.000000
max 47910.120000 38.000000 103.000000 123.000000 1.000000

Replace values over a cutoff with df.loc[]#

In some cases we want to replace a range of values; say for example we decided to remove charges over 30000 dollars.

To do this we unfortunately have to use different syntax from our regular indexing with df.query(), as we are setting values in the dataframe. We use the function df.loc[] which accesses specific locations in the dataframe defined by row and column numbers.

Say if I wanted to replace the value of SEX in the first row to ’bananas’ (!):

hospital.loc[0,'SEX']='bananas' # remember row zero is the first row in Python!
hospital.head() # check it worked
CHARGES LOS AGE SEX DRG DIED
0 4752.0 10.0 79.0 bananas 122.0 0.0
1 3941.0 6.0 34.0 F 122.0 0.0
2 3657.0 5.0 76.0 F 122.0 0.0
3 1481.0 2.0 80.0 F 122.0 0.0
4 1681.0 1.0 55.0 M 122.0 0.0

Now, instead of giving a row number, I can use a criterion (hospital.CHARGES > 30000) to find the rows I want to edit:

hospital.loc[(hospital.CHARGES > 30000),'CHARGES']=np.nan # remember row zero is the first row in Python!
hospital.describe() # check it worked
CHARGES LOS AGE DRG DIED
count 11928.000000 12843.000000 12840.000000 12841.000000 12841.000000
mean 9402.119925 7.567858 66.288162 121.690523 0.109805
std 5537.499427 5.114357 13.654237 0.658289 0.312658
min 3.000000 0.000000 20.000000 121.000000 0.000000
25% 5369.150000 4.000000 57.000000 121.000000 0.000000
50% 8329.775000 7.000000 67.000000 122.000000 0.000000
75% 12278.375000 10.000000 77.000000 122.000000 0.000000
max 30000.000000 38.000000 103.000000 123.000000 1.000000
  • Note the maximum value of CHARGES should now be 30000 dollars

Replace the whole record#

Occasionally we decide a whole record or row of the data table (often corresponding to an individual) should be replaced with NaN.

Some situations in which we would replace the whole record with NaN would be:

  • We realise after the fact that a participant didn’t meet includion criteria for our study - for example, we are studying unmedicated patients witha certain condition, and they disclosed after data collection that they are already on medication

  • They have bad values for several variables

  • we wish to only retain records that are complete

We can replace the whole record with NaN using df.loc[] and simply not specifying a column. For example, let’s make the whole second row of the dataframe become NaN

hospital.loc[1] = np.nan # remember we count from zero in Python so second row = row 1 !
hospital.head() # check it worked
CHARGES LOS AGE SEX DRG DIED
0 4752.0 10.0 79.0 bananas 122.0 0.0
1 NaN NaN NaN NaN NaN NaN
2 3657.0 5.0 76.0 F 122.0 0.0
3 1481.0 2.0 80.0 F 122.0 0.0
4 1681.0 1.0 55.0 M 122.0 0.0

Or say we want to replace the whole record for anyone with CHARGES under 100 dollars with NaN:

hospital.loc[(hospital.CHARGES<100)] = np.nan # remember we count from zero in Python so second row = row 1 !
hospital.describe() # check it worked - mmin charge should now be >= $100
CHARGES LOS AGE DRG DIED
count 11919.000000 12834.000000 12831.000000 12832.000000 12832.000000
mean 9408.858857 7.568568 66.294053 121.689994 0.109570
std 5534.051880 5.113899 13.651050 0.658085 0.312365
min 101.000000 0.000000 20.000000 121.000000 0.000000
25% 5374.960000 4.000000 57.000000 121.000000 0.000000
50% 8334.000000 7.000000 67.000000 122.000000 0.000000
75% 12285.550000 10.000000 77.000000 122.000000 0.000000
max 30000.000000 38.000000 103.000000 123.000000 1.000000

Drop the record from the dataframe#

  • df.drop(row_index)

Generally not recommended - in general we replace missing data with NaN so that it is transparent how many bad datapoints were collected.

Imagine if we conducted a survey including a question “have you ever committed a crime”. Perhaps some people would not answer this question (and indeed this may be more likely if the answer is ‘yes’).

If we deleted all the records for people who skipped the question, this could be misleading.

However sometimes we need to just remove rows. An example would be if a pile of paper forms were automatically scanned, and it turned out a lot of the were completely blank (spare forms). We would want to delete the records corresponding to those spare forms.

Another example would be if a record was an exact duplicate of another (if the same form was scanned twice, for example)

To remove records (rows) we use df.drop().

For example, perhaps we want to remove row 0 (the first row), as we this it is a prank entry (SEX == ‘bananas’):

hospital.drop(0, inplace=True)
hospital # check it's gone
CHARGES LOS AGE SEX DRG DIED
1 NaN NaN NaN NaN NaN NaN
2 3657.00 5.0 76.0 F 122.0 0.0
3 1481.00 2.0 80.0 F 122.0 0.0
4 1681.00 1.0 55.0 M 122.0 0.0
5 6378.64 9.0 84.0 M 121.0 0.0
... ... ... ... ... ... ...
12839 22603.57 14.0 79.0 F 121.0 0.0
12840 NaN 7.0 91.0 F 121.0 0.0
12841 14359.14 9.0 79.0 F 121.0 0.0
12842 12986.00 5.0 70.0 M 121.0 0.0
12843 NaN 1.0 81.0 M 123.0 1.0

12843 rows × 6 columns

Exercises#

Complete the following exercises to get used to the syntax for replacing values. It’s boring but if you master it now it will save some stress later on.

# 1. For all cases where AGE is greater than 120, replace the value of AGE with `NaN`
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
#hospital.sort_values(by='AGE')
# 2. For all cases where AGE is greater than 120, replace the entire record with `NaN`
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
#hospital.sort_values(by='AGE')
# 3. For all values 9999 in all columns, replace the 9999 with NaN
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
# hospital.sort_values(by='LOS')
# 4. Replace the code DRG = 123 with the string 'DIED'
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
# hospital