Data cleaning

2.6. Data cleaning#

As we have seen, outliers can distort the values of statistics such as the mean and standard deviation

In real datasets, outliers are common, arising from one of the following:

Real but unusual values (eg many basketball players are outliers in terms of height)
Noise in a data recording system (eg in brain imaging data, noise signals from head movement are much larger than the real brain activity we are trying to record)
Data entry error (human types the wrong number)

Identifying and removing outliers and bad data points is a crucial step in the process of preparing our data for analysis, sometimes called data wrangling

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme()

Removing an outlier#

In some cases, we can identify that an outlier datapoint should not be in our dataset, and remove it.

Let’s try an example:

Import the data#

Let’s import a dataframe with size information on a random sample of cars

cars = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/cars_outlier1.csv')
display(cars)

	length	height	width	type
0	3.9187	1.5320	1.8030	car
1	4.6486	1.5936	1.6463	car
2	3.5785	1.5447	1.7140	car
3	3.5563	1.5549	1.7331	car
4	4.0321	1.5069	1.7320	car
...	...	...	...	...
978	5.0897	1.8396	1.8356	car
979	4.0224	1.5860	1.7862	car
980	3.4771	1.5340	1.7418	car
981	5.2096	1.6396	1.8229	car
982	3.7549	1.5096	1.8274	car

983 rows × 4 columns

Get descriptives#

We can get the descriptive statistics using the decribe() method in pandas

cars.describe()

	length	height	width
count	983.000000	983.000000	983.000000
mean	4.219838	1.586132	1.793355
std	0.708656	0.131954	0.056570
min	3.110900	1.430400	1.624100
25%	3.816750	1.540050	1.760250
50%	4.121600	1.574600	1.790400
75%	4.520250	1.612350	1.821000
max	15.361000	4.201700	2.499800

Hm, the maximum value for car length looks very high.

Let’s plot the data (don’t worry about the plotting code, there is a session on this later)

sns.histplot(cars['length'])

<Axes: xlabel='length', ylabel='Count'>

_images/37163f483d9dc4649ba4aeadd9367c8200c0b0ad8d53ce456bce13d8c978e986.png

There must be a cars of length up to 15.36m (the max length in the descriptives table) but we can’t see them in the plot - although we can see the x axis is stretched out to accommodate them.

Let’s zoom in a bit:

plt.subplot(1,2,1)
sns.histplot(cars['length'])

plt.subplot(1,2,2)
sns.histplot(cars['length'])
plt.ylim((0,10))

plt.subplots_adjust(wspace = 0.5)

_images/0d099ca406e341f22394c374f096e0953ea945a89c44a877a622752302aabc54.png

Hm, a couple of outliers.

Let’s check it in the dataframe by pulling out any cars longer than 8m:

cars[cars['length']>8]

	length	height	width	type
100	15.361	4.1914	2.4998	truck
101	14.508	4.2017	2.4890	truck

ahh, the value of type for the two vehicles that exceed 8m is ‘truck’.

Let’s check what other types we have:

cars['type'].value_counts()

type
car      981
truck      2
Name: count, dtype: int64

Hm, 981 cars and 2 trucks. I think the trucks were included by mistake. Let’s drop them:

cars_clean = cars.drop([100,101]) # 100 is the row index of the truck - you can see it above
cars_clean['type'].value_counts() # check if we still have any trucks in the sample

type
car    981
Name: count, dtype: int64

We got the row indices for the trucks from the cell above where we pulled out rows with length >8m and inspected it.

if you were working with lots of files, it could be useful to cut out the middle man and find the row index using code:

ix = cars.index[cars['type']=='truck'] # find the row index for the truck and save it as 'ix'
cars_clean = cars.drop(ix)
cars_clean['type'].value_counts() # check if we still have any trucks in the sample

type
car    981
Name: count, dtype: int64

NaN an outlier value#

Sometimes, we don’t want to remove an entire entry.

Say for example there is a data entry error for one of the vehicle heights. We might not be that interested in vehicle height, and wish to retain other data on this vehicle.

Then we can replace just the offending value with the value NaN

NaN stands for Not a Number and is code across many programming languages for a missing value that should be ignored.

Let’s try an example

Import the data#

Let’s import a dataframe with size information on a random sample of cars

cars = # your code here to read the file 'https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/cars_outlier2.csv'
display(cars)

  Cell In[10], line 1
    cars = # your code here to read the file 'https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/cars_outlier2.csv'
           ^
SyntaxError: invalid syntax

Get descriptives#

We can get the descriptive statistics using the decribe() method in pandas

# your code here to get the descriptives - check above for an example!

It looks like there are some very high cars in the sample, but the maximum length and width are as expected.

Let’s have a look at the rows of the dataframe containing the very high cars

# your code here to display the rows of the dataframe with car height greater than 8m
# check above for an example

	length	height	width	type

Hm, these height values appear to be about 10x the mean height

Maybe a data entry error (the decimal point is in the wrong place)

We can replace the values with a NaN

cars_clean = cars # make a copy of the dataframe to work on

cars_clean.loc[[100,101],['height']]=np.nan
# or cut out the middle man and find the offending row numbers using code
#cars.loc[(cars['height']>8),['height']]=np.nan

# let's view the edited dataframe
display(cars_clean[98:105])

	length	height	width	type
98	4.3451	1.6417	1.8773	car
99	4.4613	1.6514	1.7431	car
100	15.3610	NaN	2.4998	truck
101	14.5080	NaN	2.4890	truck
102	3.9694	1.5171	1.7158	car
103	4.2810	1.5981	1.7840	car
104	3.4394	1.5124	1.7424	car

We can see that the mean and max height should now have gone back to reasonable values, as Python just ignores the NaNs when calculating the descriptive statistics:

cars_clean.describe()

	length	height	width
count	983.000000	981.000000	983.000000
mean	4.219838	1.580810	1.793355
std	0.708656	0.059263	0.056570
min	3.110900	1.430400	1.624100
25%	3.816750	1.540000	1.760250
50%	4.121600	1.574500	1.790400
75%	4.520250	1.611900	1.821000
max	15.361000	1.899300	2.499800

Find NaNs#

If you download a dataset, it can be useful to check which rows have missing values.

You can do this using the method isna, which returns a True for NaN values and False elsewhere.

For example we can check the section of the dataframe where we replaced some values with NaNs

cars_clean[98:105].isna()

	length	height	width	type
98	False	False	False	False
99	False	False	False	False
100	False	True	False	False
101	False	True	False	False
102	False	False	False	False
103	False	False	False	False
104	False	False	False	False

If you want to find the rows in which ‘height’ is NaN, you can use isna() to give indices like this:

cars_clean[cars_clean['height'].isna()]

	length	height	width	type
100	15.361	NaN	2.4998	truck
101	14.508	NaN	2.4890	truck

Delete or `NaN`?#

When should you delete the entire row and when should you replace a missing value with a NaN?

In general, it is better to replace bad or missing values with a NaN

I think there are three main considerations:

Partial data records are valuable#

If you have many data values for each record or row in your dataframe (in this case, records correspond to individual cars, but they could be patients for example), you may not wish ot junk all the data just because some variables are incomplete or have bad values.

For example, say we run a study which involves a hospital visit for a battery of tests (blood pressure, thyroid function, levels of vitamin B12…) and the lab looses a batch of samples for the B12 analysis. The remaining data on the patients may still be useful - maybe we didn’t even care that much about vitamin B12.

In this case, it is better to put a missing value for B12 in the affected patients and retain the rest of their data as normal.

Deleting data could be misleaading#

Say one of the measures in our patient study is a follow-up questionairre and only 60% of patients complete this.

If we simply delete all the patients who didn’t complete the follow-up from our database, we will have a biased sample (an no-one will ever know!)

For this reason it is good practice to retain incomplete data records, with Nans as appropriate, so another researcher can see the full picture.

Some errors really are errors#

However, if a record is clearly an error (completely empty) or a duplicate (two identical files for the same NHS number), it would be appropriate to delete it

Data cleaning

Contents

2.6. Data cleaning#

Set up Python libraries#

Removing an outlier#

Import the data#

Get descriptives#

NaN an outlier value#

Import the data#

Get descriptives#

Find NaNs#

Delete or NaN?#

Partial data records are valuable#

Deleting data could be misleaading#

Some errors really are errors#

Delete or `NaN`?#