Learning Objectives

Contents

3.2. Learning Objectives#

3.2.1. Conceptual#

Data cleaning

Understand how outliers and bad datapoints may be identified
Understand the problems caused by retaining outliers and bad datapoints in a dataset
Understand the factors determining how we should deal with an outlier or bad datapoint:
- delete the whole data entry
- replace the whole data entry with NaN
- replace a individual variable values with NaN

Data normalization

Understand how we calculate Z-scored or normalized values, and why these are useful
Understand when it is useful to express a datapoint as a centile of a distribution

Data disaggregation

Explore a dataset thoughtfully and make your own decisions about when to disaggregate data
- write text justfiying these decisions to the reader
Understand the possible consequences of not disaggregating data
- inability to draw conclusions
- relevance of conclusions to under-represented groups

This material is covered in the lecture.

3.2.2. Python skills#

After this week you should be able to do the following:

General

Create a new column in a dataframe

Data cleaning

Quickly plot data to check for outlier values
Sort a pandas dataframe by a given column to identify outlier values
Replace a given value in a pandas dataframe, or a particular column of a dataframe, using df.replace()
Replace values in a certain range (e.g., outliers) using df.replace()
Be aware that different Python functions handle NaN values differently, and use help pages to ensure these are handled as intended

Data normalization

Convert data values to Z scores using a combination of df.mean() and df.std()
Scale histograms and KDE plots either within or across categories

Data disaggregation

Plot data separately, and report data separately, for different cases of a categorical variable
Create a new categorical variable to categorize a continuous variable