3. Data Wrangling#

This week we are looking at “data wrangling”

Data wrangling is the process of getting a real-life dataset cleaned up and parcelled out, such that you can report meaningful statistical insights. This is a key skill for working with real data, such as those you would collect yourself in an experimental setting, or datasets downloaded from the internet.

We will cover three key topics:

Data cleaning

  • Identifying outliers

  • Removing outliers

  • Using NaN for missing values

Data normalization

  • Scaling data from different individuals or sources to make them comparable

  • The Z score

Data disaggregation

  • Separating out data according to a categorical variable

  • Creating categorical variables to separate out data

Hopefully this will prepaare us for the first Hand-In Assignment, in which you will use your skills to clean and analyze a real, large dataset.

3.1. Tasks for this week#

Conceptual material is covered in the lecture. In addition to the live lecture, you can find lecture videos on Canvas.

Please work through the guided exercises in this section (everything except the page labelled “Tutorial Exercises”) in advance of the computer-based tutorial session.

To complete the guided exercises you will need to either:

  • open the pages in Google Colab (simply click the Colab button on each page), or

  • download them as Jupyter Notbooks to your own computer and work with them locally (eg in JupyterLab)

If you find something difficult or have questions, you can discuss with your tutor in the computer-based tutoral session.

Assignment

This week we will set the first hand-in assignment.

You will download the assignment sheet and dataset from Canvas.

Your tutor will discuss this with you and tell you when you need to hand it in.