3. Data Wrangling#
This week we are looking at “data wrangling”
Data wrangling is the process of getting a real-life dataset cleaned up and parcelled out, such that you can report meaningful statistical insights. This is a key skill for working with real data, such as those you would collect yourself in an experimental setting, or datasets downloaded from the internet.
We will cover three key topics:
Data cleaning
Identifying outliers
Removing outliers
Using
NaN
for missing values
Data normalization
Scaling data from different individuals or sources to make them comparable
The Z score
Data disaggregation
Separating out data according to a categorical variable
Creating categorical variables to separate out data
Hopefully this will prepaare us for the first Hand-In Assignment, in which you will use your skills to clean and analyze a real, large dataset.
3.1. Tasks for this week#
Conceptual material is covered in the lecture. In addition to the live lecture, you can find lecture videos on Canvas.
Please work through the guided exercises in this section (everything except the page labelled “Tutorial Exercises”) in advance of the computer-based tutorial session.
To complete the guided exercises you will need to either:
open the pages in Google Colab (simply click the Colab button on each page), or
download them as Jupyter Notbooks to your own computer and work with them locally (eg in JupyterLab)
If you find something difficult or have questions, you can discuss with your tutor in the computer-based tutoral session.
Assignment
This week we will set the first hand-in assignment.
You will download the assignment sheet and dataset from Canvas.
Your tutor will discuss this with you and tell you when you need to hand it in.