1.2. Learning Objectives#
1.2.1. Conceptual#
This week we are thinking about how to describe data – covering measures of centre (mean, median, mode), measures of spread (variance, standard deviation, inter quartile range, percentiles), and description of distributions (shape and skew).
After this week you should understand:
Summary Statistics#
Conceptual difference between the mean, median and mode, and when each is used
Conceptual difference between the standard deviation and interquartile range and when each is used
Why measures based on ranks (median and inter quartile range) are robust to outliers
Why the mean is useful in predicting the behaivour of large samples
Shape of distributions#
Describe the shape and skew of a distribution in words (based on viewing a data plot)
Make predictions about the shape of a distribution from summary statistics (for example, what is the skew for a distribution where the median is higher than the mean?)
Appreciate common factors affecting the shape of distributions (what happens when a measure can only take values above zero for example).
Correlation#
Understand what correlation is, and what correlated data look like ona scatter plot
Understand the assumptions of Pearson’s correlation coefficient, and when to use Spearman’s and Pearson’s correlation coefficients
The material is covered in the lecture (also in the lecture videos on Canvas)
Note that the material on correlation is not covered in the lecture, but is covered in the notes and videos on this website. You will need to work through this yourself.
1.2.2. Python skills#
We are working with Pandas
dataframes and some of the associated methods
After this week you should be able to:
Read data from a .csv file into a
pandas
dataframe usingpandas.read_csv()
View a dataframe using
df.display()
including viewing only certain rows (selected by row index or condition)Obtain a set of descriptive statsitics using
df.describe()
, including obtaining statistics for a subset of rows or columnsObtain specific descriptive statistics using methods such as
df.mean()
,df.count()
,df.quantile()
, anddf.corr()
including for a subset of rows or columns
This material is covered in the Jupyter Notebooks in this section
There is also a DataCamp module on Pandas
- you may wish to revisit this for further Python practice