2.8. Python skills check#
Here we will review all the Python skills you should know by the end of this week
Set up Python libraries#
As usual, run the code cell below to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas
import seaborn as sns
sns.set_theme()
Load the data#
Let’s load some data about the passengers of the Titanic from the file “data/titanic.csv”
titanic = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/titanic.csv')
display(titanic)
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 11 columns
You can find some information abbout this dataset on Kaggle including explanations of the less obvious column headers
Get descriptives#
Let’s get some descriptive statistics, just for practice:
# How many people were in each class? Hint - use df.value_counts() which we saw on the page on data cleaning
# What was the mean fare in each class? Hint- use .mean() and .groupby()
# What was the standard deviation of fare in each class? Hint- use .std() and .groupby()
# What was the 10th and 90th centile of age overall?
# display rows 400-420 of the dataframe
# display only passengers under 12 years old
# display only passengers whose age is unknown (NaN)
# count how many passengers' age was unknown
# display only passengers over 70 years old
Wait a minute!
There was something strange in that last dataframe. Maybe someone’s age was mis-recorded?
# replace the misrecorded age with NaN - hint - check the page on data cleaning
# and display the relevant part of the dataframe to check
titanic[420:425]
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
420 | 0 | 3 | Gheorgheff, Mr. Stanio | male | NaN | 0 | 0 | 349254 | 7.8958 | NaN | C |
421 | 0 | 3 | Charters, Mr. David | male | 21.0 | 0 | 0 | A/5. 13032 | 7.7333 | NaN | Q |
422 | 0 | 3 | Zimmerman, Mr. Leo | male | 290.0 | 0 | 0 | 315082 | 7.8750 | NaN | S |
423 | 0 | 3 | Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria ... | female | 28.0 | 1 | 1 | 347080 | 14.4000 | NaN | S |
424 | 0 | 3 | Rosblom, Mr. Viktor Richard | male | 18.0 | 1 | 1 | 370129 | 20.2125 | NaN | S |