2.8. Python skills check#

Here we will review all the Python skills you should know by the end of this week

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme()

Load the data#

Picture of the Titanic

Let’s load some data about the passengers of the Titanic from the file “data/titanic.csv”

titanic = pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/titanic.csv')
display(titanic)
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

You can find some information abbout this dataset on Kaggle including explanations of the less obvious column headers

Get descriptives#

Let’s get some descriptive statistics, just for practice:

# How many people were in each class? Hint - use df.value_counts() which we saw on the page on data cleaning
# What was the mean fare in each class? Hint- use .mean() and .groupby()
# What was the standard deviation of fare in each class? Hint- use .std() and .groupby()
# What was the 10th and 90th centile of age overall?
# display rows 400-420 of the dataframe
# display only passengers under 12 years old
# display only passengers whose age is unknown (NaN)
# count how many passengers' age was unknown
# display only passengers over 70 years old

Wait a minute!

There was something strange in that last dataframe. Maybe someone’s age was mis-recorded?

# replace the misrecorded age with NaN - hint - check the page on data cleaning

# and display the relevant part of the dataframe to check
titanic[420:425]
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
420 0 3 Gheorgheff, Mr. Stanio male NaN 0 0 349254 7.8958 NaN C
421 0 3 Charters, Mr. David male 21.0 0 0 A/5. 13032 7.7333 NaN Q
422 0 3 Zimmerman, Mr. Leo male 290.0 0 0 315082 7.8750 NaN S
423 0 3 Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria ... female 28.0 1 1 347080 14.4000 NaN S
424 0 3 Rosblom, Mr. Viktor Richard male 18.0 1 1 370129 20.2125 NaN S