1.10. Tutorial Exercises#

This week’s tutorial exercises focus on indexing and obtaining descriptive statistics

1.10.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

1.10.2. Import a dataset to work with#

You will need to download the file OxfordWeather.csv from Canvas to your computer, then import it

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
0 1827 Jan 1 1 1 8.3 5.6 7.0 2.7 0.0
1 1827 Jan 1 2 2 2.2 0.0 1.1 2.2 0.0
2 1827 Jan 1 3 3 -2.2 -8.3 -5.3 6.1 9.7
3 1827 Jan 1 4 4 -1.7 -7.8 -4.8 6.1 0.0
4 1827 Jan 1 5 5 0.0 -10.6 -5.3 10.6 0.0
... ... ... ... ... ... ... ... ... ... ...
71338 2022 Apr 4 26 116 15.2 4.1 9.7 11.1 0.0
71339 2022 Apr 4 27 117 10.7 2.6 6.7 8.1 0.0
71340 2022 Apr 4 28 118 12.7 3.9 8.3 8.8 0.0
71341 2022 Apr 4 29 119 11.7 6.7 9.2 5.0 0.0
71342 2022 Apr 4 30 120 17.6 1.0 9.3 16.6 0.0

71343 rows × 10 columns

1.10.3. Exercises#

In the following questions, we descriptive statistics and indexing to answer some questions about the weather and climate in Oxford.

Where you are asked to calculate a value (such as the mean) rather than output a table, you should report your answer in words in the text box below the code block.

Where the question asks you to “comment”, you are simply being asked to engage with the data/ explain what you notice in plain English. Please discuss with your fellow students and your tutor as this is a really important skill for data analysis.

Part 1: Heat#

a. What was the hottest temperature on record?#

Note that the dataset ends in April 2022 and therefore does not include the record heatwave of summer 2022.

# Your code here

Your text here

b. On what date did the hottest temperature occur?#

Hint: you could use df.query() to help you here

# Your code here

Your text here

c. Display the 10 hottest days on record and comment#

Hint: you can use df.sort_values() and df.head() or df.tail() to help you here

# Your code here

Your comment here

d. Find the mean of maximum daily temperature (Tmax) for each month and comment#

Hint: you can use df.groupby() to help you here

# Your code here

Your comment here

e. Make a table displaying the mean and standard deviation of Tmax in each month#

Hint: A combination of df.agg() and df.groupby() will help you here

# Your code here

e. Make a table displaying the mean of Tmax and Tmin in each month#

Hint: A combination of df.agg() and df.groupby() will help you here

# Your code here

Part 2: Rain#

a. Run this code block to add a column called wet containing a True for days on which it rained and False otherwise#

We will practice adding columns in a later session

# Your code here
weather['wet']=weather.Rainfall_mm>0
weather
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm wet
0 1827 Jan 1 1 1 8.3 5.6 7.0 2.7 0.0 False
1 1827 Jan 1 2 2 2.2 0.0 1.1 2.2 0.0 False
2 1827 Jan 1 3 3 -2.2 -8.3 -5.3 6.1 9.7 True
3 1827 Jan 1 4 4 -1.7 -7.8 -4.8 6.1 0.0 False
4 1827 Jan 1 5 5 0.0 -10.6 -5.3 10.6 0.0 False
... ... ... ... ... ... ... ... ... ... ... ...
71338 2022 Apr 4 26 116 15.2 4.1 9.7 11.1 0.0 False
71339 2022 Apr 4 27 117 10.7 2.6 6.7 8.1 0.0 False
71340 2022 Apr 4 28 118 12.7 3.9 8.3 8.8 0.0 False
71341 2022 Apr 4 29 119 11.7 6.7 9.2 5.0 0.0 False
71342 2022 Apr 4 30 120 17.6 1.0 9.3 16.6 0.0 False

71343 rows × 11 columns

b. What is the proportion of wet days overall?#

Hint: The values True and False can be treated as 1 and 0 respectively.

To get the proportion of days on which wet==True, we can use a programmming trick which is to simply take the mean of the column wet:

  • say there are 100 days in my sample

    • say 66 of them, wet==True==1

    • for the other 44, wet==False==0

  • If we take the mean, this gives us the proportion of wet days because we:

    • add up all the values (answer=66)

    • divide by the number of cases (100)

    • result is 66/100 = 0.66 or 66%, the proportion of wet days

# your code here

Your text here

c. What is the proportion of wet days in each month? Comment on your findings#

Hint: use df.groupby()

# your code here

Your comments here

d. What is the mean quantity of rainfall (in mm) in each month? Comment on your findings#

# your code here

Your comment here

e. Display the 10 wettest days on record and comment#

# Your code here

Your comment here

f. Compare and contrast the different findings in part 2 c,d, and e#

Different descriptive statistics tell us different things about the same data!

Your comments here!

Snow#

a. Create a dataframe WhiteChristmas containing the weather on Christmas day, for all the years in which there was a White Christmas#

Hint: we don’t have a column telling us when is has snowed, but it is reasonable to assume this happens when the minimum temperature dips below zero, and Rainfall_mm is above zero.

# Your code here
# WhiteChristmas = 

b. Sort the dataframe WhiteChristmas by year and comment#

# Your code here

Your comments here

c. Any issues with our definition of ‘snow’?#

We defined snow as when the Tmin falls below zero and Rainfall is non-zero.

  • Do you think this over- or under- estiamtes the number of snowy days?

  • Why?

Your comments here

d. How common is ‘proper’ snowfall in Oxford?#

Let’s focus on days with enough snowfall to make at least a tiny snowman! Assume that this happens when TMin is below zero and there is more than 4mm of rainfall

  • 4mm of rain makes about 5cm of soggy snow in Oxford conditions, although it would make a uch greater depth of powder in a cold dry atmosphere like Utah or Colorado

Create a dataframe called SnowDays containing only days with enough snow to make a snowman.

You can check how often this happened in recent years using df.tail()

# Your code here

Your comments here