3.5. Boxplot#
Sometimes less is more!
We saw in the lecture that if we want to compare several data distributions, it can be useful to have a plot that highlights key features (the median and quartiles) whilst eliminating unnecessary detail
The boxplot can do this job
Oxford Weather example#
We will work with historical data from the Oxford weather centre
Set up Python libraries#
As usual, run the code cell below to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas
import seaborn as sns
sns.set_theme()
Load and inspect the data#
Let’s load some historical data about the weather in Oxford, from the file “OxfordWeather.csv”
weather = pandas.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
display(weather)
YYYY | MM | DD | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|
0 | 1827 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
1 | 1827 | 1 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
2 | 1827 | 1 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
3 | 1827 | 1 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
4 | 1827 | 1 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | 4 | 26 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
71339 | 2022 | 4 | 27 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
71340 | 2022 | 4 | 28 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
71341 | 2022 | 4 | 29 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
71342 | 2022 | 4 | 30 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 8 columns
Have a look at the dataframe.
What do you think is contained in each column?
- Each row is a single day from 1827 to 2022. The columns YYYY,MM,DD give the date.
- The columns Tmax, Tmin and Tmean give information about the temperature
- We also have a record of the rainfall each day
Plot the temperature#
Say we want to plot the mean temperature in each month of the year. We have almost 200 datapoints for every date (and 30ish dates within each month, so 6000 measurements per month!)
We can summarise the distribution of temperatures in each month using a boxplot:
sns.boxplot(data=weather, x="MM", y="Tmax")
<Axes: xlabel='MM', ylabel='Tmax'>
Using a simple boxplot for each month, we can easily see the trend across months for warmer weather in the summer and cooler weather in the winter.
Within each month we can also see some information about the distribution - for example:
- Temperatures are more variable in winter and summer, than in autumn and spring
- In winter, the distribution of temperatures has negative skew (there are some unusually cold years) but in summer the converse is true - this is evident from the position of the median within each box but is clearer in a violinplot (see below)
Comparing patterns#
Has the pattern of temperatures changed over the years?
Let’s compare temperatures in the 19th, 20th and 21st century.
To do so we add a ‘century’ column to our dataframe. We use the pandas.cut function to categorise the values of year into 19thC, 20thC and 21stC
weather['CCCC'] = pandas.cut(x=weather['YYYY'], bins=[1800,1900,2000,2100], labels=["19thC","20thC","21stC"])
display(weather)
YYYY | MM | DD | Tmax | Tmin | Tmean | Trange | Rainfall_mm | CCCC | |
---|---|---|---|---|---|---|---|---|---|
0 | 1827 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 | 19thC |
1 | 1827 | 1 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 | 19thC |
2 | 1827 | 1 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 | 19thC |
3 | 1827 | 1 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 | 19thC |
4 | 1827 | 1 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 | 19thC |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | 4 | 26 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 | 21stC |
71339 | 2022 | 4 | 27 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 | 21stC |
71340 | 2022 | 4 | 28 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 | 21stC |
71341 | 2022 | 4 | 29 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 | 21stC |
71342 | 2022 | 4 | 30 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 | 21stC |
71343 rows × 9 columns
Now we can use the argument hue in seaborn.boxplot to produce parallel box plots.
sns.boxplot(data=weather, x="MM", y="Tmean", hue="CCCC")
<Axes: xlabel='MM', ylabel='Tmean'>
It looks like Oxford has been getting a bit warmer over the centuries.
Violinplot#
Using Python, you can make a slighly fancier version of the boxplot called a violinplot.
The violinplot shows the full distribution of data rather than the summary captured in a boxplot - the violin body is basically a KDE plot.
Let’s give it a try using the function sns.violinplot
sns.violinplot(data=weather, x="MM", y="Tmax")
<Axes: xlabel='MM', ylabel='Tmax'>
This is a nice compromise - still easy to “eyeball” the pattern across categories (in this case, across months) but giving plenty of detail within each category also
Note for example that the trend for:
- negative skew in temperature in winter (outliers are cold days)
- positive skew in summer (outliers are hot days)