3.5. Boxplot#

Sometimes less is more!

We saw in the lecture that if we want to compare several data distributions, it can be useful to have a plot that highlights key features (the median and quartiles) whilst eliminating unnecessary detail

The boxplot can do this job

Oxford Weather example#

We will work with historical data from the Oxford weather centre

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme()

Load and inspect the data#

Let’s load some historical data about the weather in Oxford, from the file “OxfordWeather.csv”

weather = pandas.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
display(weather)
YYYY MM DD Tmax Tmin Tmean Trange Rainfall_mm
0 1827 1 1 8.3 5.6 7.0 2.7 0.0
1 1827 1 2 2.2 0.0 1.1 2.2 0.0
2 1827 1 3 -2.2 -8.3 -5.3 6.1 9.7
3 1827 1 4 -1.7 -7.8 -4.8 6.1 0.0
4 1827 1 5 0.0 -10.6 -5.3 10.6 0.0
... ... ... ... ... ... ... ... ...
71338 2022 4 26 15.2 4.1 9.7 11.1 0.0
71339 2022 4 27 10.7 2.6 6.7 8.1 0.0
71340 2022 4 28 12.7 3.9 8.3 8.8 0.0
71341 2022 4 29 11.7 6.7 9.2 5.0 0.0
71342 2022 4 30 17.6 1.0 9.3 16.6 0.0

71343 rows × 8 columns

Have a look at the dataframe.

What do you think is contained in each column?

  • Each row is a single day from 1827 to 2022. The columns YYYY,MM,DD give the date.
  • The columns Tmax, Tmin and Tmean give information about the temperature
  • We also have a record of the rainfall each day

Plot the temperature#

Say we want to plot the mean temperature in each month of the year. We have almost 200 datapoints for every date (and 30ish dates within each month, so 6000 measurements per month!)

We can summarise the distribution of temperatures in each month using a boxplot:

sns.boxplot(data=weather, x="MM", y="Tmax")
<Axes: xlabel='MM', ylabel='Tmax'>
_images/2e0ab61a6881a73a49fcd03d2daa07d2e54dbd0d02101ca2382b07a84823bf70.png

Using a simple boxplot for each month, we can easily see the trend across months for warmer weather in the summer and cooler weather in the winter.

Within each month we can also see some information about the distribution - for example:

  • Temperatures are more variable in winter and summer, than in autumn and spring
  • In winter, the distribution of temperatures has negative skew (there are some unusually cold years) but in summer the converse is true - this is evident from the position of the median within each box but is clearer in a violinplot (see below)

Comparing patterns#

Has the pattern of temperatures changed over the years?

Let’s compare temperatures in the 19th, 20th and 21st century.

To do so we add a ‘century’ column to our dataframe. We use the pandas.cut function to categorise the values of year into 19thC, 20thC and 21stC

weather['CCCC'] = pandas.cut(x=weather['YYYY'], bins=[1800,1900,2000,2100], labels=["19thC","20thC","21stC"])
display(weather)
YYYY MM DD Tmax Tmin Tmean Trange Rainfall_mm CCCC
0 1827 1 1 8.3 5.6 7.0 2.7 0.0 19thC
1 1827 1 2 2.2 0.0 1.1 2.2 0.0 19thC
2 1827 1 3 -2.2 -8.3 -5.3 6.1 9.7 19thC
3 1827 1 4 -1.7 -7.8 -4.8 6.1 0.0 19thC
4 1827 1 5 0.0 -10.6 -5.3 10.6 0.0 19thC
... ... ... ... ... ... ... ... ... ...
71338 2022 4 26 15.2 4.1 9.7 11.1 0.0 21stC
71339 2022 4 27 10.7 2.6 6.7 8.1 0.0 21stC
71340 2022 4 28 12.7 3.9 8.3 8.8 0.0 21stC
71341 2022 4 29 11.7 6.7 9.2 5.0 0.0 21stC
71342 2022 4 30 17.6 1.0 9.3 16.6 0.0 21stC

71343 rows × 9 columns

Now we can use the argument hue in seaborn.boxplot to produce parallel box plots.

sns.boxplot(data=weather, x="MM", y="Tmean", hue="CCCC")
<Axes: xlabel='MM', ylabel='Tmean'>
_images/99c4edf287f304e9a58eb6e9d124bb1880b9347ffe16a5c04d1ad33dc23b9dc2.png

It looks like Oxford has been getting a bit warmer over the centuries.

Violinplot#

Using Python, you can make a slighly fancier version of the boxplot called a violinplot.

The violinplot shows the full distribution of data rather than the summary captured in a boxplot - the violin body is basically a KDE plot.

Let’s give it a try using the function sns.violinplot

sns.violinplot(data=weather, x="MM", y="Tmax")
<Axes: xlabel='MM', ylabel='Tmax'>
_images/72db0806b9f7658e262b9adc80618b288581c0575c8d9e348153c564197f7d0b.png

This is a nice compromise - still easy to “eyeball” the pattern across categories (in this case, across months) but giving plenty of detail within each category also

Note for example that the trend for:

  • negative skew in temperature in winter (outliers are cold days)
  • positive skew in summer (outliers are hot days)
...is much more clearly visible in the violin plot than in a box plot.