1.8. Tables of descriptive statistics#

When you are writing a report on a dataset you might like to generate a nice table with the descriptive statistics that you need all together in one place

Below we review a couple of useful pieces of Python syntax for making tables:

  • df.describe()

  • df.agg()

  • df.groupby()

1.8.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

1.8.2. Import a dataset to work with#

We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet.

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
0 1827 Jan 1 1 1 8.3 5.6 7.0 2.7 0.0
1 1827 Jan 1 2 2 2.2 0.0 1.1 2.2 0.0
2 1827 Jan 1 3 3 -2.2 -8.3 -5.3 6.1 9.7
3 1827 Jan 1 4 4 -1.7 -7.8 -4.8 6.1 0.0
4 1827 Jan 1 5 5 0.0 -10.6 -5.3 10.6 0.0
... ... ... ... ... ... ... ... ... ... ...
71338 2022 Apr 4 26 116 15.2 4.1 9.7 11.1 0.0
71339 2022 Apr 4 27 117 10.7 2.6 6.7 8.1 0.0
71340 2022 Apr 4 28 118 12.7 3.9 8.3 8.8 0.0
71341 2022 Apr 4 29 119 11.7 6.7 9.2 5.0 0.0
71342 2022 Apr 4 30 120 17.6 1.0 9.3 16.6 0.0

71343 rows × 10 columns

1.8.3. Quick look: df.describe()#

  • df.describe()

We can output a standard set of descriptives for all the numerical columns in the dataframe by just using df.describe()

weather.describe()
YYYY MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
count 71343.000000 71343.000000 71343.000000 71343.000000 71342.000000 71341.000000 71341.000000 71338.000000 71343.000000
mean 1924.165174 6.516210 15.729154 182.917077 13.890153 6.156018 10.046208 7.734244 1.786964
std 56.387642 3.450063 8.799927 105.478891 6.490258 5.235341 5.632085 3.513996 3.976377
min 1827.000000 1.000000 1.000000 1.000000 -9.600000 -17.800000 -12.100000 0.000000 0.000000
25% 1875.000000 4.000000 8.000000 91.500000 9.200000 2.200000 5.900000 5.200000 0.000000
50% 1924.000000 7.000000 16.000000 183.000000 13.700000 6.400000 10.100000 7.400000 0.000000
75% 1973.000000 10.000000 23.000000 274.000000 18.900000 10.200000 14.500000 10.000000 1.700000
max 2022.000000 12.000000 31.000000 366.000000 36.500000 21.200000 27.400000 23.100000 87.900000

Note-

  • Month is missing from the table because the values are not numerical

This can be good for a quick look but most likely you will not need all these stats on all the variables - if you were producing a report for an assignment (or a client) you would want to customize the table

1.8.4. Custom table: df.agg()#

df.agg({'columname':['mean', 'std', 'count']})

You can create a table with the just the stats you need using the function df.agg()

The syntax is a bit fiddly but you can copy from this example which gives the mean and standard deviation for Tmin and Tmax:

# get mean and sd for Tmin and Tmax
weather.agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
Tmin Tmax
mean 6.156018 13.890153
std 5.235341 6.490258

Finally, instead of doing that for the whole dataframe weather, you can select rows using df.query() as before:

# get mean and sd for Tmin and Tmax in June
weather.query('MM == 6').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
Tmin Tmax
mean 10.328291 20.011487
std 2.664211 3.585932

1.8.5. Disaggregate by category: df.groupby()#

The syntax df.groupby() breaks a dataframe down by the categories you have chosen to group by. Any function you then apply will be done sparately for each of these categories.

For example, say we want to report the mean value of the peak daily temperature in each month. Remember to get the mean value of the peak daily temperature overall we did:

weather.Tmax.mean()
13.890153065515406

To break it down by month, we insert .groupby('MM'):

weather.groupby('MM').Tmax.mean()
MM
1      6.554444
2      7.401048
3      9.944914
4     13.187517
5     16.795252
6     20.011487
7     21.799007
8     21.192936
9     18.451043
10    14.112639
11     9.640041
12     7.290571
Name: Tmax, dtype: float64

We can of course combine groupby() with agg() to make a custom table for each group of data, eg

# get mean and sd for Tmin and Tmax in each month
weather.groupby('MM').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
Tmin Tmax
mean std mean std
MM
1 1.319437 4.034070 6.554444 3.831624
2 1.470683 3.861644 7.401048 3.723290
3 2.396840 3.472544 9.944914 3.641816
4 4.301786 3.144024 13.187517 3.648047
5 7.165062 3.108733 16.795252 3.761523
6 10.328291 2.664211 20.011487 3.585932
7 12.238098 2.457236 21.799007 3.511055
8 11.965261 2.627916 21.192936 3.232944
9 9.824855 3.241090 18.451043 3.088003
10 6.874028 3.825368 14.112639 3.090256
11 3.692014 3.870762 9.640041 3.254649
12 2.067461 4.131333 7.290571 3.801597
  • Note groupby() is handy to break down datasets according to a categorical variable. In this sense it works similarly to the hue argument in some of the plotting functions you will see next week, which allows us to plot data sepaarately for different cases of a categoricaal variable.

1.8.6. Practice#

Try these quick practice questions:

# Make a table containing the mean rainfall in each month
# Check: the mean in October is 2.17mm
# Get the min and max temperature for each month in the year 2000
# Check: the min and max in May are 4.9 and 7.2 degrees respectively