Tables of descriptive statistics

1.9. Tables of descriptive statistics#

When you are writing a report on a dataset you might like to generate a nice table with the descriptive statistics that you need all together in one place

Below we review a couple of useful pieces of Python syntax for making tables:

df.describe()
df.agg()
df.groupby()

1.9.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

1.9.2. Import a dataset to work with#

We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet.

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)

	YYYY	Month	MM	DD	DD365	Tmax	Tmin	Tmean	Trange	Rainfall_mm
0	1827	Jan	1	1	1	8.3	5.6	7.0	2.7	0.0
1	1827	Jan	1	2	2	2.2	0.0	1.1	2.2	0.0
2	1827	Jan	1	3	3	-2.2	-8.3	-5.3	6.1	9.7
3	1827	Jan	1	4	4	-1.7	-7.8	-4.8	6.1	0.0
4	1827	Jan	1	5	5	0.0	-10.6	-5.3	10.6	0.0
...	...	...	...	...	...	...	...	...	...	...
71338	2022	Apr	4	26	116	15.2	4.1	9.7	11.1	0.0
71339	2022	Apr	4	27	117	10.7	2.6	6.7	8.1	0.0
71340	2022	Apr	4	28	118	12.7	3.9	8.3	8.8	0.0
71341	2022	Apr	4	29	119	11.7	6.7	9.2	5.0	0.0
71342	2022	Apr	4	30	120	17.6	1.0	9.3	16.6	0.0

71343 rows × 10 columns

1.9.3. Quick look: `df.describe()`#

df.describe()

We can output a standard set of descriptives for all the numerical columns in the dataframe by just using df.describe()

weather.describe()

	YYYY	MM	DD	DD365	Tmax	Tmin	Tmean	Trange	Rainfall_mm
count	71343.000000	71343.000000	71343.000000	71343.000000	71342.000000	71341.000000	71341.000000	71338.000000	71343.000000
mean	1924.165174	6.516210	15.729154	182.917077	13.890153	6.156018	10.046208	7.734244	1.786964
std	56.387642	3.450063	8.799927	105.478891	6.490258	5.235341	5.632085	3.513996	3.976377
min	1827.000000	1.000000	1.000000	1.000000	-9.600000	-17.800000	-12.100000	0.000000	0.000000
25%	1875.000000	4.000000	8.000000	91.500000	9.200000	2.200000	5.900000	5.200000	0.000000
50%	1924.000000	7.000000	16.000000	183.000000	13.700000	6.400000	10.100000	7.400000	0.000000
75%	1973.000000	10.000000	23.000000	274.000000	18.900000	10.200000	14.500000	10.000000	1.700000
max	2022.000000	12.000000	31.000000	366.000000	36.500000	21.200000	27.400000	23.100000	87.900000

Note-

Month is missing from the table because the values are not numerical

This can be good for a quick look but most likely you will not need all these stats on all the variables - if you were producing a report for an assignment (or a client) you would want to customize the table

1.9.4. Custom table: `df.agg()`#

df.agg({'columname':['mean', 'std', 'count']})

You can create a table with the just the stats you need using the function df.agg()

The syntax is a bit fiddly but you can copy from this example which gives the mean and standard deviation for Tmin and Tmax:

# get mean and sd for Tmin and Tmax
weather.agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})

	Tmin	Tmax
mean	6.156018	13.890153
std	5.235341	6.490258

Finally, instead of doing that for the whole dataframe weather, you can select rows using df.query() as before:

# get mean and sd for Tmin and Tmax in June
weather.query('MM == 6').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})

	Tmin	Tmax
mean	10.328291	20.011487
std	2.664211	3.585932

1.9.5. Disaggregate by category: df.groupby()#

The syntax df.groupby() breaks a dataframe down by the categories you have chosen to group by. Any function you then apply will be done sparately for each of these categories.

For example, say we want to report the mean value of the peak daily temperature in each month. Remember to get the mean value of the peak daily temperature overall we did:

weather.Tmax.mean()

13.890153065515406

To break it down by month, we insert .groupby('MM'):

weather.groupby('MM').Tmax.mean()

MM
    6.554444
    7.401048
    9.944914
   13.187517
   16.795252
   20.011487
   21.799007
   21.192936
   18.451043
  14.112639
   9.640041
   7.290571
Name: Tmax, dtype: float64

We can of course combine groupby() with agg() to make a custom table for each group of data, eg

# get mean and sd for Tmin and Tmax in each month
weather.groupby('MM').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})

	Tmin		Tmax
	mean	std	mean	std
MM
1	1.319437	4.034070	6.554444	3.831624
2	1.470683	3.861644	7.401048	3.723290
3	2.396840	3.472544	9.944914	3.641816
4	4.301786	3.144024	13.187517	3.648047
5	7.165062	3.108733	16.795252	3.761523
6	10.328291	2.664211	20.011487	3.585932
7	12.238098	2.457236	21.799007	3.511055
8	11.965261	2.627916	21.192936	3.232944
9	9.824855	3.241090	18.451043	3.088003
10	6.874028	3.825368	14.112639	3.090256
11	3.692014	3.870762	9.640041	3.254649
12	2.067461	4.131333	7.290571	3.801597

Note groupby() is handy to break down datasets according to a categorical variable. In this sense it works similarly to the hue argument in some of the plotting functions you will see next week, which allows us to plot data separately for different cases of a categorical variable.

1.9.6. Practice#

Try these quick practice questions:

# Make a table containing the mean rainfall in each month
# Check: the mean in October is 2.17mm

# Get the min and max temperature for each month in the year 2000
# Check: the min and max in May are 4.9 and 25.3 degrees respectively