1.9. Tables of descriptive statistics#
When you are writing a report on a dataset you might like to generate a nice table with the descriptive statistics that you need all together in one place
Below we review a couple of useful pieces of Python syntax for making tables:
df.describe()
df.agg()
df.groupby()
1.9.1. Set up Python Libraries#
As usual you will need to run this code block to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
1.9.2. Import a dataset to work with#
We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet.
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1827 | Jan | 1 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
1 | 1827 | Jan | 1 | 2 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
2 | 1827 | Jan | 1 | 3 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
3 | 1827 | Jan | 1 | 4 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
4 | 1827 | Jan | 1 | 5 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | Apr | 4 | 26 | 116 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
71339 | 2022 | Apr | 4 | 27 | 117 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
71340 | 2022 | Apr | 4 | 28 | 118 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
71341 | 2022 | Apr | 4 | 29 | 119 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
71342 | 2022 | Apr | 4 | 30 | 120 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 10 columns
1.9.3. Quick look: df.describe()
#
df.describe()
We can output a standard set of descriptives for all the numerical columns in the dataframe by just using df.describe()
weather.describe()
YYYY | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|---|
count | 71343.000000 | 71343.000000 | 71343.000000 | 71343.000000 | 71342.000000 | 71341.000000 | 71341.000000 | 71338.000000 | 71343.000000 |
mean | 1924.165174 | 6.516210 | 15.729154 | 182.917077 | 13.890153 | 6.156018 | 10.046208 | 7.734244 | 1.786964 |
std | 56.387642 | 3.450063 | 8.799927 | 105.478891 | 6.490258 | 5.235341 | 5.632085 | 3.513996 | 3.976377 |
min | 1827.000000 | 1.000000 | 1.000000 | 1.000000 | -9.600000 | -17.800000 | -12.100000 | 0.000000 | 0.000000 |
25% | 1875.000000 | 4.000000 | 8.000000 | 91.500000 | 9.200000 | 2.200000 | 5.900000 | 5.200000 | 0.000000 |
50% | 1924.000000 | 7.000000 | 16.000000 | 183.000000 | 13.700000 | 6.400000 | 10.100000 | 7.400000 | 0.000000 |
75% | 1973.000000 | 10.000000 | 23.000000 | 274.000000 | 18.900000 | 10.200000 | 14.500000 | 10.000000 | 1.700000 |
max | 2022.000000 | 12.000000 | 31.000000 | 366.000000 | 36.500000 | 21.200000 | 27.400000 | 23.100000 | 87.900000 |
Note-
Month is missing from the table because the values are not numerical
We have got descriptives for YYYY and
This can be good for a quick look but most likely you will not need all these stats on aall the variables - if you were producing a report for an assignment (or a client) you would want to customize the table
1.9.4. Custom table: df.agg()
#
df.agg({'columname':['mean', 'std', 'count']})
You can create a table with the just the stats you need using the function df.agg()
This is not essential (although it does look nice in a report) - the syntax is a bit fiddly but you can copy from this example which gives the mean and standard deviation for Tmin and Tmax:
# get mean and sd for Tmin and Tmax
weather.agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
Tmin | Tmax | |
---|---|---|
mean | 6.156018 | 13.890153 |
std | 5.235341 | 6.490258 |
Finally, instead of doing that for the whole dataframe weather
, you can select rows using df.query()
as before:
# get mean and sd for Tmin and Tmax in June
weather.query('MM == 6').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
Tmin | Tmax | |
---|---|---|
mean | 10.328291 | 20.011487 |
std | 2.664211 | 3.585932 |
1.9.5. Disaggregate by category: df.groupby()#
The syntax df.groupby()
breaks a dataframe down by the categories you have chosen to group by. Any function you then apply will be done sparately for each of these categories.
For example, say we want to report the mean value of the peak daily temperature in each month. Remember to get the mean value of the peak daily temperature overall we did:
weather.Tmax.mean()
13.890153065515406
To break it down by month, we insert .groupby('MM')
:
weather.groupby('MM').Tmax.mean()
MM
1 6.554444
2 7.401048
3 9.944914
4 13.187517
5 16.795252
6 20.011487
7 21.799007
8 21.192936
9 18.451043
10 14.112639
11 9.640041
12 7.290571
Name: Tmax, dtype: float64
We can of course combine groupby()
with agg()
to make a custom table for each group of data, eg
# get mean and sd for Tmin and Tmax in each mmonth
weather.groupby('MM').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
Tmin | Tmax | |||
---|---|---|---|---|
mean | std | mean | std | |
MM | ||||
1 | 1.319437 | 4.034070 | 6.554444 | 3.831624 |
2 | 1.470683 | 3.861644 | 7.401048 | 3.723290 |
3 | 2.396840 | 3.472544 | 9.944914 | 3.641816 |
4 | 4.301786 | 3.144024 | 13.187517 | 3.648047 |
5 | 7.165062 | 3.108733 | 16.795252 | 3.761523 |
6 | 10.328291 | 2.664211 | 20.011487 | 3.585932 |
7 | 12.238098 | 2.457236 | 21.799007 | 3.511055 |
8 | 11.965261 | 2.627916 | 21.192936 | 3.232944 |
9 | 9.824855 | 3.241090 | 18.451043 | 3.088003 |
10 | 6.874028 | 3.825368 | 14.112639 | 3.090256 |
11 | 3.692014 | 3.870762 | 9.640041 | 3.254649 |
12 | 2.067461 | 4.131333 | 7.290571 | 3.801597 |
Note
groupby()
is handy to break down datasets according to a categorical variable. In this sense it works similarly to thehue
argument in some of the plotting functions you will see next week, which allows us to plot data sepaarately for different cases of a categoricaal variable.