1.8. Python: Descriptives and Indexing#
In this notebook we cover some key Pandas syntax and functions:
Syntax for getting various descriptive statistics from a
Pandas
dataframeSyntax for indexing a dataframe - finding the rows and columns that you need
this allows you to get descriptives for specific rows and columns
Here we meet indexing in the context of descriptive statistics, but indexing is something you will do every single time you write code for data analysis. Incorrect syntax in indexing is the number one biggest source of bugs for student on this course, so it is well worth spending the time to get to grips with it.
You absolutely should work through all the exercises in this notebook in advance of the tutorial.
1.8.1. Set up Python Libraries#
As usual you will need to run this code block to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
1.8.2. Import a dataset to work with#
We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet.
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1827 | Jan | 1 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
1 | 1827 | Jan | 1 | 2 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
2 | 1827 | Jan | 1 | 3 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
3 | 1827 | Jan | 1 | 4 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
4 | 1827 | Jan | 1 | 5 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | Apr | 4 | 26 | 116 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
71339 | 2022 | Apr | 4 | 27 | 117 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
71340 | 2022 | Apr | 4 | 28 | 118 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
71341 | 2022 | Apr | 4 | 29 | 119 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
71342 | 2022 | Apr | 4 | 30 | 120 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 10 columns
1.8.3. Get descriptive statistics#
We will look at indexing in the context of getting descriptive statistics for the different variables in a dataframe, and/or for a selection of rows.
Pandas has a number of built-in functions to get descriptive statistics:
df.mean()
- gets the meandf.median()
- gets the mediandf.var()
- gets the variancedf.std()
- gets the standard deviationdf.min()
- gets the minimum valuedf.max()
- gets the maximum valuedf.corr()
- gets the correlation coefficient (Pearson or Spearman)
in the function names above, df
is just a placeholder for “dataframe”
You can replace df
with the name of your dataframe (and we need to tell the function to use only columns containing numerical data)
weather.mean(numeric_only=True)
YYYY 1924.165174
MM 6.516210
DD 15.729154
DD365 182.917077
Tmax 13.890153
Tmin 6.156018
Tmean 10.046208
Trange 7.734244
Rainfall_mm 1.786964
dtype: float64
Which columns from the dataframe are missing from the table of means?
Month is missing because the values are strings (words); we only get the mean for numerical columns
1.8.4. Decriptives for a single column#
Format:
df.column.mean()
You might just want the mean for one column, in which case you can include the column name as follows:
weather.Tmax.mean()
13.890153065515406
Instead of a whole table of means, many of which were irrelevant, we now just have the single number we wanted - the mean of Tmax
Think check that the value obtained this way matches the one in the table above.
Why does it work?#
The commmand weather.Tmax.mean()
has two steps.
Step 1:
The syntax weather.Tmax
(or more generally, df.column
) returns just the one column of the dataframe as a series (essentially a one-column table).
Look:
weather.Tmax
0 8.3
1 2.2
2 -2.2
3 -1.7
4 0.0
...
71338 15.2
71339 10.7
71340 12.7
71341 11.7
71342 17.6
Name: Tmax, Length: 71343, dtype: float64
If we liked, we could give this single-column series a name a ‘save’ it:
DailyMaxima = weather.Tmax
print(DailyMaxima)
0 8.3
1 2.2
2 -2.2
3 -1.7
4 0.0
...
71338 15.2
71339 10.7
71340 12.7
71341 11.7
71342 17.6
Name: Tmax, Length: 71343, dtype: float64
However, we don’t need to do that as we can just tack the .mean()
function onto it directly as follows:
Step 2:
The syntax .mean()
gets the mean of the thing before the dot (i.e. the series you created with the command weather.Tmax
)
weather.Tmax.mean()
13.890153065515406
Exercises#
Complete the following quick practice questions.
Note that the numerical answers are provided so you can check that you have done it correctly.
# get the mean daily rainfall
# answer = 1.79
# get the median daily rainfall
# answer = 0.0
# think - why??
# get the standard deviation of the maximum daily temperature
# answer = 6.49
# find the minimum value of "year"
# answer = 1827
1.8.5. Get descriptives for a subset of rows#
df.query('columnname == number')
df.query('columnname == "string"')
df.query('columnname1 == value1 and columnname2 == value2')
Say I want to know the mean daily maximum temperatures for the year 1921.
Step 1: find the relevant rows#
I first need to somehow pull out the rows of the table where the value for YYYY is 1921
I can do this using the function df.query
, for example weather.query(‘YYYY == 1921’)
note the quote marks surrounding the whole query
note the double equals sign
==
, which is used for checking if two values are equal (as opposed to setting a value)
# get all the rows where year is 1921
weather.query('weather.YYYY == 1921')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/scope.py:231, in Scope.resolve(self, key, is_local)
230 if self.has_resolvers:
--> 231 return self.resolvers[key]
233 # if we're here that means that we have no locals and we also have
234 # no resolvers
File /opt/anaconda3/anaconda3/lib/python3.11/collections/__init__.py:1006, in ChainMap.__getitem__(self, key)
1005 pass
-> 1006 return self.__missing__(key)
File /opt/anaconda3/anaconda3/lib/python3.11/collections/__init__.py:998, in ChainMap.__missing__(self, key)
997 def __missing__(self, key):
--> 998 raise KeyError(key)
KeyError: 'weather'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/scope.py:242, in Scope.resolve(self, key, is_local)
238 try:
239 # last ditch effort we look in temporaries
240 # these are created when parsing indexing expressions
241 # e.g., df[df > 0]
--> 242 return self.temps[key]
243 except KeyError as err:
KeyError: 'weather'
The above exception was the direct cause of the following exception:
UndefinedVariableError Traceback (most recent call last)
Cell In[12], line 2
1 # get all the rows where year is 1921
----> 2 weather.query('weather.YYYY == 1921')
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:4599, in DataFrame.query(self, expr, inplace, **kwargs)
4597 kwargs["level"] = kwargs.pop("level", 0) + 1
4598 kwargs["target"] = None
-> 4599 res = self.eval(expr, **kwargs)
4601 try:
4602 result = self.loc[res]
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:4725, in DataFrame.eval(self, expr, inplace, **kwargs)
4722 kwargs["target"] = self
4723 kwargs["resolvers"] = tuple(kwargs.get("resolvers", ())) + resolvers
-> 4725 return _eval(expr, inplace=inplace, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/eval.py:336, in eval(expr, parser, engine, local_dict, global_dict, resolvers, level, target, inplace)
327 # get our (possibly passed-in) scope
328 env = ensure_scope(
329 level + 1,
330 global_dict=global_dict,
(...)
333 target=target,
334 )
--> 336 parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
338 if engine == "numexpr" and (
339 is_extension_array_dtype(parsed_expr.terms.return_type)
340 or getattr(parsed_expr.terms, "operand_types", None) is not None
(...)
344 )
345 ):
346 warnings.warn(
347 "Engine has switched to 'python' because numexpr does not support "
348 "extension array dtypes. Please set your engine to python manually.",
349 RuntimeWarning,
350 stacklevel=find_stack_level(),
351 )
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:808, in Expr.__init__(self, expr, engine, parser, env, level)
806 self.parser = parser
807 self._visitor = PARSERS[parser](self.env, self.engine, self.parser)
--> 808 self.terms = self.parse()
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:827, in Expr.parse(self)
823 def parse(self):
824 """
825 Parse an expression.
826 """
--> 827 return self._visitor.visit(self.expr)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:411, in BaseExprVisitor.visit(self, node, **kwargs)
409 method = f"visit_{type(node).__name__}"
410 visitor = getattr(self, method)
--> 411 return visitor(node, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:417, in BaseExprVisitor.visit_Module(self, node, **kwargs)
415 raise SyntaxError("only a single expression is allowed")
416 expr = node.body[0]
--> 417 return self.visit(expr, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:411, in BaseExprVisitor.visit(self, node, **kwargs)
409 method = f"visit_{type(node).__name__}"
410 visitor = getattr(self, method)
--> 411 return visitor(node, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:420, in BaseExprVisitor.visit_Expr(self, node, **kwargs)
419 def visit_Expr(self, node, **kwargs):
--> 420 return self.visit(node.value, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:411, in BaseExprVisitor.visit(self, node, **kwargs)
409 method = f"visit_{type(node).__name__}"
410 visitor = getattr(self, method)
--> 411 return visitor(node, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:718, in BaseExprVisitor.visit_Compare(self, node, **kwargs)
716 op = self.translate_In(ops[0])
717 binop = ast.BinOp(op=op, left=node.left, right=comps[0])
--> 718 return self.visit(binop)
720 # recursive case: we have a chained comparison, a CMP b CMP c, etc.
721 left = node.left
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:411, in BaseExprVisitor.visit(self, node, **kwargs)
409 method = f"visit_{type(node).__name__}"
410 visitor = getattr(self, method)
--> 411 return visitor(node, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:531, in BaseExprVisitor.visit_BinOp(self, node, **kwargs)
530 def visit_BinOp(self, node, **kwargs):
--> 531 op, op_class, left, right = self._maybe_transform_eq_ne(node)
532 left, right = self._maybe_downcast_constants(left, right)
533 return self._maybe_evaluate_binop(op, op_class, left, right)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:451, in BaseExprVisitor._maybe_transform_eq_ne(self, node, left, right)
449 def _maybe_transform_eq_ne(self, node, left=None, right=None):
450 if left is None:
--> 451 left = self.visit(node.left, side="left")
452 if right is None:
453 right = self.visit(node.right, side="right")
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:411, in BaseExprVisitor.visit(self, node, **kwargs)
409 method = f"visit_{type(node).__name__}"
410 visitor = getattr(self, method)
--> 411 return visitor(node, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:643, in BaseExprVisitor.visit_Attribute(self, node, **kwargs)
640 ctx = node.ctx
641 if isinstance(ctx, ast.Load):
642 # resolve the value
--> 643 resolved = self.visit(value).value
644 try:
645 v = getattr(resolved, attr)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:411, in BaseExprVisitor.visit(self, node, **kwargs)
409 method = f"visit_{type(node).__name__}"
410 visitor = getattr(self, method)
--> 411 return visitor(node, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/expr.py:544, in BaseExprVisitor.visit_Name(self, node, **kwargs)
543 def visit_Name(self, node, **kwargs):
--> 544 return self.term_type(node.id, self.env, **kwargs)
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/ops.py:91, in Term.__init__(self, name, env, side, encoding)
89 tname = str(name)
90 self.is_local = tname.startswith(LOCAL_TAG) or tname in DEFAULT_GLOBALS
---> 91 self._value = self._resolve_name()
92 self.encoding = encoding
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/ops.py:115, in Term._resolve_name(self)
110 if local_name in self.env.scope and isinstance(
111 self.env.scope[local_name], type
112 ):
113 is_local = False
--> 115 res = self.env.resolve(local_name, is_local=is_local)
116 self.update(res)
118 if hasattr(res, "ndim") and res.ndim > 2:
File /opt/anaconda3/anaconda3/lib/python3.11/site-packages/pandas/core/computation/scope.py:244, in Scope.resolve(self, key, is_local)
242 return self.temps[key]
243 except KeyError as err:
--> 244 raise UndefinedVariableError(key, is_local) from err
UndefinedVariableError: name 'weather' is not defined
Note the size of this table - there are 365 rows. Why?
In the original table
weather
, there is one row per day365 of those rows (days) match the criterion ‘YYYY = 1921’ because there are 365 days in a year.
If I wanted to, I could give this table a name and ‘save’ it for later use:
weather1921 = weather.query('YYYY == 1921') # create a new dataframe for just 1921
weather1921 # look at my new dataframe
YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|---|---|
34333 | 1921 | Dec | 12 | 31 | 1 | 9.1 | 3.3 | 6.2 | 5.8 | 0.0 |
34334 | 1921 | Jan | 1 | 1 | 2 | 12.2 | 7.1 | 9.7 | 5.1 | 9.7 |
34335 | 1921 | Jan | 1 | 2 | 3 | 11.9 | 8.7 | 10.3 | 3.2 | 6.1 |
34336 | 1921 | Jan | 1 | 3 | 4 | 9.9 | 5.9 | 7.9 | 4.0 | 0.0 |
34337 | 1921 | Jan | 1 | 4 | 5 | 12.3 | 9.8 | 11.1 | 2.5 | 5.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34693 | 1921 | Dec | 12 | 26 | 361 | 10.8 | -3.1 | 3.9 | 13.9 | 5.6 |
34694 | 1921 | Dec | 12 | 27 | 362 | 11.2 | 5.4 | 8.3 | 5.8 | 1.3 |
34695 | 1921 | Dec | 12 | 28 | 363 | 13.9 | 3.9 | 8.9 | 10.0 | 0.0 |
34696 | 1921 | Dec | 12 | 29 | 364 | 7.6 | 1.7 | 4.7 | 5.9 | 0.8 |
34697 | 1921 | Dec | 12 | 30 | 365 | 11.7 | 1.9 | 6.8 | 9.8 | 4.5 |
365 rows × 10 columns
… but I don’t need to do this as I can tack on more commands to complete my goal in a single line of code.
Step 2: find the relevant column#
Now that I have grabbed the relevant rows, I narrow it down to column Tmax
weather.query('YYYY == 1921').Tmax
34333 9.1
34334 12.2
34335 11.9
34336 9.9
34337 12.3
...
34693 10.8
34694 11.2
34695 13.9
34696 7.6
34697 11.7
Name: Tmax, Length: 365, dtype: float64
Step 3: add in the actual function#
FInally, I can tack on teh function I want to run on my selected rows and column: mean()
:
weather.query('YYYY == 1921').Tmax.mean()
15.651232876712331
Exercises#
Complete the following quick practice questions:
# Get the mean daily maximum temperature in 2006
# answer = 15.7
# Get the mean daily minimum temperature in 1947
# answer = 6.16
1.8.6. Get descriptives for one category#
What about getting the mean daily maximum temperatures for, say, October? So I need to pull out all the rows in which 'Month'
matches 'Oct'
The syntax is very similar, but now I am matching a string "Oct"
, so it needs to be in quotes:
weather.query('Month == "Oct"').Tmax.mean()
14.112638544251446
note you can actually use single and double quotes interchangeably in Python, but in cases like this where you have quotes within quotes, it is better to use both types to avoid the computer getting confused about which opening and closing quotes to pair up.
so you could do either of these
weather.query(‘Month == “Oct”’).Tmax.mean()
weather.query(“Month == ‘Oct’”).Tmax.mean()
but not these
weather.query(‘Month == ‘Oct’’).Tmax.mean()
weather.query(“Month == “Oct””).Tmax.mean()
Exercises#
Complete the following quick practice questions:
# Get the mean daily rainfall in January
# answer = 1.77
# Get the mean daily rainfall in June
# answer = 1.81
# Is this surprising? why?
1.8.7. Match multiple conditions#
What if I want to know the mean daily maximum temperature in 1921 for June only?
We can simply use and
or &
to pass df.query()
multiple conditions:
# mean daily maximum temperature in 1920
weather.query('YYYY == 1921').Tmax.mean()
15.651232876712331
# mean daily maximumm temperature in 1920
weather.query('YYYY == 1921 & MM == 6').Tmax.mean()
21.443333333333335
Think hopefully the value for June is higher than the average for the whole year? If not something has gone wrong!
Exercises#
Complete the following quick practice questions:
# Get the mean daily rainfall in June 2007
# answer = 2.62
# Get the minimum temperature in January 1947
# answer = -12.4. This was an unusually cold winter.
# note that you were asked for the min temperature in that month, not the mean of the daily minimum Tmin)
1.8.8. Syntax for correlation#
When we run a correlation, we need to index two columns - the two we are correlating.
Say we want to get the correlation between the minimum and maximum daily temperatures,Tmin
and Tmax
:
Correlation is ‘symmetrical’ (it doesn’t matter which variable is \(x\) and which is \(y\)) but for the purposes of the pandas
function we are using, we first find one of the columns (doesn’t matter which one):
weather.Tmin
… and call the pandas
function df.corr()
to operate on it:
weather.Tmin.corr()
Inside the brackets, we tell it which column to correlate against weather.Tmin
:
# get the correlation between columns Tmin and Tmax of dataframe weather
weather.Tmin.corr(weather.Tmax)
0.8414800914062079
We could change what we correlate against Tmin
by changing what is inside the brackets, eg:
# get the correlation between columns Tmin and Tmean of dataframe weather
weather.Tmin.corr(weather.Tmean)
0.9502479355058366
Since correlation is ‘symmetrical’ (it doesn’t matter which variable is \(x\) and which is \(y\)) we can also switch the variables around:
# get the correlation between columns Tmin and Tmean of dataframe weather
weather.Tmean.corr(weather.Tmin)
0.9502479355058366
… we just need one column specified outside the brackets and one inside
Correlation type#
We can set the type of correlation to Pearson or Spearman as appropriate (the default is Pearson’s \(r\)):
# get the correlation between columns Tmin and Tmax of dataframe weather
weather.Tmin.corr(weather.Tmax, method='spearman')
0.8438796074978867
Exercises#
Complete the following practice questions to help you get to grips with the syntax for correlation:
# Find the Spearman correlation between daily mean temperature Tmean, and rainfall
# answer = -0.019
# Find the correlation between year and daily mean temperature
# answer = 0.083
Think - are these strong or weak correlations? Is this surprising, did you expect something different?
All-by-all correlation matrix#
We can also quickly output all the correlations between all possible pairs of columns in a dataframe by simply not giving any column index (but we do have to tell it to only use the numeric columns):
weather.corr(numeric_only=True)
YYYY | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|---|
YYYY | 1.000000 | -0.003411 | -0.000059 | -0.003372 | 0.071631 | 0.089683 | 0.083044 | -0.001257 | 0.008117 |
MM | -0.003411 | 1.000000 | 0.010567 | 0.995580 | 0.179681 | 0.235401 | 0.213082 | -0.018820 | 0.043672 |
DD | -0.000059 | 0.010567 | 1.000000 | 0.092771 | 0.001217 | 0.002876 | 0.002035 | -0.002055 | 0.005315 |
DD365 | -0.003372 | 0.995580 | 0.092771 | 1.000000 | 0.177016 | 0.233248 | 0.210545 | -0.020536 | 0.043925 |
Tmax | 0.071631 | 0.179681 | 0.001217 | 0.177016 | 1.000000 | 0.841480 | 0.967881 | 0.593339 | -0.008807 |
Tmin | 0.089683 | 0.235401 | 0.002876 | 0.233248 | 0.841480 | 1.000000 | 0.950248 | 0.064379 | 0.086181 |
Tmean | 0.083044 | 0.213082 | 0.002035 | 0.210545 | 0.967881 | 0.950248 | 1.000000 | 0.371965 | 0.035037 |
Trange | -0.001257 | -0.018820 | -0.002055 | -0.020536 | 0.593339 | 0.064379 | 0.371965 | 1.000000 | -0.144654 |
Rainfall_mm | 0.008117 | 0.043672 | 0.005315 | 0.043925 | -0.008807 | 0.086181 | 0.035037 | -0.144654 | 1.000000 |
The problem with this is that the figure we want is often buried amongst a lot of irrelevant correlation coefficients.
Correlation using scipy.stats
#
Note that here we are using the pandas
function df.corr()
Later in the course we will use two functions from a library called scipy.stats
:
stats.pearsonr()
stats.spearmanr()
These are actually a bit more useful as they give us a \(p\)-value for the correlation as well as the correlation coefficient
don’t worry if you don’t know what a \(p\)-value is yet - this will be covered later in the course
scipy.stats
will be used extensively in the module on hypothesis testing