1.6. Python Functions#

This week, we will need to create our own Python functions as part of running a permutation test.

Here we will review what function are and how we can create our own using Python code.

This is a kind of Python tangent to our main stats objective for the week.

1.6.1. Set up Python libraries#

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

1.6.2. Import the data#

We need some data to work with. Let’s use the good old Oxford Weather dataset.

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
0 1827 Jan 1 1 1 8.3 5.6 7.0 2.7 0.0
1 1827 Jan 1 2 2 2.2 0.0 1.1 2.2 0.0
2 1827 Jan 1 3 3 -2.2 -8.3 -5.3 6.1 9.7
3 1827 Jan 1 4 4 -1.7 -7.8 -4.8 6.1 0.0
4 1827 Jan 1 5 5 0.0 -10.6 -5.3 10.6 0.0
... ... ... ... ... ... ... ... ... ... ...
71338 2022 Apr 4 26 116 15.2 4.1 9.7 11.1 0.0
71339 2022 Apr 4 27 117 10.7 2.6 6.7 8.1 0.0
71340 2022 Apr 4 28 118 12.7 3.9 8.3 8.8 0.0
71341 2022 Apr 4 29 119 11.7 6.7 9.2 5.0 0.0
71342 2022 Apr 4 30 120 17.6 1.0 9.3 16.6 0.0

71343 rows × 10 columns

1.6.3. What is a function?#

A function is a computer programme that takes in some information (an input), does something with it, and returns an output.

  • Functions were introduced in DataCamp and you could review this if helpful after reading this section.

We have been using Python functions for the last several weeks, mainy from the function libraries pandas and seaborn. For example the function df.mean() gets the mean of each column in a dataframe.

Let’s make our own simple function to get the mean for a single column of a dataframe:

def myMean(x):
    m=sum(x)/len(x)
    return m
  • The input is x

  • The output is m

  • Inside the function, we calculate m from x

You will notice if you ran the code block above that nothing seemed to happen. That’s because we just created the function (computer programme) but didn’t run it.

Now, having created the function, we can run or call it as follows:

myMean(weather.Rainfall_mm)
1.7869643833314295

What happened?

  • We called the function by saying myMean()

  • We gave it an input (inside the brackets, weather.Rainfall_mm

  • The function calculated the mean by adding up the values in th input column and dividing by the humber of values (length of the columns)

  • The function gave us an output (shown below the code box), of 1.79mm, which is the mean rainfall

Let’s just check using the built-in function that we are used to, df.mean()

weather.Rainfall_mm.mean()
1.7869643833312312

Yep, same answer.

Note#

You have to run the code block defining the function before you can call it, otherwise it won’t have been created and won’t exist!

1.6.4. Difference of means#

As another example, let’s define a function that takes in two inputs and finds the difference of their means:

def dMeans(x,y):
    mx = sum(x)/len(x)
    my = sum(y)/len(y)
    diff = mx-my
    return diff

Note that this function now has two inputs: x and y

The function does the following

  • calculate the mean for x as mx

  • calculate the mean for y as my

  • get the difference mx-my

Let’s use it to calculate the difference in mean rainfall between November and May

# find the relevant rows and column in the dataframe and give them a name
nov = weather.query('Month == "Nov"').Rainfall_mm
may = weather.query('Month == "May"').Rainfall_mm

# run the function dMeans
dMeans(nov,may)

# note we could have done the same thing in a single line:
# dMeans(weather.query('Month == "Oct"').Rainfall_mm, weather.query('Month == "May"').Rainfall_mm)
# the only reason I didn't do this was that I think the version above is a bit easier to follow as a student
0.37674993107251487

Apparently it rains more in November than May, which is unsuprising; the mean daily rainfall is 0.51 mm greater in November.

Note that which input (nov or may) gets called x and y within the function is determined by the order that we write them within the function’s parentheses

In the function call we have:

def dMeans(x,y):

meaning that when we call the function, whatever is first in the brackets becomes x and whatever is second becomes y. So when we call

dMeans(nov,may)

  • nov becomes x and

  • may becomes y

The function returns mean(x) - mean(y) so this is rainfall in November-May; if the output is a positive number this means that there was more rain in November than May.

If we called dMeans(may,nov) we would get rainfall in May-November - presumably a negative number, as the rainfall in November is higher.

1.6.5. Mean difference#

Finally, let’s define a function that takes in two matched pairs inputs and finds the mean difference (within pairs).

For example, say we want to know if the weather was warmer in 2001 than in 1901.

Instead of comparing the average temperature for all 365 days in 1901 to the average for all 365 days in 2001, we could compare each date in 1901 to the same date in 2001 - so we find the difference in temperature between Jan 1st 1901 and Jan 1st 2001, then the same for Jan 2nd etc.

Naturally this will only work if the data are actually matched and hence the two samples have the same \(n\).

def mDiff(x,y):
    diff = x-y
    meanDiff = sum(diff)/len(diff)
    return meanDiff

Eagle-eyed readers may realise that these two functions, given the same data, give the same answer. However, we will see later in this chapter that once we start randomly shuffling data (as in a permutation test), the difference of means and mean difference behave quite differently.