1.6. Python Functions#
This week, we will need to create our own Python functions as part of running a permutation test.
Here we will review what function are and how we can create our own using Python code.
This is a kind of Python tangent to our main stats objective for the week.
1.6.1. Set up Python libraries#
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
1.6.2. Import the data#
We need some data to work with. Let’s use the good old Oxford Weather dataset.
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1827 | Jan | 1 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
1 | 1827 | Jan | 1 | 2 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
2 | 1827 | Jan | 1 | 3 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
3 | 1827 | Jan | 1 | 4 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
4 | 1827 | Jan | 1 | 5 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | Apr | 4 | 26 | 116 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
71339 | 2022 | Apr | 4 | 27 | 117 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
71340 | 2022 | Apr | 4 | 28 | 118 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
71341 | 2022 | Apr | 4 | 29 | 119 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
71342 | 2022 | Apr | 4 | 30 | 120 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 10 columns
1.6.3. What is a function?#
A function is a computer programme that takes in some information (an input), does something with it, and returns an output.
Functions were introduced in DataCamp and you could review this if helpful after reading this section.
We have been using Python functions for the last several weeks, mainy from the function libraries pandas
and seaborn
. For example the function df.mean()
gets the mean of each column in a dataframe.
Let’s make our own simple function to get the mean for a single column of a dataframe:
def myMean(x):
m=sum(x)/len(x)
return m
The input is
x
The output is
m
Inside the function, we calculate
m
fromx
You will notice if you ran the code block above that nothing seemed to happen. That’s because we just created the function (computer programme) but didn’t run it.
Now, having created the function, we can run or call it as follows:
myMean(weather.Rainfall_mm)
1.7869643833314295
What happened?
We called the function by saying
myMean()
We gave it an input (inside the brackets,
weather.Rainfall_mm
The function calculated the mean by adding up the values in th input column and dividing by the humber of values (length of the columns)
The function gave us an output (shown below the code box), of 1.79mm, which is the mean rainfall
Let’s just check using the built-in function that we are used to, df.mean()
weather.Rainfall_mm.mean()
1.7869643833312312
Yep, same answer.
Note#
You have to run the code block defining the function before you can call it, otherwise it won’t have been created and won’t exist!
1.6.4. Difference of means#
As another example, let’s define a function that takes in two inputs and finds the difference of their means:
def dMeans(x,y):
mx = sum(x)/len(x)
my = sum(y)/len(y)
diff = mx-my
return diff
Note that this function now has two inputs: x and y
The function does the following
calculate the mean for x as
mx
calculate the mean for y as
my
get the difference
mx-my
Let’s use it to calculate the difference in mean rainfall between November and May
# find the relevant rows and column in the dataframe andgive them a name
nov = weather.query('Month == "Nov"').Rainfall_mm
may = weather.query('Month == "May"').Rainfall_mm
dMeans(nov,may)
# note we could have done the same thing in a single line:
# dMeans(weather.query('Month == "Oct"').Rainfall_mm, weather.query('Month == "May"').Rainfall_mm)
# the only reason I didn't do this was that I think the version above is a bit easier to follow as a student
0.37674993107251487
Apparently it rains more in November than May, which is unsuprising; the mean daily rainfall is 0.51 mm greater in November.
Note that which input (nov or may) gets called x and y within the function is determined by the order that we write them within the function’s parentheses
In the function call we have:
def dMeans(x,y):
meaning that when we call the function, whatever is first in the brackets becomes x
and whatever is second becomes y
. So when we call
dMeans(nov,may)
nov becomes x and
may becomes y
The function returns mean(x) - mean(y) so this is rainfall in November-May; if the output is a positive number this means that there was more rain in November than May.
If we called dMeans(may,nov)
we would get rainfall in May-November - presumably a negative number, as the rainfall in November is higher.