{ "cells": [ { "cell_type": "markdown", "id": "b41ea2df-83c4-4847-a73b-09adff7e46f5", "metadata": {}, "source": [ "# Python Functions\n", "\n", "This week, we will need to create our own Python functions as part of running a permutation test.\n", "\n", "Here we will review what function are and how we can create our own using Python code.\n", "\n", "This is a kind of Python tangent to our main stats objective for the week.\n", "\n", "### Set up Python libraries" ] }, { "cell_type": "code", "execution_count": 27, "id": "4f5be0ff-234c-40fb-af9b-03d89ed584fe", "metadata": {}, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import warnings \n", "warnings.simplefilter('ignore', category=FutureWarning)" ] }, { "cell_type": "markdown", "id": "351a97c5-e190-4ae4-9f72-959b078b5967", "metadata": {}, "source": [ "### Import the data\n", "\n", "We need some data to work with. Let's use the good old Oxford Weather dataset." ] }, { "cell_type": "code", "execution_count": 28, "id": "c95740c8-1c03-493c-ace4-9e4c068e9a8e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YYYYMonthMMDDDD365TmaxTminTmeanTrangeRainfall_mm
01827Jan1118.35.67.02.70.0
11827Jan1222.20.01.12.20.0
21827Jan133-2.2-8.3-5.36.19.7
31827Jan144-1.7-7.8-4.86.10.0
41827Jan1550.0-10.6-5.310.60.0
.................................
713382022Apr42611615.24.19.711.10.0
713392022Apr42711710.72.66.78.10.0
713402022Apr42811812.73.98.38.80.0
713412022Apr42911911.76.79.25.00.0
713422022Apr43012017.61.09.316.60.0
\n", "

71343 rows × 10 columns

\n", "
" ], "text/plain": [ " YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm\n", "0 1827 Jan 1 1 1 8.3 5.6 7.0 2.7 0.0\n", "1 1827 Jan 1 2 2 2.2 0.0 1.1 2.2 0.0\n", "2 1827 Jan 1 3 3 -2.2 -8.3 -5.3 6.1 9.7\n", "3 1827 Jan 1 4 4 -1.7 -7.8 -4.8 6.1 0.0\n", "4 1827 Jan 1 5 5 0.0 -10.6 -5.3 10.6 0.0\n", "... ... ... .. .. ... ... ... ... ... ...\n", "71338 2022 Apr 4 26 116 15.2 4.1 9.7 11.1 0.0\n", "71339 2022 Apr 4 27 117 10.7 2.6 6.7 8.1 0.0\n", "71340 2022 Apr 4 28 118 12.7 3.9 8.3 8.8 0.0\n", "71341 2022 Apr 4 29 119 11.7 6.7 9.2 5.0 0.0\n", "71342 2022 Apr 4 30 120 17.6 1.0 9.3 16.6 0.0\n", "\n", "[71343 rows x 10 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "weather = pd.read_csv(\"https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv\")\n", "display(weather)" ] }, { "cell_type": "markdown", "id": "8cce932a-c3d7-4ac1-87c0-d8800e29ce0a", "metadata": {}, "source": [ "## What is a function?\n", "\n", "A function is a computer programme that takes in some information (an *input*), does something with it, and returns an *output*. \n", "\n", "* *Functions were introduced in DataCamp and you could review this if helpful after reading this section.*\n", "\n", "We have been using Python functions for the last several weeks, mainy from the function libraries `pandas` and `seaborn`. For example the function `df.mean()` gets the mean of each column in a dataframe.\n", "\n", "Let's make our own simple function to get the mean for a single column of a dataframe:" ] }, { "cell_type": "code", "execution_count": 29, "id": "d5117647-9265-4058-8c6f-c2e5a8dcaadd", "metadata": {}, "outputs": [], "source": [ "def myMean(x):\n", " m=sum(x)/len(x)\n", " return m" ] }, { "cell_type": "markdown", "id": "131e9016-a93a-4660-b510-b13785b5fd03", "metadata": {}, "source": [ "* The input is `x`\n", "* The output is `m`\n", "* Inside the function, we calculate `m` from `x`" ] }, { "cell_type": "markdown", "id": "2912b65e-2f9c-47a5-ab1b-3fa46b3f8120", "metadata": {}, "source": [ "You will notice if you ran the code block above that nothing seemed to happen. That's because we just created the function (computer programme) but didn't run it.\n", "\n", "Now, having created the function, we can run or *call* it as follows:" ] }, { "cell_type": "code", "execution_count": 30, "id": "e59d8b26-2c98-48b0-84d1-a53d29c12fc7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1.7869643833314295" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myMean(weather.Rainfall_mm)" ] }, { "cell_type": "markdown", "id": "43459002-b5ea-4923-ae16-e63b1484a810", "metadata": {}, "source": [ "What happened? \n", "\n", "* We *called* the function by saying `myMean()`\n", "* We gave it an input (inside the brackets, `weather.Rainfall_mm`\n", "* The function calculated the mean by adding up the values in th input column and dividing by the humber of values (length of the columns)\n", "* The function gave us an output (shown below the code box), of 1.79mm, which is the mean rainfall\n", " \n", "Let's just check using the built-in function that we are used to, `df.mean()`" ] }, { "cell_type": "code", "execution_count": 31, "id": "8269bfea-8de4-4c5a-9cf1-203bf4c26cf7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1.7869643833312312" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather.Rainfall_mm.mean()" ] }, { "cell_type": "markdown", "id": "12d77f21-d02b-43ae-9129-b445d3fec728", "metadata": { "tags": [] }, "source": [ "Yep, same answer.\n", "\n", "### Note\n", "\n", "You have to run the code block defining the function before you can call it, otherwise it won't have been created and won't exist!" ] }, { "cell_type": "markdown", "id": "c6a35112-0753-44ff-9532-948d1b5cc7a4", "metadata": {}, "source": [ "## Difference of means\n", "\n", "As another example, let's define a function that takes in two inputs and finds the difference of their means:" ] }, { "cell_type": "code", "execution_count": 33, "id": "049f8d74-412a-4e9f-90ef-a09213f73035", "metadata": { "tags": [] }, "outputs": [], "source": [ "def dMeans(x,y):\n", " mx = sum(x)/len(x)\n", " my = sum(y)/len(y)\n", " diff = mx-my\n", " return diff" ] }, { "cell_type": "markdown", "id": "cbcd41bf-f5cd-4ca5-99a4-3233d857ac0d", "metadata": {}, "source": [ "Note that this function now has two inputs: x and y\n", "\n", "The function does the following \n", "* calculate the mean for x as `mx`\n", "* calculate the mean for y as `my`\n", "* get the difference `mx-my`\n", "\n", "Let's use it to calculate the difference in mean rainfall between November and May" ] }, { "cell_type": "code", "execution_count": 34, "id": "349adfdf-3c18-4b98-b84f-78098513514a", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.5064019851116548" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find the relevant rows and column in the dataframe andgive them a name\n", "nov = weather.query('Month == \"Nov\"').Rainfall_mm\n", "may = weather.query('Month == \"May\"').Rainfall_mm\n", "\n", "dMeans(nov,may)\n", "\n", "# note we could have done the same thing in a single line:\n", "# dMeans(weather.query('Month == \"Oct\"').Rainfall_mm, weather.query('Month == \"May\"').Rainfall_mm)\n", "# the only reason I didn't do this was that I think the version above is a bit easier to follow as a student" ] }, { "cell_type": "markdown", "id": "d16e62c2-6458-471b-b82a-85554f8df7f8", "metadata": {}, "source": [ "Apparently it rains more in November than May, which is unsuprising; the mean daily rainfall is 0.51 mm greater in November.\n", "\n", "Note that which input (nov or may) gets called x and y within the function is determined by the order that we write them within the function's parentheses\n", "\n", "In the function call we have:\n", "\n", "`def dMeans(x,y):`\n", "\n", "meaning that when we call the function, whatever is first in the brackets becomes `x` and whatever is second becomes `y`. So when we call\n", "\n", "`dMeans(nov,may)`\n", "\n", "* nov becomes x and \n", "* may becomes y\n", "\n", "The function returns mean(x) - mean(y) so this is rainfall in November-May; if the output is a positive number this means that there was more rain in November than May.\n", "\n", "If we called `dMeans(may,nov)` we would get rainfall in May-November - presumably a negative number, as the rainfall in November is higher.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ab436195-0edc-4bc6-bd36-db79b96c51f5", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }