{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "69252ce6",
   "metadata": {},
   "source": [
    "# Python: Descriptives and Indexing\n",
    "\n",
    "In this notebook we cover some key <tt>Pandas</tt> syntax and functions:\n",
    "\n",
    "* Syntax for getting various descriptive statistics from a `Pandas` dataframe\n",
    "* Syntax for **indexing** a dataframe - finding the rows and columns that you need\n",
    "    * this allows you to get descriptives for specific rows and columns\n",
    "\n",
    "Here we meet **indexing** in the context of descriptive statistics, but indexing is something you will do every single time you write code for data analysis. Incorrect syntax in indexing is the number one biggest source of bugs for student on this course, so it is well worth spending the time to get to grips with it.\n",
    "\n",
    "You absolutely should work through all the exercises in this notebook in advance of the tutorial.\n",
    "\n",
    "## Set up Python Libraries\n",
    "\n",
    "As usual you will need to run this code block to import the relevant Python libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7420037e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set-up Python libraries - you need to run this but you don't need to change it\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "sns.set_theme(style='white')\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14d3d992",
   "metadata": {},
   "source": [
    "## Import a dataset to work with\n",
    "\n",
    "We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "53648053",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YYYY</th>\n",
       "      <th>Month</th>\n",
       "      <th>MM</th>\n",
       "      <th>DD</th>\n",
       "      <th>DD365</th>\n",
       "      <th>Tmax</th>\n",
       "      <th>Tmin</th>\n",
       "      <th>Tmean</th>\n",
       "      <th>Trange</th>\n",
       "      <th>Rainfall_mm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1827</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>8.3</td>\n",
       "      <td>5.6</td>\n",
       "      <td>7.0</td>\n",
       "      <td>2.7</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1827</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2.2</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.1</td>\n",
       "      <td>2.2</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1827</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>-2.2</td>\n",
       "      <td>-8.3</td>\n",
       "      <td>-5.3</td>\n",
       "      <td>6.1</td>\n",
       "      <td>9.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1827</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>-1.7</td>\n",
       "      <td>-7.8</td>\n",
       "      <td>-4.8</td>\n",
       "      <td>6.1</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1827</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-10.6</td>\n",
       "      <td>-5.3</td>\n",
       "      <td>10.6</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71338</th>\n",
       "      <td>2022</td>\n",
       "      <td>Apr</td>\n",
       "      <td>4</td>\n",
       "      <td>26</td>\n",
       "      <td>116</td>\n",
       "      <td>15.2</td>\n",
       "      <td>4.1</td>\n",
       "      <td>9.7</td>\n",
       "      <td>11.1</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71339</th>\n",
       "      <td>2022</td>\n",
       "      <td>Apr</td>\n",
       "      <td>4</td>\n",
       "      <td>27</td>\n",
       "      <td>117</td>\n",
       "      <td>10.7</td>\n",
       "      <td>2.6</td>\n",
       "      <td>6.7</td>\n",
       "      <td>8.1</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71340</th>\n",
       "      <td>2022</td>\n",
       "      <td>Apr</td>\n",
       "      <td>4</td>\n",
       "      <td>28</td>\n",
       "      <td>118</td>\n",
       "      <td>12.7</td>\n",
       "      <td>3.9</td>\n",
       "      <td>8.3</td>\n",
       "      <td>8.8</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71341</th>\n",
       "      <td>2022</td>\n",
       "      <td>Apr</td>\n",
       "      <td>4</td>\n",
       "      <td>29</td>\n",
       "      <td>119</td>\n",
       "      <td>11.7</td>\n",
       "      <td>6.7</td>\n",
       "      <td>9.2</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71342</th>\n",
       "      <td>2022</td>\n",
       "      <td>Apr</td>\n",
       "      <td>4</td>\n",
       "      <td>30</td>\n",
       "      <td>120</td>\n",
       "      <td>17.6</td>\n",
       "      <td>1.0</td>\n",
       "      <td>9.3</td>\n",
       "      <td>16.6</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>71343 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       YYYY Month  MM  DD  DD365  Tmax  Tmin  Tmean  Trange  Rainfall_mm\n",
       "0      1827   Jan   1   1      1   8.3   5.6    7.0     2.7          0.0\n",
       "1      1827   Jan   1   2      2   2.2   0.0    1.1     2.2          0.0\n",
       "2      1827   Jan   1   3      3  -2.2  -8.3   -5.3     6.1          9.7\n",
       "3      1827   Jan   1   4      4  -1.7  -7.8   -4.8     6.1          0.0\n",
       "4      1827   Jan   1   5      5   0.0 -10.6   -5.3    10.6          0.0\n",
       "...     ...   ...  ..  ..    ...   ...   ...    ...     ...          ...\n",
       "71338  2022   Apr   4  26    116  15.2   4.1    9.7    11.1          0.0\n",
       "71339  2022   Apr   4  27    117  10.7   2.6    6.7     8.1          0.0\n",
       "71340  2022   Apr   4  28    118  12.7   3.9    8.3     8.8          0.0\n",
       "71341  2022   Apr   4  29    119  11.7   6.7    9.2     5.0          0.0\n",
       "71342  2022   Apr   4  30    120  17.6   1.0    9.3    16.6          0.0\n",
       "\n",
       "[71343 rows x 10 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "weather = pd.read_csv(\"https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv\")\n",
    "display(weather)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af6131ce",
   "metadata": {},
   "source": [
    "## Get descriptive statistics\n",
    "\n",
    "We will look at indexing in the context of getting descriptive statistics for the different variables in a dataframe, and/or for a selection of rows.\n",
    "\n",
    "Pandas has a number of built-in functions to get descriptive statistics:\n",
    "    \n",
    "- `df.mean()`   -  gets the mean\n",
    "- `df.median()` - gets the median\n",
    "- `df.var()` - gets the variance\n",
    "- `df.std()` - gets the standard deviation\n",
    "- `df.min()` - gets the minimum value\n",
    "- `df.max()` - gets the maximum value\n",
    "- `df.corr()` - gets the correlation coefficient (Pearson or Spearman)\n",
    "\n",
    "in the function names above, `df` is just a placeholder for \"dataframe\" \n",
    "\n",
    "You can replace `df` with the name of your dataframe (and we need to tell the function to use only columns containing numerical data)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0e4c6baf",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "YYYY           1924.165174\n",
       "MM                6.516210\n",
       "DD               15.729154\n",
       "DD365           182.917077\n",
       "Tmax             13.890153\n",
       "Tmin              6.156018\n",
       "Tmean            10.046208\n",
       "Trange            7.734244\n",
       "Rainfall_mm       1.786964\n",
       "dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.mean(numeric_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddd45678",
   "metadata": {},
   "source": [
    "Which columns from the dataframe are missing from the table of means?\n",
    "* Month is missing because the values are strings (words); we only get the mean for numerical columns\n",
    "\n",
    "## Decriptives for a single column\n",
    "\n",
    "Format:\n",
    "- `df.column.mean()`\n",
    "\n",
    "You might just want the mean for one column, in which case you can include the column name as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "60351416",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13.890153065515406"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.Tmax.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "772ff0e0",
   "metadata": {},
   "source": [
    "Instead of a whole table of means, many of which were irrelevant, we now just have the single number we wanted - the mean of <tt>Tmax</tt>\n",
    "\n",
    "* *Think* check that the value obtained this way matches the one in the table above.\n",
    "\n",
    "#### Why does it work?\n",
    "\n",
    "The commmand `weather.Tmax.mean()` has two steps.\n",
    "\n",
    "**Step 1:**\n",
    "\n",
    "The syntax `weather.Tmax` (or more generally, `df.column`) returns just the one column of the dataframe as a <tt>series</tt> (essentially a one-column table).\n",
    "\n",
    "Look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ca537aa1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         8.3\n",
       "1         2.2\n",
       "2        -2.2\n",
       "3        -1.7\n",
       "4         0.0\n",
       "         ... \n",
       "71338    15.2\n",
       "71339    10.7\n",
       "71340    12.7\n",
       "71341    11.7\n",
       "71342    17.6\n",
       "Name: Tmax, Length: 71343, dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.Tmax"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73fdc0b4",
   "metadata": {},
   "source": [
    "If we liked, we could give this single-column series a name a 'save' it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a85759d0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0         8.3\n",
      "1         2.2\n",
      "2        -2.2\n",
      "3        -1.7\n",
      "4         0.0\n",
      "         ... \n",
      "71338    15.2\n",
      "71339    10.7\n",
      "71340    12.7\n",
      "71341    11.7\n",
      "71342    17.6\n",
      "Name: Tmax, Length: 71343, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "DailyMaxima  = weather.Tmax\n",
    "print(DailyMaxima)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ccb6bb4",
   "metadata": {},
   "source": [
    "However, we don't need to do that as we can just tack the `.mean()` function onto it directly as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9672939",
   "metadata": {},
   "source": [
    "**Step 2:**\n",
    "\n",
    "The syntax `.mean()` gets the mean of the thing before the dot (i.e. the series you created with the command `weather.Tmax`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2eb628aa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13.890153065515406"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.Tmax.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b115731",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "Complete the following quick practice questions. \n",
    "\n",
    "Note that the numerical answers are provided so you can check that you have done it correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "6bccd623",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the mean daily rainfall\n",
    "# answer = 1.79\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "9ac0f7eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the median daily rainfall\n",
    "# answer = 0.0\n",
    "# think - why??\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "403db34f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the standard deviation of the maximum daily temperature\n",
    "# answer = 5.63\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "d93cc437",
   "metadata": {},
   "outputs": [],
   "source": [
    "# find the minimum value of \"year\" \n",
    "# answer = 1827\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62d816dc",
   "metadata": {},
   "source": [
    "## Get descriptives for a subset of rows\n",
    "\n",
    "- `df.query('columnname == number')`\n",
    "- `df.query('columnname == \"string\"')`\n",
    "- `df.query('columnname1 == value1 and columnname2 == value2')`\n",
    "\n",
    "\n",
    "Say I want to know the mean daily maximum temperatures for the year 1921.\n",
    "\n",
    "#### Step 1: find the relevant rows\n",
    "\n",
    "I first need to somehow pull out the rows of the table where the value for <tt>YYYY</tt> is <tt>1921</tt>\n",
    "    \n",
    "I can do this using the function `df.query`, for example <tt>weather.query('YYYY == 1921')</tt>\n",
    "   * **note** the quote marks surrounding the whole query\n",
    "   * **note** the double equals sign `==`, which is used for *checking* if two values are equal (as opposed to setting a value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "98393d39",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YYYY</th>\n",
       "      <th>Month</th>\n",
       "      <th>MM</th>\n",
       "      <th>DD</th>\n",
       "      <th>DD365</th>\n",
       "      <th>Tmax</th>\n",
       "      <th>Tmin</th>\n",
       "      <th>Tmean</th>\n",
       "      <th>Trange</th>\n",
       "      <th>Rainfall_mm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>34333</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>31</td>\n",
       "      <td>1</td>\n",
       "      <td>9.1</td>\n",
       "      <td>3.3</td>\n",
       "      <td>6.2</td>\n",
       "      <td>5.8</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34334</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>12.2</td>\n",
       "      <td>7.1</td>\n",
       "      <td>9.7</td>\n",
       "      <td>5.1</td>\n",
       "      <td>9.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34335</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>11.9</td>\n",
       "      <td>8.7</td>\n",
       "      <td>10.3</td>\n",
       "      <td>3.2</td>\n",
       "      <td>6.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34336</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>9.9</td>\n",
       "      <td>5.9</td>\n",
       "      <td>7.9</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34337</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>12.3</td>\n",
       "      <td>9.8</td>\n",
       "      <td>11.1</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34693</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>26</td>\n",
       "      <td>361</td>\n",
       "      <td>10.8</td>\n",
       "      <td>-3.1</td>\n",
       "      <td>3.9</td>\n",
       "      <td>13.9</td>\n",
       "      <td>5.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34694</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>27</td>\n",
       "      <td>362</td>\n",
       "      <td>11.2</td>\n",
       "      <td>5.4</td>\n",
       "      <td>8.3</td>\n",
       "      <td>5.8</td>\n",
       "      <td>1.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34695</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>28</td>\n",
       "      <td>363</td>\n",
       "      <td>13.9</td>\n",
       "      <td>3.9</td>\n",
       "      <td>8.9</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34696</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>29</td>\n",
       "      <td>364</td>\n",
       "      <td>7.6</td>\n",
       "      <td>1.7</td>\n",
       "      <td>4.7</td>\n",
       "      <td>5.9</td>\n",
       "      <td>0.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34697</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>30</td>\n",
       "      <td>365</td>\n",
       "      <td>11.7</td>\n",
       "      <td>1.9</td>\n",
       "      <td>6.8</td>\n",
       "      <td>9.8</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>365 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       YYYY Month  MM  DD  DD365  Tmax  Tmin  Tmean  Trange  Rainfall_mm\n",
       "34333  1921   Dec  12  31      1   9.1   3.3    6.2     5.8          0.0\n",
       "34334  1921   Jan   1   1      2  12.2   7.1    9.7     5.1          9.7\n",
       "34335  1921   Jan   1   2      3  11.9   8.7   10.3     3.2          6.1\n",
       "34336  1921   Jan   1   3      4   9.9   5.9    7.9     4.0          0.0\n",
       "34337  1921   Jan   1   4      5  12.3   9.8   11.1     2.5          5.5\n",
       "...     ...   ...  ..  ..    ...   ...   ...    ...     ...          ...\n",
       "34693  1921   Dec  12  26    361  10.8  -3.1    3.9    13.9          5.6\n",
       "34694  1921   Dec  12  27    362  11.2   5.4    8.3     5.8          1.3\n",
       "34695  1921   Dec  12  28    363  13.9   3.9    8.9    10.0          0.0\n",
       "34696  1921   Dec  12  29    364   7.6   1.7    4.7     5.9          0.8\n",
       "34697  1921   Dec  12  30    365  11.7   1.9    6.8     9.8          4.5\n",
       "\n",
       "[365 rows x 10 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get all the rows where year is 1921\n",
    "weather.loc[weather.YYYY == 1921]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1e00495",
   "metadata": {},
   "source": [
    "Note the size of this table - there are 365 rows. Why?\n",
    "* In the original table `weather`, there is one row per day\n",
    "* 365 of those rows (days) match the criterion 'YYYY = 1921' because there are 365 days in a year.\n",
    "\n",
    "If I wanted to, I could give this table a name and 'save' it for later use:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ee284f7a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YYYY</th>\n",
       "      <th>Month</th>\n",
       "      <th>MM</th>\n",
       "      <th>DD</th>\n",
       "      <th>DD365</th>\n",
       "      <th>Tmax</th>\n",
       "      <th>Tmin</th>\n",
       "      <th>Tmean</th>\n",
       "      <th>Trange</th>\n",
       "      <th>Rainfall_mm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>34333</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>31</td>\n",
       "      <td>1</td>\n",
       "      <td>9.1</td>\n",
       "      <td>3.3</td>\n",
       "      <td>6.2</td>\n",
       "      <td>5.8</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34334</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>12.2</td>\n",
       "      <td>7.1</td>\n",
       "      <td>9.7</td>\n",
       "      <td>5.1</td>\n",
       "      <td>9.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34335</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>11.9</td>\n",
       "      <td>8.7</td>\n",
       "      <td>10.3</td>\n",
       "      <td>3.2</td>\n",
       "      <td>6.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34336</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>9.9</td>\n",
       "      <td>5.9</td>\n",
       "      <td>7.9</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34337</th>\n",
       "      <td>1921</td>\n",
       "      <td>Jan</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>12.3</td>\n",
       "      <td>9.8</td>\n",
       "      <td>11.1</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34693</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>26</td>\n",
       "      <td>361</td>\n",
       "      <td>10.8</td>\n",
       "      <td>-3.1</td>\n",
       "      <td>3.9</td>\n",
       "      <td>13.9</td>\n",
       "      <td>5.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34694</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>27</td>\n",
       "      <td>362</td>\n",
       "      <td>11.2</td>\n",
       "      <td>5.4</td>\n",
       "      <td>8.3</td>\n",
       "      <td>5.8</td>\n",
       "      <td>1.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34695</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>28</td>\n",
       "      <td>363</td>\n",
       "      <td>13.9</td>\n",
       "      <td>3.9</td>\n",
       "      <td>8.9</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34696</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>29</td>\n",
       "      <td>364</td>\n",
       "      <td>7.6</td>\n",
       "      <td>1.7</td>\n",
       "      <td>4.7</td>\n",
       "      <td>5.9</td>\n",
       "      <td>0.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34697</th>\n",
       "      <td>1921</td>\n",
       "      <td>Dec</td>\n",
       "      <td>12</td>\n",
       "      <td>30</td>\n",
       "      <td>365</td>\n",
       "      <td>11.7</td>\n",
       "      <td>1.9</td>\n",
       "      <td>6.8</td>\n",
       "      <td>9.8</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>365 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       YYYY Month  MM  DD  DD365  Tmax  Tmin  Tmean  Trange  Rainfall_mm\n",
       "34333  1921   Dec  12  31      1   9.1   3.3    6.2     5.8          0.0\n",
       "34334  1921   Jan   1   1      2  12.2   7.1    9.7     5.1          9.7\n",
       "34335  1921   Jan   1   2      3  11.9   8.7   10.3     3.2          6.1\n",
       "34336  1921   Jan   1   3      4   9.9   5.9    7.9     4.0          0.0\n",
       "34337  1921   Jan   1   4      5  12.3   9.8   11.1     2.5          5.5\n",
       "...     ...   ...  ..  ..    ...   ...   ...    ...     ...          ...\n",
       "34693  1921   Dec  12  26    361  10.8  -3.1    3.9    13.9          5.6\n",
       "34694  1921   Dec  12  27    362  11.2   5.4    8.3     5.8          1.3\n",
       "34695  1921   Dec  12  28    363  13.9   3.9    8.9    10.0          0.0\n",
       "34696  1921   Dec  12  29    364   7.6   1.7    4.7     5.9          0.8\n",
       "34697  1921   Dec  12  30    365  11.7   1.9    6.8     9.8          4.5\n",
       "\n",
       "[365 rows x 10 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather1921 = weather.query('YYYY == 1921') # create a new dataframe for just 1921\n",
    "weather1921 # look at my new dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9e16c44",
   "metadata": {},
   "source": [
    "... but I don't need to do this as I can tack on more commands to complete my goal in a single line of code.\n",
    "\n",
    "#### Step 2: find the relevant column\n",
    "\n",
    "Now that I have grabbed the relevant rows, I narrow it down to column <tt>Tmax</tt>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "9582155a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "33967     3.2\n",
       "33968    13.2\n",
       "33969     4.5\n",
       "33970     3.7\n",
       "33971     3.7\n",
       "         ... \n",
       "34328    12.1\n",
       "34329    11.9\n",
       "34330    12.3\n",
       "34331    12.4\n",
       "34332    12.1\n",
       "Name: Tmax, Length: 366, dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.query('YYYY == 1920').Tmax"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef1a7e64",
   "metadata": {},
   "source": [
    "#### Step 4: add in the actual function\n",
    "\n",
    "FInally, I can tack on teh function I want to run on my selected rows and column: `mean()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "a2547cbb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13.701092896174861"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.query('YYYY == 1920').Tmax.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88c44d08",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "Complete the following quick practice questions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "5dba672e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the mean daily maximum temperature in 2006\n",
    "# answer = 15.7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "6eabd979",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the mean daily minimum temperature in 1947\n",
    "# answer = 6.16"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23a2ac8e",
   "metadata": {},
   "source": [
    "## Get descriptives for one category\n",
    "\n",
    "What about getting the mean daily maximum temperatures for, say, October? So I need to pull out all the rows in which <tt>'MM'</tt> matches <tt>'Oct'</tt>\n",
    "\n",
    "The syntax is very similar, but now I am matching a string `\"Oct\"`, so it needs to be in quotes:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "28646cb0",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14.112638544251446"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.query('Month == \"Oct\"').Tmax.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0404bdd3",
   "metadata": {},
   "source": [
    "* **note** you can actually use single and double quotes interchangeably in Python, but in cases like this where you have quotes within quotes, it is better to use both types to avoid the computer getting confused about which opening and closing quotes to pair up.\n",
    "\n",
    "so you could do either of these\n",
    "\n",
    "- weather.query('Month == \"Oct\"').Tmax.mean()\n",
    "- weather.query(\"Month == 'Oct'\").Tmax.mean()\n",
    "\n",
    "but not these\n",
    "\n",
    "- weather.query('Month == 'Oct'').Tmax.mean()\n",
    "- weather.query(\"Month == \"Oct\"\").Tmax.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10985313",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "Complete the following quick practice questions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "30a4ed22",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the mean daily rainfall in January\n",
    "# answer = 1.77"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "a6cfbcdd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the mean daily rainfall in June\n",
    "# answer = 1.81\n",
    "# Is this surprising? why?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce908842",
   "metadata": {},
   "source": [
    "## Match multiple conditions\n",
    "\n",
    "What if I want to know the mean daily maximum temperature in 1920 for June only?\n",
    "\n",
    "We can simply use `and` or `&` to pass `df.query()` multiple conditions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "22cd5fd6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13.701092896174861"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# mean daily maximum temperature in 1920\n",
    "weather.query('YYYY == 1920').Tmax.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "2ba8f15b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "19.553333333333335"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# mean daily maximumm temperature in 1920\n",
    "weather.query('YYYY == 1920 & MM == 6').Tmax.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bb36dd2",
   "metadata": {},
   "source": [
    "* **Think** hopefully the value for June is higher than the average for the whole year? If not something has gone wrong!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a61d8b31",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "Complete the following quick practice questions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "e61f9992",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the mean daily rainfall in June 2007\n",
    "# answer = 2.62"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "73d2f3c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the minimum temperature in January 1947\n",
    "# answer = -12.4. This was an unusually cold winter.\n",
    "# note that you were asked for the min temperature in that month, not the mean of the daily minimum Tmin)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2938b77",
   "metadata": {},
   "source": [
    "## Syntax for correlation\n",
    "\n",
    "When we run a correlation, we need to index *two* columns - the two we are correlating. \n",
    "\n",
    "Say we want to get the correlation between the minimum and maximum daily temperatures,`Tmin` and `Tmax`:\n",
    "\n",
    "Correlation is 'symmetrical' (it doesn't matter which variable is $x$ and which is $y$) but for the purposes of the `pandas` function we are using, we first find one of the columns (doesn't matter which one):\n",
    "\n",
    "`weather.Tmin`\n",
    "\n",
    "... and call the `pandas` function `df.corr()` to operate on it:\n",
    "\n",
    "`weather.Tmin.corr()`\n",
    "\n",
    "Inside the brackets, we tell it which column to correlate against `weather.Tmin`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "39e65c71",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8414800914062079"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the correlation between columns Tmin and Tmax of dataframe weather\n",
    "weather.Tmin.corr(weather.Tmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65b086ce",
   "metadata": {},
   "source": [
    "We could change what we correlate against `Tmin` by changing what is inside the brackets, eg:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "5cee1d38-0059-4151-8116-47e3b99d159d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9502479355058366"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the correlation between columns Tmin and Tmean of dataframe weather\n",
    "weather.Tmin.corr(weather.Tmean)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ab0f17b-13c1-4576-b66a-9dd72b9ef5a0",
   "metadata": {},
   "source": [
    "Since correlation is 'symmetrical' (it doesn't matter which variable is $x$ and which is $y$) we can also switch the variables around:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "d766d48a-c480-41d9-b3aa-52d6d2ec07ce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9502479355058366"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the correlation between columns Tmin and Tmean of dataframe weather\n",
    "weather.Tmean.corr(weather.Tmin)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce34b0b3-cecd-474d-a4e0-9df084ade760",
   "metadata": {},
   "source": [
    "... we just need one column specified outside the brackets and one inside"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b58fcf91-870a-4503-9de8-018c19d88b52",
   "metadata": {},
   "source": [
    "### Correlation type\n",
    "\n",
    "We can set the type of correlation to Pearson or Spearman as appropriate (the default is Pearson's $r$):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "7ec28f45",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8438796074978867"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the correlation between columns Tmin and Tmax of dataframe weather\n",
    "weather.Tmin.corr(weather.Tmax, method='spearman')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71036f0c",
   "metadata": {},
   "source": [
    "#### Exercises\n",
    "\n",
    "Complete the following practice questions to help you get to grips with the syntax for correlation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "071284d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the Spearman correlation between daily mean temperature Tmean, and rainfall\n",
    "# answer = -0.019"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "7eac1df4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.08304423160159403"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find the correlation between year and daily mean temperature\n",
    "# answer = 0.083"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af3d7111-bd34-4d18-816b-21171474daf3",
   "metadata": {},
   "source": [
    "Think - are these strong or weak correlations? Is this surprising, did you expect something different?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d68d7c61",
   "metadata": {},
   "source": [
    "#### All-by-all correlation matrix\n",
    "\n",
    "We can also quickly output all the correlations between all possible pairs of columns in a dataframe by simply not giving any column index (but we do have to tell it to only use the numeric columns):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "d04b1b96",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YYYY</th>\n",
       "      <th>MM</th>\n",
       "      <th>DD</th>\n",
       "      <th>DD365</th>\n",
       "      <th>Tmax</th>\n",
       "      <th>Tmin</th>\n",
       "      <th>Tmean</th>\n",
       "      <th>Trange</th>\n",
       "      <th>Rainfall_mm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>YYYY</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.003411</td>\n",
       "      <td>-0.000059</td>\n",
       "      <td>-0.003372</td>\n",
       "      <td>0.071631</td>\n",
       "      <td>0.089683</td>\n",
       "      <td>0.083044</td>\n",
       "      <td>-0.001257</td>\n",
       "      <td>0.008117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MM</th>\n",
       "      <td>-0.003411</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.010567</td>\n",
       "      <td>0.995580</td>\n",
       "      <td>0.179681</td>\n",
       "      <td>0.235401</td>\n",
       "      <td>0.213082</td>\n",
       "      <td>-0.018820</td>\n",
       "      <td>0.043672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DD</th>\n",
       "      <td>-0.000059</td>\n",
       "      <td>0.010567</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.092771</td>\n",
       "      <td>0.001217</td>\n",
       "      <td>0.002876</td>\n",
       "      <td>0.002035</td>\n",
       "      <td>-0.002055</td>\n",
       "      <td>0.005315</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DD365</th>\n",
       "      <td>-0.003372</td>\n",
       "      <td>0.995580</td>\n",
       "      <td>0.092771</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.177016</td>\n",
       "      <td>0.233248</td>\n",
       "      <td>0.210545</td>\n",
       "      <td>-0.020536</td>\n",
       "      <td>0.043925</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tmax</th>\n",
       "      <td>0.071631</td>\n",
       "      <td>0.179681</td>\n",
       "      <td>0.001217</td>\n",
       "      <td>0.177016</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.841480</td>\n",
       "      <td>0.967881</td>\n",
       "      <td>0.593339</td>\n",
       "      <td>-0.008807</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tmin</th>\n",
       "      <td>0.089683</td>\n",
       "      <td>0.235401</td>\n",
       "      <td>0.002876</td>\n",
       "      <td>0.233248</td>\n",
       "      <td>0.841480</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.950248</td>\n",
       "      <td>0.064379</td>\n",
       "      <td>0.086181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tmean</th>\n",
       "      <td>0.083044</td>\n",
       "      <td>0.213082</td>\n",
       "      <td>0.002035</td>\n",
       "      <td>0.210545</td>\n",
       "      <td>0.967881</td>\n",
       "      <td>0.950248</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.371965</td>\n",
       "      <td>0.035037</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Trange</th>\n",
       "      <td>-0.001257</td>\n",
       "      <td>-0.018820</td>\n",
       "      <td>-0.002055</td>\n",
       "      <td>-0.020536</td>\n",
       "      <td>0.593339</td>\n",
       "      <td>0.064379</td>\n",
       "      <td>0.371965</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.144654</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rainfall_mm</th>\n",
       "      <td>0.008117</td>\n",
       "      <td>0.043672</td>\n",
       "      <td>0.005315</td>\n",
       "      <td>0.043925</td>\n",
       "      <td>-0.008807</td>\n",
       "      <td>0.086181</td>\n",
       "      <td>0.035037</td>\n",
       "      <td>-0.144654</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 YYYY        MM        DD     DD365      Tmax      Tmin  \\\n",
       "YYYY         1.000000 -0.003411 -0.000059 -0.003372  0.071631  0.089683   \n",
       "MM          -0.003411  1.000000  0.010567  0.995580  0.179681  0.235401   \n",
       "DD          -0.000059  0.010567  1.000000  0.092771  0.001217  0.002876   \n",
       "DD365       -0.003372  0.995580  0.092771  1.000000  0.177016  0.233248   \n",
       "Tmax         0.071631  0.179681  0.001217  0.177016  1.000000  0.841480   \n",
       "Tmin         0.089683  0.235401  0.002876  0.233248  0.841480  1.000000   \n",
       "Tmean        0.083044  0.213082  0.002035  0.210545  0.967881  0.950248   \n",
       "Trange      -0.001257 -0.018820 -0.002055 -0.020536  0.593339  0.064379   \n",
       "Rainfall_mm  0.008117  0.043672  0.005315  0.043925 -0.008807  0.086181   \n",
       "\n",
       "                Tmean    Trange  Rainfall_mm  \n",
       "YYYY         0.083044 -0.001257     0.008117  \n",
       "MM           0.213082 -0.018820     0.043672  \n",
       "DD           0.002035 -0.002055     0.005315  \n",
       "DD365        0.210545 -0.020536     0.043925  \n",
       "Tmax         0.967881  0.593339    -0.008807  \n",
       "Tmin         0.950248  0.064379     0.086181  \n",
       "Tmean        1.000000  0.371965     0.035037  \n",
       "Trange       0.371965  1.000000    -0.144654  \n",
       "Rainfall_mm  0.035037 -0.144654     1.000000  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather.corr(numeric_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "828bb885",
   "metadata": {},
   "source": [
    "The problem with this is that the figure we want is often buried amongst a lot of irrelevant correlation coefficients.\n",
    "\n",
    "#### Correlation using `scipy.stats`\n",
    "\n",
    "Note that here we are using the `pandas` function `df.corr()`\n",
    "\n",
    "Later in the course we will use two functions from a library called `scipy.stats`:\n",
    "\n",
    "* `stats.pearsonr()`\n",
    "* `stats.spearmanr()`\n",
    "\n",
    "These are actually a bit more useful as they give us a $p$-value for the correlation as well as the correlation coefficient\n",
    "* don't worry if you don't know what a $p$-value is yet - this will be covered later in the course\n",
    "\n",
    "`scipy.stats` will be used extensively in the module on hypothesis testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d033e382",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}