{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3cc40844",
   "metadata": {},
   "source": [
    "# Regression models in Python\n",
    "\n",
    "We will be using the `statsmodels` package in Python, so we will need to import this along with the other Python packages we have been using. \n",
    "\n",
    "### Set up Python libraries\n",
    "\n",
    "As usual, run the code cell below to import the relevant Python libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fa6f15a4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set-up Python libraries - you need to run this but you don't need to change it\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "sns.set_theme(style='white')\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "660ca2e6",
   "metadata": {},
   "source": [
    "## Happiness toy example\n",
    "\n",
    "The python code, once you get the hang of it, is pretty straightforward. The output that Python gives is a whole table of statistics, some of which are important to us in this class, and some will be important later, and some we won’t need in the course at all. \n",
    "\n",
    "So, when looking at the Python output, the key objective for the moment is to know where to find the key things, e.g., the intercept and slope coefficients.\n",
    "\n",
    "Let’s run a regression model in Python for the ‘toy’ country happiness data. \n",
    "\n",
    "### Import and view data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b7031217",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>GDPpc</th>\n",
       "      <th>LifeSat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Austria</td>\n",
       "      <td>51.9</td>\n",
       "      <td>7.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Czechia</td>\n",
       "      <td>38.5</td>\n",
       "      <td>6.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Finland</td>\n",
       "      <td>47.2</td>\n",
       "      <td>7.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Greece</td>\n",
       "      <td>27.1</td>\n",
       "      <td>5.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Ireland</td>\n",
       "      <td>90.8</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Luxembourg</td>\n",
       "      <td>112.6</td>\n",
       "      <td>7.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Poland</td>\n",
       "      <td>32.4</td>\n",
       "      <td>6.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Slovakia</td>\n",
       "      <td>30.5</td>\n",
       "      <td>6.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Sweden</td>\n",
       "      <td>50.9</td>\n",
       "      <td>7.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>12.4</td>\n",
       "      <td>5.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Country  GDPpc  LifeSat\n",
       "0     Austria   51.9      7.2\n",
       "1     Czechia   38.5      6.9\n",
       "2     Finland   47.2      7.8\n",
       "3      Greece   27.1      5.9\n",
       "4     Ireland   90.8      7.0\n",
       "5  Luxembourg  112.6      7.4\n",
       "6      Poland   32.4      6.1\n",
       "7    Slovakia   30.5      6.4\n",
       "8      Sweden   50.9      7.4\n",
       "9     Ukraine   12.4      5.1"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import and view data\n",
    "happiness=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/happiness10.csv')\n",
    "happiness"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01e31be0",
   "metadata": {},
   "source": [
    "### Run the regression using `statsmodels`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b4d4be70",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/joreilly/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10\n",
      "  warnings.warn(\"kurtosistest only valid for n>=20 ... continuing \"\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>         <td>LifeSat</td>     <th>  R-squared:         </th> <td>   0.418</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.345</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   5.738</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 09 Feb 2024</td> <th>  Prob (F-statistic):</th>  <td>0.0435</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>13:59:40</td>     <th>  Log-Likelihood:    </th> <td> -9.1084</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>    10</td>      <th>  AIC:               </th> <td>   22.22</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>     8</td>      <th>  BIC:               </th> <td>   22.82</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "      <td></td>         <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th> <td>    5.8485</td> <td>    0.421</td> <td>   13.878</td> <td> 0.000</td> <td>    4.877</td> <td>    6.820</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>GDPpc</th>     <td>    0.0176</td> <td>    0.007</td> <td>    2.395</td> <td> 0.043</td> <td>    0.001</td> <td>    0.035</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 0.349</td> <th>  Durbin-Watson:     </th> <td>   1.689</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.840</td> <th>  Jarque-Bera (JB):  </th> <td>   0.448</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.289</td> <th>  Prob(JB):          </th> <td>   0.799</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.139</td> <th>  Cond. No.          </th> <td>    113.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/latex": [
       "\\begin{center}\n",
       "\\begin{tabular}{lclc}\n",
       "\\toprule\n",
       "\\textbf{Dep. Variable:}    &     LifeSat      & \\textbf{  R-squared:         } &     0.418   \\\\\n",
       "\\textbf{Model:}            &       OLS        & \\textbf{  Adj. R-squared:    } &     0.345   \\\\\n",
       "\\textbf{Method:}           &  Least Squares   & \\textbf{  F-statistic:       } &     5.738   \\\\\n",
       "\\textbf{Date:}             & Fri, 09 Feb 2024 & \\textbf{  Prob (F-statistic):} &   0.0435    \\\\\n",
       "\\textbf{Time:}             &     13:59:40     & \\textbf{  Log-Likelihood:    } &   -9.1084   \\\\\n",
       "\\textbf{No. Observations:} &          10      & \\textbf{  AIC:               } &     22.22   \\\\\n",
       "\\textbf{Df Residuals:}     &           8      & \\textbf{  BIC:               } &     22.82   \\\\\n",
       "\\textbf{Df Model:}         &           1      & \\textbf{                     } &             \\\\\n",
       "\\textbf{Covariance Type:}  &    nonrobust     & \\textbf{                     } &             \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lcccccc}\n",
       "                   & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]}  \\\\\n",
       "\\midrule\n",
       "\\textbf{Intercept} &       5.8485  &        0.421     &    13.878  &         0.000        &        4.877    &        6.820     \\\\\n",
       "\\textbf{GDPpc}     &       0.0176  &        0.007     &     2.395  &         0.043        &        0.001    &        0.035     \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lclc}\n",
       "\\textbf{Omnibus:}       &  0.349 & \\textbf{  Durbin-Watson:     } &    1.689  \\\\\n",
       "\\textbf{Prob(Omnibus):} &  0.840 & \\textbf{  Jarque-Bera (JB):  } &    0.448  \\\\\n",
       "\\textbf{Skew:}          &  0.289 & \\textbf{  Prob(JB):          } &    0.799  \\\\\n",
       "\\textbf{Kurtosis:}      &  2.139 & \\textbf{  Cond. No.          } &     113.  \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "%\\caption{OLS Regression Results}\n",
       "\\end{center}\n",
       "\n",
       "Notes: \\newline\n",
       " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                LifeSat   R-squared:                       0.418\n",
       "Model:                            OLS   Adj. R-squared:                  0.345\n",
       "Method:                 Least Squares   F-statistic:                     5.738\n",
       "Date:                Fri, 09 Feb 2024   Prob (F-statistic):             0.0435\n",
       "Time:                        13:59:40   Log-Likelihood:                -9.1084\n",
       "No. Observations:                  10   AIC:                             22.22\n",
       "Df Residuals:                       8   BIC:                             22.82\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "Intercept      5.8485      0.421     13.878      0.000       4.877       6.820\n",
       "GDPpc          0.0176      0.007      2.395      0.043       0.001       0.035\n",
       "==============================================================================\n",
       "Omnibus:                        0.349   Durbin-Watson:                   1.689\n",
       "Prob(Omnibus):                  0.840   Jarque-Bera (JB):                0.448\n",
       "Skew:                           0.289   Prob(JB):                        0.799\n",
       "Kurtosis:                       2.139   Cond. No.                         113.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# run the regression model\n",
    "\n",
    "# first we run this line to tell statsmodels where to find the data and the explanatory variables\n",
    "reg_formula = sm.regression.linear_model.OLS.from_formula(data = happiness, formula = 'LifeSat ~ GDPpc')\n",
    "\n",
    "# then we run this line to fit the regression (work out the values of intercept and slope)\n",
    "# the output is a structure which we will call reg_results\n",
    "reg_results = reg_formula.fit()\n",
    "\n",
    "# let's view a summary of the regression results\n",
    "reg_results.summary() \n",
    "\n",
    "# note you get a warning message because the dataset is quite small - disregard this\n",
    "                                             "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0a268ed",
   "metadata": {},
   "source": [
    "\n",
    "### Interpreting the output\n",
    "\n",
    "The summary output is three tables\n",
    "\n",
    "The **top table** tells us about the regression we ran, including \n",
    "* what was the dependent variable\n",
    "* how many observations there were\n",
    "* type of regression model used (in this case Ordinary Least Squares, OLS - don't worry about this for now)\n",
    "* various measures of model fit such as $R^2$, AIC, BIC, Log-Likelihood - some (but not all) of these will be covered in later weeks\n",
    "\n",
    "The **middle table** gives the output of the regression\n",
    "* coefficients (beta values) - in this simple regression these are the intercept and the slope\n",
    "* statistical significance of the effects (t and p values)\n",
    "\n",
    "The **bottom table** contains some information about whether the assummptions of linear regression were satisfied\n",
    "\n",
    "### Predicted values, residuals\n",
    "\n",
    "For now let's focus on the middle table\n",
    "\n",
    "* find the values for the intercept and the slope. Can you write out the regression equation in the form\n",
    "\n",
    "$$ \\hat{y} = \\beta_0 + \\beta_1 x $$\n",
    "\n",
    "... replacing the $\\beta$ values with numbers?\n",
    "\n",
    "\n",
    "You could work out the predicted value of $y$, namely $\\hat{y}$, for each value of $x4 from the equation. \n",
    "However `statsmodels` can give us the predicted values and residuals. \n",
    "\n",
    "* Run the following code which will give us the predicted values $\\hat{y}$ for each row of the data. Display the result. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8892d911",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    6.763547\n",
       "1    6.527300\n",
       "2    6.680684\n",
       "3    6.326313\n",
       "4    7.449370\n",
       "5    7.833713\n",
       "6    6.419754\n",
       "7    6.386256\n",
       "8    6.745917\n",
       "9    6.067146\n",
       "dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# return predicted values\n",
    "reg_results.fittedvalues"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "faf6c0b2",
   "metadata": {},
   "source": [
    "... and the residuals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2655d19d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.436453\n",
       "1    0.372700\n",
       "2    1.119316\n",
       "3   -0.426313\n",
       "4   -0.449370\n",
       "5   -0.433713\n",
       "6   -0.319754\n",
       "7    0.013744\n",
       "8    0.654083\n",
       "9   -0.967146\n",
       "dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# return residuals\n",
    "reg_results.resid"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e6df5b3",
   "metadata": {},
   "source": [
    "These would be easier to view if we added them in as new columns in our dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1ac6970d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>GDPpc</th>\n",
       "      <th>LifeSat</th>\n",
       "      <th>yhat</th>\n",
       "      <th>residuals</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Austria</td>\n",
       "      <td>51.9</td>\n",
       "      <td>7.2</td>\n",
       "      <td>6.763547</td>\n",
       "      <td>0.436453</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Czechia</td>\n",
       "      <td>38.5</td>\n",
       "      <td>6.9</td>\n",
       "      <td>6.527300</td>\n",
       "      <td>0.372700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Finland</td>\n",
       "      <td>47.2</td>\n",
       "      <td>7.8</td>\n",
       "      <td>6.680684</td>\n",
       "      <td>1.119316</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Greece</td>\n",
       "      <td>27.1</td>\n",
       "      <td>5.9</td>\n",
       "      <td>6.326313</td>\n",
       "      <td>-0.426313</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Ireland</td>\n",
       "      <td>90.8</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.449370</td>\n",
       "      <td>-0.449370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Luxembourg</td>\n",
       "      <td>112.6</td>\n",
       "      <td>7.4</td>\n",
       "      <td>7.833713</td>\n",
       "      <td>-0.433713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Poland</td>\n",
       "      <td>32.4</td>\n",
       "      <td>6.1</td>\n",
       "      <td>6.419754</td>\n",
       "      <td>-0.319754</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Slovakia</td>\n",
       "      <td>30.5</td>\n",
       "      <td>6.4</td>\n",
       "      <td>6.386256</td>\n",
       "      <td>0.013744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Sweden</td>\n",
       "      <td>50.9</td>\n",
       "      <td>7.4</td>\n",
       "      <td>6.745917</td>\n",
       "      <td>0.654083</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Ukraine</td>\n",
       "      <td>12.4</td>\n",
       "      <td>5.1</td>\n",
       "      <td>6.067146</td>\n",
       "      <td>-0.967146</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Country  GDPpc  LifeSat      yhat  residuals\n",
       "0     Austria   51.9      7.2  6.763547   0.436453\n",
       "1     Czechia   38.5      6.9  6.527300   0.372700\n",
       "2     Finland   47.2      7.8  6.680684   1.119316\n",
       "3      Greece   27.1      5.9  6.326313  -0.426313\n",
       "4     Ireland   90.8      7.0  7.449370  -0.449370\n",
       "5  Luxembourg  112.6      7.4  7.833713  -0.433713\n",
       "6      Poland   32.4      6.1  6.419754  -0.319754\n",
       "7    Slovakia   30.5      6.4  6.386256   0.013744\n",
       "8      Sweden   50.9      7.4  6.745917   0.654083\n",
       "9     Ukraine   12.4      5.1  6.067146  -0.967146"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "happiness['yhat']=reg_results.fittedvalues\n",
    "happiness['residuals']=reg_results.resid\n",
    "happiness"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc1ad087",
   "metadata": {},
   "source": [
    "This should look familiar! We had the same table back in concepts section. (There may be small differences due to rounding). \n",
    "\n",
    "Hopefully you can see that:\n",
    "\n",
    "* The $\\hat{y}$ values are fairly similar to the corresponding values of $y$, LifeSat\n",
    "* The residuals are positive when $y>\\hat{y}$ and negative $y<\\hat{y}$  \n",
    "* The size of the residual is larger when the difference between $y$ and $\\hat{y}$ is larger\n",
    "\n",
    "Do you think there are any outliers in the data? As we noted earlier, there was one very high and one very low value among the residuals. \n",
    "\n",
    "Let’s try re-running the regression model without these two potential outliers: Finland and Ukraine. Let’s replace the whole row of the dataframe with ‘NaN’ for these two countries and re-run the regression command.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d7714a9f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>GDPpc</th>\n",
       "      <th>LifeSat</th>\n",
       "      <th>yhat</th>\n",
       "      <th>residuals</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Austria</td>\n",
       "      <td>51.9</td>\n",
       "      <td>7.2</td>\n",
       "      <td>6.763547</td>\n",
       "      <td>0.436453</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Czechia</td>\n",
       "      <td>38.5</td>\n",
       "      <td>6.9</td>\n",
       "      <td>6.527300</td>\n",
       "      <td>0.372700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Greece</td>\n",
       "      <td>27.1</td>\n",
       "      <td>5.9</td>\n",
       "      <td>6.326313</td>\n",
       "      <td>-0.426313</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Ireland</td>\n",
       "      <td>90.8</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.449370</td>\n",
       "      <td>-0.449370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Luxembourg</td>\n",
       "      <td>112.6</td>\n",
       "      <td>7.4</td>\n",
       "      <td>7.833713</td>\n",
       "      <td>-0.433713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Poland</td>\n",
       "      <td>32.4</td>\n",
       "      <td>6.1</td>\n",
       "      <td>6.419754</td>\n",
       "      <td>-0.319754</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Slovakia</td>\n",
       "      <td>30.5</td>\n",
       "      <td>6.4</td>\n",
       "      <td>6.386256</td>\n",
       "      <td>0.013744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Sweden</td>\n",
       "      <td>50.9</td>\n",
       "      <td>7.4</td>\n",
       "      <td>6.745917</td>\n",
       "      <td>0.654083</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Country  GDPpc  LifeSat      yhat  residuals\n",
       "0     Austria   51.9      7.2  6.763547   0.436453\n",
       "1     Czechia   38.5      6.9  6.527300   0.372700\n",
       "2         NaN    NaN      NaN       NaN        NaN\n",
       "3      Greece   27.1      5.9  6.326313  -0.426313\n",
       "4     Ireland   90.8      7.0  7.449370  -0.449370\n",
       "5  Luxembourg  112.6      7.4  7.833713  -0.433713\n",
       "6      Poland   32.4      6.1  6.419754  -0.319754\n",
       "7    Slovakia   30.5      6.4  6.386256   0.013744\n",
       "8      Sweden   50.9      7.4  6.745917   0.654083\n",
       "9         NaN    NaN      NaN       NaN        NaN"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Code for changing Finland and Ukraine to NaN\n",
    "happiness_clean = happiness.copy()\n",
    "happiness_clean.loc[(happiness_clean.Country=='Ukraine')]=np.nan\n",
    "happiness_clean.loc[(happiness_clean.Country=='Finland')]=np.nan\n",
    "happiness_clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9afe1dd0",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/joreilly/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8\n",
      "  warnings.warn(\"kurtosistest only valid for n>=20 ... continuing \"\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>         <td>LifeSat</td>     <th>  R-squared:         </th> <td>   0.467</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.379</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   5.266</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 09 Feb 2024</td> <th>  Prob (F-statistic):</th>  <td>0.0615</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>13:59:41</td>     <th>  Log-Likelihood:    </th> <td> -3.9966</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>     8</td>      <th>  AIC:               </th> <td>   11.99</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>     6</td>      <th>  BIC:               </th> <td>   12.15</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "      <td></td>         <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th> <td>    6.0904</td> <td>    0.345</td> <td>   17.671</td> <td> 0.000</td> <td>    5.247</td> <td>    6.934</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>GDPpc</th>     <td>    0.0128</td> <td>    0.006</td> <td>    2.295</td> <td> 0.062</td> <td>   -0.001</td> <td>    0.027</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 1.068</td> <th>  Durbin-Watson:     </th> <td>   1.229</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.586</td> <th>  Jarque-Bera (JB):  </th> <td>   0.669</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.315</td> <th>  Prob(JB):          </th> <td>   0.716</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 1.731</td> <th>  Cond. No.          </th> <td>    131.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/latex": [
       "\\begin{center}\n",
       "\\begin{tabular}{lclc}\n",
       "\\toprule\n",
       "\\textbf{Dep. Variable:}    &     LifeSat      & \\textbf{  R-squared:         } &     0.467   \\\\\n",
       "\\textbf{Model:}            &       OLS        & \\textbf{  Adj. R-squared:    } &     0.379   \\\\\n",
       "\\textbf{Method:}           &  Least Squares   & \\textbf{  F-statistic:       } &     5.266   \\\\\n",
       "\\textbf{Date:}             & Fri, 09 Feb 2024 & \\textbf{  Prob (F-statistic):} &   0.0615    \\\\\n",
       "\\textbf{Time:}             &     13:59:41     & \\textbf{  Log-Likelihood:    } &   -3.9966   \\\\\n",
       "\\textbf{No. Observations:} &           8      & \\textbf{  AIC:               } &     11.99   \\\\\n",
       "\\textbf{Df Residuals:}     &           6      & \\textbf{  BIC:               } &     12.15   \\\\\n",
       "\\textbf{Df Model:}         &           1      & \\textbf{                     } &             \\\\\n",
       "\\textbf{Covariance Type:}  &    nonrobust     & \\textbf{                     } &             \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lcccccc}\n",
       "                   & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]}  \\\\\n",
       "\\midrule\n",
       "\\textbf{Intercept} &       6.0904  &        0.345     &    17.671  &         0.000        &        5.247    &        6.934     \\\\\n",
       "\\textbf{GDPpc}     &       0.0128  &        0.006     &     2.295  &         0.062        &       -0.001    &        0.027     \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lclc}\n",
       "\\textbf{Omnibus:}       &  1.068 & \\textbf{  Durbin-Watson:     } &    1.229  \\\\\n",
       "\\textbf{Prob(Omnibus):} &  0.586 & \\textbf{  Jarque-Bera (JB):  } &    0.669  \\\\\n",
       "\\textbf{Skew:}          &  0.315 & \\textbf{  Prob(JB):          } &    0.716  \\\\\n",
       "\\textbf{Kurtosis:}      &  1.731 & \\textbf{  Cond. No.          } &     131.  \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "%\\caption{OLS Regression Results}\n",
       "\\end{center}\n",
       "\n",
       "Notes: \\newline\n",
       " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                LifeSat   R-squared:                       0.467\n",
       "Model:                            OLS   Adj. R-squared:                  0.379\n",
       "Method:                 Least Squares   F-statistic:                     5.266\n",
       "Date:                Fri, 09 Feb 2024   Prob (F-statistic):             0.0615\n",
       "Time:                        13:59:41   Log-Likelihood:                -3.9966\n",
       "No. Observations:                   8   AIC:                             11.99\n",
       "Df Residuals:                       6   BIC:                             12.15\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "Intercept      6.0904      0.345     17.671      0.000       5.247       6.934\n",
       "GDPpc          0.0128      0.006      2.295      0.062      -0.001       0.027\n",
       "==============================================================================\n",
       "Omnibus:                        1.068   Durbin-Watson:                   1.229\n",
       "Prob(Omnibus):                  0.586   Jarque-Bera (JB):                0.669\n",
       "Skew:                           0.315   Prob(JB):                        0.716\n",
       "Kurtosis:                       1.731   Cond. No.                         131.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# re-run the regression model\n",
    "\n",
    "# first we run this line to tell statsmodels where to find the data and the explanatory variables\n",
    "reg_formula = sm.regression.linear_model.OLS.from_formula(data = happiness_clean, formula = 'LifeSat ~ GDPpc')\n",
    "\n",
    "# then we run this line to fit the regression (work out the values of intercept and slope)\n",
    "# the output is a structure which we will call reg_results\n",
    "reg_results = reg_formula.fit()\n",
    "\n",
    "# let's view a summary of the regression results\n",
    "reg_results.summary() \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8ed07e7",
   "metadata": {},
   "source": [
    "### What changed when we excluded outliers? \n",
    "\n",
    "* Has the slope value become bigger or smaller? \n",
    "\n",
    "* Has the conclusion changed?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae2b6636",
   "metadata": {},
   "source": [
    "# The calculation for the slope and intercept \n",
    "\n",
    "Here’s a new data set looking at height and finger length among a sample of adults, which we will use to demonstrate how it is possible to calculate regression coefficients, using the equations.\n",
    "\n",
    "The equation for finding $b$ is\n",
    "\n",
    "$$ b = \\frac{s_{xy}}{s^2_x} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "7f4e256f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Height</th>\n",
       "      <th>FingerLength</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>56</td>\n",
       "      <td>3.875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>57</td>\n",
       "      <td>4.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>58</td>\n",
       "      <td>3.875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>58</td>\n",
       "      <td>4.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>58</td>\n",
       "      <td>4.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2995</th>\n",
       "      <td>74</td>\n",
       "      <td>4.875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2996</th>\n",
       "      <td>74</td>\n",
       "      <td>5.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2997</th>\n",
       "      <td>74</td>\n",
       "      <td>5.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2998</th>\n",
       "      <td>74</td>\n",
       "      <td>5.250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2999</th>\n",
       "      <td>77</td>\n",
       "      <td>4.375</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3000 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Height  FingerLength\n",
       "0         56         3.875\n",
       "1         57         4.000\n",
       "2         58         3.875\n",
       "3         58         4.000\n",
       "4         58         4.000\n",
       "...      ...           ...\n",
       "2995      74         4.875\n",
       "2996      74         5.000\n",
       "2997      74         5.000\n",
       "2998      74         5.250\n",
       "2999      77         4.375\n",
       "\n",
       "[3000 rows x 2 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# load and view the data\n",
    "heightFinger = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/HeightFingerInches.csv')\n",
    "display(heightFinger)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcf927f7",
   "metadata": {},
   "source": [
    "The equation for finding $b$ is\n",
    "\n",
    "$$ b = \\frac{s_{xy}}{s^2_x} $$\n",
    "\n",
    "Let's apply that in Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "4a45e526",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b = 0.0550010848727285\n"
     ]
    }
   ],
   "source": [
    "s_x = heightFinger.Height.std()\n",
    "s_y = heightFinger.FingerLength.std()\n",
    "s_xy = heightFinger.Height.cov(heightFinger.FingerLength)\n",
    "\n",
    "b = s_xy/(s_x**2)\n",
    "print('b = ' + str(b))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcab0a39",
   "metadata": {},
   "source": [
    "What is the value of the intercept? The equation for finding the intercept is as follows:\n",
    "\n",
    "$$ a = \\bar{y} - b\\bar{x} $$\n",
    "\n",
    "$\\bar{x}$ and $\\bar{y}$ are the means of $x$ and $y$ for the 10 data points. \n",
    "\n",
    "Use Python to find the mean of $y$ and $x$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "78e57ee0",
   "metadata": {},
   "outputs": [],
   "source": [
    "x_bar = heightFinger.Height.mean()\n",
    "y_bar = heightFinger.FingerLength.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22ba54d0",
   "metadata": {},
   "source": [
    "Can you calculate $a$? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "5687615e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9465806367945135"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = y_bar - b*x_bar\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d6f3dd7",
   "metadata": {},
   "source": [
    "Let’s run a regression model in Python for the finger length data, to check our results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "f66591ea",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>      <td>FingerLength</td>   <th>  R-squared:         </th> <td>   0.423</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.423</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   2200.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 09 Feb 2024</td> <th>  Prob (F-statistic):</th>  <td>  0.00</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>14:05:31</td>     <th>  Log-Likelihood:    </th> <td>  1163.6</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  3000</td>      <th>  AIC:               </th> <td>  -2323.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>  2998</td>      <th>  BIC:               </th> <td>  -2311.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "      <td></td>         <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th> <td>    0.9466</td> <td>    0.077</td> <td>   12.321</td> <td> 0.000</td> <td>    0.796</td> <td>    1.097</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Height</th>    <td>    0.0550</td> <td>    0.001</td> <td>   46.909</td> <td> 0.000</td> <td>    0.053</td> <td>    0.057</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>20.597</td> <th>  Durbin-Watson:     </th> <td>   0.182</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.000</td> <th>  Jarque-Bera (JB):  </th> <td>  28.786</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td>-0.062</td> <th>  Prob(JB):          </th> <td>5.61e-07</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 3.464</td> <th>  Cond. No.          </th> <td>1.68e+03</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 1.68e+03. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
      ],
      "text/latex": [
       "\\begin{center}\n",
       "\\begin{tabular}{lclc}\n",
       "\\toprule\n",
       "\\textbf{Dep. Variable:}    &   FingerLength   & \\textbf{  R-squared:         } &     0.423   \\\\\n",
       "\\textbf{Model:}            &       OLS        & \\textbf{  Adj. R-squared:    } &     0.423   \\\\\n",
       "\\textbf{Method:}           &  Least Squares   & \\textbf{  F-statistic:       } &     2200.   \\\\\n",
       "\\textbf{Date:}             & Fri, 09 Feb 2024 & \\textbf{  Prob (F-statistic):} &     0.00    \\\\\n",
       "\\textbf{Time:}             &     14:05:31     & \\textbf{  Log-Likelihood:    } &    1163.6   \\\\\n",
       "\\textbf{No. Observations:} &        3000      & \\textbf{  AIC:               } &    -2323.   \\\\\n",
       "\\textbf{Df Residuals:}     &        2998      & \\textbf{  BIC:               } &    -2311.   \\\\\n",
       "\\textbf{Df Model:}         &           1      & \\textbf{                     } &             \\\\\n",
       "\\textbf{Covariance Type:}  &    nonrobust     & \\textbf{                     } &             \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lcccccc}\n",
       "                   & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]}  \\\\\n",
       "\\midrule\n",
       "\\textbf{Intercept} &       0.9466  &        0.077     &    12.321  &         0.000        &        0.796    &        1.097     \\\\\n",
       "\\textbf{Height}    &       0.0550  &        0.001     &    46.909  &         0.000        &        0.053    &        0.057     \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lclc}\n",
       "\\textbf{Omnibus:}       & 20.597 & \\textbf{  Durbin-Watson:     } &    0.182  \\\\\n",
       "\\textbf{Prob(Omnibus):} &  0.000 & \\textbf{  Jarque-Bera (JB):  } &   28.786  \\\\\n",
       "\\textbf{Skew:}          & -0.062 & \\textbf{  Prob(JB):          } & 5.61e-07  \\\\\n",
       "\\textbf{Kurtosis:}      &  3.464 & \\textbf{  Cond. No.          } & 1.68e+03  \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "%\\caption{OLS Regression Results}\n",
       "\\end{center}\n",
       "\n",
       "Notes: \\newline\n",
       " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \\newline\n",
       " [2] The condition number is large, 1.68e+03. This might indicate that there are \\newline\n",
       " strong multicollinearity or other numerical problems."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:           FingerLength   R-squared:                       0.423\n",
       "Model:                            OLS   Adj. R-squared:                  0.423\n",
       "Method:                 Least Squares   F-statistic:                     2200.\n",
       "Date:                Fri, 09 Feb 2024   Prob (F-statistic):               0.00\n",
       "Time:                        14:05:31   Log-Likelihood:                 1163.6\n",
       "No. Observations:                3000   AIC:                            -2323.\n",
       "Df Residuals:                    2998   BIC:                            -2311.\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "Intercept      0.9466      0.077     12.321      0.000       0.796       1.097\n",
       "Height         0.0550      0.001     46.909      0.000       0.053       0.057\n",
       "==============================================================================\n",
       "Omnibus:                       20.597   Durbin-Watson:                   0.182\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):               28.786\n",
       "Skew:                          -0.062   Prob(JB):                     5.61e-07\n",
       "Kurtosis:                       3.464   Cond. No.                     1.68e+03\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "[2] The condition number is large, 1.68e+03. This might indicate that there are\n",
       "strong multicollinearity or other numerical problems.\n",
       "\"\"\""
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# run the regression model on height/finger data\n",
    "\n",
    "# first we run this line to tell statsmodels where to find the data and the explanatory variables\n",
    "reg_formula = sm.regression.linear_model.OLS.from_formula(data = heightFinger, formula = 'FingerLength ~ Height')\n",
    "\n",
    "# then we run this line to fit the regression (work out the values of intercept and slope)\n",
    "# the output is a structure which we will call reg_results\n",
    "reg_results = reg_formula.fit()\n",
    "\n",
    "# let's view a summary of the regression results\n",
    "reg_results.summary() \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49a1995b",
   "metadata": {},
   "source": [
    "Find $a$ and $b$ in the regression output tablbe (they are not called $a$ and $b$ in the table - what are they called?). Do they match the values we calculated ourselves using the equation?\n",
    "\n",
    "Great! Our calculations based on the equations are confirmed by Python’s regression table. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b32cd25e",
   "metadata": {},
   "source": [
    "### Quick extra questions\n",
    "\n",
    "Here are some extra questions on the Regression output table, to see if you can begin to pick out all the important information. \n",
    "\n",
    "1. In the top left, it gives the method as ‘Least Squares’. Above, it gives the model type as ‘OLS’. Do you know what OLS stand for?\n",
    "    * Ordinary Least Squares\n",
    "1. How many observations are in the Height vs Finger model according to the regression output table?\n",
    "    * 3000\n",
    "1. What do you think ‘std err’ might stand for, in the column after ‘coef’?\n",
    "    * Standard Error. You have learned about this important concept alreaady! We’ll come back to it again in a couple of weeks. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77e1403d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fba2ac0e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}