3.6. Creating new variables/columns#
Sometimes it is helpful to create new variables that recode data in meaningful ways, particularly, you may want to categorize continuous variables, or gather many different categories together
Here we look at how to do that
3.6.1. Set up Python Libraries#
As usual you will need to run this code block to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
3.6.2. Import a dataset to work with#
Let’s use the OxfordWeather data:
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
display(weather)
YYYY | MM | DD | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
---|---|---|---|---|---|---|---|---|
0 | 1827 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
1 | 1827 | 1 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
2 | 1827 | 1 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
3 | 1827 | 1 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
4 | 1827 | 1 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | 4 | 26 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
71339 | 2022 | 4 | 27 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
71340 | 2022 | 4 | 28 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
71341 | 2022 | 4 | 29 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
71342 | 2022 | 4 | 30 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 8 columns
3.6.3. Categorize a continuous variable#
Perhaps we would like to plot the weather in the 19th, 20th and 21st centuries separately.
First we create a new column and fill is with NaNs (because we don’t have any real numbers to put in it yet)
weather['CCCC'] = np.NaN
weather
YYYY | MM | DD | Tmax | Tmin | Tmean | Trange | Rainfall_mm | CCCC | |
---|---|---|---|---|---|---|---|---|---|
0 | 1827 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 | NaN |
1 | 1827 | 1 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 | NaN |
2 | 1827 | 1 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 | NaN |
3 | 1827 | 1 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 | NaN |
4 | 1827 | 1 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | 4 | 26 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 | NaN |
71339 | 2022 | 4 | 27 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 | NaN |
71340 | 2022 | 4 | 28 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 | NaN |
71341 | 2022 | 4 | 29 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 | NaN |
71342 | 2022 | 4 | 30 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 | NaN |
71343 rows × 9 columns
Use df.loc[]
#
We can use df.loc[]
to set the values of CCCC based on the values of YYYY:
weather.loc[weather.YYYY<1900, 'CCCC']="19th"
weather.loc[(weather.YYYY>=1900)&(weather.YYYY<2000), 'CCCC']="20th"
weather.loc[weather.YYYY>2000, 'CCCC']="21st"
weather.query('YYYY == 1981')
/var/folders/ft/hqqrzz3d29xfyct7ct4630x00000gt/T/ipykernel_92694/2190974986.py:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '19th' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
weather.loc[weather.YYYY<1900, 'CCCC']="19th"
YYYY | MM | DD | Tmax | Tmin | Tmean | Trange | Rainfall_mm | CCCC | |
---|---|---|---|---|---|---|---|---|---|
56248 | 1981 | 1 | 1 | 8.4 | 5.1 | 6.8 | 3.3 | 0.5 | 20th |
56249 | 1981 | 1 | 2 | 10.7 | 5.0 | 7.9 | 5.7 | 0.2 | 20th |
56250 | 1981 | 1 | 3 | 10.1 | 8.2 | 9.2 | 1.9 | 0.0 | 20th |
56251 | 1981 | 1 | 4 | 5.8 | 1.5 | 3.7 | 4.3 | 0.1 | 20th |
56252 | 1981 | 1 | 5 | 6.6 | -1.1 | 2.8 | 7.7 | 1.0 | 20th |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56608 | 1981 | 12 | 27 | 3.1 | 0.4 | 1.8 | 2.7 | 0.8 | 20th |
56609 | 1981 | 12 | 28 | 3.9 | 0.0 | 2.0 | 3.9 | 15.3 | 20th |
56610 | 1981 | 12 | 29 | 9.4 | 1.8 | 5.6 | 7.6 | 7.1 | 20th |
56611 | 1981 | 12 | 30 | 9.7 | 1.8 | 5.8 | 7.9 | 0.4 | 20th |
56612 | 1981 | 12 | 31 | 7.8 | 2.6 | 5.2 | 5.2 | 7.1 | 20th |
365 rows × 9 columns
Use pd.cut()
#
We can use a hand pandas
function, pd.cut()
to bin data
# reload the dataframe
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
weather['CCCC'] = pd.cut(weather.YYYY, bins=[0,1900,2000,9999], labels=['19th','20th','21st'])
weather
YYYY | MM | DD | Tmax | Tmin | Tmean | Trange | Rainfall_mm | CCCC | |
---|---|---|---|---|---|---|---|---|---|
0 | 1827 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 | 19th |
1 | 1827 | 1 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 | 19th |
2 | 1827 | 1 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 | 19th |
3 | 1827 | 1 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 | 19th |
4 | 1827 | 1 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 | 19th |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71338 | 2022 | 4 | 26 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 | 21st |
71339 | 2022 | 4 | 27 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 | 21st |
71340 | 2022 | 4 | 28 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 | 21st |
71341 | 2022 | 4 | 29 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 | 21st |
71342 | 2022 | 4 | 30 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 | 21st |
71343 rows × 9 columns
This can be handy just to group the data into equal sized bins, for example (as we can use a number of bins rather than a list of bin boundaries)
pd.qcut()
#
You can use the related function pd.qcut()
to split the data into quantiles