3.6. Creating new variables/columns#

Sometimes it is helpful to create new variables that recode data in meaningful ways, particularly, you may want to categorize continuous variables, or gather many different categories together

Here we look at how to do that

3.6.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.6.2. Import a dataset to work with#

Let’s use the OxfordWeather data:

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
display(weather)
YYYY MM DD Tmax Tmin Tmean Trange Rainfall_mm
0 1827 1 1 8.3 5.6 7.0 2.7 0.0
1 1827 1 2 2.2 0.0 1.1 2.2 0.0
2 1827 1 3 -2.2 -8.3 -5.3 6.1 9.7
3 1827 1 4 -1.7 -7.8 -4.8 6.1 0.0
4 1827 1 5 0.0 -10.6 -5.3 10.6 0.0
... ... ... ... ... ... ... ... ...
71338 2022 4 26 15.2 4.1 9.7 11.1 0.0
71339 2022 4 27 10.7 2.6 6.7 8.1 0.0
71340 2022 4 28 12.7 3.9 8.3 8.8 0.0
71341 2022 4 29 11.7 6.7 9.2 5.0 0.0
71342 2022 4 30 17.6 1.0 9.3 16.6 0.0

71343 rows × 8 columns

3.6.3. Categorize a continuous variable#

Perhaps we would like to plot the weather in the 19th, 20th and 21st centuries separately.

First we create a new column and fill is with NaNs (because we don’t have any real numbers to put in it yet)

weather['CCCC'] = np.NaN
weather
YYYY MM DD Tmax Tmin Tmean Trange Rainfall_mm CCCC
0 1827 1 1 8.3 5.6 7.0 2.7 0.0 NaN
1 1827 1 2 2.2 0.0 1.1 2.2 0.0 NaN
2 1827 1 3 -2.2 -8.3 -5.3 6.1 9.7 NaN
3 1827 1 4 -1.7 -7.8 -4.8 6.1 0.0 NaN
4 1827 1 5 0.0 -10.6 -5.3 10.6 0.0 NaN
... ... ... ... ... ... ... ... ... ...
71338 2022 4 26 15.2 4.1 9.7 11.1 0.0 NaN
71339 2022 4 27 10.7 2.6 6.7 8.1 0.0 NaN
71340 2022 4 28 12.7 3.9 8.3 8.8 0.0 NaN
71341 2022 4 29 11.7 6.7 9.2 5.0 0.0 NaN
71342 2022 4 30 17.6 1.0 9.3 16.6 0.0 NaN

71343 rows × 9 columns

Use df.loc[]#

We can use df.loc[] to set the values of CCCC based on the values of YYYY:

weather.loc[weather.YYYY<1900, 'CCCC']="19th"
weather.loc[(weather.YYYY>=1900)&(weather.YYYY<2000), 'CCCC']="20th"
weather.loc[weather.YYYY>2000, 'CCCC']="21st"
weather.query('YYYY == 1981')
/var/folders/ft/hqqrzz3d29xfyct7ct4630x00000gt/T/ipykernel_96077/2190974986.py:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '19th' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  weather.loc[weather.YYYY<1900, 'CCCC']="19th"
YYYY MM DD Tmax Tmin Tmean Trange Rainfall_mm CCCC
56248 1981 1 1 8.4 5.1 6.8 3.3 0.5 20th
56249 1981 1 2 10.7 5.0 7.9 5.7 0.2 20th
56250 1981 1 3 10.1 8.2 9.2 1.9 0.0 20th
56251 1981 1 4 5.8 1.5 3.7 4.3 0.1 20th
56252 1981 1 5 6.6 -1.1 2.8 7.7 1.0 20th
... ... ... ... ... ... ... ... ... ...
56608 1981 12 27 3.1 0.4 1.8 2.7 0.8 20th
56609 1981 12 28 3.9 0.0 2.0 3.9 15.3 20th
56610 1981 12 29 9.4 1.8 5.6 7.6 7.1 20th
56611 1981 12 30 9.7 1.8 5.8 7.9 0.4 20th
56612 1981 12 31 7.8 2.6 5.2 5.2 7.1 20th

365 rows × 9 columns

Use pd.cut()#

We can use a hand pandas function, pd.cut() to bin data

# reload the dataframe
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")

weather['CCCC'] = pd.cut(weather.YYYY, bins=[0,1900,2000,9999], labels=['19th','20th','21st'])
weather
YYYY MM DD Tmax Tmin Tmean Trange Rainfall_mm CCCC
0 1827 1 1 8.3 5.6 7.0 2.7 0.0 19th
1 1827 1 2 2.2 0.0 1.1 2.2 0.0 19th
2 1827 1 3 -2.2 -8.3 -5.3 6.1 9.7 19th
3 1827 1 4 -1.7 -7.8 -4.8 6.1 0.0 19th
4 1827 1 5 0.0 -10.6 -5.3 10.6 0.0 19th
... ... ... ... ... ... ... ... ... ...
71338 2022 4 26 15.2 4.1 9.7 11.1 0.0 21st
71339 2022 4 27 10.7 2.6 6.7 8.1 0.0 21st
71340 2022 4 28 12.7 3.9 8.3 8.8 0.0 21st
71341 2022 4 29 11.7 6.7 9.2 5.0 0.0 21st
71342 2022 4 30 17.6 1.0 9.3 16.6 0.0 21st

71343 rows × 9 columns

This can be handy just to group the data into equal sized bins, for example (as we can use a number of bins rather than a list of bin boundaries)

pd.qcut()#

You can use the related function pd.qcut() to split the data into quantiles