Creating new variables/columns

3.6. Creating new variables/columns#

Sometimes it is helpful to create new variables that recode data in meaningful ways, particularly, you may want to categorize continuous variables, or gather many different categories together

Here we look at how to do that

3.6.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.6.2. Import a dataset to work with#

Let’s use the OxfordWeather data:

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
display(weather)

	YYYY	MM	DD	Tmax	Tmin	Tmean	Trange	Rainfall_mm
0	1827	1	1	8.3	5.6	7.0	2.7	0.0
1	1827	1	2	2.2	0.0	1.1	2.2	0.0
2	1827	1	3	-2.2	-8.3	-5.3	6.1	9.7
3	1827	1	4	-1.7	-7.8	-4.8	6.1	0.0
4	1827	1	5	0.0	-10.6	-5.3	10.6	0.0
...	...	...	...	...	...	...	...	...
71338	2022	4	26	15.2	4.1	9.7	11.1	0.0
71339	2022	4	27	10.7	2.6	6.7	8.1	0.0
71340	2022	4	28	12.7	3.9	8.3	8.8	0.0
71341	2022	4	29	11.7	6.7	9.2	5.0	0.0
71342	2022	4	30	17.6	1.0	9.3	16.6	0.0

71343 rows × 8 columns

3.6.3. Categorize a continuous variable#

Perhaps we would like to plot the weather in the 19th, 20th and 21st centuries separately.

First we create a new column and fill is with NaNs (because we don’t have any real numbers to put in it yet)

weather['CCCC'] = np.NaN
weather

	YYYY	MM	DD	Tmax	Tmin	Tmean	Trange	Rainfall_mm	CCCC
0	1827	1	1	8.3	5.6	7.0	2.7	0.0	NaN
1	1827	1	2	2.2	0.0	1.1	2.2	0.0	NaN
2	1827	1	3	-2.2	-8.3	-5.3	6.1	9.7	NaN
3	1827	1	4	-1.7	-7.8	-4.8	6.1	0.0	NaN
4	1827	1	5	0.0	-10.6	-5.3	10.6	0.0	NaN
...	...	...	...	...	...	...	...	...	...
71338	2022	4	26	15.2	4.1	9.7	11.1	0.0	NaN
71339	2022	4	27	10.7	2.6	6.7	8.1	0.0	NaN
71340	2022	4	28	12.7	3.9	8.3	8.8	0.0	NaN
71341	2022	4	29	11.7	6.7	9.2	5.0	0.0	NaN
71342	2022	4	30	17.6	1.0	9.3	16.6	0.0	NaN

71343 rows × 9 columns

Use `df.loc[]`#

We can use df.loc[] to set the values of CCCC based on the values of YYYY:

weather.loc[weather.YYYY<1900, 'CCCC']="19th"
weather.loc[(weather.YYYY>=1900)&(weather.YYYY<2000), 'CCCC']="20th"
weather.loc[weather.YYYY>2000, 'CCCC']="21st"
weather.query('YYYY == 1981')

/var/folders/ft/hqqrzz3d29xfyct7ct4630x00000gt/T/ipykernel_47607/2190974986.py:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '19th' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  weather.loc[weather.YYYY<1900, 'CCCC']="19th"

	YYYY	MM	DD	Tmax	Tmin	Tmean	Trange	Rainfall_mm	CCCC
56248	1981	1	1	8.4	5.1	6.8	3.3	0.5	20th
56249	1981	1	2	10.7	5.0	7.9	5.7	0.2	20th
56250	1981	1	3	10.1	8.2	9.2	1.9	0.0	20th
56251	1981	1	4	5.8	1.5	3.7	4.3	0.1	20th
56252	1981	1	5	6.6	-1.1	2.8	7.7	1.0	20th
...	...	...	...	...	...	...	...	...	...
56608	1981	12	27	3.1	0.4	1.8	2.7	0.8	20th
56609	1981	12	28	3.9	0.0	2.0	3.9	15.3	20th
56610	1981	12	29	9.4	1.8	5.6	7.6	7.1	20th
56611	1981	12	30	9.7	1.8	5.8	7.9	0.4	20th
56612	1981	12	31	7.8	2.6	5.2	5.2	7.1	20th

365 rows × 9 columns

Use `pd.cut()`#

We can use a hand pandas function, pd.cut() to bin data

# reload the dataframe
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")

weather['CCCC'] = pd.cut(weather.YYYY, bins=[0,1900,2000,9999], labels=['19th','20th','21st'])
weather

	YYYY	MM	DD	Tmax	Tmin	Tmean	Trange	Rainfall_mm	CCCC
0	1827	1	1	8.3	5.6	7.0	2.7	0.0	19th
1	1827	1	2	2.2	0.0	1.1	2.2	0.0	19th
2	1827	1	3	-2.2	-8.3	-5.3	6.1	9.7	19th
3	1827	1	4	-1.7	-7.8	-4.8	6.1	0.0	19th
4	1827	1	5	0.0	-10.6	-5.3	10.6	0.0	19th
...	...	...	...	...	...	...	...	...	...
71338	2022	4	26	15.2	4.1	9.7	11.1	0.0	21st
71339	2022	4	27	10.7	2.6	6.7	8.1	0.0	21st
71340	2022	4	28	12.7	3.9	8.3	8.8	0.0	21st
71341	2022	4	29	11.7	6.7	9.2	5.0	0.0	21st
71342	2022	4	30	17.6	1.0	9.3	16.6	0.0	21st

71343 rows × 9 columns

This can be handy just to group the data into equal sized bins, for example (as we can use a number of bins rather than a list of bin boundaries)

`pd.qcut()`#

You can use the related function pd.qcut() to split the data into quantiles

Creating new variables/columns

Contents