{ "cells": [ { "cell_type": "markdown", "id": "213984e9", "metadata": {}, "source": [ "# Standardizing data\n", "\n", "Some data are recorded in naturally meaningful units; examples are \n", "* height of adults in cm\n", "* temperature in $^{\\circ}C$\n", "\n", "In other cases, units may be hard to interpret because we don't have a sense of what a typical score is, based on general knowledge\n", "* scores on an IQ test marked out of 180\n", "* height of 6-year-olds in cm\n", "\n", "A further problem is quantifying how unusual a data value is when values are presented as different units\n", "* High school grades from different countries or systems (A-levels vs IB vs Abitur vs.....)\n", "\n", "In all cases it can be useful to present data in standard units.\n", "\n", "Two common ways of doing this are:\n", "* Convert data to Z-scores\n", "* Convert data to quantiles\n", "\n", "In this section we will review both these approaches.\n", "\n", "\n", "### Set up Python Libraries\n", "\n", "As usual you will need to run this code block to import the relevant Python libraries" ] }, { "cell_type": "code", "execution_count": 2, "id": "be5b5d06", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set-up Python libraries - you need to run this but you don't need to change it\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set_theme(style='white')\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "markdown", "id": "c7fee4fc", "metadata": {}, "source": [ "### Import a dataset to work with\n", "\n", "Let's look at a fictional dataset containing some body measurements for 50 individuals" ] }, { "cell_type": "code", "execution_count": 4, "id": "ac9c9d93", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDsexheightweightage
0101708M16164.835
1101946F16568.142
2108449F17576.631
3108796M18081.031
4113449F17980.131
5114688M17274.042
6119187F14854.845
7120679F16064.044
8120735F18888.432
9124269F17274.029
10124713M17576.626
11127076M18081.028
12131626M16265.635
13132218M17072.329
14132609F17274.041
15134660F15963.234
16135195M16971.442
17140073F16870.634
18140114M19595.141
19145185F15761.645
20146279F18081.030
21146519F17274.034
22151451F17173.137
23152597M17274.027
24154672M16769.739
25155594F16568.125
26158165M17576.645
27159457F17677.436
28162323M17374.831
29166948M17475.728
30168411M17576.629
31168574F16366.430
32169209F15963.245
33171236F16467.234
34172289M18181.927
35173925M18989.325
36176598F16971.437
37177002F18081.036
38178659M18181.926
39180992F17778.331
40183304F17677.430
41184706M18383.740
42185138M16971.428
43185223F17072.341
44186041M17576.625
45186887M15459.326
46187016M16164.832
47198157M18081.033
48199112M17274.033
49199614F16467.231
\n", "
" ], "text/plain": [ " ID sex height weight age\n", "0 101708 M 161 64.8 35\n", "1 101946 F 165 68.1 42\n", "2 108449 F 175 76.6 31\n", "3 108796 M 180 81.0 31\n", "4 113449 F 179 80.1 31\n", "5 114688 M 172 74.0 42\n", "6 119187 F 148 54.8 45\n", "7 120679 F 160 64.0 44\n", "8 120735 F 188 88.4 32\n", "9 124269 F 172 74.0 29\n", "10 124713 M 175 76.6 26\n", "11 127076 M 180 81.0 28\n", "12 131626 M 162 65.6 35\n", "13 132218 M 170 72.3 29\n", "14 132609 F 172 74.0 41\n", "15 134660 F 159 63.2 34\n", "16 135195 M 169 71.4 42\n", "17 140073 F 168 70.6 34\n", "18 140114 M 195 95.1 41\n", "19 145185 F 157 61.6 45\n", "20 146279 F 180 81.0 30\n", "21 146519 F 172 74.0 34\n", "22 151451 F 171 73.1 37\n", "23 152597 M 172 74.0 27\n", "24 154672 M 167 69.7 39\n", "25 155594 F 165 68.1 25\n", "26 158165 M 175 76.6 45\n", "27 159457 F 176 77.4 36\n", "28 162323 M 173 74.8 31\n", "29 166948 M 174 75.7 28\n", "30 168411 M 175 76.6 29\n", "31 168574 F 163 66.4 30\n", "32 169209 F 159 63.2 45\n", "33 171236 F 164 67.2 34\n", "34 172289 M 181 81.9 27\n", "35 173925 M 189 89.3 25\n", "36 176598 F 169 71.4 37\n", "37 177002 F 180 81.0 36\n", "38 178659 M 181 81.9 26\n", "39 180992 F 177 78.3 31\n", "40 183304 F 176 77.4 30\n", "41 184706 M 183 83.7 40\n", "42 185138 M 169 71.4 28\n", "43 185223 F 170 72.3 41\n", "44 186041 M 175 76.6 25\n", "45 186887 M 154 59.3 26\n", "46 187016 M 161 64.8 32\n", "47 198157 M 180 81.0 33\n", "48 199112 M 172 74.0 33\n", "49 199614 F 164 67.2 31" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/BodyData.csv')\n", "display(data)" ] }, { "cell_type": "markdown", "id": "f3ddc6de", "metadata": {}, "source": [ "## Z score\n", "\n", "The Z-score tells us how many standard deviations above or below the mean of the distribution a given value lies.\n", "\n", "Let's convert our weights to Z-scores. We will need to know the mean and standard deviation of weight:" ] }, { "cell_type": "code", "execution_count": 6, "id": "e8dd34d4-c183-4515-8fca-fc07ef891bca", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "73.73\n", "7.891438140058334\n" ] } ], "source": [ "print(data.weight.mean())\n", "print(data.weight.std())" ] }, { "cell_type": "markdown", "id": "518f8f5c-663b-48d8-99b4-e65b8649e4b4", "metadata": { "tags": [] }, "source": [ "We can then calculate a Z-score for each person's weight. \n", "* Someone whose weight is exactly on the mean (74kg) will have a Z-score of 0.\n", "* Someone whose weight is one standard deviation below the mean (65kg) will have a Z-score of -1\n", "etc" ] }, { "cell_type": "code", "execution_count": 8, "id": "63a2e8db-4762-4115-823a-20701b048575", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDsexheightweightageWeightZ
0101708M16164.835-1.131606
1101946F16568.142-0.713431
2108449F17576.6310.363685
3108796M18081.0310.921252
4113449F17980.1310.807204
5114688M17274.0420.034214
6119187F14854.845-2.398802
7120679F16064.044-1.232982
8120735F18888.4321.858977
9124269F17274.0290.034214
10124713M17576.6260.363685
11127076M18081.0280.921252
12131626M16265.635-1.030230
13132218M17072.329-0.181209
14132609F17274.0410.034214
15134660F15963.234-1.334358
16135195M16971.442-0.295257
17140073F16870.634-0.396632
18140114M19595.1412.707998
19145185F15761.645-1.537109
20146279F18081.0300.921252
21146519F17274.0340.034214
22151451F17173.137-0.079833
23152597M17274.0270.034214
24154672M16769.739-0.510680
25155594F16568.125-0.713431
26158165M17576.6450.363685
27159457F17677.4360.465061
28162323M17374.8310.135590
29166948M17475.7280.249638
30168411M17576.6290.363685
31168574F16366.430-0.928855
32169209F15963.245-1.334358
33171236F16467.234-0.827479
34172289M18181.9271.035299
35173925M18989.3251.973024
36176598F16971.437-0.295257
37177002F18081.0360.921252
38178659M18181.9261.035299
39180992F17778.3310.579109
40183304F17677.4300.465061
41184706M18383.7401.263395
42185138M16971.428-0.295257
43185223F17072.341-0.181209
44186041M17576.6250.363685
45186887M15459.326-1.828564
46187016M16164.832-1.131606
47198157M18081.0330.921252
48199112M17274.0330.034214
49199614F16467.231-0.827479
\n", "
" ], "text/plain": [ " ID sex height weight age WeightZ\n", "0 101708 M 161 64.8 35 -1.131606\n", "1 101946 F 165 68.1 42 -0.713431\n", "2 108449 F 175 76.6 31 0.363685\n", "3 108796 M 180 81.0 31 0.921252\n", "4 113449 F 179 80.1 31 0.807204\n", "5 114688 M 172 74.0 42 0.034214\n", "6 119187 F 148 54.8 45 -2.398802\n", "7 120679 F 160 64.0 44 -1.232982\n", "8 120735 F 188 88.4 32 1.858977\n", "9 124269 F 172 74.0 29 0.034214\n", "10 124713 M 175 76.6 26 0.363685\n", "11 127076 M 180 81.0 28 0.921252\n", "12 131626 M 162 65.6 35 -1.030230\n", "13 132218 M 170 72.3 29 -0.181209\n", "14 132609 F 172 74.0 41 0.034214\n", "15 134660 F 159 63.2 34 -1.334358\n", "16 135195 M 169 71.4 42 -0.295257\n", "17 140073 F 168 70.6 34 -0.396632\n", "18 140114 M 195 95.1 41 2.707998\n", "19 145185 F 157 61.6 45 -1.537109\n", "20 146279 F 180 81.0 30 0.921252\n", "21 146519 F 172 74.0 34 0.034214\n", "22 151451 F 171 73.1 37 -0.079833\n", "23 152597 M 172 74.0 27 0.034214\n", "24 154672 M 167 69.7 39 -0.510680\n", "25 155594 F 165 68.1 25 -0.713431\n", "26 158165 M 175 76.6 45 0.363685\n", "27 159457 F 176 77.4 36 0.465061\n", "28 162323 M 173 74.8 31 0.135590\n", "29 166948 M 174 75.7 28 0.249638\n", "30 168411 M 175 76.6 29 0.363685\n", "31 168574 F 163 66.4 30 -0.928855\n", "32 169209 F 159 63.2 45 -1.334358\n", "33 171236 F 164 67.2 34 -0.827479\n", "34 172289 M 181 81.9 27 1.035299\n", "35 173925 M 189 89.3 25 1.973024\n", "36 176598 F 169 71.4 37 -0.295257\n", "37 177002 F 180 81.0 36 0.921252\n", "38 178659 M 181 81.9 26 1.035299\n", "39 180992 F 177 78.3 31 0.579109\n", "40 183304 F 176 77.4 30 0.465061\n", "41 184706 M 183 83.7 40 1.263395\n", "42 185138 M 169 71.4 28 -0.295257\n", "43 185223 F 170 72.3 41 -0.181209\n", "44 186041 M 175 76.6 25 0.363685\n", "45 186887 M 154 59.3 26 -1.828564\n", "46 187016 M 161 64.8 32 -1.131606\n", "47 198157 M 180 81.0 33 0.921252\n", "48 199112 M 172 74.0 33 0.034214\n", "49 199614 F 164 67.2 31 -0.827479" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new column and put the calcualted z-scores in it\n", "data['WeightZ'] = (data.weight - data.weight.mean())/data.weight.std()\n", "data" ] }, { "cell_type": "markdown", "id": "40516f48-d8ca-44f9-ba83-1c0ad3d08b79", "metadata": {}, "source": [ "Look down the table for some heavy and light people. Do their z-scores look like you would expect?" ] }, { "cell_type": "markdown", "id": "c57ea956", "metadata": {}, "source": [ "### Disadvantages of the Z score\n", "\n", "Z score tells us how many standard deviations above or below the mean a datapoint lies.\n", "\n", "We can use some hand 'rules of thumb' to know how unusual a Z-score is, as long as the data distribution is approximately normal:\n", " * Don't worry if you don't know what the Normal distribution is yet - you will learn about this in detail later in the course\n", " \n", "\n", "\n", "\n", "The Z-score does have a couple of disadvantages:\n", "* it is only really meaningful for symmetrical data distributions (especially the Normal distribution) - for skewed distributions, there will be momre datapoints with a Z-score of, say, +2, than -2\n", " \n", "\n", " \n", "Additionally, the Z-score is not easily understood by non statistically trained people\n", " \n", "It is therefore sometimes more meaningful to standardize data by presenting them as *quantiles*" ] }, { "cell_type": "markdown", "id": "2aaad28d", "metadata": {}, "source": [ "## Quantiles\n", "\n", "Quantiles (or centiles) tell us what proportion of data points are expected to exceed a certain value. This is easy to interpret. \n", "\n", "For example, say my six year old daughter is 125cm tall, would you say she is tall for her age? You probably have no idea - this is in contrast to adult heights where people might have a sense of the distribution due to general knowledge (eg 150cm is small and 180cm is tall)\n", "\n", "In fact, a a 6 year old with height 125cm lies on the 95th centile, which means they are taller than 95% of children the same age (will definitley look tall in the playground).\n", "\n", "To calculate a given quantile of a dataset we use `df.quantile()`, eg" ] }, { "cell_type": "code", "execution_count": 18, "id": "18fe4e26", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "181.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find the 90th centile for height in out dataframe\n", "data.height.quantile(q=0.9) # get 90th centile" ] }, { "cell_type": "markdown", "id": "2a2fb8ff-c311-420d-a5ca-374a2a6b7dca", "metadata": {}, "source": [ "The 90th centile is 181cm, ie 10% of people are taller than 181cm." ] }, { "cell_type": "markdown", "id": "9f917630-d33d-43b5-bacb-2380c5ccb947", "metadata": {}, "source": [ "Adding quantiles to the table can be done using `df.qcut()`, which categorizes the data into quantiles. For example, I can produe a table saying which decile each person's weight falls into as follows:\n", "\n", "* Deciles are 10ths, in the same way that centiles are 100ths\n", " * if someone's weight in the 0th decile, that means their weight is in the bottom 10% of the sample\n", " * if someone's weight is in the 9th decile, it mmeans they are in teh top 10% ot the sample (heavier than 90% of people)" ] }, { "cell_type": "code", "execution_count": 19, "id": "33b57717-b626-4edb-a308-c1a8f0702c6c", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDsexheightweightageWeightZweightQ
0101708M16164.835-1.1316061
1101946F16568.142-0.7134312
2108449F17576.6310.3636856
3108796M18081.0310.9212527
4113449F17980.1310.8072047
5114688M17274.0420.0342144
6119187F14854.845-2.3988020
7120679F16064.044-1.2329821
8120735F18888.4321.8589779
9124269F17274.0290.0342144
10124713M17576.6260.3636856
11127076M18081.0280.9212527
12131626M16265.635-1.0302301
13132218M17072.329-0.1812093
14132609F17274.0410.0342144
15134660F15963.234-1.3343580
16135195M16971.442-0.2952573
17140073F16870.634-0.3966323
18140114M19595.1412.7079989
19145185F15761.645-1.5371090
20146279F18081.0300.9212527
21146519F17274.0340.0342144
22151451F17173.137-0.0798334
23152597M17274.0270.0342144
24154672M16769.739-0.5106802
25155594F16568.125-0.7134312
26158165M17576.6450.3636856
27159457F17677.4360.4650617
28162323M17374.8310.1355905
29166948M17475.7280.2496385
30168411M17576.6290.3636856
31168574F16366.430-0.9288551
32169209F15963.245-1.3343580
33171236F16467.234-0.8274792
34172289M18181.9271.0352998
35173925M18989.3251.9730249
36176598F16971.437-0.2952573
37177002F18081.0360.9212527
38178659M18181.9261.0352998
39180992F17778.3310.5791097
40183304F17677.4300.4650617
41184706M18383.7401.2633959
42185138M16971.428-0.2952573
43185223F17072.341-0.1812093
44186041M17576.6250.3636856
45186887M15459.326-1.8285640
46187016M16164.832-1.1316061
47198157M18081.0330.9212527
48199112M17274.0330.0342144
49199614F16467.231-0.8274792
\n", "
" ], "text/plain": [ " ID sex height weight age WeightZ weightQ\n", "0 101708 M 161 64.8 35 -1.131606 1\n", "1 101946 F 165 68.1 42 -0.713431 2\n", "2 108449 F 175 76.6 31 0.363685 6\n", "3 108796 M 180 81.0 31 0.921252 7\n", "4 113449 F 179 80.1 31 0.807204 7\n", "5 114688 M 172 74.0 42 0.034214 4\n", "6 119187 F 148 54.8 45 -2.398802 0\n", "7 120679 F 160 64.0 44 -1.232982 1\n", "8 120735 F 188 88.4 32 1.858977 9\n", "9 124269 F 172 74.0 29 0.034214 4\n", "10 124713 M 175 76.6 26 0.363685 6\n", "11 127076 M 180 81.0 28 0.921252 7\n", "12 131626 M 162 65.6 35 -1.030230 1\n", "13 132218 M 170 72.3 29 -0.181209 3\n", "14 132609 F 172 74.0 41 0.034214 4\n", "15 134660 F 159 63.2 34 -1.334358 0\n", "16 135195 M 169 71.4 42 -0.295257 3\n", "17 140073 F 168 70.6 34 -0.396632 3\n", "18 140114 M 195 95.1 41 2.707998 9\n", "19 145185 F 157 61.6 45 -1.537109 0\n", "20 146279 F 180 81.0 30 0.921252 7\n", "21 146519 F 172 74.0 34 0.034214 4\n", "22 151451 F 171 73.1 37 -0.079833 4\n", "23 152597 M 172 74.0 27 0.034214 4\n", "24 154672 M 167 69.7 39 -0.510680 2\n", "25 155594 F 165 68.1 25 -0.713431 2\n", "26 158165 M 175 76.6 45 0.363685 6\n", "27 159457 F 176 77.4 36 0.465061 7\n", "28 162323 M 173 74.8 31 0.135590 5\n", "29 166948 M 174 75.7 28 0.249638 5\n", "30 168411 M 175 76.6 29 0.363685 6\n", "31 168574 F 163 66.4 30 -0.928855 1\n", "32 169209 F 159 63.2 45 -1.334358 0\n", "33 171236 F 164 67.2 34 -0.827479 2\n", "34 172289 M 181 81.9 27 1.035299 8\n", "35 173925 M 189 89.3 25 1.973024 9\n", "36 176598 F 169 71.4 37 -0.295257 3\n", "37 177002 F 180 81.0 36 0.921252 7\n", "38 178659 M 181 81.9 26 1.035299 8\n", "39 180992 F 177 78.3 31 0.579109 7\n", "40 183304 F 176 77.4 30 0.465061 7\n", "41 184706 M 183 83.7 40 1.263395 9\n", "42 185138 M 169 71.4 28 -0.295257 3\n", "43 185223 F 170 72.3 41 -0.181209 3\n", "44 186041 M 175 76.6 25 0.363685 6\n", "45 186887 M 154 59.3 26 -1.828564 0\n", "46 187016 M 161 64.8 32 -1.131606 1\n", "47 198157 M 180 81.0 33 0.921252 7\n", "48 199112 M 172 74.0 33 0.034214 4\n", "49 199614 F 164 67.2 31 -0.827479 2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['weightQ'] = pd.qcut(data.weight, 10, labels = False) \n", "data" ] }, { "cell_type": "markdown", "id": "a4c202bd-5cbc-46c6-ae1f-22ad6a72e558", "metadata": {}, "source": [ "**NOTE** this is a bit fiddly as `df.qcut` won't create empty bins. Since this dataset is quite small, we can't create one bin for each centile as naturally some will be empty (as there are less than 100 datapoints)" ] }, { "cell_type": "code", "execution_count": null, "id": "ae6859a6-e7b2-485c-8b14-075b19c1a542", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }