3.10. Tutorial Exercises#
These tutorial exercises are designed to help you prepare for the first assignment.
As a researcher, there are two distinct phases to data analysis:
Understanding the dataset yourself - this involves making lots of quick plots and descriptive statistics to
check for outliers
find out the data distributions
look for differences between categories
look for associations between variables
Preparing a report for a reader - this involves a focus on readability and the reader
explain any key features of the dataset
highlighting key results with descriptive statistics and figures
figures should be well labelled and tweaked to make your point as clearly as possible
there should be clear, readable explanatory text
for most readers/clients, non technical language should be used
in all cases, jargon should be avoided
In these tutorial exercises, you will complete some guided tasks (and some open-ended ones) to explore the dataset for yourself.
For the hand-in assignment, you will produce a report on the same dataset for a specified reader.
3.10.1. Crime Survey Data#
We will work with a dataset extracted from the Crime Survey England and Wales 2013.
I obtained the data from the UK Data Service, a data repository run by the UK Research Councils. This text is from their introduction to the dataset:
The Crime Survey for England and Wales (CSEW) is a face-to-face victimisation survey in which people resident in households in England and Wales are asked about their experiences of a range of crimes in the 12 months prior to the interview. Respondents to the survey are also asked about their perceptions of crime and attitudes towards crime related issues such as the police and criminal justice system.
The dataset I have given you contains only some of the questions that respondants were asked, containing information about the respondants’ individual demographic features, neighbourhood, perceptions of crime and confidence in the police.
The brief for the hand-in report will be to write a short report for the Home Secretary addressing two topics:
Which groups are the most likely to be victims of crime? and
What factors affect confidence in policing?,
Note that the idea is to write for a generic Home Secretary - they have responsibility for Law and Order and as a politician are interested in how different sections of the public perceive these issues. You can assume they have no statistical training. However there is no need to accommodate the political attitudes or personal characteristics of any particular Home Secretary.
In these preparatory exercises you will play around with the data to try and work out which factors are important predictors of that confidence.
I have put my own conclusions at the bottom of this page - this is just to give an idea of the kinds of things you might look at.
Note#
The survey was conducted in 2013 in the UK. Events of recent years may have affected the confidence of certain groups in the police; this would not be reflected in the data used here.
Set up Python libraries#
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
Import the data#
Download the dataset from Canvas and import them as a dataframe called crime
# WARNING! This will only work once you:
# download the datafile and
# put it in the right place on your computer!
crime = pd.read_csv('../data/CrimeData_2013.csv')
crime
ID | Sex | Age | AgeGroup | EthnicGroup | Education | SES | DeprivationIndex | Victim | confpolice | antisoc | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135230170.0 | Male | 45.0 | 4 | White | University | 2.0 | 3.0 | 0.0 | -2.290506 | -3.42 |
1 | 135230210.0 | Male | 28.0 | 2 | White | University | 1.0 | 4.0 | 0.0 | 0.349198 | 0.52 |
2 | 135231010.0 | Female | 58.0 | 5 | Black or Black British | 99 | 5.0 | 2.0 | 0.0 | -0.381797 | -2.27 |
3 | 135231210.0 | Male | 70.0 | 6 | Asian or Asian British | GCSE | 3.0 | 4.0 | 0.0 | 99.000000 | 99.00 |
4 | 135233210.0 | Female | 64.0 | 5 | White | Other | 5.0 | 5.0 | 0.0 | 0.613168 | -0.84 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9295 | 147638210.0 | Male | 43.0 | 3 | White | University | 1.0 | 99.0 | 0.0 | 1.029429 | -0.31 |
9296 | 147639090.0 | Male | 70.0 | 6 | White | 99 | 5.0 | 99.0 | 0.0 | -1.051876 | 0.45 |
9297 | 147639130.0 | Female | 80.0 | 7 | White | 99 | 5.0 | 99.0 | 0.0 | -0.808211 | -0.27 |
9298 | 147639250.0 | Male | 86.0 | 7 | White | University | 1.0 | 99.0 | 0.0 | -1.711802 | 0.56 |
9299 | 147639290.0 | Male | 70.0 | 6 | White | 99 | 2.0 | 99.0 | 0.0 | 0.115685 | 1.22 |
9300 rows × 11 columns
Variables in the dataset#
Information about the respondant and their neighbourhood:
ID a unique number for each participant
Sex
Age in years
Age Group ages in 10-year groups
Ethnic Group the categories given are the ones recorded in the original survey
Education highest level of education completed; modern British qualifications are used as a short hand for any equivalent, for example ‘A-Levels’ includes any equivalent of completing high school to age 18.
SES socio-economic status
Managerial and professional occs
Intermediate occs
Small employers and own account workers
Lower supervisory and technical occupations
Semi-routine and routine occupations
Never worked and long term unemployed
Full-time students
Not classified
Deprivation Index this is a neighbourhood-level measure of poverty, in qunitiles
1 is the most deprived (poorest) 20% of neighbourhoods
5 is the least deprived (wealthiest) 20%
Victim has the respondant been a victim of crime in the last 12 months?
The following variables are constructed variables summarizing the respondant’s attitudes on the following points:
confpolice how confident are you in the policing of your neighbourhood?
antisoc how much antisocial behaviour is there in your neighbourhood?
Each variable actually reflects a combination of the respondant’s answers to several questions; for example antisoc is based on several questions asking about different antisocial behaviours- ‘is there vandalism in your neighbourhood’, ‘are there gangs present in your neighbourhood’, ‘is there a fly tipping problem in your neighbourhood’ etc):
3.10.2. Check for bad values#
Are there any outliers of filler values (such as 9999
) in the dataset?
check, and deal with them appropriately
I would suggest starting with df.describe()
# Your code here - you may need multiple code blocks in this section
Have we got rid of all the 99’s now?#
We actually haven’t. When we checked the dataframe using df.describe()
, we only checked the numerical variables, but some of the variables coded as strings also have missing values coded as 99.
Have a look back at the dataframe and see if you can spot (and remove) them…
Let’s check what string variables were in the dataframe:
# Display the dataframe again to remind yourself what string variables there are
Now we check whether there are any ‘99’s using df.column.unique()
:
# eg print(crime.Sex.unique())
Now we replace the 99’s.
Hint: the 99’s in these strong variables are the string ‘99’, not the number 99
!
# Your code here
3.10.3. Explore the distribution of the variables#
Now you will explore each variable individually by making suitable graphs. Complete each code block to produce a suitable plot or descriptive statistic. There are no right answers but in each case you should look at what you produced and evaluate whether your learned something from it!
# Are there more men or women in the sample?
# What ages were included in the survey and what is the distribution of respondants' ages?
# What are the bins used for the variable AgeGroup?
# hint: the `hue` argument of sns.histplot can help you here
# How many respondants came from each ethnic group?
# Hint: you may need to resize the firgue so the x-axis labels are readable
# plt.figure(figsize=(12,2))
# What proportion of respondants have been a victim of crime in the last 12 months?
# Hint: `victim` is coded as 1 or 0
# For each of the constructed variables (confpolice and antisoc) plot the distribution
NOTE - We note that these variables take positive and negative values
For confidence in the police there is a strong spike around zero (many people have a neutral attitude - or are they disengaged or have no experience with the police? - later we will break this down by victims/non victims) - there is also a strong positive tail
For perceptions of ‘antisocial behaviour in my neighbourhood’ there is a strong positive tail - could it depend on neighbourhood characteristics?
# Is there a correlation between people's experience of antisocial behaviour, and their confidence in the police?
crime.confpolice.corr(crime.antisoc)
0.1394095893261098
# For the attitude variables (confpolice and antisoc) what is the mean and standard deviation?
# Can you guess how these attitude variables ended up with that mean and standard deviation (think back to the section on standardizing data)?
crime.agg({'confpolice':['mean', 'std'], 'antisoc':['mean', 'std']})
confpolice | antisoc | |
---|---|---|
mean | 6.589355 | 6.643392 |
std | 24.696669 | 24.788734 |
The means are 1 and the sd is 0 for each - they have been Z-scored.
# Which variables have a lot of missing data?
# HINT use df.isna() and sum()
crime.isna().sum()
ID 0
Sex 0
Age 0
AgeGroup 0
EthnicGroup 0
Education 0
SES 0
DeprivationIndex 0
Victim 0
confpolice 0
antisoc 0
dtype: int64
3.10.4. Who is most likely to be a victim of crime?#
Explore which demographic variables make a difference to the chance of being a victim of crime. Are more men than women victims of crime? etc
HINT as Victim
is coded as 1 (if they have been a victim of crime in the past 12 months) and 0 (otherwise), you can obtain the proportion of people who have been a victim by taking the mean value of the column Victim
.
You can also use sns.barplot()
with the x
and hue
arguments to plot the proportion who are victims of crime within each category (each age group, etc).
# You will add several code blocks here to explore the data
3.10.5. Disaggregate the data#
In the lecture we heard that a pattern that holds in one group of respondants may not hold for another group. We can check this by disaggregating the data.
Are students victims of crime because they live in deprived neighbourhoods?#
Try plotting the proportion of people who are victime of crime broken down by deprivation index (as above), but further broken down by SES
HINT: Use the hue
argument to get clusters of bars for each level of SES
# Your code here
Do the patterns hold for all ethnicities?#
The proportion of people who are victims of crime falls with age overall.
But is this true for all ethnicities?
The survey respondants were overwhelmingly white, so even if there was a quite different pattern of victimization in other ethnic groups, when we simply average everyone together this effect would be ‘swamped’
Try plotting the proportion of people who are victims, broken down by AgeGroup
, and further broken down by EthinicGroup
# Your code here
3.10.6. Do attitudes differ depending on demographics?#
Looking at the attitude variables (confpolice
and antisoc
), which demographic factors seem to influence these?
I found it most helpful to make barplots for each attitude variable, broken down by demographic factors (such as deprivation index or age group).
Go ahead and explore the breakdown of confpolice
and antisoc
by AgeGroup
, EthnicGroup
and DeprivationIndex
Age#
# Your code here
# Hint: use plt.subplot to create pairs of plots - this keeps things tidy
3.10.7. Deprivation Index#
# Your code here
# Hint: use plt.subplot to create pairs of plots - this keeps things tidy
3.10.8. Ethnic Group#
# Your code here
# Hint: use plt.subplot to create pairs of plots - this keeps things tidy
3.10.9. Lurking Variables?#
Earlier we noticed that people who reported higher levels of antisocial behaviour in their neighbourhood also had higher confidence in the police, which is somewhat surprising.
Looking at how confpolice
and antisoc
vary by AgeGroup
, can you think of a spurious explanation for this correlation?
3.10.10. Conclusions#
Now think about how you might summarize what you have learned from the dataset
who is most likely to be a victim of crime?
who is most likely to report high levels of antisocial behaviour in their neighbourhood?
what factors predict confidence in the police and experience of antisocial behaviour?
what are the limitations of the survey? Are the experiences of all groups equally well documented?