3.10. Tutorial Exercises#

These tutorial exercises are designed to help you prepare for the first assignment.

As a researcher, there are two distinct phases to data analysis:

  • Understanding the dataset yourself - this involves making lots of quick plots and descriptive statistics to

    • check for outliers

    • find out the data distributions

    • look for differences between categories

    • look for associations between variables

  • Preparing a report for a reader - this involves a focus on readability and the reader

    • explain any key features of the dataset

    • highlighting key results with descriptive statistics and figures

    • figures should be well labelled and tweaked to make your point as clearly as possible

    • there should be clear, readable explanatory text

    • for most readers/clients, non technical language should be used

    • in all cases, jargon should be avoided

In these tutorial exercises, you will complete some guided tasks (and some open-ended ones) to explore the dataset for yourself.

For the hand-in assignment, you will produce a report on the same dataset for a specified reader.

3.10.1. Crime Survey Data#

We will work with a dataset extracted from the Crime Survey England and Wales 2013.

I obtained the data from the UK Data Service, a data repository run by the UK Research Councils. This text is from their introduction to the dataset:

The Crime Survey for England and Wales (CSEW) is a face-to-face victimisation survey in which people resident in households in England and Wales are asked about their experiences of a range of crimes in the 12 months prior to the interview. Respondents to the survey are also asked about their perceptions of crime and attitudes towards crime related issues such as the police and criminal justice system.

The dataset I have given you contains only some of the questions that respondants were asked, containing information about the respondants’ individual demographic features, neighbourhood, perceptions of crime and confidence in the police.

The brief for the hand-in report will be to write a short report for the Home Secretary addressing two topics:

  1. Which groups are the most likely to be victims of crime? and

  2. What factors affect confidence in policing?,

Note that the idea is to write for a generic Home Secretary - they have responsibility for Law and Order and as a politician are interested in how different sections of the public perceive these issues. You can assume they have no statistical training. However there is no need to accommodate the political attitudes or personal characteristics of any particular Home Secretary.

In these preparatory exercises you will play around with the data to try and work out which factors are important predictors of that confidence.

I have put my own conclusions at the bottom of this page - this is just to give an idea of the kinds of things you might look at.

Note#

The survey was conducted in 2013 in the UK. Events of recent years may have affected the confidence of certain groups in the police; this would not be reflected in the data used here.

Set up Python libraries#

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

Import the data#

Download the dataset from Canvas and import them as a dataframe called crime

# WARNING! This will only work once you:
# download the datafile and 
# put it in the right place on your computer!
crime  = pd.read_csv('../data/CrimeData_2013.csv')
crime
ID Sex Age AgeGroup EthnicGroup Education SES DeprivationIndex Victim confpolice antisoc
0 135230170.0 Male 45.0 4 White University 2.0 3.0 0.0 -2.290506 -3.42
1 135230210.0 Male 28.0 2 White University 1.0 4.0 0.0 0.349198 0.52
2 135231010.0 Female 58.0 5 Black or Black British 99 5.0 2.0 0.0 -0.381797 -2.27
3 135231210.0 Male 70.0 6 Asian or Asian British GCSE 3.0 4.0 0.0 99.000000 99.00
4 135233210.0 Female 64.0 5 White Other 5.0 5.0 0.0 0.613168 -0.84
... ... ... ... ... ... ... ... ... ... ... ...
9295 147638210.0 Male 43.0 3 White University 1.0 99.0 0.0 1.029429 -0.31
9296 147639090.0 Male 70.0 6 White 99 5.0 99.0 0.0 -1.051876 0.45
9297 147639130.0 Female 80.0 7 White 99 5.0 99.0 0.0 -0.808211 -0.27
9298 147639250.0 Male 86.0 7 White University 1.0 99.0 0.0 -1.711802 0.56
9299 147639290.0 Male 70.0 6 White 99 2.0 99.0 0.0 0.115685 1.22

9300 rows × 11 columns

Variables in the dataset#

Information about the respondant and their neighbourhood:

  • ID a unique number for each participant

  • Sex

  • Age in years

  • Age Group ages in 10-year groups

  • Ethnic Group the categories given are the ones recorded in the original survey

  • Education highest level of education completed; modern British qualifications are used as a short hand for any equivalent, for example ‘A-Levels’ includes any equivalent of completing high school to age 18.

  • SES socio-economic status

      1. Managerial and professional occs

      1. Intermediate occs

      1. Small employers and own account workers

      1. Lower supervisory and technical occupations

      1. Semi-routine and routine occupations

      1. Never worked and long term unemployed

      1. Full-time students

      1. Not classified

  • Deprivation Index this is a neighbourhood-level measure of poverty, in qunitiles

    • 1 is the most deprived (poorest) 20% of neighbourhoods

    • 5 is the least deprived (wealthiest) 20%

  • Victim has the respondant been a victim of crime in the last 12 months?

The following variables are constructed variables summarizing the respondant’s attitudes on the following points:

  • confpolice how confident are you in the policing of your neighbourhood?

  • antisoc how much antisocial behaviour is there in your neighbourhood?

Each variable actually reflects a combination of the respondant’s answers to several questions; for example antisoc is based on several questions asking about different antisocial behaviours- ‘is there vandalism in your neighbourhood’, ‘are there gangs present in your neighbourhood’, ‘is there a fly tipping problem in your neighbourhood’ etc):

3.10.2. Check for bad values#

Are there any outliers of filler values (such as 9999) in the dataset?

  • check, and deal with them appropriately

I would suggest starting with df.describe()

# Your code here - you may need multiple code blocks in this section

Have we got rid of all the 99’s now?#

We actually haven’t. When we checked the dataframe using df.describe(), we only checked the numerical variables, but some of the variables coded as strings also have missing values coded as 99.

Have a look back at the dataframe and see if you can spot (and remove) them…

Let’s check what string variables were in the dataframe:

# Display the dataframe again to remind yourself what string variables there are

Now we check whether there are any ‘99’s using df.column.unique():

# eg print(crime.Sex.unique())

Now we replace the 99’s.

Hint: the 99’s in these strong variables are the string ‘99’, not the number 99!

# Your code here

3.10.3. Explore the distribution of the variables#

Now you will explore each variable individually by making suitable graphs. Complete each code block to produce a suitable plot or descriptive statistic. There are no right answers but in each case you should look at what you produced and evaluate whether your learned something from it!

# Are there more men or women in the sample?
# What ages were included in the survey and what is the distribution of respondants' ages?
# What are the bins used for the variable AgeGroup?
# hint: the `hue` argument of sns.histplot can help you here
# How many respondants came from each ethnic group?
# Hint: you may need to resize the firgue so the x-axis labels are readable
# plt.figure(figsize=(12,2))
# What proportion of respondants have been a victim of crime in the last 12 months?
# Hint: `victim` is coded as 1 or 0
# For each of the constructed variables (confpolice and antisoc) plot the distribution

NOTE - We note that these variables take positive and negative values

  • For confidence in the police there is a strong spike around zero (many people have a neutral attitude - or are they disengaged or have no experience with the police? - later we will break this down by victims/non victims) - there is also a strong positive tail

  • For perceptions of ‘antisocial behaviour in my neighbourhood’ there is a strong positive tail - could it depend on neighbourhood characteristics?

# Is there a correlation between people's experience of antisocial behaviour, and their confidence in the police?
crime.confpolice.corr(crime.antisoc)
0.1394095893261098
# For the attitude variables (confpolice and antisoc) what is the mean and standard deviation?
# Can you guess how these attitude variables ended up with that mean and standard deviation (think back to the section on standardizing data)?
crime.agg({'confpolice':['mean', 'std'], 'antisoc':['mean', 'std']})
confpolice antisoc
mean 6.589355 6.643392
std 24.696669 24.788734

The means are 1 and the sd is 0 for each - they have been Z-scored.

# Which variables have a lot of missing data?
# HINT use df.isna() and sum()
crime.isna().sum()
ID                  0
Sex                 0
Age                 0
AgeGroup            0
EthnicGroup         0
Education           0
SES                 0
DeprivationIndex    0
Victim              0
confpolice          0
antisoc             0
dtype: int64

3.10.4. Who is most likely to be a victim of crime?#

Explore which demographic variables make a difference to the chance of being a victim of crime. Are more men than women victims of crime? etc

HINT as Victim is coded as 1 (if they have been a victim of crime in the past 12 months) and 0 (otherwise), you can obtain the proportion of people who have been a victim by taking the mean value of the column Victim.

You can also use sns.barplot() with the x and hue arguments to plot the proportion who are victims of crime within each category (each age group, etc).

# You will add several code blocks here to explore the data

3.10.5. Disaggregate the data#

In the lecture we heard that a pattern that holds in one group of respondants may not hold for another group. We can check this by disaggregating the data.

Are students victims of crime because they live in deprived neighbourhoods?#

Try plotting the proportion of people who are victime of crime broken down by deprivation index (as above), but further broken down by SES

HINT: Use the hue argument to get clusters of bars for each level of SES

# Your code here

Do the patterns hold for all ethnicities?#

The proportion of people who are victims of crime falls with age overall.

But is this true for all ethnicities?

The survey respondants were overwhelmingly white, so even if there was a quite different pattern of victimization in other ethnic groups, when we simply average everyone together this effect would be ‘swamped’

Try plotting the proportion of people who are victims, broken down by AgeGroup, and further broken down by EthinicGroup

# Your code here

3.10.6. Do attitudes differ depending on demographics?#

Looking at the attitude variables (confpolice and antisoc), which demographic factors seem to influence these?

I found it most helpful to make barplots for each attitude variable, broken down by demographic factors (such as deprivation index or age group).

Go ahead and explore the breakdown of confpolice and antisoc by AgeGroup, EthnicGroup and DeprivationIndex

Age#

# Your code here
# Hint: use plt.subplot to create pairs of plots - this keeps things tidy

3.10.7. Deprivation Index#

# Your code here
# Hint: use plt.subplot to create pairs of plots - this keeps things tidy

3.10.8. Ethnic Group#

# Your code here
# Hint: use plt.subplot to create pairs of plots - this keeps things tidy

3.10.9. Lurking Variables?#

Earlier we noticed that people who reported higher levels of antisocial behaviour in their neighbourhood also had higher confidence in the police, which is somewhat surprising.

Looking at how confpolice and antisoc vary by AgeGroup, can you think of a spurious explanation for this correlation?

3.10.10. Conclusions#

Now think about how you might summarize what you have learned from the dataset

  • who is most likely to be a victim of crime?

  • who is most likely to report high levels of antisocial behaviour in their neighbourhood?

  • what factors predict confidence in the police and experience of antisocial behaviour?

  • what are the limitations of the survey? Are the experiences of all groups equally well documented?