Movie data

8.1. Movie data#

The file ‘imdb.csv’ contains data about 6178 movies and TV shows, which were scraped from the website IMDb.com (and which I downloaded from Kaggle.com)

Your task is to investigate what makes a good movie and write up your claims in a blog-style report, “10 ingredients for a good movie”. You don’t have to have exactly 10 (try for at least 5!).

Our metric of a ‘good movie’ shall be the IMDb rating (base on users’ votes). You may like to comment on whether this is a reliable metric.

This is a blog for geeky movie enthusiasts so you can support your claims with some numbers (correlations, decriptive statistics and the like) but try to avoid jargon and use the data to support the points you are making in the text, rather than robotically reporting statistics.

Below I will suggest some avenues to explore but I encourage you to investigate the data yourself - play around with plotting the relationship between different variables before you settle down to write your report.

What makes a good report?#

Write a (very) short introduction explaining what the data are and what you are using as a measure of movie ‘quality’

It’s up to you to find some interesting features in the data and show them effectively using plots

You should support your claims using descriptive statistics such the mean rating for different groups of movies or correlation coefficients

No need to use this notebook for your report - you can start a new ‘clean’ one without my notes in.

You might even like to start two notebooks - one for rough work (you experimenting with plotting the data, and saving any analyses you want to discuss with you tutor) and one for your final report.

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme()
sns.set_style('whitegrid')

Import the data#

imdb=pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/imdb.csv')

# get only the movies (ignoring TV shows)
movies=imdb[imdb['Type']=='Film']
movies

	Name	Date	Rating	Votes	Genre	Duration	Type	Certificate	Episodes	Nudity	Violence	Profanity	Alcohol	Frightening
0	No Time to Die	2021	7.6	107163.0	Action, Adventure, Thriller	163.0	Film	PG-13	NaN	Mild	Moderate	Mild	Mild	Moderate
1	The Guilty	2021	6.3	64375.0	Crime, Drama, Thriller	90.0	Film	R	NaN	NaN	NaN	Severe	NaN	Moderate
2	The Many Saints of Newark	2021	6.4	27145.0	Crime, Drama	120.0	Film	R	NaN	Moderate	Severe	Severe	Moderate	Moderate
3	Venom: Let There Be Carnage	2021	6.4	30443.0	Action, Adventure, Sci-Fi	97.0	Film	PG-13	NaN	NaN	Moderate	Moderate	Mild	Moderate
4	Dune	2021	8.3	84636.0	Action, Adventure, Drama	155.0	Film	PG-13	NaN	NaN	Moderate	NaN	Mild	Moderate
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6173	The Human Centipede II (Full Sequence)	2011	3.8	37492.0	Horror	91.0	Film	Not Rated	NaN	Severe	Severe	Severe	Mild	Severe
6174	Double Indemnity	1944	8.3	150448.0	Crime, Drama, Film-Noir	107.0	Film	Passed	NaN	NaN	Mild	NaN	Mild	Mild
6175	Before the Devil Knows You're Dead	2007	7.3	100668.0	Crime, Drama, Thriller	117.0	Film	R	NaN	Severe	Moderate	Severe	Severe	Severe
6176	Queen Bees	2021	6.0	887.0	Comedy, Drama, Romance	100.0	Film	PG-13	NaN	NaN	NaN	Mild	Moderate	NaN
6177	Death Race	2008	6.3	203578.0	Action, Sci-Fi, Thriller	105.0	Film	R	NaN	Mild	Severe	Severe	Mild	Moderate

4446 rows × 14 columns

Where to start?#

Here are some myths (or truths?) for you to investigate to get you started

Movies are getting worse over time
Longer movies are better
- As well as investigating the general relationship, can you find some points of interest - for example, name cases of long movies that were also bad (certainly to be avoided!)? Are there any surprisingly long movies in the dataset? What about those that were short but sweet?
More nudity makes for a highly rated movie
Movies rated 15 are the worst!
- To do this you will need to create a new column converting the US and international certificates into their closest UK equivalent - U,PG,12,15 or 18
Straight-to-TV movies have lower ratings than those shown in cinemas
- The straight-to-TV movies have certificates beginning in 'TV'
Popular movies are better (you can use the number of votes as a measure of popularity of a movie, since people should only vote if they watched it - you may like to comment on whether this is a reasonabble measure of popularity or whether it may be biased)

Examples:#

Here are a couple of examples to get you started:

Example 1: Are movies getting worse over time?#

According to IMDb raters, yes they are. Or to be more accurate, the number of low-rated movies increased over time, whilst a steady flow of high rated movies was also maintained.

sns.histplot(data=movies, x='Date', y='Rating')
plt.xlim([1920,2020])
plt.ylim([0,10])
plt.suptitle('Movie ratings over time', fontsize=16)

Text(0.5, 0.98, 'Movie ratings over time')

_images/bd0b0881abd79ae6d6bf5fa36d469d6f1a63baf231dbae6430921c1c646e1f87.png

Rating was negatively correlated with year of release (Spearman’s r = -0.29)

movies[['Date','Rating']].corr(method='spearman')

	Date	Rating
Date	1.000000	-0.298428
Rating	-0.298428	1.000000

Thinking point -#

No need to put this in your blog, but you could test if that correlation would be likely to have happened due to chance (is it significant?) by bootstrapping and seeing how often you got a correlation as far from zero as r=-0.29 if you resampled the data.

Example 2: Does swearing spoil a good movie?#

Perhaps surpisingly, yes it does. At least, movies with no swearing (Profanity) at all seem to get slightly higher ratings.

In the plot below there is a small but perceptible advantage in ratings for movies with no Profanity - the average rating for a movie with no profanity was a third of one point (out of ten) higher than one with profanity!

print('mean rating - no profanity = ' + str(movies[movies['Profanity']=='None']['Rating'].mean()))
print('mean rating - with profanity = ' + str(movies[movies['Profanity']!='None']['Rating'].mean()))

mean rating - no profanity = nan
mean rating - with profanity = 6.625366193908393

# plot ratings by 'alcohol content' of the movie
# note that I have set the order of the violins to increasing alcohol, rather than the default order which wasn't meaningful
sns.violinplot(data=movies, x='Profanity', y='Rating', order=['None','Mild','Moderate','Severe'])

<Axes: xlabel='Profanity', ylabel='Rating'>

_images/e70946e12b0b488d751e11f11ddfe0285b9051e395e11006af4573c8d765e272.png

That said, there was an explosion in profanity from the 1960s onwards - so maybe this just reflects the fact that the old classics are the best movies!

sns.set_palette('rocket')

sns.kdeplot(
    data=movies, 
    x='Date',  
    hue='Profanity', 
    multiple='stack',
    hue_order=['None','Mild','Moderate','Severe'])

plt.suptitle('Distribution profanity ratings over time', fontsize=16)

Text(0.5, 0.98, 'Distribution profanity ratings over time')

_images/888edac5079a099a2952cccbc4b415855206c04160354719a1ae5bdb4585a84d.png

By the way….#

Don’t put this in your blog as it is too technical, but you might be interested to see it:

It looks like there is only a tiny effect of Profanity on ratings, but could it have occurred due to chance?

We construct a bootstrapped confidence interval for the difference in ratings between movies with and without profanity:

nReps = 10000
mDiff=np.empty(nReps)
n1=len(movies[movies['Profanity']=='None'])
n2=len(movies[movies['Profanity']!='None'])

print(n1)
print(n2)

for i in range(nReps):
    sample1 = movies[movies['Profanity']=='None'].sample(n1, replace=True)
    sample2 = movies[movies['Profanity']!='None'].sample(n2, replace=True)
    mDiff[i]= sample1['Rating'].mean()-sample2['Rating'].mean()
    
sns.histplot(mDiff)
plt.xlabel('Difference of means: no profanity- profanity')

0
4446

Text(0.5, 0, 'Difference of means: no profanity- profanity')

_images/4a0c725f63fb815886842e50560bc5d3fd3a748e353f7d58c97d71c9c4448c8a.png

Since the difference of means is almost always greater than ero, we can be confident that movies with no profanity really do have higher mean ratings - even if the difference is less than 0.3/10 points!

Useful information: Movie Certificates#

The movie certificates given are mostly US certificates but there are different schemes in use for movies shown in cinema and on TV. There are a fair few certificates in use in the database, some of which relate to only a few titles:

movies['Certificate'].value_counts()

Certificate
R            1885
PG-13        1145
PG            530
Not Rated     189
TV-MA         109
G              70
Approved       42
TV-14          35
Passed         24
NC-17          14
TV-PG          13
E              12
(Banned)       10
TV-G           10
GP              6
X               6
TV-Y7           5
M               3
M/PG            2
Name: count, dtype: int64

There are also some old (pre 1968) or international certificates mixed in there

To meaningfully use these data (for example to answer the question, are movies rated 15 the worst?) it would be helpful to group these into a smaller set of classifications - for example the closest UK equivalent

UK movie ratings are:

U - Universal - suitable for all (including young children)
PG - Parental Guidance - may be unsuitable for very young children
12,15,18 - Not suitable for persons under the age of 12,15 or 18 respectively

Alternatively, to answer the question about straight-to-TV movies, you will want to separate out those with a certificate beginning ‘TV’ (only ever shown on TV) from the others

The meaning of certificates should be Googlable but here are my notes on the certificates used in the dataset:

TV-Y: Designed to be appropriate for all children
TV-Y7: Suitable for ages 7 and up
G: Suitable for General Audiences
TV-G: Suitable for General Audiences
PG: Parental Guidance suggested
TV-PG: Parental Guidance suggested
PG-13: Parents strongly cautioned. May be Inappropriate for ages 12 and under.
TV-14: Parents strongly cautioned. May not be suitable for ages 14 and under.
R: Restricted. May be inappropriate for ages 17 and under.
TV-MA: For Mature Audiences. May not be suitable for ages 17 and under.
NC-17: Inappropriate for ages 17 and under
M - Mature
TV-Y7 - suitable for children over 7
banned/approved/passed - pre 1968 classifications

In reclassifying the movies you may consider excluding some cases that don’t clearly fall into one category or another, for example the pre-1968 movies.

There are also some remaining mysteries, for example, one certificate I was not able to understand was ‘E’: Google tells me that ‘E’ stands for educational or exempt but if I look which movies have this rating, I’m not so sure this is correct:

movies[movies['Certificate']=='E']

	Name	Date	Rating	Votes	Genre	Duration	Type	Certificate	Episodes	Nudity	Violence	Profanity	Alcohol	Frightening
429	Luca	2021	7.5	112694.0	Animation, Adventure, Comedy	95.0	Film	E	NaN	NaN	Mild	NaN	NaN	Mild
521	Space Jam: A New Legacy	2021	4.4	56919.0	Animation, Adventure, Comedy	115.0	Film	E	NaN	NaN	Mild	Mild	Mild	Mild
526	Godzilla vs. Kong	2021	6.4	176046.0	Action, Sci-Fi, Thriller	113.0	Film	E	NaN	NaN	Moderate	Mild	Mild	Moderate
581	PAW Patrol: The Movie	2021	6.2	3898.0	Animation, Adventure, Comedy	86.0	Film	E	NaN	Mild	Mild	NaN	NaN	Mild
851	In the Heights	2021	7.4	35374.0	Drama, Musical, Romance	143.0	Film	E	NaN	Mild	NaN	Mild	Mild	NaN
872	The Boss Baby: Family Business	2021	5.9	11863.0	Animation, Adventure, Comedy	107.0	Film	E	NaN	Mild	Mild	Mild	NaN	Mild
1978	The Comeback Trail	2020	5.6	4961.0	Comedy, Crime	104.0	Film	E	NaN	Mild	Moderate	Moderate	Mild	Mild
2654	Peter Rabbit 2: The Runaway	2021	6.2	7310.0	Animation, Adventure, Comedy	93.0	Film	E	NaN	NaN	Mild	NaN	NaN	Mild
2877	Spirit Untamed	2021	5.4	2585.0	Animation, Adventure, Comedy	88.0	Film	E	NaN	NaN	Mild	Mild	NaN	Mild
3156	AINBO: Spirit of the Amazon	2021	4.7	933.0	Animation, Adventure, Comedy	84.0	Film	E	NaN	No Rate	No Rate	No Rate	No Rate	No Rate
4814	Dragon Rider	2020	5.6	1199.0	Animation, Adventure, Comedy	91.0	Film	E	NaN	NaN	Mild	Mild	Mild	Mild
5964	Dragon Rider	2020	5.6	1199.0	Animation, Adventure, Comedy	91.0	Film	E	NaN	NaN	Mild	Mild	Mild	Mild

You may find this example useful in creating a new column for UK classifications:

#first, create a new column with the default value 'None'
movies = movies.assign(classification = 'none')

#np.isin to do a boolean check on every element of the array and see if it is in a different list (your list of interest)
# use np.where to get booleans where that condition is met, assign a new value if true, or do nothing if not true
movies['classification'] = np.where(np.isin(movies.Certificate, ['G', 'TV-G', 'TV-Y', 'TV-7']), 'U', movies.classification)
movies['classification'] = np.where(np.isin(movies.Certificate, ['PG', 'TV-PG']), 'PG', movies.classification)


#the syntax for this is:
# np.where(set condition, do this if condition is met, otherwise do this)
# np.isin(array_to_test, list of values you want to check if the element matches)

#so the code is doing:
# finding where the movie certificate is either G, TV-G, TV-Y or TV-7
# setting the value of 'classification' to 'U' if that movie meets the condition
# if not, it sets the value of 'classification' to itself (i.e. it does nothing)

movies[movies['classification']=='U']

	Name	Date	Rating	Votes	Genre	Duration	Type	Certificate	Episodes	Nudity	Violence	Profanity	Alcohol	Frightening	classification
392	Halloweentown	1998	6.7	15617.0	Adventure, Comedy, Family	84.0	Film	TV-G	NaN	NaN	Mild	NaN	NaN	Mild	U
426	Love Strikes Twice	2021	7.2	458.0	Comedy, Mystery, Romance	84.0	Film	TV-G	NaN	NaN	NaN	NaN	NaN	NaN	U
889	Cars	2006	7.1	388952.0	Animation, Adventure, Comedy	117.0	Film	G	NaN	NaN	NaN	Mild	NaN	Mild	U
1002	2001: A Space Odyssey	1968	8.3	631754.0	Adventure, Sci-Fi	149.0	Film	G	NaN	NaN	Mild	NaN	NaN	Moderate	U
1073	The Lion King	1994	8.5	986883.0	Animation, Adventure, Drama	88.0	Film	G	NaN	NaN	Mild	NaN	NaN	Moderate	U
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5701	Pocahontas	1995	6.7	177837.0	Animation, Adventure, Drama	81.0	Film	G	NaN	NaN	Mild	NaN	NaN	Mild	U
5803	Robin Hood	1973	7.6	120934.0	Animation, Adventure, Comedy	83.0	Film	G	NaN	NaN	Mild	NaN	Mild	Mild	U
5903	The Polar Express	2004	6.6	188010.0	Animation, Adventure, Comedy	100.0	Film	G	NaN	NaN	NaN	NaN	NaN	Mild	U
5944	Chitty Chitty Bang Bang	1968	6.9	43331.0	Adventure, Family, Fantasy	144.0	Film	G	NaN	NaN	Mild	NaN	Mild	Mild	U
6064	The Princess Diaries 2: Royal Engagement	2004	5.8	77450.0	Comedy, Family, Romance	113.0	Film	G	NaN	NaN	NaN	NaN	NaN	NaN	U

80 rows × 15 columns