8.1. Movie data#

The file ‘imdb.csv’ contains data about 6178 movies and TV shows, which were scraped from the website IMDb.com (and which I downloaded from Kaggle.com)

Your task is to investigate what makes a good movie and write up your claims in a blog-style report, “10 ingredients for a good movie”. You don’t have to have exactly 10 (try for at least 5!).

Our metric of a ‘good movie’ shall be the IMDb rating (base on users’ votes). You may like to comment on whether this is a reliable metric.

This is a blog for geeky movie enthusiasts so you can support your claims with some numbers (correlations, decriptive statistics and the like) but try to avoid jargon and use the data to support the points you are making in the text, rather than robotically reporting statistics.

Below I will suggest some avenues to explore but I encourage you to investigate the data yourself - play around with plotting the relationship between different variables before you settle down to write your report.

What makes a good report?#

Write a (very) short introduction explaining what the data are and what you are using as a measure of movie ‘quality’

It’s up to you to find some interesting features in the data and show them effectively using plots

You should support your claims using descriptive statistics such the mean rating for different groups of movies or correlation coefficients

No need to use this notebook for your report - you can start a new ‘clean’ one without my notes in.

You might even like to start two notebooks - one for rough work (you experimenting with plotting the data, and saving any analyses you want to discuss with you tutor) and one for your final report.

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
sns.set_theme()
sns.set_style('whitegrid')

Import the data#

imdb=pandas.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/imdb.csv')

# get only the movies (ignoring TV shows)
movies=imdb[imdb['Type']=='Film']
movies
Name Date Rating Votes Genre Duration Type Certificate Episodes Nudity Violence Profanity Alcohol Frightening
0 No Time to Die 2021 7.6 107163.0 Action, Adventure, Thriller 163.0 Film PG-13 NaN Mild Moderate Mild Mild Moderate
1 The Guilty 2021 6.3 64375.0 Crime, Drama, Thriller 90.0 Film R NaN NaN NaN Severe NaN Moderate
2 The Many Saints of Newark 2021 6.4 27145.0 Crime, Drama 120.0 Film R NaN Moderate Severe Severe Moderate Moderate
3 Venom: Let There Be Carnage 2021 6.4 30443.0 Action, Adventure, Sci-Fi 97.0 Film PG-13 NaN NaN Moderate Moderate Mild Moderate
4 Dune 2021 8.3 84636.0 Action, Adventure, Drama 155.0 Film PG-13 NaN NaN Moderate NaN Mild Moderate
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6173 The Human Centipede II (Full Sequence) 2011 3.8 37492.0 Horror 91.0 Film Not Rated NaN Severe Severe Severe Mild Severe
6174 Double Indemnity 1944 8.3 150448.0 Crime, Drama, Film-Noir 107.0 Film Passed NaN NaN Mild NaN Mild Mild
6175 Before the Devil Knows You're Dead 2007 7.3 100668.0 Crime, Drama, Thriller 117.0 Film R NaN Severe Moderate Severe Severe Severe
6176 Queen Bees 2021 6.0 887.0 Comedy, Drama, Romance 100.0 Film PG-13 NaN NaN NaN Mild Moderate NaN
6177 Death Race 2008 6.3 203578.0 Action, Sci-Fi, Thriller 105.0 Film R NaN Mild Severe Severe Mild Moderate

4446 rows × 14 columns

Where to start?#

Here are some myths (or truths?) for you to investigate to get you started

  1. Movies are getting worse over time
  2. Longer movies are better
    • As well as investigating the general relationship, can you find some points of interest - for example, name cases of long movies that were also bad (certainly to be avoided!)? Are there any surprisingly long movies in the dataset? What about those that were short but sweet?
  3. More nudity makes for a highly rated movie
  4. Movies rated 15 are the worst!
    • To do this you will need to create a new column converting the US and international certificates into their closest UK equivalent - U,PG,12,15 or 18
  5. Straight-to-TV movies have lower ratings than those shown in cinemas
    • The straight-to-TV movies have certificates beginning in 'TV'
  6. Popular movies are better (you can use the number of votes as a measure of popularity of a movie, since people should only vote if they watched it - you may like to comment on whether this is a reasonabble measure of popularity or whether it may be biased)

Examples:#

Here are a couple of examples to get you started:

Example 1: Are movies getting worse over time?#

According to IMDb raters, yes they are. Or to be more accurate, the number of low-rated movies increased over time, whilst a steady flow of high rated movies was also maintained.

sns.histplot(data=movies, x='Date', y='Rating')
plt.xlim([1920,2020])
plt.ylim([0,10])
plt.suptitle('Movie ratings over time', fontsize=16)
Text(0.5, 0.98, 'Movie ratings over time')
_images/bd0b0881abd79ae6d6bf5fa36d469d6f1a63baf231dbae6430921c1c646e1f87.png

Rating was negatively correlated with year of release (Spearman’s r = -0.29)

movies[['Date','Rating']].corr(method='spearman')
Date Rating
Date 1.000000 -0.298428
Rating -0.298428 1.000000

Thinking point -#

No need to put this in your blog, but you could test if that correlation would be likely to have happened due to chance (is it significant?) by bootstrapping and seeing how often you got a correlation as far from zero as r=-0.29 if you resampled the data.

Example 2: Does swearing spoil a good movie?#

Perhaps surpisingly, yes it does. At least, movies with no swearing (Profanity) at all seem to get slightly higher ratings.

In the plot below there is a small but perceptible advantage in ratings for movies with no Profanity - the average rating for a movie with no profanity was a third of one point (out of ten) higher than one with profanity!

print('mean rating - no profanity = ' + str(movies[movies['Profanity']=='None']['Rating'].mean()))
print('mean rating - with profanity = ' + str(movies[movies['Profanity']!='None']['Rating'].mean()))
mean rating - no profanity = nan
mean rating - with profanity = 6.625366193908393
# plot ratings by 'alcohol content' of the movie
# note that I have set the order of the violins to increasing alcohol, rather than the default order which wasn't meaningful
sns.violinplot(data=movies, x='Profanity', y='Rating', order=['None','Mild','Moderate','Severe'])
<Axes: xlabel='Profanity', ylabel='Rating'>
_images/e70946e12b0b488d751e11f11ddfe0285b9051e395e11006af4573c8d765e272.png

That said, there was an explosion in profanity from the 1960s onwards - so maybe this just reflects the fact that the old classics are the best movies!

sns.set_palette('rocket')

sns.kdeplot(
    data=movies, 
    x='Date',  
    hue='Profanity', 
    multiple='stack',
    hue_order=['None','Mild','Moderate','Severe'])

plt.suptitle('Distribution profanity ratings over time', fontsize=16)
Text(0.5, 0.98, 'Distribution profanity ratings over time')
_images/888edac5079a099a2952cccbc4b415855206c04160354719a1ae5bdb4585a84d.png

By the way….#

Don’t put this in your blog as it is too technical, but you might be interested to see it:

It looks like there is only a tiny effect of Profanity on ratings, but could it have occurred due to chance?

We construct a bootstrapped confidence interval for the difference in ratings between movies with and without profanity:

nReps = 10000
mDiff=np.empty(nReps)
n1=len(movies[movies['Profanity']=='None'])
n2=len(movies[movies['Profanity']!='None'])

print(n1)
print(n2)

for i in range(nReps):
    sample1 = movies[movies['Profanity']=='None'].sample(n1, replace=True)
    sample2 = movies[movies['Profanity']!='None'].sample(n2, replace=True)
    mDiff[i]= sample1['Rating'].mean()-sample2['Rating'].mean()
    
sns.histplot(mDiff)
plt.xlabel('Difference of means: no profanity- profanity')
0
4446
Text(0.5, 0, 'Difference of means: no profanity- profanity')
_images/4a0c725f63fb815886842e50560bc5d3fd3a748e353f7d58c97d71c9c4448c8a.png

Since the difference of means is almost always greater than ero, we can be confident that movies with no profanity really do have higher mean ratings - even if the difference is less than 0.3/10 points!

Useful information: Movie Certificates#

The movie certificates given are mostly US certificates but there are different schemes in use for movies shown in cinema and on TV. There are a fair few certificates in use in the database, some of which relate to only a few titles:

movies['Certificate'].value_counts()
Certificate
R            1885
PG-13        1145
PG            530
Not Rated     189
TV-MA         109
G              70
Approved       42
TV-14          35
Passed         24
NC-17          14
TV-PG          13
E              12
(Banned)       10
TV-G           10
GP              6
X               6
TV-Y7           5
M               3
M/PG            2
Name: count, dtype: int64

There are also some old (pre 1968) or international certificates mixed in there

To meaningfully use these data (for example to answer the question, are movies rated 15 the worst?) it would be helpful to group these into a smaller set of classifications - for example the closest UK equivalent

UK movie ratings are:

  • U - Universal - suitable for all (including young children)
  • PG - Parental Guidance - may be unsuitable for very young children
  • 12,15,18 - Not suitable for persons under the age of 12,15 or 18 respectively

Alternatively, to answer the question about straight-to-TV movies, you will want to separate out those with a certificate beginning ‘TV’ (only ever shown on TV) from the others

The meaning of certificates should be Googlable but here are my notes on the certificates used in the dataset:

  • TV-Y: Designed to be appropriate for all children
  • TV-Y7: Suitable for ages 7 and up
  • G: Suitable for General Audiences
  • TV-G: Suitable for General Audiences
  • PG: Parental Guidance suggested
  • TV-PG: Parental Guidance suggested
  • PG-13: Parents strongly cautioned. May be Inappropriate for ages 12 and under.
  • TV-14: Parents strongly cautioned. May not be suitable for ages 14 and under.
  • R: Restricted. May be inappropriate for ages 17 and under.
  • TV-MA: For Mature Audiences. May not be suitable for ages 17 and under.
  • NC-17: Inappropriate for ages 17 and under
  • M - Mature
  • TV-Y7 - suitable for children over 7
  • banned/approved/passed - pre 1968 classifications

In reclassifying the movies you may consider excluding some cases that don’t clearly fall into one category or another, for example the pre-1968 movies.

There are also some remaining mysteries, for example, one certificate I was not able to understand was ‘E’: Google tells me that ‘E’ stands for educational or exempt but if I look which movies have this rating, I’m not so sure this is correct:

movies[movies['Certificate']=='E']
Name Date Rating Votes Genre Duration Type Certificate Episodes Nudity Violence Profanity Alcohol Frightening
429 Luca 2021 7.5 112694.0 Animation, Adventure, Comedy 95.0 Film E NaN NaN Mild NaN NaN Mild
521 Space Jam: A New Legacy 2021 4.4 56919.0 Animation, Adventure, Comedy 115.0 Film E NaN NaN Mild Mild Mild Mild
526 Godzilla vs. Kong 2021 6.4 176046.0 Action, Sci-Fi, Thriller 113.0 Film E NaN NaN Moderate Mild Mild Moderate
581 PAW Patrol: The Movie 2021 6.2 3898.0 Animation, Adventure, Comedy 86.0 Film E NaN Mild Mild NaN NaN Mild
851 In the Heights 2021 7.4 35374.0 Drama, Musical, Romance 143.0 Film E NaN Mild NaN Mild Mild NaN
872 The Boss Baby: Family Business 2021 5.9 11863.0 Animation, Adventure, Comedy 107.0 Film E NaN Mild Mild Mild NaN Mild
1978 The Comeback Trail 2020 5.6 4961.0 Comedy, Crime 104.0 Film E NaN Mild Moderate Moderate Mild Mild
2654 Peter Rabbit 2: The Runaway 2021 6.2 7310.0 Animation, Adventure, Comedy 93.0 Film E NaN NaN Mild NaN NaN Mild
2877 Spirit Untamed 2021 5.4 2585.0 Animation, Adventure, Comedy 88.0 Film E NaN NaN Mild Mild NaN Mild
3156 AINBO: Spirit of the Amazon 2021 4.7 933.0 Animation, Adventure, Comedy 84.0 Film E NaN No Rate No Rate No Rate No Rate No Rate
4814 Dragon Rider 2020 5.6 1199.0 Animation, Adventure, Comedy 91.0 Film E NaN NaN Mild Mild Mild Mild
5964 Dragon Rider 2020 5.6 1199.0 Animation, Adventure, Comedy 91.0 Film E NaN NaN Mild Mild Mild Mild

You may find this example useful in creating a new column for UK classifications:

#first, create a new column with the default value 'None'
movies = movies.assign(classification = 'none')

#np.isin to do a boolean check on every element of the array and see if it is in a different list (your list of interest)
# use np.where to get booleans where that condition is met, assign a new value if true, or do nothing if not true
movies['classification'] = np.where(np.isin(movies.Certificate, ['G', 'TV-G', 'TV-Y', 'TV-7']), 'U', movies.classification)
movies['classification'] = np.where(np.isin(movies.Certificate, ['PG', 'TV-PG']), 'PG', movies.classification)


#the syntax for this is:
# np.where(set condition, do this if condition is met, otherwise do this)
# np.isin(array_to_test, list of values you want to check if the element matches)

#so the code is doing:
# finding where the movie certificate is either G, TV-G, TV-Y or TV-7
# setting the value of 'classification' to 'U' if that movie meets the condition
# if not, it sets the value of 'classification' to itself (i.e. it does nothing)
movies[movies['classification']=='U']
Name Date Rating Votes Genre Duration Type Certificate Episodes Nudity Violence Profanity Alcohol Frightening classification
392 Halloweentown 1998 6.7 15617.0 Adventure, Comedy, Family 84.0 Film TV-G NaN NaN Mild NaN NaN Mild U
426 Love Strikes Twice 2021 7.2 458.0 Comedy, Mystery, Romance 84.0 Film TV-G NaN NaN NaN NaN NaN NaN U
889 Cars 2006 7.1 388952.0 Animation, Adventure, Comedy 117.0 Film G NaN NaN NaN Mild NaN Mild U
1002 2001: A Space Odyssey 1968 8.3 631754.0 Adventure, Sci-Fi 149.0 Film G NaN NaN Mild NaN NaN Moderate U
1073 The Lion King 1994 8.5 986883.0 Animation, Adventure, Drama 88.0 Film G NaN NaN Mild NaN NaN Moderate U
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5701 Pocahontas 1995 6.7 177837.0 Animation, Adventure, Drama 81.0 Film G NaN NaN Mild NaN NaN Mild U
5803 Robin Hood 1973 7.6 120934.0 Animation, Adventure, Comedy 83.0 Film G NaN NaN Mild NaN Mild Mild U
5903 The Polar Express 2004 6.6 188010.0 Animation, Adventure, Comedy 100.0 Film G NaN NaN NaN NaN NaN Mild U
5944 Chitty Chitty Bang Bang 1968 6.9 43331.0 Adventure, Family, Fantasy 144.0 Film G NaN NaN Mild NaN Mild Mild U
6064 The Princess Diaries 2: Royal Engagement 2004 5.8 77450.0 Comedy, Family, Romance 113.0 Film G NaN NaN NaN NaN NaN NaN U

80 rows × 15 columns