kahlee.info
BlogProjectsAboutPrivacy

Insights on the IMDb Dataset

Part 1 - Top 500 movies and their genres

April 2025

Like most people, I love movies. There are some I could watch endlessly, some I put on and find something new every time, and some I'm happy to watch only once but still find things to appreciate.

The IMDb dataset has been made available for personal and non-commercial use. It's a great dataset to explore when practicing data science skills for a number of reasons:

  • It's huge, containing millions of records dating back decades.
  • It's not perfectly clean (great for practicing data exploration and cleansing).
  • It's relevant and engaging as most people would be aware of IMDb and the information it contains.

I will be running a few posts on exploring this dataset using Python and data analysis packages. To start, I've been investigating the top 500 movies in the past 50 years to see what trends exist around genres.

IMDb allows users to rate movies and TV shows between 1 and 10, with 1 being lowest and 10 being highest. The final rating is a weighted average and the exact algorithm isn't disclosed by IMDb. This means that if we decide the top 500 movies are just those with a rating of 10, then a movie with 5 votes and a 10 rating would be equal to a movie with 10,000 votes and a 10 rating.

To account for this, I created a new column to hold the average rating multiplied by the number of votes to give a rough indication of the "top" movies (really, it would be more appropriate to say "most popular" movies). I also limited this to the last 50 years to keep the insights more recent and relevant.

In the dataset, a movie can have up to 3 genres (e.g. Action, Adventure, Drama), and most movies have 3 genres (mean 2.6, median 3). To find an accurate split of genres, I counted each genre individually rather than in their set (e.g. "La La Land" counts in the "Comedy" genre, the "Drama" genre, and the "Music" genre as opposed to a single genre called "Comedy, Drama, Music").

Drama, Action, and Adventure appear to be the top genres in the top 500. The next question to ask is whether this is a consistent trend throughout the last 50 years or whether it is a peak during a certain time period. For example, Westerns and Musicals might have been popular during certain decades, but have been less so in recent years.

To analyse this, I've looked at the top 6 genres vs. the bottom 2 genres:

A few interesting things to note here. As thought, "Westerns" seemed to drop out of the top 500 in 2015. "Music" movies started picking up after that, but dropped out again in 2018. The biggest spike is for "Action" movies which enjoyed a peak in 2014, but all genres quickly started declining after 2014.

Let's explore the drop after 2014 further. Does this indicate a drop in the specific genres, or does it mean fewer films made after 2014 are making the top 500? To find out, I compared the top 500 movies by release year against the total number of movies released each year.

The top 500 movies are always just a fraction of the total number of movies released every year. This chart shows that the quantity of movies in the top 500 drops after 2014, but the total number of movies released keeps increasing until 2021.

There could be several reasons contributing to a decline like this:

  • The COVID-19 pandemic clearly had an impact around 2020. While there is a significant uptick in movies released in 2021 and 2022 (possibly including movies which would have been released earlier but were delayed due to the pandemic), it is possible the pandemic had longer-lasting effects, as the number of movies in the top 500 stays low after 2020.
  • In 2023 there were two strikes which may have impacted the top 500 movies and the total number of movies released - the SAG-AFTRA strike and the Writers Guild of America strike.
  • It's possible that fewer movies being released in recent years are good enough to knock some older beloved movies out of the top 500.
  • It's possible that people are engaging less with the ratings system than in previous years.

The "top" movies are based on user votes. Data on the IMDb user base demographics is not public, and neither is the algorithm to determine average ratings, so we can't analyse these further. I can, however, analyse the total number of votes per movie since 1975 to help understand the final point further.

It appears that movies released after 2014 are attracting fewer ratings than movies released before 2015. This could suggest several things:

  • Fewer people are interacting with the "ratings" functionality since 2014, but they may be interacting with the website in other ways.
  • Fewer movies are being released that make people want to interact with the "ratings" functionality. People often only want to express opinions if something makes them feel strongly one way or another. Recent releases might not be encouraging people to put their ratings online.
  • As more movies are being released, fewer movies might appeal to people willing to rate movies on IMDb. For example, many recent Academy Award nominees are not in the top 500 list, but would still be considered very good movies.

There is definitely scope for further analysis, which will be explored in future blog posts. For now, it is interesting to see these insights across top movies and their genres.

Information courtesy of IMDb (https://www.imdb.com). Used with permission.