Creating a TV Show & Movies Database

ROLE:

Team Lead & Database Engineer


TIMELINE:

January- April 2025

TEAM:

1 Database Engineer

2 Query Engineers

(Lousiane Gauthier & Mark Schwartz)

SKILLS:

Relational databases

Query Engineering

SQL Code


  • Our team used data from IMDB and Rotten Tomatoes reviews to create a media reviews database with over 20,000 reviews for tv shows & movies. We also created 15 queries to test the database and analyze trends in review data.

  • Total TV & Movie Reviews: 20,132

    • 2929 TV Show Reviews

    • 17203 Movie Reviews

    Key Data Features:

    1. title

    2. release year

    3. media type (show/movie)

    4. runtime in minutes (avg. episode runtime used for shows)

    5. Synopsis

    6. Content rating

    7. Actor names

    8. Reviewer names

    9. Review Text

  • This database was transformed from 4 datasets of IMDB and Rotten Tomatoes reviews, with 2 consisting of movies, and the other 2 of TV shows. To create our database, we organized the data into 6 strong entities (shown in green) and 3 weak entities (shown in red). Many names were repeated across the actors and directors tables, e.g. actor/directors like Angelina Jolie. To adhere to Boyce-Codd normal form, the actors and directors tables reference a centralized table “people”. Deciding which columns in the original datasets should become entities was challenging, and we examined several configurations against the constraints of each normal form.

  • Example Scenario 1: Demographic analysts want their horror film to receive high praises from top critics. They plan to get a profile of things that top critics dislike in horror films. Find the 5 horror movies that were rated the worst by top critics, compare this with the overall scores for each movie. Be sure to also include the title, runtime, year, mpa rating.

    Example Scenario 2: A team of writers want to make a novel and nuanced experience that defies genre categorization and takes inspiration from many different genres. Find the 5 media titles with the most genres and compare their scores. Is there any correlation between genre count and score?

    Example Scenario 3: A media reviews wants to understand the relationship between review volume and average scores. By grouping publishers into brackets of 100 reviews (e.g., 200–299, 300–399), they can see how they compare to others within their specific review output tier. Filter by publishers with 200 to 2,000 total reviews to focus on medium-to-large scale creators.

    Example Scenario 4: A film review analytics team wants to see how the score given top critics on comedies changed from the year prior from 2000-2020. Find the average score given by top critics to comedies each year and determine the change (+/- percent score) from the year before.

  • C1: Ensure that no invalid review scores are inserted into the database.

    This trigger function throws an exception if a score for a media review is outside the range of 0-10. This trigger is applied when new review records are inserted, and when existing records are updated.

    C2: Warn users when deleting records in the media table if they still have reviews linked to them.

    This trigger function gets a count of reviews that are associated with the current media_id. If that count is greater than 0, notify the user that all of these reviews will be deleted as well.

  • With the proper queries, our PSQL media reviews database can be an effective tool for media reviews analysis. With the abundance of included records, we extracted detailed insights regarding reviewers, reviews, publishers, actors, and directors. With the in-built reliability constraints and regular maintainance, our reviews database could serve as a powerful tool for analyzing demographic and marketing trends.

  • 1. Too few data integrity constraints to cover common integrity failures

    A more in-depth analysis of possible domain, entity (primary key), and referential integrity violations would allow us to create a more robust set of constraints.

    2. Only one review per piece of media.

    The original csv files containing IMDB and Rotten Tomatoes reviews were organized into our schema using a long python script. Late in the project’s timeline, I discovered that the section of the script that deleted duplicate entries in the media table, also deleted “duplicate” reviews (any 2 reviews for the same piece of media. This prevented us from analyzing review trends across a single piece of media’s reviews. I plan to rebuild the dataset with the newly fixed script to make this a possiblity.

Read FULL REPORT
Learn More on github

I also plan to develop a React front-end for this database so that anyone can query my dataset in their browser!