Datasets


The data sources we used

CMU Movie Summary Corpus

You can find the data here.
This is the main dataset we used for our analysis. It contains 42,306 movie plot summaries extracted from Wikipedia alongside aligned metadata extracted from Freebase, including movie box office revenue, genre, release date, runtime, and language.
Character names and aligned information about the actors who portray them, including gender and estimated age at the time of the movie’s release

TMDB

You can find the data here.
The TMDb (The Movie Database) is a comprehensive movie database that provides information about movies, including details like titles, ratings, release dates, revenue, genres, and much more.
This dataset contains a collection of 1,000,000 movies from the TMDB database. Among the features available in the dataset are id, title, release_date, vote_average, vote_count, status, release_date, revenue, runtime, adult, original_language, popularity genres, and poster_path.

MovieLens (with Reviews)

You can find the data here.
This dataset contains 32 million ratings that we used to perform sentiment analysis and compute a success score metric alongside other features.