Building a Predictor for Movie Ratings Final Report Haowen Cao [email protected] Daniel Holstein [email protected] Casatrina Lee [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Building a Predictor for Movie Ratings Final Report Haowen Cao [email protected] Daniel Holstein [email protected] Casatrina Lee [email protected] Abstract directors who are involved in it. We modeled the database as a directed bipartite graph with actors, We investigate the degree to which movie directors and movies as nodes. We chose to focus popularity is a function of cast and genre. primarily on directors and actors based on the as- We mined the open-source IMDb dataset, sumption that those two roles would be the most and restructured it into two bipartite graphs visible to audiences and therefore most influential (a graph of movies to actors, and a graph in determining the popularity of the movie. Edges of movies to genres). Using the HITS algo- are drawn from actors and directors to movies when rithm, we assign each actor and director a the particular person has worked on that particular hub score and an authority score based on movie. the popularity of future projects with other Using the HITS algorithm, we would be able to collaborators. Using a similar technique, assign each actor and director a hub score and an we also use the ratings of the movie to pre- authority score based on the popularity of his past dict the popularity of a particular genre. We projects, and therefore train a model to be able to also used PageRank on 2 bipartite graphs predict the popularity of future projects with other (movies to actors, and movies to directors). collaborators. Using a similar technique, we also While PageRank did not achieve another use the ratings of the movie to predict the popularity rating prediction method, it did yield in- of a particular genre. teresting insights into the structure of the PageRank, HITS close cousin, was also run on bipartite graphs. two derived IMDb graphs: (1) the bipartite graph from movies to actors, and then the bipartite graph 1 Introduction from movies to directors. Naturally, we expected IMDb is often used as a platform by which users results similar to those obtained via the HITS algo- can obtain relevant information about specific rithm, and to obtain yet another rating prediction movies and artistes. Such information includes method. While this was not the case, applying objective characteristics such as the cast and pro- PageRank did yield some interesting insight into duction crew, awards information, genre and pro- the structure of the bipartite graphs. duction dates of the movies, as well as the biogra- 2 Review of Prior Work phies, accolades and other projects of a particular artiste in question. The database also provides an We reviewed other papers who had conducted var- insight into the popularity of the movie as it allows ious analyses on the IMDb dataset and focused users to rate movies, as well as leave comments on mainly on papers that strove to utilize network fea- message boards. In this project, we focus mainly tures to predict movie popularity. using ratings as a quantitative measure of the popu- One paper (Oghina et al., 2012) utilized sev- larity of a movie, relying on the fact that the vast eral social media platforms to predict IMDb movie number of users who rate the movies, combined ratings. Using a cross-channel prediction model, with IMDb’s weighted-mean algorithm, allow for the authors focused mainly on data from Twitter the rating information to provide a somewhat accu- and Youtube, extracting the number of tweets, the rate representation of the movie’s popularity. number of reviews and the number of upvotes and Our project hopes to be able to build a predic- downvotes to draw conclusions on the popularity of tor of a movie’s success based on the actors and the movie. The main statistical method used were mean absolute error (MAE) and root mean squared obtain the data relevant to our investigation. For error (RMSE). To distinguish between negative and movies, we extracted the title of the movie, the year positive reviews, the authors parsed the reviews and of release, the number of ratings it has received utilized sentiment analysis and the Spearman co- from users, as well as its rating as shown on the efficient to classify reviews such that they would IMDb website. For actors, actresses and directors, become qualitative indicators. The success of this we curated a list of movies that each actor or direc- model gives us a basis that social networks can tor has worked on previously. Finally, we grouped act as accurate predictors of a movie’s popularity. movies according to genre in order to investigate We hope to build on this finding by further incor- the correlation of genre and the popularity of the porating objective features of movies (eg. cast, respective movies. crew, genre, etc) and zoom in on the IMDb dataset Table 1 shows the basic relevant features of our instead of cross referencing other social media net- network. works. feature value Another particularly relevant paper by Grujic, movie nodes 20828 2008, had also modeled the dataset as a bipartite actor nodes 62863 graph to create a movie recommendation network. actress nodes 34204 However, she focused mainly on comments by director nodes 5671 users on movies rather than on ratings. She used cast movie edges 866702 the movies that a user had commented on as a pre- ! movie genre edges 50741 dictor of the kind of movies that the user would be ! Average labels of movies 2.43 interested in and constructed three distinct graphs Average rating of movies 6.39 to investigate clustering within the network. Her first and most basic model was to utilize the pres- Table 1: the relevant features of the network ence of a comment as an indication of interest by a user on a movie. Her second graph generated rec- ommendation lists for movies based on the number 4 The Algorithm of users each pair of movies had in common. Her 4.1 HITS final network was a user-preferential random net- We used the HITS algorithm as mentioned in class work in which movies were randomly connected to in order to analyze the links between the nodes of 10 other movies chosen from a probability distribu- our graphs. However, we did make some modifi- tion proportional to the number of users who had cations so as to model our dataset more accurately. commented on those movies. Figure 1 shows the modified HITS model we use Grujic found that there was a strong case of clus- for our network. In this network it contains three tering within the IMDb network, and our project different types of nodes: can build on this presence of clustering to build our predictor. While both papers mentioned above fo- The people nodes p : people including actors, • i cused on comments, we choose to focus on a more actresses and directors. quantitative measure - ratings. This allows us to The movie nodes m : movies including the have an idea of the how popular the movie is since a • i training set and the test set. movie with a rating of 9/10 is clearly more popular than one with 2/10. This should be an improve- The genre nodes gi: genres for all the movies. ment over the binary variable of comments where • Grujic did not make a distinction between positive First, we create a network with movies, people comments and negative comments. A universally to use the HITS algorithm. This would be a bi- unpopular film could receive numerous negative partite network in which all edges can be drawn comments but would appear as popular’ as a box from an person to a movie that they have worked office hit in this model. on. In our model, the movie nodes are hubs and the people nodes are authorities. We evaluate the 3 Data Collection Process movie’s ratings and model it as the authority score representing the quality as a content and the total We obtained the raw data using IMDb API. Based sum of votes by experts. Correspondingly, the peo- on what we need to use, we parsed part of it to ple nodes represent the quality as an expert since genres, it makes no sense to make the sum as actor 1 or some other fixed number. movie genre actor 2. We have general weights for actors, actresses and directors. This is based on the assump- movie tion that they each have different extents genre of influence on the popularity of the movie. director Specifically, the director was given the high- est weight since we believe that the role of movie the director is most important in producing a director good movie. 3. For the authority score of people, instead of Figure 1: The HITS graph model for people, only storing the average score, we also stored movies and genres the total number of movies that the person has participated in. Therefore we are able to use they are crucial factors in making up the popularity the weighted score in the algorithm. of the film. Second, we create a bipartite network with 4.2 PageRank movies and genres to use the HITS algorithm. We explored the structure of the IMDb graph us- Edges are drawn from a movie to a genre if a movie ing the PageRank algorithm. In this approach, we is part of that genre. This time, movie nodes are ran PageRank on two distinct, undirected, bipar- hubs and genre nodes are authorities. The genre tite graphs: one from movies to actors, the other score is useful to adjust the score of movies since from movies to directors.