Movie Success Prediction Using K-Nearest Neighbor Algorithm Bramesh S M, Assistant Professor, P

International Journal for Research in Engineering Application & Management (IJREAM) ISSN : 2454-9150 Vol-06, Issue-05, Aug 2020 Movie Success Prediction using k-Nearest Neighbor Algorithm Bramesh S M, Assistant Professor, P. E. S. College of Engineering, Mandya, India, [email protected] Puttaswamy B S, Assistant Professor, P. E. S. College of Engineering, Mandya, India, [email protected] Abstract - The quantity of movies produced and / or released in the globe is growing exponentially except during an outbreak of a pandemic disease like COVID-19. The success proportion of the movie is of highest importance since huge amount of money is financed in the creation of each of these movies. In such a situation, prior knowledge about the failure and / or success of a specific movie and knowing the factors affecting the movie success will benefit the production communities, since these forecasts will give them a reasonable idea of how to go about with the campaigning and advertising, which itself is an expensive issue altogether. So, the prediction of the failure and or success of a movie is very much important to the film industry. Hence, we aim to explore Internet Movie Database (IMDb) in order to predict the imdb_score, which will reflect the movie success, using k-Nearest Neighbor (k-NN) algorithm. From the experimental analysis, we found that k-NN on the test data got the highest accuracy of 68.64% when k = 9. Obtained results finally show that the prediction model using k-NN has achieved a decent prediction accuracy on the IMDb dataset. Keywords — k-NN, Predictive analytics, Movie success, Critical review, IMDb score, R. I. INTRODUCTION office revenue ([3], [4], [5]). Few earlier work ([5], [6], [7]) has used IMDb data, to signify gross of a movie depending The global Box office revenue during the year 2019 was on stochastic and regression models. Few of them have 42.2 billion USD [1] with many new movies produced categorized movies has either flop or success based on their every year. In such a scenario, there exist both successful profits and applied binary classifications for prediction. The movies and also failure movies. Consequently, how can we extent of movie success does not exclusively depend on select a successful movie to relax and enjoy on our profits. Success of movies depends on a several issues like weekends? Usually, what we do to answer this question is? director, actresses / actors, background story, the time of We look at the score of a movie or see its review on the release, etc. Further few researchers have developed a website. The www.imdb.com website is just a good prediction model with some pre-released data which were selection to refer at this time. The www.imdb.com contains used as their features [8]. In majority of the work, information about movies and also the observations from researchers have considered a very limited feature. Thus, viewers. The scores given on the IMDb website are their model’s performance was poor. Moreover, they have extremely recognized by the public, reflecting the content also ignored involvement of viewers on whom success of a excellence as well as the viewer’s choice to some level. If movie typically depends. Few researchers have also adopted one can come up with a model, which will predict the score several applications of Natural Language Processing (NLP) on IMDb website, then this model would be a powerful for sentiment analysis ([9], [10]) and collected movie model to identify how successful a movie will be? Even reviews for their test domain. In such a scenario, the before it is released. Hence this research work, try to prediction accuracy lies on how large the test domain is. display the vital factors inducing the score on IMDb website Having a domain of smaller size is not a better idea for and proposes an effective approach to forecast the movie measurement. Yet again, most of them did not consider success using the IMDb dataset [2]. critic’s reviews into account. Also, viewers’ reviews can be II. LITERATURE SURVEY biased as a fan of actress / actor may fail to give an unbiased view. In [11] A. Sivasantoshreddy et. al has used Movie success mainly be subjected to the perceptions hype analysis to predict a movie box-office opening that, in what way the movie has been justified by the prediction. In some cases, movie success prediction was viewers. In the early period, many people ranked gross box made through neural network analysis ([8]). Some 133 | IJREAMV06I0565056 DOI : 10.35291/2454-9150.2020.0565 © 2020, IJREAM All Rights Reserved. International Journal for Research in Engineering Application & Management (IJREAM) ISSN : 2454-9150 Vol-06, Issue-05, Aug 2020 researchers have come up with the prediction model using variables. In this step, we also found that the response social media, social network and hype analysis ([12], [13]) variable “imdb_score” was numerical, and the predictors where they calculated positivity and the number of were mixed with numerical and categorical variables. It is comments related to a particular movie. The majority of the also observed that IMDb data had duplicates and hence we work seems to be focused towards analysis of movie removed duplicates from the IMDb data. After removing reviews or user-specific preferences. However, using the duplicates, we found number of observations left = machine learning techniques to analyze the attributes of a 4998. We also observed that movie titles had (Â) as a movie is relatively unexplored method for predicting its special character at the end and some movie titles had success. Hence this paper aims at using IMDb 5000 movie whitespaces, hence we deleted them from the movie titles. dataset for analyzing the attributes of a movie and then to Finally, in this step, we found that there was a small amount develop a prediction model to predict movie success using of difference in the averages of imdb_score related to k-NN. diverse genres, and also found that nearly all the averages are in the same range of 6 ~ 8. So, we removed input III. DATA DESCRIPTION variable “genres” as it was not really related to the The IMDb dataset used in this paper is obtained from imdb_score. Kaggle website. The IMDb dataset has 28 variables for C. Data Cleaning 5043 movies, spanning across a hundred years in sixty-six countries. In this work, we have used “imdb_score” has the In this step, firstly we dealt with missing values in each output variable and other 27 variables have been used as column, by using colSums() function in R to total NA in possible input variables. every column. Then we used a heatmap to visualize the missing values. IV. PROPOSED METHODOLOGY Fig. 2: Heatmap to visualize missing values From the heatmap, we found that the budget and gross input variables have many missing values, and as we want to retain these two variables for further analysis, we deleted rows with null values for budget and gross. The result of this operation is not too bad, because this operation omitted Fig. 1: The proposed methodology 23% of the observations. Now our dataset has 3857 As shown in Fig. 1, the proposed methodology includes observations. Secondly, we analyzed aspect_ratio input the following steps: variable and found that aspect_ratio input variable had the A. Data Extraction highest number of missing values in the IMDb dataset. Then from the means of imdb_score for diverse aspect_ratios, we In this step, data is extracted (IMDb data) from Kaggle found that there was no significant difference, and all the website in order to analyze which type of movies are more means was falling in the range of 6.3 ~ 6.8. Hence, we successful, in other words, get higher imdb_score? removed this input variable as it won’t affect our following B. Data Exploration analysis. Then, we dealt with 0s and NAs. i.e., in the case of numeric predictors, we deal with 0s and NAs by replacing In this step, the data extracted in the previous step (IMDb 0s with NA, and then replacing all NAs with their respective data) is loaded into the R workspace by loading the suitable column mean and in the case of categorical predictors, packages in R. After loading the IMDb data, we applied Blanks (“”) represent missing values and these missing str(object, …) function in R on the IMDb data to understand values cannot be replaced with sensible data, hence we the data at the high level. i.e., knowing about number of deleted those rows. Since IMDb data has gross and budget observations, number of variables in the IMDb data. We information, we added two columns: profit and percentage found that the IMDb data had 5043 observations and 28 return on investment for further analysis. We removed input 134 | IJREAMV06I0565056 DOI : 10.35291/2454-9150.2020.0565 © 2020, IJREAM All Rights Reserved. International Journal for Research in Engineering Application & Management (IJREAM) ISSN : 2454-9150 Vol-06, Issue-05, Aug 2020 variables like colored and language from further analysis as Fig.5 shows the top twenty movies based on its it was observed that both input variables were almost Percentage Return on Investment. Since movie profit does constant. Finally, around 8% movies are from UK, 79% not give a clear picture about its financial success over the from USA, 13% from other countries, we clustered other years, this analysis, over the absolute value of the Return on countries together to make this categorical input variable Investment (ROI) across its Budget, would provide good (countries) with less levels: UK, USA, Others. results. Analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) versus its D.

Movie Success Prediction Using K-Nearest Neighbor Algorithm Bramesh S M, Assistant Professor, P

Elmore Leonard, 1925-2013

EMMY®-WINNING COMEDIAN PATTON OSWALT SET to HOST the 21St ANNUAL ART DIRECTORS GUILD’S EXCELLENCE in PRODUCTION DESIGN AWARDS

Patton Oswalt

JUSTIFICATION and Knowledgenew Studies in Epistemology JONATHAN BENNETT, University Board of Consulting Editors: GEORGE S

Newsday - Long Island, N.Y

TASER™ Guidelines

You'll Never Leave Harlan Alive": Using FX's Justified Ot Form a Cultural Understanding of Crime in Harlan County, Kentucky Morgan Stone Eastern Kentucky University

Electronic Formatting Requirements and Guidelines of the Second District Pursuant to California Rules of Court Rules 8.72(A) and 8.74(B)

Scott D. Rowland, PE, LEED AP

World Health Forum World Health Forum a Quarterly Record of Ideas, Arguments, and Experiences Contributed by Health Professionals the World Over

Were Standard Oil's Fault Than Justifying Their Existence

Newsday, Inc. V. Morgenthau Supreme Court of New York, Appellate Division, First Department February 17, 2004, Decided ; February 17, 2004, Entered 2868N