Forecasting the Popularity of Applications an Analysis of Textual and Graphical Properties

Forecasting the Popularity of Applications An analysis of textual and graphical properties Harro van der Kroft Master Thesis for Econometrics - Big Data Track Faculty of Economics and Business Section Econometrics Abstract This thesis contributes to the scarce literature pertaining to App Store content popularity prediction. By scraping data from the Apple App Store, we form feature sets pertaining to the textual and graphical domain. The methodology employed allows for the use of other data, from other online content sources, and fuses these feature sets by means of late fusion. This thesis researches the predictive power of Neural Networks and Support Vector Machines in parallel, and by layering different feature sets it ascertains that there is an added benefit in combining different feature sets. We reveal that there is predictive power in using the methodology outlined in this thesis. I II Acknowledgments I would like to sincerely thank my supervisor Prof. Dr. M. Worring for his supervision, patience, and enthusiasm. The passion entertained by Marcel has furthered my interests in the field of AI more than I could have hoped for. Furthermore, I would like to thank Leo Huberts, Diederik van Krieken, Frederique Arntz, and Dominique van der Vlist for their input and constructive criticism. II Statement of Originality This document is written by Student Harro van der Kroft who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents. III Contents 1 Introduction 1 2 Literature Review 3 2.1 Internet Movie Database ....................................... 3 2.2 Online content analysis ....................................... 3 2.3 Popularity Prediction ........................................ 4 2.4 Deep Learning and Image classification ............................... 4 2.5 Modality in features ......................................... 5 3 Theory 6 3.1 Statistics ............................................... 6 3.2 Natural Language Processing .................................... 7 3.2.1 TF-IDF ............................................ 7 3.2.2 LDA .............................................. 8 3.2.3 Pre-processing & stemming ................................. 9 3.2.4 Topic Number Estimation .................................. 9 3.3 Artificial Neural Networks ...................................... 10 3.3.1 Feed-forward network .................................... 11 3.3.2 Activation layers ....................................... 12 3.3.3 Network Training ....................................... 12 3.3.4 Loss function ......................................... 12 3.3.5 Other layers .......................................... 13 3.3.6 Normalization ......................................... 14 3.4 Support Vector Machine ....................................... 14 3.4.1 Ensemble Learning ...................................... 15 3.4.2 Kernel ............................................. 15 3.4.3 Parameters .......................................... 16 3.5 Synthetic Sampling .......................................... 16 4 Methodology 19 4.1 Pre-processing ............................................ 19 CONTENTS V 4.2 Feature Extraction .......................................... 20 4.2.1 Image ............................................. 20 4.2.2 LDA .............................................. 20 4.2.3 Genres ............................................. 20 4.3 Prediction Goal ............................................ 21 4.3.1 Continuous .......................................... 21 4.4 Prediction ............................................... 21 4.4.1 Sampling ........................................... 21 4.4.2 Support Vector Machine ................................... 22 4.4.3 Neural Network ........................................ 22 4.5 Summary ............................................... 23 5 Experiment 24 5.1 Origin & Explanation ........................................ 24 5.1.1 Statistics ........................................... 24 5.2 Genres Feature Set .......................................... 26 5.2.1 Statistics ........................................... 26 5.2.2 Results ............................................ 27 5.3 Image Feature Set .......................................... 29 5.3.1 Results ............................................ 29 5.4 Description Feature Set ....................................... 31 5.4.1 Parameters .......................................... 31 5.4.2 Results ............................................ 33 5.5 Title Feature Set ........................................... 34 5.5.1 Parameters .......................................... 34 5.5.2 Results ............................................ 36 5.6 Fusion ................................................. 37 5.6.1 Neural Network ........................................ 37 5.6.2 Support Vector Machine ................................... 38 5.7 Remarks ................................................ 40 6 Conclusion 42 V CONTENTS VI 6.1 Future Work ............................................. 43 References 45 VI 1 | Introduction With the introduction of the Apple iPhone in 2007, smart phones have become a fixture in the online consumption of media. There are over 3.9 billion active mobile data subscriptions worldwide, with estimates for 2022 being 6.9 bilion (Ericsson, 2017, p. 2). Furthermore, the data transfer over a monthly period associated with these active subscriptions is over 2.1 GiB (about 3 Compact Discs), in 2014. The fact that people spend an ever increasing amount of time on their phones (Meeker, 2014), means that online content consumption is a large and increasing part of people’s lives. Companies such as Google, Netflix, Amazon, Hulu, Apple, Microsoft, and many more try to captivate users with applications, movies and online content related to their respective fields and businesses. The mobile app development market is a large market. On August 16th, AppShopper.com reported that there were 1.6 million application available in the Apple App Store (AppShopper, 2017). A recent article by Forbes.com showed that for the 2016 calendar year the total money spent in the App Store was $30 billion, with developers receiving over $20 billion Forbes (2017). All in all these numbers show that there is a lot of revenue to be made in the online content business, with the App Store being a prime example of a medium serving online content. Online content however, is very diverse. The content ranges from images on Flickr to Microsoft Excel in the Android App Store. There is a lot of variety, and the attention span of people is intrinsically biased (or: short) (Szabo and Huberman, 2010a, p. 88). Therefore, the added value for each item has to be clearly communicated to the consumer. When doing so, one must consider the different feature sets pertaining to an item. Firstly, the graphical domain: thumbnails, videos, layouts, and presentation. Secondly, the textual domain: descriptions, titles, reviews. Lastly, more meta attributes may be considered: awards, mentions in other online content, and for movies: actors. However, this diversity means nothing without having a common denominator to pin the added value per consumer on. This diversity means nothing without having a common denominator to pin the added value per consumer on. A clear example example is the rating of an item. These ratings allow the consumer to show sentiment, and allows the content provider to have a proxy for their actual needed statistic: popularity. Popularity is a vague construct. We therefore have the need to quantify it. One of the ways this can be achieved is by the number of views (henceforth known as views more simply). The views show a good chunk 1 2 of popularity, but there is a fatal flaw in using this statistic to produce a proxy for popularity: it does not show the sentiment for a particular item. An item may have a large amount of views because of marketing but still fall short of consumer expectations of the particular content. Now as stated before, many content providers allow for the rating of an application. An example would be the rating of an application in the App Store of Apple: a 1-5 rating to show sentiment. Companies such as the aforementioned giants need to anticipate the effect of their next move. The biggest problem for most companies is: how will my future content evolve? Will HBO produce another season of their latest TV Show or will Netflix produce a new series in its entirety? An approximation the success online can garner more security for these companies. This paper answers the following question by developing a tool set/algorithm: Is it possible to predict the average rating of App Store content? With the following sub-questions: 1. How do Support Vector Machines (SVM), and Neural Networks (NN) perform? 2. How does performance depend on the exploitation of different feature sets? The tools used within this paper are based in the realm of Machine Learning: NN, SVM, and Latent Dirichlet Allocation (LDA), Support Vector Machines, and Ensemble Learning. The use of NN and SVM is because they allow for a classification problem to be solved, which is why they were chosen. The expectation of the results of this paper is that

Load more