Forecasting the Popularity of Applications An analysis of textual and graphical properties

Harro van der Kroft Master Thesis for Econometrics - Big Data Track Faculty of Economics and Business Section Econometrics

Abstract

This thesis contributes to the scarce literature pertaining to App Store content popularity prediction. By scraping data from the Apple App Store, we form feature sets pertaining to the textual and graphical domain. The methodology employed allows for the use of other data, from other online content sources, and fuses these feature sets by means of late fusion. This thesis researches the predictive power of Neural Networks and Support Vector Machines in parallel, and by layering different feature sets it ascertains that there is an added benefit in combining different feature sets. We reveal that there is predictive power in using the methodology outlined in this thesis.

I II

Acknowledgments

I would like to sincerely thank my supervisor Prof. Dr. M. Worring for his supervision, patience, and enthusiasm. The passion entertained by Marcel has furthered my interests in the field of AI more than I could have hoped for. Furthermore, I would like to thank Leo Huberts, Diederik van Krieken, Frederique Arntz, and Dominique van der Vlist for their input and constructive criticism.

II Statement of Originality

This document is written by Student Harro van der Kroft who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

III Contents

1 Introduction 1

2 Literature Review 3

2.1 Internet Movie Database ...... 3

2.2 Online content analysis ...... 3

2.3 Popularity Prediction ...... 4

2.4 Deep Learning and Image classification ...... 4

2.5 Modality in features ...... 5

3 Theory 6

3.1 Statistics ...... 6

3.2 Natural Language Processing ...... 7

3.2.1 TF-IDF ...... 7

3.2.2 LDA ...... 8

3.2.3 Pre-processing & ...... 9

3.2.4 Topic Number Estimation ...... 9

3.3 Artificial Neural Networks ...... 10

3.3.1 Feed-forward network ...... 11

3.3.2 Activation layers ...... 12

3.3.3 Network Training ...... 12

3.3.4 Loss function ...... 12

3.3.5 Other layers ...... 13

3.3.6 Normalization ...... 14

3.4 Support Vector Machine ...... 14

3.4.1 Ensemble Learning ...... 15

3.4.2 Kernel ...... 15

3.4.3 Parameters ...... 16

3.5 Synthetic Sampling ...... 16

4 Methodology 19

4.1 Pre-processing ...... 19 CONTENTS V

4.2 Feature Extraction ...... 20

4.2.1 Image ...... 20

4.2.2 LDA ...... 20

4.2.3 Genres ...... 20

4.3 Prediction Goal ...... 21

4.3.1 Continuous ...... 21

4.4 Prediction ...... 21

4.4.1 Sampling ...... 21

4.4.2 Support Vector Machine ...... 22

4.4.3 Neural Network ...... 22

4.5 Summary ...... 23

5 Experiment 24

5.1 Origin & Explanation ...... 24

5.1.1 Statistics ...... 24

5.2 Genres Feature Set ...... 26

5.2.1 Statistics ...... 26

5.2.2 Results ...... 27

5.3 Image Feature Set ...... 29

5.3.1 Results ...... 29

5.4 Description Feature Set ...... 31

5.4.1 Parameters ...... 31

5.4.2 Results ...... 33

5.5 Title Feature Set ...... 34

5.5.1 Parameters ...... 34

5.5.2 Results ...... 36

5.6 Fusion ...... 37

5.6.1 Neural Network ...... 37

5.6.2 Support Vector Machine ...... 38

5.7 Remarks ...... 40

6 Conclusion 42

V CONTENTS VI

6.1 Future Work ...... 43

References 45

VI 1 | Introduction

With the introduction of the Apple iPhone in 2007, smart phones have become a fixture in the online consumption of media. There are over 3.9 billion active mobile data subscriptions worldwide, with estimates for 2022 being 6.9 bilion (Ericsson, 2017, p. 2). Furthermore, the data transfer over a monthly period associated with these active subscriptions is over 2.1 GiB (about 3 Compact Discs), in 2014. The fact that people spend an ever increasing amount of time on their phones (Meeker, 2014), means that online content consumption is a large and increasing part of people’s lives. Companies such as Google, Netflix, Amazon, Hulu, Apple, Microsoft, and many more try to captivate users with applications, movies and online content related to their respective fields and businesses.

The mobile app development market is a large market. On August 16th, AppShopper.com reported that there were 1.6 million application available in the Apple App Store (AppShopper, 2017). A recent article by Forbes.com showed that for the 2016 calendar year the total money spent in the App Store was $30 billion, with developers receiving over $20 billion Forbes (2017). All in all these numbers show that there is a lot of revenue to be made in the online content business, with the App Store being a prime example of a medium serving online content.

Online content however, is very diverse. The content ranges from images on Flickr to Microsoft Excel in the Android App Store. There is a lot of variety, and the attention span of people is intrinsically biased (or: short) (Szabo and Huberman, 2010a, p. 88). Therefore, the added value for each item has to be clearly communicated to the consumer. When doing so, one must consider the different feature sets pertaining to an item. Firstly, the graphical domain: thumbnails, videos, layouts, and presentation. Secondly, the textual domain: descriptions, titles, reviews. Lastly, more meta attributes may be considered: awards, mentions in other online content, and for movies: actors.

However, this diversity means nothing without having a common denominator to pin the added value per consumer on. This diversity means nothing without having a common denominator to pin the added value per consumer on. A clear example example is the rating of an item. These ratings allow the consumer to show sentiment, and allows the content provider to have a proxy for their actual needed statistic: popularity.

Popularity is a vague construct. We therefore have the need to quantify it. One of the ways this can be achieved is by the number of views (henceforth known as views more simply). The views show a good chunk

1 2 of popularity, but there is a fatal flaw in using this statistic to produce a proxy for popularity: it does not show the sentiment for a particular item. An item may have a large amount of views because of marketing but still fall short of consumer expectations of the particular content. Now as stated before, many content providers allow for the rating of an application. An example would be the rating of an application in the App Store of Apple: a 1-5 rating to show sentiment.

Companies such as the aforementioned giants need to anticipate the effect of their next move. The biggest problem for most companies is: how will my future content evolve? Will HBO produce another season of their latest TV Show or will Netflix produce a new series in its entirety? An approximation the success online can garner more security for these companies.

This paper answers the following question by developing a tool set/algorithm:

Is it possible to predict the average rating of App Store content?

With the following sub-questions:

1. How do Support Vector Machines (SVM), and Neural Networks (NN) perform? 2. How does performance depend on the exploitation of different feature sets?

The tools used within this paper are based in the realm of : NN, SVM, and Latent Dirichlet Allocation (LDA), Support Vector Machines, and Ensemble Learning. The use of NN and SVM is because they allow for a classification problem to be solved, which is why they were chosen.

The expectation of the results of this paper is that it is possible to ascertain a decent approximation of the average rating of online content, but with some probable caveats: Firstly, meta data that would be relevant tot he popularity classification that is not readily available is most probably omitted (e.g. marketing budget for an application, the popularity of an actor at the time of release). Secondly, companies are not too keen on disclosing all information regarding their online content. Lastly, the used algorithms have their benefits, but also disadvantages which will be discussed.

This paper will first focus on the relevant literature pertaining to textual and graphical analysis of content in chapter 2. Afterwards, we will first introduce theoretical constructs in chapter 3, where the basic foundation will be laid for the reader to understand the methodology as outlined in the next chapter, chapter 4. After this an experiment will be performed on App Store data in chapter 5. Finally, a conclusion will be drawn in chapter 6.

2 2 | Literature Review

The main focuses of this chapter are: the textual analysis of online content, the analysis of graphical online content, and online content analysis in general. The general developments in the field of machine learning pertaining to classifying images will also be discussed.

2.1 Internet Movie Database

Recent research with regards to online content popularity has mainly focused on popularity of movies by using the Internet Movie Database (IMDb): papers such as those by Eren and Sert (2017) and Pramod and Joshi (2017). The former focuses on a binary classification: flop or success; the latter focuses on predicting a rating. The work by Eren and Sert (2017) is of importance for this thesis as they combine mixed data types. The use of mixed data types is interesting for this thesis. Other work includes that by Latif and Afzal (2016), which couples econometric regressions and machine learning to attain a rating.

In the paper by Hsu et al. (2014) the authors tackle a similar problem as this thesis: predicting popularity. The authors analyze 32,968 movies, with a focus on the graphical part. By using neural networks, the authors achieve a high accuracy (a prediction absolute error of 0.82). They use 31,406 movies as a training set, using a 95/5% split for training/testing. The authors use key image components to identify features contained in images. By using color histograms, gradient histograms, texture, and objects the popularity of an image is predicted by means of Support Vector Regression. The work by Oghina et al. (2012) uses YouTube commentary sentiment for prediction. Others focus on only looking at the box office revenue Mestyán et al. (2013). In short: the IMDb data-set is a well-structured data set which has been analyzed thoroughly.

2.2 Online content analysis

IMDb data is well prepared and thoroughly analyzed, we therefore perform a focus shift: other online categories with lesser quality data are of interest. This section analyzes relevant papers in this subject.

The paper by Khosla et al. (2014) focuses on the popularity of online images. Their paper uses a data set consisting of 2.3 million images from Flickr, an image sharing site. They use meta-data consisting of views and social cues. For them, a social cue is the number of friends of the photo’s uploader. The paper does not, however, concern itself with the rating of the online content, but with the amount of views. This

3 2.3. POPULARITY PREDICTION 4 is because Flickr does not allow for an average rating on a nominal scale. They do allow for thumbs up or down, which provides less information than an integer scale from e.g. 1 to 10. The authors used views as a proxy for popularity, and this use of a proxy for popularity will be used in this thesis.

Other branches of online content that have been extensively researched are Youtube and Twitter, with research such as the works done by Bae and Lee (2012), and Szabo and Huberman (2010b). The former work focuses on investigating the factors that drive the popularity of messages on Twitter mostly based on the sentiment. The latter work focuses on the popularity of videos on YouTube by looking at the popular (at the time) website Digg.com. They associate the popularity of YouTube videos by the amount of likes on Digg. The paper by Szabo and Huberman (2010a) also concerns itself with Digg.com by using the information gathered on the site to model the popularity of applications. Despite the novel approach, it uses data outside the applications themselves, which we see as a fault with the paper. We will use the data more closely related (meta data) to the online content as a base.

2.3 Popularity Prediction

Harvey et al. (2011) predicts the rating given by a user. An interesting point to make is that by predicting the rating of a user, one can essentially predict the overall popularity of an item using this algorithm. The paper by Malmi (2014) investigates the connection between the usage of an application and the popularity associated with it. The data set used in the paper describes the usage of the application and the user’s phone. It tries to quantify the correlation between popularity and usage data. They find a surprisingly small correlation between the popularity of an application and usage.

In the paper by Mazloom et al. (2016), an analysis is done on the different driving forces behind the popularity of brand related online content posts on social media. It combines features from the visual and textual properties of online content. One of the foremost results of the paper is that the visual and textual properties complement each other. However, the difference in this thesis is that the properties are not distilled into engagement properties as that would be of no use in this thesis. We are not interested in engagement or sentiment, but using visual and textual properties is interesting for this thesis.

2.4 Deep Learning and Image classification

This thesis uses deep learning for feature extraction from images, and the following section will review the important contributions made in image classification.

In 2009 the cleanly-labelled ImageNet (Deng et al., 2009) data set was introduced. It consists of ap- proximately 50 million sorted and labelled images. Because a perfectly labelled imageset is a gold mine within the field of machine learning, it initiated an arms race in the field of picture labeling and object recognition. The data of ImageNet is the basis for the competition Large Scale Visual Recognition Challenge

4 2.5. MODALITY IN FEATURES 5

(ILSVRC). This competition, and the rules it specified, allowed for algorithms to compete in the field of image recognition and object classification. The two trophy accuracy tests were top-5 and top-1 error rates. The first is defined as the percent of classifications whereby the correct label is not part of the top-5 labels predicted by the model. The latter is defined in the same matter. The competition first took place in 2009 and the research community concerning image classification and object recognition was boosted significantly with more sophisticated and better error rates as a result.

Among these methodologies is the work by Krizhevsky et al. (2012). The network described in their paper is called AlexNet, a convolutional neural network consisting of only 8 layers: the first 5 are convolutional layers followed by a 5 fully connected layers with dropout layers in between. AlexNet was also noteworthy in that it used intra-gpu connections to train. By doing so it reached an error rate of 15.3% in the top-5 error rate, a full 10.8 percentage points ahead of the runner up. The concepts of these layers are explained in section 3.3. AlexNet is discussed here as it a widely used and well-documented applicant to the ILSCRC, there are however better teams. The teams at the top of LSVRC2016 are: CUImage, HIKVision, Nuist ILSVRC (2016)

2.5 Modality in features

The previous section concentrated on the gathering of data and therefore features. When combining these data points it is necessary to talk about the fusion of these features. The paper by Snoek et al. (2005) concerns itself with the fusion of multiple types of features. The paper illustrates this by using two types of ’fusion’: early and late fusion. The former uses the modalities in feature space while the later fuses the modalities in feature space. For example: adding price data and colour data in one data set to forecast the sales of ice cream. In late fusion the feature sets are trained individually to from an outcome. This outcome, and the probabilities for each class that arise, are trained upon afterwards again. This can be applied to regression, SVM and NN.

The main take-away from this analysis of recent research is that in their respective fields a lot of work has been done. However, the research is either very concentrated on a certain field of online content (IMDb, Flickr, Youtube, Twitter) or with the use case (Fraud detection, revenue prediction, usage prediction). To ovecome this disadvantage, this paper posits a more general model with an experiment on App Store data as an example.

5 3 | Theory

This chapter lays the the theoretical groundwork for the methodology outlined in chapter 4. Firstly section 3.1 covers some basic statistics pertaining to the field of machine learning. Secondly, section 3.2.2 covers the textual analysis part. Thirdly, section 3.3 covers Neural Networks, and section 3.4 covers the theoretical side of SVM. Finally, section 3.5 covers data sampling.

3.1 Statistics

The section includes some primer information on Bayesian statistics. Within the field of machine learning, Bayesian statistics is widely used. This form of statistics uses a so-called posterior probability of an event as being the conditional probability that is assigned before relevant data is taken into account. Conversely, the posterior probability is the probability distribution of an unknown random variable, after relevant data has been taken into account. "Posterior" here means taking into account relevant evidence from an experiment. The posterior probability can be written in textual form as:

Posterior probability ∝ Likelihood × Prior probability

We shall illustrate with an example; given an experiment (data) d, and parameters theta, we may define the above equation as follows: P (d|θ)P (θ) P (θ|d) = P (d) Now if the left-hand side posterior (P (θ|d)) is of the same probability as the prior probability (P [θ]), then the prior and posterior are called conjugate distributions, where the prior is called the conjugate prior Pratt et al. (1995).

The Dirichlet distribution, in this tehsis denoted by Dirichlet(α), is family of continuous multivariate probability distributions parametrized by a vector α constituting of real, positive numbers. The Dirichlet distribution is the multivariate generalization of the Beta distribution. Dirichlet distribution are often used in Bayesian statistics, as the Dirichlet distribution is the conjugate prior of the Multinomial distribution.

6 3.2. NATURAL LANGUAGE PROCESSING 7

3.2 Natural Language Processing

The field of Natural Language Processing (or: NLP) is a field of artificial intelligence that is concerned with the processing of human language in a way that computers are able to process it. In particular, it is concerned with processing large corpora of texts. The challenges available in natural language processing involve speech recognition, dialog interaction systems, generating natural language, amongst others. We define some notation to be used in this section:

Token This is a unit from the vocabulary indexed by {1,...,T }. A token can be seen as a lowercase version of a word without punctuation. Using unit vectors we distinguish between certain tokens used.

If token i is used, a vector ei is used to represent this word. ei is a vector with all 0’s, except for the i-th spot. If a token is not used in a document, the resulting vector is a 0 vector: a vector with all zeroes. Document is a sequence containing units of tokens present in a document. A document is defined as the set of tokens that are contained within it

d = {t1, t2,..., tW −1, tW }, (3.1)

With W being the amount of words present in the document after processing.

Corpus The corpus is the set of all documents. It is defined as C = {d1, d2,..., dN−1, dN }, with N the number of documents in the corpus.

To save space and computational troubles, all 0 vectors can be omitted.

3.2.1 TF-IDF

Within NLP the need arises to rank words for their significance. Either the word does not contain information relevant for the test (and, or, the) or the word is too rare within the Corpus. There needs to be a balance between rarity and information. Before tackling the concept of TF-IDF we introduce some notation. For any set defined by S, with the example being {1, 5, 50, 512} we introduce the notion of elements inside of the set: |S| = |{1, 5, 50, 512}| = 4

Now suppose that we have a corpus of text documents and we wish to rank which document are most relevant to our query ‘a tidy room’. A simple way of querying this data set is by eliminating all documents that do not contain the words ‘a’, ‘tidy’, and ‘room’. This however creates a two-fold problem: we are left with a lot of documents, and the documents that are left have no ranking in an order of their relevance. To distinguish between the leftover documents we count the frequency of the terms in each document; this is aptly named the Term Frequency (TF). The first form of this insight is made by Luhn (1957). This thesis

7 3.2. NATURAL LANGUAGE PROCESSING 8 uses the following definition for TF:

TF(t, d) = ft,d = |{t ∈ d}|, (3.2)

with t indicating the token as discussed in section 3.2, d the document being analyzed, and ft,d specifies the raw count of the token in that particular document.

This insight of TF however, does not control for the amount of documents available in total. We shall take the word ‘a’ as an example: it is a common word, therefore the term frequency as defined by equation 3.2 will wrongly assert a high importance to documents containing the word ‘a’. We would assume that the words ‘tidy’, and ‘room’ carry more weight in defining the importance of a text. We therefore employ the concept of Inverse Document Frequency (IDF): ( ) N IDF(t, C) = log , |{d ∈ C : t ∈ d}| where t again constitutes the token being researched, C the corpus, N a count of the number of documents in the corpus, |{d ∈ C : t ∈ d}| leads to the number of documents where the term t is present. The assumption here is that only tokens are researched that are actually in the corpus. The work by Sparck Jones (1972) created the basis for Inverse Document Frequency.

By combining Term Frequency and Inverse Document Frequency, we obtain a ranking function that is a trade-off between term frequency on the document level, and the frequency the token appears in the corpus. We calculate it as follows:

TF-IDF(t, d, C) = tf(t, d) × idf(t, C) (3.3) ( ) N = |{t ∈ d}| × log (3.4) |{d∗ ∈ C : t ∈ d∗}|

This ranking function can be used to make topic modelling computationally more efficient. Not all tokens present in the original texts are still in the documents after parsing.

3.2.2 LDA

Within the realm of machine learning and NLP, a is a statistical tool for extracting the abstract ‘topics’ that are assumed to be hidden in a collection of documents. For purposes a topic model can lay bare the hidden semantic structures of the given text. This subsection covers a specific type of topic modelling: the Latent Dirichlet Allocation (LDA) model. First described by Blei et al. (2003), this type of topic model allows for the differences in sets to be explained by latent factors described by the presence of certain topics within documents.

8 3.2. NATURAL LANGUAGE PROCESSING 9

3.2.2.1 Model

LDA is a generative model, which means that it learns the joint probabilities of word, document, and topic distributions. The word generative comes from the fact that it starts with a set of priors, which it then changes upon learning new information. This relates to the ‘prior’ and ‘posterior’ probability part of equation 3.1. Here the text and word level probabilities will be elaborated on. Documents are represented as random mixtures over latent topics. Each topic chosen from {1,...,K} is characterized by a distribution over the words. Basically, documents are represented as random mixtures over latent topics. Each topic in turn is characterized by a distribution over the words. For each document d ∈ C, LDA assumes the following generative process, as described by Blei et al. (2003):

1. We have a corpus C which consists of W documents each with Ni words. 2. Choose a W -dimensional θ ∼ Dirichlet(α). With α being a prior. 3. Choose a K-dimensional φ ∼ Dirichlet(β). With β being a prior.

4. For each of the document, word combinations i, j, with i ∈ {1,...,W } and j ∈ {1,...,Ni}:

a) Choose a topic zi,j ∼ Multinomial(θj). ∼ b) Choose a word wi,j Multinomial(φzi,j ) In other words, we choose a set of priors (α, β) and build our model around this, assigning a topic and word distribution for each combination. The paper by Hong and Davison (2010) shows us that the method of choosing priors is of importance, where topic models typically assume symmetric Dirichlet priors, where α and β are chosen so that each topic, word, and document probability is the same. The paper by Wallach et al. (2009) suggests that an asymmetric α and symmetric β allow for better performance than uniformly distributed priors. Intuitively, as explained in the paper by Andrzejewski et al. (2009, p. 1) this makes sense: in general a word or document will have a preference towards a certain topic and this information should be incorporated in the priors, if known.

3.2.3 Pre-processing & stemming

Before the model can analyze a text, some parsing needs to occur: by removing all punctuation, the words can be converted to tokens. Because texts are grammatically sound for a human, it means that "work", "working", and "worked" are seen as separate tokens. Within natural language there are families of related words that have the same semantic meaning. By correcting for these differences we garner more information than if we did not apply stemming (Lovins, 1968).

3.2.4 Topic Number Estimation

In the field of topic modelling it is critical to have a correct intuition about the topics: they have to be modelled to actually represent the text being analyzed. By creating such a criterion, we are able to deduce

9 3.3. ARTIFICIAL NEURAL NETWORKS 10 the amount of topics K.

One of the ways we are able to do so is via perplexity. Perplexity is used in information theory as a measurement of how well a probability distribution or model predicts a given test sample. It can be used, in our case for K, to determine the distribution which fits the sample the best (Blei et al., 2003, p. 1008).

There are, however, problems with the use of perplexity for estimating k. One of the most notable is that perplexity does not correlate strongly to the judgment of humans, as outlined in the paper by Chang et al. (2009): they tested multiple measures of model likelihood and correlated it to human judgment in large-scale user studies. They conclude that a higher adherence to a levels of model likelihood leads to lower number of semantically meaningful topics.

An alternative method for evaluating the optimal number of topics in LDA is based on so called topic coherence as suggested in Chang et al. (2009). Topic coherence (or in the paper: Cv) is a measure of how interpretable the topics are to humans. Coherence starts by picking the top-N words, sorted by term weight within the topic. It then calculates how similar the words are to each other. There are multiple methods for doing so, of which almost all are outlined in the paper by Röder et al. (2015). The authors performed an analysis of the various methods and correlated them to human judgment. The method which is called

Cv was found to be the most highly correlated of all. The method makes multiple passes over the corpus C, accumulating both term occurence and co-occurence count (how many times a word is used in conjuction with another word), and does for the top-N words in each topic.

3.3 Artificial Neural Networks

(Artificial) Neural Networks, or NN, are systems inspired by the biological neural systems of computing that constitute brains. These systems are thought to learn progressively, without being programmed to do a specific task at hand.

A Neural Network is a collection of units (outputs) and weights in a mesh called a net which are analogous to the synapses and axons in a human brain. Each connection (synapse) between neurons can transmit a signal to another neuron. The receiving neuron (post-synaptic) can process the signal and apply a signal to downstream neurons connected to it. Neurons have a state which is presented by a real number typically ranging between 0 and 1. Neurons and the connections associated with it have a weight that may vary during learning, and this weight can increase and decrease the strength of the signal sent downstream. The strength of a signal is learned from feedback given a loss function associated with the output. Furthermore, the weights may have a threshold such that the aggregate signal from connected neurons may not be above of below a certain level. An example for Neural Networks would be the binary classification between "a building" and "not a building". The Neural Network would train on a data set which contains features that constitute a "building" and "not a building". From here the Neural Networks would train to classify new

10 3.3. ARTIFICIAL NEURAL NETWORKS 11 samples.

3.3.1 Feed-forward network

The basic form of a (1 hidden layer) Neural Network consists of I linear combinations of the input variables { }I (or: features) xi i in the form:

∑I (1) (1) ∈ { } aj = wji xi + b j 1,...,H i=1

With H the number of nodes in the hidden layer. The variable b(1) is a bias with respect to this particular layer. The quantities aj are called integrations. These integrations are transformed by the use of a (non-

)linear activation function, which computes the new activation (zj), which is be defined as:

zj = h(aj).

The function h(·) is often times chosen with the aim of keeping the end result within certain boundaries for next layers to work with (more examples will be given in subsection 3.3.2). The next layers use the resulting zj:   ∑H  (2) (2) yk = h wkj zj + b j=1 (k) ∈ { } Where the variables wji are the weights associated with the output level yk, and where k 1,...,K with

K being the total number of outputs. This transformation of zj → yk constitutes going from the hidden layer to the output layer. We again introduce biases for this level. Graphically, it can be demonstrated as follows:

Input Hidden Output layer layer layer

z1 y1 Input 1 x1

z2 y2 Input 2 x2

z3

b(1) b(2)

Figure 3.3.1: A simple example of a neural network

The arrows in figure 3.3.1 going to and from the input, hidden, and output layers have associated weights (y) ∈ { } wji , with y 1, 2 . Combining all stages and the output layer, we ascertain:  ( )  ∑H ∑I  (2) × (1) (1) (2) yk(x, w) = h wkj h wji xi + b + b (3.5) j=1 i=1

11 3.3. ARTIFICIAL NEURAL NETWORKS 12

We call the act of evaluating 3.5 as forward propagation of information throughout the network (Bishop, 2006, p. 229). The network as defined in this section may be easily expanded on by introducing new layers with their own biases, weights, and transformations (see subsection 3.3.5).

3.3.2 Activation layers

As mentioned in the previous subsection, the outputs between layers use activation functions. There are a number of activation layers with desirable properties that will be discussed here: ReLu, Sigmoid, tanh, and softmax.

1. aReLU(x) = max(0, x) ( Nair and Hinton (2010)) 1 2. asigmoid(x) = 1+e−x 2 − − 3. atanh(x) = 1+e−2x 1 = 2asigmoid(2x) 1 x ∑ e j 4. asoftmax(x, j) = xk k=1 e aReLU is less computationally expensive than atanh and asigmoid because ‘ReLU’ involves simpler mathemat- ical operations and creates a less analog output (half of the output is 0). The use of asoftmax is restricted in that the values of x need to be in the range [0, 1] and add up to 1.

3.3.3 Network Training

{ }N { }N Given a training set comprising of input vector xn n=1, and an associated target vector tn n , we minimize the error function,taking the squared sum of errors an example:

1 ∑N E(w) = ∥y(x , w) − t ∥2 (3.6) 2 n n n=1

To derive the optimal weights and biases in the network, the gradient of the error function must be found. We evaluate the gradient for each weight individually:

δE δE δy δz n = i j δwji δyi δzj δwji and find the local error signal involved in change a weight. The Neural Network is optimized by observing and acting upon the change in error function as defined above. Every instance of training the network is called an epoch.

3.3.4 Loss function

For many applications the objective is probably more complex than minimizing the amount of mis-classifications. An example can be seen in medicine. For a doctor it is more important to correctly identify a healthy person than treat a healthy person. In other words: a patient with illness should not be turned away (type-I error), but an ill person could receive treatment that is unneeded (type-II error).

12 3.3. ARTIFICIAL NEURAL NETWORKS 13

We can formalize this issues by identifying a loss function, alternatively called a (negative) cost function, which provides a single overall measure of loss incurred in taking the actions and decisions as so defined by the Neural Network. The goal of the Neural Network is to minimize this function. The optimal solution is the one which minimizes the loss function.

One of the potential (non-linear) loss functions is the Cross Entropy Loss function which is useful when training a classification problem with n classes. The loss in class i, using the data vector d

log(edj ) loss(d, i) = − ∑ (3.7) edj j   ∑   = −di + log exp(dj) (3.8) j

Where d is a vector containing the data. The Cross Entropy Loss function may be generalized for imbalanced ∑ data sets. When handling imbalanced data set one can include weights wi, with i=1 wi = 1:    ∑    loss(d, i) = wi −di + log exp(dj) . j ∑ ∑ Other loss functions that may be considered are L1 ( |y − t|) and L2 ( |y − t|2) regularization.

3.3.5 Other layers

The hidden layer described earlier uses full-connected layers. A fully-connected layer is one where all nodes are connected to all previous layer’s nodes and the next layer’s nodes. There are a multitude of layers, but the ones worth noting for this thesis are: convolutional layers, pooling layers, and dropout layers. They are all layers used by AlexNet (Krizhevsky et al., 2012).

The problem with normal Neural Networks is that they do not scale well with the inclusion of images. In AlexNet the input image size is 224 x 224 x 3 (3 because of the three main colour channels: Red (R), Green (G), and Blue (B)), a vector of size 150,528. Although this would be manageable, we would certainly want multiple layers, therefore full connectivity would lead to either over fitting or an untrainable model. As when dealing with high-dimensional queries one may come across the "dimensionality curse" where the number of possible parameters is larger than the number of training samples. Within the context of Neural Networks, pruning or lowering the amount of parameters has shown to be useful Bengio and Bengio (2000, p. 1).

Convolutional layers use the inputs of an image in a more geometric sense: the neurons are arranged in 3 dimensions: width, height, and depth. After this arrangement a filter of (for example) 5x5x3 slides over the volume defined earlier. We then take the dot product, with the filter being w:

wT x + b

13 3.4. SUPPORT VECTOR MACHINE 14

which in our case would be 75-dimensional dot product with the end result being a scalar. This operation is called a convolution. The end result of this would be a 28x28x1 volume. The 28 comes from the fact that it is the number of unique positions possible from the 32x32 field. The convolutional layer is built up from the amount of filters placed on the original filters, this could be more than 1, allowing for different types of information. Because there are multiple passes over the same pixels, the spatial information is preserved.

AlexNet contains so-called Pooling Layers (Krizhevsky et al., 2012, p. 4). These layers aim to progres- sively reduce the size of the spatial information by summarizing the content of a previous nodes, therefore reducing the dimensionality of the net. By reducing the amount of parameters a control is exerted on the net to prevent over fitting (i.e. dimensionality curse). A way of thinking of a pooling layer is down sampling: reducing the size of a 500x500x3 image to 100x100x3. This preserves the spatial information but reduces the parameters needing to be optimized.

Dropout layers, as introduced by Srivastava et al. (2014), are very specific in their function. Like the Pooling Layer they allow for a control with over-fitting, but they do not conserve spatial information. The idea of a dropout layer is to randomly "drop out" certain activations in a layer by setting them equal to zero. This forces the network to create redundant path for the same answer.

3.3.6 Normalization

Within the field of statistics normalizing a random variable is a way of forcing the random variable to a certain distribution, after which more easily an analysis can be made with regard to the source material. The field of image processing is similar, we want the input parameters (pixels in our case) to be similarly distributed. This will make convergence faster when training the network (Ioffe and Szegedy, 2015, p. 8). To accomplish normalization, we first define X to the training set, and Y the test set.

1. For each c ∈ {R, G, B} pertaining to the colour channels calculate the mean (µc) and standard devia-

tion (σc) using the information in training set xi ∈ X

2. For both xi ∈ X and yj ∈ Y , and for each channel apply the following formula, with c being the relevant channel: c − µ c∗ = c . σc If we did not scale our input training vectors, the ranges of our distributions of feature values would likely be different per feature, and therefore the learning rate would cause corrections in each dimension that would differ from one another. In other words: the inputs should be distributed similarly.

3.4 Support Vector Machine

Support Vector Machines, or SVM, is a machine learning algorithm which is used for both classification and regression problems. However, it is mostly used in classification problems. In this algorithm each data

14 3.4. SUPPORT VECTOR MACHINE 15 item is in an n-dimensional space with the value of each feature being the value of the coordinate. We can perform classification by finding a hyper plane that differentiates k-classes.

To understand the concept of Support Vector Machines, it is of importance that one knows the definition and possible applications of hyper-planes. The definition used by Curtis (1968) is used, accompanied with a geometric interpretation: We may consider a hyper-plane to be a subspace, but of one dimension less that the space it is residing in (also called ambient space). Therefore, as an example, if a space is 4-dimensional, its hyper-plane would 3-dimensional. In general: if the space is n-dimensional, its hyper-plane is (n-1)- dimensional. This notion of hyper-planes can be used in generality in spaces where the notion of sub-spaces is defined.

A Support Vector machine creates a hyper-plane (or a multitude of hyper-planes) in a n dimensional space. With n being of a high order, this space can be used for regression analysis, and binary classification. The method creates a separation by classes by creating hyper-planes that have the greatest distance to the nearest training data point of any class, the so called margin.

There are two methods when dealing with multi-class classification problems: implementing the multi- class classification problem by creating a single optimization problem (Crammer and Singer, 2001). Or by using the version as studied to be the best by Duan and Keerthi (2005) by reducing the single multi-class problem into multiple subsets of binary classification problems. In this paper the latter will be used, as the run-time is significantly reduced (Pedregosa et al., 2011).

3.4.1 Ensemble Learning

Support Vector Machines, is computationally intensive Chapelle (2007). With the computational complexity described as O(max{n, d}, min{n, d}2) with n the number of points and d the number of dimensions. To counter this, ensemble methods may be employed.

In statistics, and machine learning in particular, ensemble methods obtain a classifier that is a combina- tion of multiple classifiers. Multiple methods of obtaining a classifier using subsets are available, to reduce the computational complexity. Breiman (1999) creates a classifier from using random subsets of samples. In the paper by Ho (1998) randomly sample the subsets of features, and call this Random Subspaces . When using a sampler on both the samples and features, we call this Random Patches Louppe and Geurts (2012). Finally, when drawing samples with replacement one calls this Bagging Breiman (1996).

3.4.2 Kernel

A kernel with regards to Support Vector Machines is a way of computing the similarity of two vectors x and y in some high-dimensional feature space (Bishop, 2006, p. 292). Suppose we have function φ that maps our vectors as: Rn → Rm. We may define the dot product in this space to be ϕ(x)T ϕ(y). A kernel k(·, ·) is a function that corresponds to this dot product (in that space). In our example we use k(x, y) = ϕ(x)T ϕ(y).

15 3.5. SYNTHETIC SAMPLING 16

This kernel is also called the linear kernel (Bishop, 2006, p. 292). Another example is the Radial Basis Function (RBF), this is an often used kernel in training non-linear SVM problems (Chang et al., 2010, p. 1475). It is defined as: k(x, y) = exp(−γ∥x − y∥2)

The parameter associated with RBF is γ, which will be discussed in the next subsection.

3.4.3 Parameters

The parameters associated with SVM depend on the kernel used. Here we will discuss the parameters associated with SVM and the RBF kernel. The first parameter of interest for SVM itself is C, this parameter is a positive integer that controls the cost of misclassification.

The γ coefficient is a parameter for the RBF kernel and allows tuning in the case of over-fitting: a higher γ is associated with over fitting. A smaller γ would imply a distribution of Gaussian shape, with a large variance. This would imply that the influence of ’surrounding’ (i.e. close by in higher dimensional space) is greater. A large γ would imply the opposite: no wide-spread influence therefore the choice of γ entertains the classic bias-variance trade-off: a large γ leads to a high bias and low variance and vice-versa.

3.5 Synthetic Sampling

In the field of Machine Learning, a great source of bias and misclassification, with multi-class classification problems, is an unbalanced data set. If your data set is in total 1000 objects and 980 of them belong to the class "dog" and the rest of the 20 belong to the class "cat" it gets hard to predict the class of unseen data that belongs to "cat". The classifier will tend towards the more available class to maximize accuracy. This problem of imbalanced data may occur with a multitude of causes: the data is skewed because of the latent nature of the variable at hand, there may be a sample selection bias, or there are other problems at play. To better analyze a dataset, one must identify, and if possible correct for, biases in the data set.

To counteract this bias of unbalanced data sets one can apply sampling of classes. There are three cases to discuss: the oversampling of the minority class ("cat" in the example), under sampling the majority class ("dog" in the example), and a combination of over-and under sampling.

The SMOTE Chawla et al. (2002) is a method for over-sampling a data-set to counter classification problems posed by imbalanced dependent variable distribution. SMOTE generates synthetic cases when a rare target value is required. Chawla et al. (2002) uses an interpolation strategy when creating these examples artificially. The smote algorithm introduced by Chawla et al. (2002) uses an over-sampling strategy consisting of generating synthetic cases xˆi when a rare target value yi is requested. They proposed a strategy whereby the strategy is to select one of the target value yi’s k-nearest neighbours and use these new observations and interpolated them, generating xˆi.

16 3.5. SYNTHETIC SAMPLING 17

In the paper by He et al. (2008), the ADAptive SYNthetic sampling approach (or ADASYN) is in- troduced. The authors build upon the methodology of SMOTE (Chawla et al., 2002) by focusing on the minority classes which are difficult to learn. ADASYN generated synthetic data for the minority classes, but increases the amount generated for the minority classes which are harder to learn. The ADASYN algorithm works as follows:

{ }n We start with a training set D with in total n samples of labelled and paired data xi, yi i=0. With xi ∈ X and yi ∈ Y = {−1, 1} with yi identifying the class associated with the corresponding xi. We now define two subsets of D: D−(n−) and D+(n+) as the minority and majority classes and their respective counts. We therefore have n = n− + n+, n− < n+, and D− ∪ D+ = D.

1. One calculates the number of samples we wish to generate for the minority class:

G = (n+ − n−) × β (3.9)

with β ∈ [0, 1] as a parameter to specify the balance level after generation of the synthetic data. β = 0 would indicate no samples being trained and β = 1 would indicate a fully balanced data set after synthetic data generation.

2. for each item xi ∈ D− find the K nearest neighbours based on the euclidean distance. Using this information calculate the ratio

ri = ∆i/K, i = 1, . . . , n−

where ∆i is to be defined as the number of examples in the K nearest neighbours of xi that belong to

D−, we therefore achieve ri ∈ [0, 1].

3. we now normalize ri as follows: ∑ ri rˆi = n− ri ∑ i=1 so that rˆi has the following attribute: i rˆi = 1 4. we now calculate the number of synthetic data points that are to be generated for every minority

example xi:

gi =r ˆi × G (3.10)

with G being the total number of synthetic data examples that are to be generated, in accordance with equation 3.9

5. We know now how many data points are to be generated for each xi ∈ D−. For each of these gi we

loop over them and randomly pick one point from the K nearest neighbours and call this point xki: we then generate a point by the following formula:

si = xi + (xki − xi)λ

with λ a randomly generated number between 0 and 1.

17 3.5. SYNTHETIC SAMPLING 18

This algorithm can be generalized for multi-class imbalanced data sets by handling one minority class at a time. The introduction of synthesized data is known to improve the performance of classification algorithms, in the original paper of He et al. (2008) the increase of classification accuracy was better than SMOTE in all but one case presented in the paper.

18 4 | Methodology

The main focus of this chapter is methodology used to answer the research question. First, the pre-processing of the data is discussed in section 4.1. Secondly, in section 4.2 feature extraction will be discussed. Thirdly, section 4.3 will discuss the classification and prediction part. Fourthly, section 4.4 will discuss the fusion of feature sets. Lastly, section 4.5 will summarize the methodology.

Figure 4.0.1: The structure of the methodology at a glance

The methodology described in this chapter in graphically explained in figure 4.0.1. Section 4.1 is linked to part 1 in blue, section 4.2 is part 2, section 4.3 will describe part 3, with finally part 4 being described by 4.4.

4.1 Pre-processing

The pre-processing (part 1 of figure 4.0.1) is the part where an 80%/20% train/test set is created. We do not use a 95/5 split as in the literature review, as the authors employed a 20-fold cross-validation. The choice for 80/20 is done out of computational constraints. Some restrictions are put on the input data of this method: the target variable (rating) needs to consist of at least 1 rating to allow for the average target rating to be between the min and max values of the average rating.

19 4.2. FEATURE EXTRACTION 20

4.2 Feature Extraction

For the analysis of the online content given, we define the feature sets to be used. Four feature sets will be discussed here, namely:

• Image of the item (4.2.1) • Textual feature sets (title & description) (4.2.2) • Genre information (4.2.3)

4.2.1 Image

To generate a feature set from images, the pre-trained Convolutional Neural Network named AlexNet is used. By using data from ImageNet we are able to ascertain spatial and object data from the image associated with an item. The normalization for images used in AlexNet uses the following normalization parameters:

(µR, µG, µB) = (0.485, 0.456, 0.406) (σR, σG, σB) = (0.229, 0.224, 0.225)

These values are because of the use of a pre-trained model (PyTorch, 2017). The output of AlexNet is a 1000-fold feature vector.

4.2.2 LDA

As explained in the theory chapter, Latent Dirichlet Allocation is used to extract textual features from a text. In this specific case, the the textual features contained within the title and description of an item are used. The parameters of interest in this feature set is k and number of iterations over the data set. For computational convenience we will set the number of iterations to be 20. The end result of this analysis is that there will be two feature sets (title and description) containing both ki features. The variant according to the paper by Hoffman et al. (2010) is used: an online, therefore sequentially run, variant of LDA.

4.2.2.1 Parameters

As discussed in the theory chapter, the optimal number of topics is to be determined. As stated, there are multiple methods for determining the optimal number of topics k. We shall use the coherence measure Cv to determine the optimal number of k by choosing the lowest coherence measure with k ∈ {25, 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}, with 5 passes over the corpus.

4.2.3 Genres

The assumption of this thesis is that an online content item has informational of categorical nature. This may be interpreted as the genre of the item. This will be added as a feature set with the number of items being pertaining to the number of genres/categories in dummy form.

20 4.3. PREDICTION GOAL 21

4.3 Prediction Goal

In defining the goal of an algorithm, the need for defining accuracy arises. The way this is handled in this thesis is by using a binned scale. A binned scale is defined as a scale which is cut up into a pre-defined amount of pieces. We define the scale by defining the width of the scale ε. The subsequent definition of accuracy (a(·)) is: 1 ∑N a(ε) = b(ε, actual , expected ) N i i i=1 with b(·, ·, ·) defined as:

b(ε, x, y) = Iy∈[x−(x mod ε),x−(x mod ε)+ε)∩D D = [min{ratings}, max{ratings}]

Note the boundaries: [·, ·), the boundary value max D is included in the last set. With the assumption that the rating is defined as starting on an integer or multitude of ε. The N is defined as the number of items in the relevant set of items. The goal of all algorithms is to be able to predict the rating of an item as accurately as possible. To compare this work with other work, we will facilitate two accuracy definitions:

1. Spot-on is where ε = 0.5. 2. Next-to is where ε = 1.0.

The most interesting definition is spot-on. This definition allows for a more thorough learning process with regards to Neural Networks and SVM, as it allows for the difference between e.g. a rating of an item of 3.2 and 3.6, if that is the most prevalent rating. This information is lost when viewing the rating of an item through the lens of the next-to definition.

4.3.1 Continuous

The choice of binning the end variable brings up the loss of information going from a continuous variable to a discrete variable. This is a valid point and the reasoning behind it is that we would like to compare this algorithm with previous work (Glisovic, 2016; Wieser, 2016). Furthermore the discretisation of the variable allows for faster run-times with regards to Neural Networks and SVM (which would have to be change to Support Vector Regression).

4.4 Prediction

This section describes the final two steps: generating the feature sets and the fusion of those feature spaces.

4.4.1 Sampling

Before any classification algorithm is applied to the data set, the data set is run through ADASYN to generate a balanced data set.

21 4.4. PREDICTION 22

4.4.2 Support Vector Machine

Before any generation can take place the original unbalanced data set has to be over sampled as described in section 3.5. Before any generation can take place, the training set is supplemented with ADASYN to ensure a balanced data set. The target value is the rating of online content split by use of the spot-on definition. To ascertain the parameters for this intermediate step a grid search is performed on the following variables:

• kernel, one of – linear: ⟨x, x′⟩ – radial basis function: exp(−γ∥x − x′∥2) • γ ∈ {0.01, 0.1, 0.5, 0.9, 0.99} • C ∈ {1, 10, 100, 1000}

Where γ will be only be adjusted upon in the case of the Radial Basis Function (rbf). This grid search will be done on a random sampling (10%) on an ADASYN sampled training set. As for the method of SVM multi-classification used, the method as proposed in Duan and Keerthi (2005) will be used, to reduce the multi-class optimization problem to a multitude of binary classification problems. The SVM classifier will be trained with the spot-on definition as defined in section 4.3.I

The outputs used from SVM are the so called probabilities: in the binary case, the paper by Platt et al. (1999) may be used to ascertain per-class probabilities. It uses a logistic regression on the SVM scores, fit by an additional cross-validation on the data. However, because we are in a multi-class case we use the extension as described by Wu et al. (2004). Furthermore as SVM is computationally intensive, the Bagging ensemble method with overlapping subsets is used.

4.4.3 Neural Network

A neural network with two hidden layers with both 400 nodes is trained to ascertain a 1×q vector of outputs to use for classification, which depends on the ε chosen (8 for spot-on and 4 for next-to). The activation function used is either aRelu for ease of computation or asoftmax if the first does not converge successfully. The number of epochs used is 20.000. To counteract over fitting a dropout layer is included between the first and second layer. The loss function used is the Cross Entropy Loss function as defined in subsection 3.3.4. The output of the Neural Network is normalized according to the same procedure of image normalization by using the column vector for each feature (y), and applying the following function:

−1 − normalize(y) = σy y ιµy.

With ι = [1,..., 1]′. Using the normalized concatenated input feature sets the same Neural Network as before is trained.

22 4.5. SUMMARY 23

4.5 Summary

To grant the reader a more structured view of this chapter we summarize the methodology:

Preprocessing Split the data set into 80% training and 20% test set. Feature Extraction Extract Genre, Image, description and title textual information. Balance with ADASYN. Feature preparation Prepare the features for fusion by applying Support Vector Machines and Neural Networks, training on spot-on definition. Fusion Use Support Vector Machine and Neural Network to train a classifier using the spot-on definition. Prediction Predict on the test set and enumerate the spot-on and next-to accuracies per rating bin.

23 5 | Experiment

The results from an experiment using scraped application data is outlined here. Firstly, the data is described in section 5.1. Secondly the feature sets and their generation is outlined in sections 5.2 until section 5.5. Thirdly, the fusion step is described in section 5.6. Finally, the results are summarized in section 5.7

5.1 Origin & Explanation

The data used for this chapter is App Store data. The application store in question is the iOS App Store by Apple. The data is provided by a cooperation between the University of Amsterdam and AppTweak (http://www.apptweak.com). The tools of AppTweak scraped the top rated applications per genre collecting data such as title, description, thumbnail link, gallery videos, images, price, and reviews. A subselection has been made with regards to the data used, only apps that have the categorization of ’Game’ have been included.

5.1.1 Statistics

The general statistics of relevant variables is displayed in table 5.1.1.

Values Test Train Total mean 4.05 3.97 3.99 std. dev. 0.72 0.74 0.74 skew -1.62 -1.30 -1.36 Kurtosis 3.20 1.84 2.07 #(1.0 - 1.5) 86 293 379 #(1.5 - 2.0) 87 335 422 #(2.0 - 2.5) 118 739 857 #(2.5 - 3.0) 268 1,507 1,775 #(3.0 - 3.5) 717 3,646 4,363 #(3.5 - 4.0) 1,250 5,093 6,343 #(4.0 - 4.5) 2,552 9,477 12,029 #(4.5 - 5.0) 1,828 6,530 8,358 all 6,906 27,620 34,526 Table 5.1.1: Summary statistics for ratings

24 5.1. ORIGIN & EXPLANATION 25

Figure 5.1.1: Train (left) and test (right) rating distribution (spot-on)

The table 5.1.1, combined with the information shown in figure 5.1.1, shows that the average rating is skewed towards the higher ratings, this can be explained from the way AppTweak has scraped the data: only the top rated applications per genre are collected. Furthermore, the mean is fairly similar (no more than 10% difference) and the standard deviation is also in line with the total set, same with the skew. The kurtosis is different, which can be seen in figure 5.1.1. From figure 5.1.1, and using the information contained within 5.1.1 we see that the two subsets are evenly distributed.

The results are displayed in a table format, where when the next-to columns are concerned the values of the ratings within [1.0-1.5] are checked to coincide with values of [1.0-2.0] to allow for a comparison with previous work. As an example if a rating is predicted to be 1.0-1.5 and the actual value is 1.7, the prediction (when using the next-to definition) is considered correct. We now focus our attention to the feature sets.

25 5.2. GENRES FEATURE SET 26

5.2 Genres Feature Set

This section describes the results for the Genres feature set. To grant the user some insight in the data we describe the statistics and distributions of genres in subsection 5.2.1. We then focus on the results garnered in subsection 5.2.2.

5.2.1 Statistics

The following table shows the number of apps, and average rating of apps per genre, with the added note that an app can pertain to more than one genre.

Genre Name # Apps Average Rating Genre Title # Apps Average Rating Action 12,825 3.68 Music 2,840 3.71 Adventure 11,582 3.77 Puzzle 14,239 3.80 Arcade 12,046 3.68 Racing 5,250 3.63 Board 5,617 3.56 Role Playing 9,165 3.80 Card 4,653 3.59 Simulation 10,600 3.53 Casino 4,619 3.63 Sports 4,655 3.41 Dice 2,779 3.58 Strategy 8,593 3.66 Educational 8,492 3.53 Trivia 6,057 3.64 Family 13,702 3.67 Word 4,803 3.86 Kids 0 - Table 5.2.1: Genre information

As can be seen from table 5.2.1, the genre kids has no applications associated with it, it will therefore be dropped. We now apply oversampling on all non-majority classes to achieve the new distribution of the ratings:

Figure 5.2.1: Before (left) and after (right) applying ADASYN

On the balanced data set we apply the methods outlined in the methodology chapter: Support Vector Machines and Neural Network.

26 5.2. GENRES FEATURE SET 27

5.2.2 Results

This section will describe the results for Support Vector Machine applied to the applied. We begin by applying a grid-search to find γ and C in a random 10% sample of the balanced data set:

Accuracies C γ Test Train spot-on next-to 1 0.01 21 17 Rating 1 0.1 34 25 Test Train Test Train 1 0.5 34 30 1.0 - 1.5 23.26 31.76 25.58 46.82 1 0.9 35 33 1.5 - 2.0 5.75 24.80 17.24 65.44 1 0.99 34 33 2.0 - 2.5 3.39 39.55 8.47 58.57 10 0.01 18 20 2.5 - 3.0 0.75 12.51 8.21 46.24 10 0.1 35 29 3.0 - 3.5 4.60 10.65 22.45 25.04 10 0.5 34 38 3.5 - 4.0 16.80 11.49 61.28 44.34 10 0.9 35 43 4.0 - 4.5 41.89 40.20 85.50 85.69 10 0.99 36 44 4.5 - 5.0 39.93 34.37 75.55 57.84 100 0.01 33 26 total 30.02 25.91 66.02 53.87 100 0.1 34 34 Table 5.2.3: Accuracies associated with opti- mal SVM parameters (10% sample) 100 0.5 36 47 100 0.9 36 50 Accuracies C 100 0.99 36 51 Test Train 1000 0.01 36 30 1 21 20 1000 0.1 36 41 10 21 20 1000 0.5 36 51 100 21 20 1000 0.9 37 52 1000 21 20 1000 0.99 36 52 Table 5.2.4: Results with a linear kernel Table 5.2.2: Grid search results for SVM (10% sample)

From the grid search in table 5.2.2 we see that with C = 1000, γ = 0.99, and therefore a RBF kernel is needed. From this we ascertain an accuracy of 25.91% on the training set. Figures 5.2.2 and 5.2.3 show the predicted and actual ratings for this sample.

Figure 5.2.2: Prediction values(10% sample) Figure 5.2.3: Actual values (10% sample)

As the values in 5.2.2 show a distribution close to that of the actual values (figure 5.2.1, we continue on to

27 5.2. GENRES FEATURE SET 28 the full data set. We use the values C = 1000, γ = 0.99 on the full dataset. From this we get the following results and distributions:

Figure 5.2.4: Prediction ratings Figure 5.2.5: Actual values of the test ratings

The following tables show the results for both applying the Neural Network methodology and Support Vector Machines with the acquired parameters.

spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 10.47 67.87 11.63 76.52 1.0 - 1.5 22.09 21.16 22.09 21.16 1.5 - 2.0 0.00 65.96 2.30 81.88 1.5 - 2.0 0.00 0.00 3.45 5.07 2.0 - 2.5 1.69 58.21 5.08 72.24 2.0 - 2.5 0.00 0.00 0.85 0.41 2.5 - 3.0 1.87 47.93 7.09 61.89 2.5 - 3.0 0.37 0.40 0.37 0.40 3.0 - 3.5 3.63 28.52 14.50 42.03 3.0 - 3.5 0.00 0.66 3.63 6.47 3.5 - 4.0 10.24 19.96 82.08 64.31 3.5 - 4.0 4.64 6.36 4.88 6.68 4.0 - 4.5 69.91 69.71 92.71 93.55 4.0 - 4.5 86.60 88.39 96.51 96.92 4.5 - 5.0 23.19 32.91 89.00 74.52 4.5 - 5.0 13.46 14.58 97.65 97.30 all 34.43 48.77 74.72 71.00 all 36.69 35.28 63.12 58.66 Table 5.2.5: Results for full training set (SVM) Table 5.2.6: Results for full training set (NN)

We attain a result of 34.43% using the spot-on definition, with 74.72% with the next-to definition when using SVM. Of note here is the very low accuracy for ‘1.5’-‘3.5-4.0’ for the Neural Network, it seems that even with ADASYN sampling the Neural Network prefers the higher valued ratings. Both SVM and Neural Network perform badly on the 1-5-2.0 range, indicating a lack of information contained within the data set with regards to differentiating a 1.5-2.0 rating. Seeing as most genres have a fairly even distribution with respect to their average rating, a low accuracy is not an unexpected result, when used in isolation.

28 5.3. IMAGE FEATURE SET 29

5.3 Image Feature Set

This section describes the intermediate results associated with the Images feature set. Before putting the images through AlexNet the images have to be adjusted. The images from the app scraping process ranged from 1024x1024 to 75x75. To consolidate this, and be able to use AlexNet, the input had to be converted RGB values (as some pictures were stored in black and white) and re-sized to 227x227. The images also have to be normalized. The balanced data set is comparable in distribution compared to figure 5.2.1. Figure 5.3.1 shows a random sampling of images, to illustrate the source material.

Figure 5.3.1: Random selection of images

5.3.1 Results

Below are listed the grid search results for parameters 5.3.1 and the accuracies associated with those optimal SVM parameters (table 5.3.2).

Accuracies C γ Test Train spot-on next-to 1.0 0.01 22 49 Rating 1.0 0.1 30 98 Test Train Test Train 1.0 0.5 18 99 1.0 - 1.5 1.16 99.78 1.16 99.89 1.0 0.9 18 100 1.5 - 2.0 0.00 99.90 3.49 99.90 1.0 0.99 18 100 2.0 - 2.5 0.85 100.00 5.13 100.00 10.0 0.01 23 92 2.5 - 3.0 5.22 99.69 6.34 99.69 10.0 0.1 30 99 3.0 - 3.5 2.65 100.00 28.07 100.00 10.0 0.5 18 100 3.5 - 4.0 21.28 100.00 25.12 100.00 10.0 0.9 18 100 4.0 - 4.5 69.02 100.00 71.18 100.00 10.0 0.99 18 100 4.5 - 5.0 2.52 100.00 70.99 100.00 100.0 0.01 23 99 all 30.54 99.92 52.96 99.93 100.0 0.1 30 100 Table 5.3.2: Accuracies associated with opti- mal SVM parameters (10% sample) 100.0 0.5 18 100 100.0 0.9 18 100 Accuracies C 100.0 0.99 18 100 Test Train 1000.0 0.01 24 100 1.0 11 48 1000.0 0.1 31 100 10.0 10 48 1000.0 0.5 18 100 100.0 11 47 1000.0 0.9 18 100 1000.0 11 47 1000.0 0.99 18 100 Table 5.3.3: Results using a linear kernel Table 5.3.1: Grid search results for SVM (10% sample)

Table 5.3.1 shows the results for the grid search, with table 5.3.2 showing the distribution for the accuracies

29 5.3. IMAGE FEATURE SET 30 associated with the optimal C and γ. As the SVM estimator does not only guess 1 rating we continue. Table 5.3.3 and 5.3.1 illustrate that the best choice within this random sample is the Radial Basis Function as a kernel accompanied with C = 10, γ = 0.1. Using these settings we train the entire data set. We see also that C does not matter a lot.

spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 23.26 95.85 46.51 98.26 1.0 - 1.5 1.16 96.97 2.33 98.20 1.5 - 2.0 22.09 92.56 33.72 97.56 1.5 - 2.0 2.33 95.81 2.33 97.28 2.0 - 2.5 17.95 73.73 31.62 78.63 2.0 - 2.5 0.85 93.35 5.13 95.78 2.5 - 3.0 14.55 58.42 29.10 68.94 2.5 - 3.0 2.99 87.92 4.10 91.30 3.0 - 3.5 13.83 34.54 29.47 43.71 3.0 - 3.5 11.03 71.33 31.56 81.11 3.5 - 4.0 15.92 30.57 30.56 43.32 3.5 - 4.0 18.80 67.54 32.00 76.89 4.0 - 4.5 13.76 21.03 19.84 26.17 4.0 - 4.5 44.12 74.32 65.96 86.78 4.5 - 5.0 7.39 15.07 19.76 26.45 4.5 - 5.0 25.78 57.63 67.27 79.81 all 12.80 53.39 23.83 61.03 all 27.86 81.03 51.57 88.52 Table 5.3.4: Results for full training set (SVM) Table 5.3.5: Results for full training set (NN)

We can see a disparity between the two results, where the neural network is good > 25% in categorizing the higher end of the ratings spectrum, the SVM classifier is biased (in accuracy) towards the lower end of the spectrum with less skew. We also see that the SVM classifier does not have a high accuracy on the higher end of the spectrum, whilst the Neural Network performances more evenly across the board.

30 5.4. DESCRIPTION FEATURE SET 31

5.4 Description Feature Set

This section describes the results attained for the description feature set. First the parameters are searched for the LDA model in subsection 5.4.1, followed by a discussion of the results in 5.4.2.

5.4.1 Parameters

We first calculate the TF-IDF values for each individual word after stemming, where after observing the TF-IDF values, a cutoff is chosen. From visual inspection of figure 5.5.1, we set the cutoff to be between [2.5, 6.5]. We now list the difference pre-and after cull in figures 5.5.1, 5.5.2, 5.5.3, and 5.5.4.

Figure 5.4.1: TF-IDF graph

Word TF-IDF Word TF-IDF Word TF-IDF Word TF-IDF up 0.925657 thi 0.5981 behindthescen 10.2263 kidz 10.2263 new 0.909615 it 0.587034 brog 10.2263 pogu 10.2263 have 0.888618 play 0.510886 superpong 10.2263 jonesin 10.2263 by 0.879416 on 0.460175 spiffywar 10.2263 chubukov 10.2263 will 0.872721 is 0.329279 waffleturtl 10.2263 jirbo 10.2263 as 0.846381 for 0.290002 sorel 10.2263 pixio 10.2263 be 0.843011 with 0.286958 doublet 10.2263 ijezzbal 10.2263 that 0.818909 in 0.257535 provision 10.2263 stephenflem 10.2263 more 0.760925 your 0.224049 easthaven 10.2263 galley 10.2263 from 0.759918 game 0.189414 fanni 10.2263 glovercom 10.2263 are 0.736809 you 0.186967 agnew 10.2263 aki 10.2263 all 0.699321 of 0.174215 askew 10.2263 xerc 10.2263 or 0.686651 to 0.0905064 saratoga 10.2263 cupertino 10.2263 featur 0.636971 and 0.0740746 tournement 10.2263 sextupl 10.2263 Table 5.4.1: Lowest TF-IDF words pre-cull (description) Table 5.4.2: Highest TF-IDF pre-cull (description)

31 5.4. DESCRIPTION FEATURE SET 32

Word TF-IDF Word TF-IDF Word TF-IDF Word TF-IDF cant 4.23733 playabl 4.21503 pellet 8.14685 sherman 8.14685 paid 4.23733 text 4.21503 inki 8.14685 fischer 8.14685 experienc 4.22984 career 4.21503 blinki 8.14685 tribun 8.14685 credit 4.22984 wont 4.21503 appstoreapp 8.14685 cappuccino 8.14685 gmail 4.22984 sequel 4.21503 droplet 8.14685 elvi 8.14685 wit 4.22984 illustr 4.21503 mulligan 8.14685 rgb 8.14685 owner 4.22736 agre 4.21014 symmetr 8.14685 gopher 8.14685 regular 4.22736 kick 4.2077 joseph 8.14685 zu 8.14685 bigfishtwitt 4.22488 dive 4.2077 glutton 8.14685 geolog 8.14685 seem 4.22241 bottom 4.20527 seventh 8.14685 interven 8.14685 condit 4.22241 undo 4.20285 unsurpass 8.14685 roach 8.14685 ahead 4.21994 grid 4.20285 misfortun 8.14685 multius 8.14685 bigfi 4.21994 corner 4.20285 unclutt 8.14685 interpol 8.14685 ten 4.21748 straight 4.20285 flatten 8.14685 lcd 8.14685 Table 5.4.3: Lowest TF-IDF words past-cull (description) Table 5.4.4: Highest TF-IDF past-cull (description)

The information contained within tables 5.5.1-5.5.4 is seen clearly: words such as ‘illustr’ (pertaining to illustrating, illustrate, et cetera) show more information than ‘your’ or ‘by’, although lower informational words such as ‘wont’ do still come up, they are less prone to appear. After determining, and consequently filtering for, the cutoff point the attention shifts to determining k, the number of topics to be chosen in the LDA model. We search for k ∈ {25, 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}, and attain the following values for the coherence measure:

k Cv k Cv 25 0.47 500 0.49 50 0.49 600 0.49 75 0.47 700 0.48 100 0.46 800 0.50 200 0.47 900 0.52 300 0.45 1000 0.51 400 0.47 Table 5.4.5: The values associated with the figure Figure 5.4.2: Coherence estimates

We will be using k = 300, as it attains the lowest coherence estimate. Now that we know the optimal k value we run 30 passes over the corpus to use for SVM and NN. To do so, we first turn our attention to Support Vector Machines where we determine the optimal C and γ by grid-searching a 10% sample:

32 5.4. DESCRIPTION FEATURE SET 33

Accuracies C γ Test Train 1.0 0.01 10 14 spot-on next-to rating 1.0 0.1 11 14 Test Train Test Train 1.0 0.5 12 19 1.0 - 1.5 29.07 60.49 40.70 70.25 1.0 0.9 13 21 1.5 - 2.0 12.64 51.02 28.74 69.40 1.0 0.99 16 22 2.0 - 2.5 11.02 23.64 38.14 50.83 10.0 0.01 10 15 2.5 - 3.0 24.25 41.94 33.96 52.05 10.0 0.1 8 21 3.0 - 3.5 1.12 3.51 9.48 12.71 10.0 0.5 10 24 3.5 - 4.0 10.40 11.12 10.88 12.62 10.0 0.9 10 25 4.0 - 4.5 36.48 36.37 39.54 39.49 10.0 0.99 12 26 4.5 - 5.0 5.91 7.11 38.24 37.98 100.0 0.01 12 20 all 18.69 30.00 30.52 43.37 100.0 0.1 10 25 Table 5.4.7: Results for the 10% sample 100.0 0.5 12 27 Accuracies 100.0 0.9 14 28 C 100.0 0.99 14 28 Test Train 1000.0 0.01 10 25 1.0 7 20 1000.0 0.1 11 27 10.0 8 23 1000.0 0.5 18 30 100.0 12 24 1000.0 0.9 22 31 1000.0 19 27 1000.0 0.99 21 30 Table 5.4.8: Accuracies for a linear kernel. Table 5.4.6: SVM results for description LDA feature set

From these results we see that the optimal choice of (C,γ) is (1000,0.9) with a Radial Basis Function kernel.

5.4.2 Results

We now proceed by applying these settings on the entire data set when it concerns SVM. We also train a Neural Network, of which the results can be seen in table 5.4.9 and 5.4.10 for SVM and NN respectively.

spot-on next-to spot-on next-to rating rating Test Train Test Train Test Train Test Train 1.0 - 1.5 29.07 60.49 40.70 70.25 1.0 - 1.5 8.14 79.63 11.63 83.62 1.5 - 2.0 12.64 51.02 28.74 69.40 1.5 - 2.0 4.60 75.55 6.90 80.23 2.0 - 2.5 11.02 23.64 38.14 50.83 2.0 - 2.5 5.08 60.36 9.32 69.46 2.5 - 3.0 24.25 41.94 33.96 52.05 2.5 - 3.0 2.61 47.30 4.85 56.92 3.0 - 3.5 1.12 3.51 9.48 12.71 3.0 - 3.5 3.21 27.80 10.74 33.22 3.5 - 4.0 10.40 11.12 10.88 12.62 3.5 - 4.0 8.32 15.92 13.36 26.45 4.0 - 4.5 36.48 36.37 39.54 39.49 4.0 - 4.5 75.20 75.57 83.15 83.07 4.5 - 5.0 5.91 7.11 38.24 37.98 4.5 - 5.0 12.04 14.73 82.44 70.90 all 18.69 30.00 30.52 43.37 all 33.16 49.71 56.66 62.25 Table 5.4.9: Results for full data set (SVM) Table 5.4.10: Results for full data set (NN)

Again, as is the case for the Images feature set we see that the Neural Network performs well on the higher end of the ratings spectrum, relative to SVM, which performs significantly (3x) better in accuracy on the lower end of the ratings spectrum.

33 5.5. TITLE FEATURE SET 34

5.5 Title Feature Set

This section describes the parameter searching (5.5.1), and results (5.5.2) for the title feature set

5.5.1 Parameters

Here we first calculate the TF-IDF, where after observing the words and stemming, a cutoff is chosen. The TF-IDF needs to be between [3, 6.5] from visual inspection of figure 5.5.1. Furthermore, we list the difference pre-and after cull in figures 5.5.1-5.5.4.

Figure 5.5.1: TF-IDF graph

word TF-IDF word TF-IDF word TF-IDF word TF-IDF hidden 4.11483 race 3.71753 cro 10.2263 beecel 10.2263 war 4.05459 casino 3.7101 superbal 10.2263 wooli 10.2263 hd 4.04009 simul 3.65822 garf 10.2263 wordtouch 10.2263 world 4.03393 word 3.6415 pictureflip 10.2263 killersudoku 10.2263 salon 3.98993 girl 3.59561 chart 10.2263 bandicoot 10.2263 my 3.95909 quiz 3.53545 mimeo 10.2263 lumen 10.2263 trivia 3.95342 slot 3.43958 ijezzbal 10.2263 blockstouch 10.2263 in 3.93844 puzzl 3.23696 aki 10.2263 marblejump 10.2263 guess 3.93103 kid 3.17098 trism 10.2263 isudoku 10.2263 with 3.84787 and 2.98924 sextupl 10.2263 partner 10.2263 adventur 3.84617 of 2.83797 lumina 10.2263 blocksclass 10.2263 up 3.84448 for 2.68249 yulan 10.2263 idic 10.2263 pro 3.80793 free 2.33198 imangi 10.2263 tyranno 10.2263 fun 3.76015 the 2.29825 muddl 10.2263 chessclock 10.2263 Table 5.5.1: Top 30 lowest words for title pre-cull Table 5.5.2: Highest tf-idf words for title pre-cull

34 5.5. TITLE FEATURE SET 35

word TF-IDF word TF-IDF word TF-IDF word TF-IDF hero 4.22736 with 3.84787 volleybal 7.92371 toilet 7.92371 babi 4.22736 adventur 3.84617 deuc 7.92371 ragdol 7.92371 dress 4.17656 up 3.84448 random 7.92371 bloon 7.92371 machin 4.17421 pro 3.80793 cow 7.92371 present 7.92371 anim 4.16251 fun 3.76015 type 7.92371 submarin 7.92371 to 4.15556 edit 3.72952 luxuri 7.92371 airlin 7.92371 hidden 4.11483 race 3.71753 expert 7.92371 never 7.92371 war 4.05459 casino 3.7101 astrawar 7.92371 conflict 7.92371 hd 4.04009 simul 3.65822 pick 7.92371 alli 7.92371 world 4.03393 word 3.6415 against 7.92371 bay 7.92371 salon 3.98993 girl 3.59561 terror 7.92371 rang 7.92371 my 3.95909 quiz 3.53545 spanish 7.92371 potter 7.92371 trivia 3.95342 slot 3.43958 bomb 7.92371 fireman 7.92371 in 3.93844 puzzl 3.23696 lunar 7.92371 web 7.92371 Table 5.5.3: Top 30 lowest words for description after Table 5.5.4: Highest tf-idf words for description after cull cull

After stemming the words, and culling within the range as defined earlier, we turn our attention to finding the optimal k value for the title feature set. We now employ the coherence measure to search for k ∈ {25, 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}, and attain the following values:

k Cv k Cv 25 0.51 500 0.56 50 0.43 600 0.56 75 0.42 700 0.56 100 0.40 800 0.56 200 0.47 900 0.56 300 0.53 1000 0.57 400 0.55 Table 5.5.5: The values associated with the figure

Figure 5.5.2: Coherence estimates

We observe the lowest coherence estimate to be 0.40 with an associated k = 100. This choice of K will be used to generate the LDA features to feed into both SVM and NN. We now perform a grid-search to obtain the optimal parameters of (C, γ) on a random 10% sample. We attain the following results

35 5.5. TITLE FEATURE SET 36

Accuracies C γ Test Train 1.0 0.01 10 13 spot-on next-to Rating 1.0 0.1 14 16 Test Train Test Train 1.0 0.5 19 20 1.0 - 1.5 1.16 40.13 3.49 56.09 1.0 0.9 23 21 1.5 - 2.0 4.60 43.80 4.60 58.26 1.0 0.99 20 21 2.0 - 2.5 0.00 28.09 1.69 37.09 10.0 0.01 13 16 2.5 - 3.0 0.00 24.38 0.00 32.16 10.0 0.1 16 19 3.0 - 3.5 7.67 18.33 13.53 23.53 10.0 0.5 22 23 3.5 - 4.0 7.92 13.95 14.64 20.77 10.0 0.9 24 24 4.0 - 4.5 70.81 76.43 82.84 85.71 10.0 0.99 26 24 4.5 - 5.0 12.75 14.98 85.50 74.47 100.0 0.01 16 17 all 31.84 32.59 57.43 48.47 100.0 0.1 24 22 Table 5.5.7: Results from 10% sample 100.0 0.5 28 26 Accuracies 100.0 0.9 32 28 C 100.0 0.99 29 29 Test Train 1000.0 0.01 20 20 1.0 6 10 1000.0 0.1 28 25 10.0 7 15 1000.0 0.5 30 30 100.0 10 21 1000.0 0.9 30 32 1000.0 13 23 1000.0 0.99 32 33 Table 5.5.8: Accuracies for a linear kernel. Table 5.5.6: Results grid search title feature set

From these tables we see that the optimal (C, γ) combination is (1000,0.99) with a Radial Basis Function kernel.

5.5.2 Results

This subsection describes the results from both the Neural Network and Support Vector Machine. We apply the parameters as defined in the previous subsection, and run SVM and NN on the entire data set.

spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 1.16 44.64 6.98 58.37 1.0 - 1.5 5.81 28.71 19.77 46.72 1.5 - 2.0 4.60 38.48 6.90 55.29 1.5 - 2.0 8.05 28.61 12.64 47.33 2.0 - 2.5 0.00 30.30 1.69 36.20 2.0 - 2.5 4.24 37.58 7.63 42.88 2.5 - 3.0 0.75 20.22 0.75 29.14 2.5 - 3.0 1.87 7.38 7.46 37.23 3.0 - 3.5 3.07 17.78 11.16 24.07 3.0 - 3.5 8.23 11.73 16.04 15.92 3.5 - 4.0 7.76 15.97 10.32 20.56 3.5 - 4.0 7.20 5.81 14.56 13.27 4.0 - 4.5 61.01 65.36 84.01 87.00 4.0 - 4.5 61.13 59.52 71.55 71.46 4.5 - 5.0 26.97 30.65 86.27 73.79 4.5 - 5.0 15.15 13.92 70.62 59.59 all 31.51 32.50 57.14 47.60 all 29.08 23.95 50.26 41.34 Table 5.5.9: Results for the entire data set for title Table 5.5.10: Results Neural Network for title feature set

We see a failure on the SVM side to correctly identify the 2.0-3.0 range, indicating a lack of information in this block. Although NN performs better in this range it is still in the 0-5% range, also indicating a lack

36 5.6. FUSION 37 of ability to train on this range.

5.6 Fusion

This section describes the results pertaining to the end results: combining feature sets. Within this section we entertain the notion of combining the different available feature sets:

1. Images 2. Images + Genres 3. Images + Genres + Title 4. Images + Genres + Title + description

By doing so, we are able to ascertain the added value of each feature set, if any.

5.6.1 Neural Network

Here the individual combinations results are presented in table 5.6.1- 5.6.4:

spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 4.65 98.26 4.65 99.10 1.0 - 1.5 4.65 98.26 4.65 99.10 1.5 - 2.0 1.16 97.21 1.16 98.11 1.5 - 2.0 1.16 97.21 1.16 98.11 2.0 - 2.5 2.56 88.82 9.40 93.34 2.0 - 2.5 2.56 88.82 9.40 93.34 2.5 - 3.0 8.21 73.95 12.31 84.25 2.5 - 3.0 8.21 73.95 12.31 84.25 3.0 - 3.5 14.94 53.30 30.87 62.42 3.0 - 3.5 14.94 53.30 30.87 62.42 3.5 - 4.0 19.20 46.73 33.60 59.64 3.5 - 4.0 19.20 46.73 33.60 59.64 4.0 - 4.5 36.43 67.09 60.86 82.23 4.0 - 4.5 36.43 67.09 60.86 82.23 4.5 - 5.0 29.45 55.20 65.02 74.40 4.5 - 5.0 29.45 55.20 65.02 74.40 all 26.72 72.63 49.71 81.67 all 26.72 72.63 49.71 81.67 Table 5.6.1: Images full set results Table 5.6.3: Images + Genres full set results

spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 2.33 98.61 2.33 99.09 1.0 - 1.5 5.81 98.76 5.81 99.24 1.5 - 2.0 2.33 98.53 2.33 99.02 1.5 - 2.0 1.16 98.59 1.16 99.05 2.0 - 2.5 1.71 91.13 9.40 94.26 2.0 - 2.5 1.71 91.68 9.40 95.54 2.5 - 3.0 8.96 75.35 13.43 85.44 2.5 - 3.0 7.46 79.70 10.07 87.14 3.0 - 3.5 14.53 54.10 32.68 63.91 3.0 - 3.5 12.71 53.37 32.40 63.90 3.5 - 4.0 20.48 49.67 35.12 61.95 3.5 - 4.0 22.40 52.96 35.52 63.56 4.0 - 4.5 39.45 70.92 60.20 82.38 4.0 - 4.5 37.92 69.46 60.16 82.51 4.5 - 5.0 26.77 50.81 64.59 73.57 4.5 - 5.0 28.46 55.00 64.31 74.83 all 27.32 73.76 49.84 82.46 all 27.33 75.05 49.70 83.23 Table 5.6.2: Images + Genres + Title full set results Table 5.6.4: All feature set results

These results indicate that the addition of an an added feature set is of incremental value to the accuracy of the predictor. We however see no increase in the accuracy of the 1.5-2.0 rating, after adding the description feature set. The accuracy of the 3rd feature set combination has the highest accuracy in this range, which

37 5.6. FUSION 38 would suggest that this feature set has information pertaining to this rating range that the other feature set combinations do not have.

5.6.2 Support Vector Machine

As done individually for the feature sets, we need to ascertain the optimal (C, γ) combination for each individual feature set based on a random 10% sample. The results are presented per cell as training and test results, divided by a comma. The linear kernel has not been tested here as it has shown to not be effective on the individual feature sets. Because of the large amount of classifiers that need to be trained, a bagging classifier is employed to cut down on computation time.

C 1 10 100 1000 0.01 54.69, 26.62 58.58, 26.04 59.90, 25.80 60.95, 25.88 0.1 58.98, 26.22 60.89, 25.87 62.18, 25.39 63.70, 24.74 γ 0.5 60.98, 25.91 63.10, 25.42 65.23, 24.93 68.23, 25.71 0.9 61.89, 25.51 64.38, 25.10 67.38, 25.57 71.70, 26.16 0.99 62.03, 25.48 64.67, 25.13 67.94, 25.78 72.43, 26.35 Table 5.6.5: Images

C 1 10 100 1000 0.01 53.80, 26.78 57.64, 25.96 59.00, 25.87 60.18, 25.43 0.1 57.90, 26.04 60.02, 25.46 62.57, 25.19 65.32, 25.22 γ 0.5 60.57, 25.48 64.27, 25.28 68.80, 25.29 75.29, 25.51 0.9 61.95, 25.67 66.68, 25.54 73.01, 25.22 80.58, 24.72 0.99 62.34, 25.83 67.27, 25.45 73.97, 25.43 81.77, 24.55 Table 5.6.6: Images + Genres

C 1 10 100 1000 0.01 54.98, 26.74 58.98, 26.59 60.95, 26.42 62.59, 25.80 0.1 59.25, 26.41 62.47, 26.09 65.44, 25.26 69.88, 25.04 γ 0.5 62.51, 26.14 67.34, 25.38 74.88, 25.68 83.89, 25.10 0.9 64.32, 26.07 71.16, 26.01 80.49, 25.51 90.67, 24.26 0.99 64.60, 26.00 72.06, 25.96 81.83, 25.45 91.32, 24.26 Table 5.6.7: Images + Genres + Title

38 5.6. FUSION 39

C 1 10 100 1000 0.01 55.11, 27.07 59.08, 26.36 61.30, 26.54 64.96, 26.19 0.1 60.29, 26.65 64.63, 26.48 70.54, 25.58 79.90, 25.13 γ 0.5 65.53, 26.43 75.62, 25.86 87.66, 24.62 96.59, 22.83 0.9 69.29, 26.52 82.46, 25.83 94.17, 23.65 99.38, 22.86 0.99 69.87, 26.43 83.86, 25.99 95.09, 23.80 99.50, 22.43 Table 5.6.8: Images + Genres + Title + Description

From here we receive the following results:

Feature set combination C γ train test images 1000 0.99 72.43 26.35 images + genres 1000 0.99 81.77 24.55 images + genres + title 1000 0.99 91.32 24.26 images + genres’ + title + description 1000 0.99 99.50 22.43 Table 5.6.9: Results grid search

We now apply these settings on the full data set:

spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 2.33 97.69 5.81 99.29 1.0 - 1.5 3.49 95.43 5.81 98.84 1.5 - 2.0 2.33 96.20 2.33 98.22 1.5 - 2.0 2.33 90.61 2.33 96.74 2.0 - 2.5 5.13 79.81 14.53 84.72 2.0 - 2.5 3.42 69.42 9.40 76.02 2.5 - 3.0 7.09 61.30 13.43 74.50 2.5 - 3.0 8.21 53.44 11.94 68.27 3.0 - 3.5 12.71 52.61 31.42 62.07 3.0 - 3.5 10.89 48.99 30.45 59.29 3.5 - 4.0 20.32 54.66 33.44 64.72 3.5 - 4.0 20.72 53.09 32.56 63.24 4.0 - 4.5 37.18 65.28 57.29 73.57 4.0 - 4.5 36.82 63.22 58.16 72.85 4.5 - 5.0 25.12 60.40 59.44 71.77 4.5 - 5.0 26.05 59.97 59.39 71.50 all 25.81 70.98 47.10 78.60 all 25.84 66.72 47.00 75.81 Table 5.6.12: Images + Genres full set results Table 5.6.10: Images full set results spot-on next-to spot-on next-to Rating Rating Test Train Test Train Test Train Test Train 1.0 - 1.5 1.16 99.10 9.30 99.59 1.0 - 1.5 3.49 98.71 6.98 99.48 1.5 - 2.0 2.33 97.91 2.33 99.08 1.5 - 2.0 1.16 97.09 2.33 98.59 2.0 - 2.5 5.13 91.50 17.09 93.55 2.0 - 2.5 7.69 87.23 14.53 90.25 2.5 - 3.0 7.46 77.84 14.18 86.81 2.5 - 3.0 6.34 68.99 12.69 80.42 3.0 - 3.5 17.32 62.55 34.50 68.53 3.0 - 3.5 13.13 56.52 32.68 64.68 3.5 - 4.0 19.52 60.95 36.16 71.44 3.5 - 4.0 20.88 58.16 36.24 68.51 4.0 - 4.5 34.82 68.70 50.63 74.67 4.0 - 4.5 36.00 66.68 54.12 74.17 4.5 - 5.0 21.02 61.20 54.19 71.57 4.5 - 5.0 23.48 60.36 57.31 70.98 all 24.19 77.52 44.17 83.21 all 25.10 74.24 45.99 80.91 Table 5.6.13: Images + Genres + Title + Description Table 5.6.11: Images + Genres + Title full set results full set results

We see here from the falling test accuracy [25.84, 25.81, 25.10, 24.19] and the increasing training set accuracy [66.72, 70.98, 74.24, 77.52] that the classifier is over-fitting on the data set. A common problem with SVM,

39 5.7. REMARKS 40 however the end result still shows decent results with no sub 1% accuracy in the test set.

5.7 Remarks

This section will summarize the finding found in the previous sections. Tables 5.7.1 and 5.7.2 show the summarized results from the previous sections.

SVM NN Feature Set Test Train Test Train Genres 34.4 48.8 36.7 35.3∗ Image 12.8∗∗ 53.4 27.9 81.03 Title 31.5 32.5 29.08 23.95 Description 18.7 30.0 33.2 49.8 Table 5.7.1: Individual results feature sets

SVM NN Feature Set Test Train Test Train Images 25.8 66.7 26.7 72.6 Images + Genres 25.8 71.0 26.7 72.6 Images + Genres + Title 25.1 74.2 27.3 73.8 All 24.2 77.5 27.3 75.1 Table 5.7.2: Accuracies of combinations of feature sets

A few things should be noted here, the result marked with * is a result that is highly skewed towards the higher ratings which are more available, therefore the disparity between test/training accuracy. The results marked with ∗∗ performs very well on lower ratings and is therefore lower in end accuracy but performs better than its NN counterpart. We see that between the two methodologies, table 5.7.2 shows that NN generalizes better and SVM is prone to over-fitting in this situation for this data-set. The disparity between the Image feature set from the combinations part and the individual feature set is because in the combination part a Bagging method has been applied.

Genres shows poor accuracy for the 1-5.-2.0 range for both SVM and NN, and the NN performs sub-par on the 1.5-3.5 range. This could be due to a lack of information contained within the Genres feature set. A possible explanation could be the generation of ADASYN samples, as they do allow for continuous values for the genres, although the original data set only includes binary identification for an app pertaining to a specific genre.

As for the Image Feature set, we see that the SVM classifier (with C, γ = 10, 0.1) and a Radial Basis Function kernel) trains very well on the training sample set, and classifies the ratings evenly across the board, with it having a high accuracy for the lower ratings. This, combined with the fact that the Neural Network performs badly on the lower ratings but well on the high ratings, would indicate the difference in approach. A possible combination of the two methods could lead to a higher accuracy.

40 5.7. REMARKS 41

The description feature set tells a different story with regards to the accuracy, although NN performs skewed to the higher ratings, SVM (bar 3.5-4.0) performs evenly across the ratings. A bad performance on 3.5-4.0 is present in the title feature set as well, indicating a lack of information pertaining to that rating group textually. The title feature set shows a very bad performance on the 1.0-3.5 using SVM and NN again indicating either a lack of information after generating synthetic samples.

A crucial point here is to compare the work done here to other work. The work by Wieser (2016) does the same pre-processing as this paper but uses a rounding function to the nearest integer for its ratings. The author attained an overall accuracy of 46.0% using Support Vector Machines using all feature sets available to him. The caveat here is that although this accuracy seems higher than the 44.2% attained in this chapter, the distribution of the accuracy is highly skewed to the 4 rating class. Although it is not described clearly in the paper, one would assume this to pertain to our ‘3.5-4.0’ and ‘4.0-4.5’ classes. The results for the 1 and 2 class attain an accuracy of 0 and 0.16% respectively. From this we assume that the classifier as proposed in this paper performs better.

The paper by Glisovic (2016) uses the same definition and classes as Wieser (2016) does. The accuracy attained within this paper is 62.8% and 63.5% for SVM and NN respectively. It however has 0.0% accuracy on the (translated to the definition in this paper) ‘1.5-2.0’ and ‘2.0-2.5’ rating. Although the classifier proposed in this paper performs badly (relative to the other classes) it never attains a 0.0% accuracy, indicating a better performance overall.

41 6 | Conclusion

The structure of the thesis is divided in 5 chapters: firstly, chapter 1 gave a short introduction on the research and the subject pertaining to it, after this chapter 2 focused on literature pertaining to the subject of online content popularity and the prediction thereof. Chapter 3 described the theoretical groundwork of the research. Chapter 4 proposed a method of predicting the average rating of an online content item. Lastly chapter 5 discussed the application of the methodology with an experiment on App Store data and discussed the results.

The thesis statement was formulated as follows: ‘Is it possible to predict the average rating of App Store content?’ with two sub-questions: firstly we wanted to check whether Support Vector Machines or Neural Networks show a performance difference. We also wanted to see how the performance depended on the exploitation of different feature sets by means of an experiment. The experiment entailed investigating the information (spatial and objects) contained within images associated with an app. Also investigated was the predictive power of the textual information contained within the title and description of an item by using Latent Dirichlet Allocation. Furthermore the categorical information was contained within genres was also added as a feature set. These features sets were combined by use of late fusion, training a separate classifier for each feature set. After this another classifier was trained. This was done for both Support Vector machine and Neural Network.

The first sub-question is about the performance of Support Vector Machines and Neural Networks. This sub-question was answered by means of an experiment on scraped data of 34,526 applications from the App Store with the help of AppTweak, we were able to check the performance of the suggested methodology. The results showed that Support Vector Machines are prone to over-fitting and therefore generalize poorly. Neural Networks, although prone to have problems with lower ratings showed good generalization and performed better. We therefore conclude, for this sub-question, that Neural Networks perform better in this particular case.

The other sub-question is about the performance of the inclusion of different feature sets. As described earlier the Support Vector Machine classifier generalized poorly, and therefore the Neural Network classifier will be discussed here. Every addition of a feature set included a boost to the classifier accuracy. There is however an important distinction to be made within the feature sets: the title feature set made the [1.5-2.0]

42 6.1. FUTURE WORK 43 accuracy drop to 1.16 indicating either a bad description of the information contained within the title or problems with the feature set itself. The performance increased in general however with the addition of the description feature set, we therefore conclude that the performance relies on the addition and mixture of all available feature sets, and that it aides in the accuracy.

We conclude that it is possible to predict the average rating of the App Store content, but not to an acceptable degree of certainty yet. The caveat of this thesis lies mostly in the gathering of data. The ground work of this paper however, can serve as a possible stepping stone to future research with more fine-tuned models.

6.1 Future Work

For future work we suggest transitioning to using a continuous variable instead of binning. Although it is hard to pinpoint the informational difference between an application of rating 4.2 and 4.21 it is safe to assume that there is atleast some information loss when ‘binning’ the rating of an application, and that this information loss is highly dependent on the method of binning. This use of a continuous rating would however require a different approach with regards to the methodology described in this thesis.

Another point of interest is the Images feature set. Although the spatial information and object infor- mation contained within these images is analyzed, a more app-specific approach can be taken. By creating a labeled set of training data more relevant to app (or whatever category the online content is associated with) icons (e.g. cartoon images), and training a custom image classifier on this data, it is not unlikely a higher accuracy can be achieved.

As applications in particular have reviews, it is suggested to use the sentiment and/or textual information contained within these reviews as a feature set. The inclusion from engagement/sentiment on reddit.com, facebook.com or twitter.com can serve as an additional feature set also.

Sampling of data is done in this thesis by means of ADASYN, it however treats the data as a continuous variable. Improvements can be made by means of applying a ceiling function to the generated data, or using a sampling algorithm more suited for binary data when applying it to categorical binary data. As far as the data itself is concerned a prime necessity in general is the need for well distributed data. Although ADASYN mitigates this need somewhat, a scraping tool more focused on lower-rated apps could improve the algorithm performance.

Pertaining to the sampling of the data and splitting in a training and test set, a more thorough approach using a k-fold cross-validation should be employed, to make the choice of a training/test set moot.

A more thorough approach to generating the topic number k as it is very coarse at the moment. At the moment an optimal k is found in one pass, but a pass which looks in the range of [k − δ, k + δ] could make for a more specific k than currently used. Also, in the context of Latent Dirichlet Allocation: the use of

43 6.1. FUTURE WORK 44

TF-IDF, although far reaching in describing the tokens used, the cut off points were done on a non-structural way. A more structured approach, using perplexity attained after TF-IDF have been chosen could enhance accuracy.

This thesis treats Support Vector machines separately and a suggestion for future work is the cross usage of Neural Networks and Support Vector machines, as the results shown in the previous chapter show that for the Images feature set Support Vector Machines performs well on the lower ratings while Neural Networks perform well on the higher ratings.

The combinations of all these enhancements should make for an interesting continuation of the research posited in this thesis, allowing for a better insight in the prediction of online content popularity.

44 REFERENCES

References

D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25–32. ACM, 2009.

AppShopper. Homepage, 2017. URL https://web.archive.org/web/20170816125951/http:// appshopper.com/.

Y. Bae and H. Lee. Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology, 63(12): 2521–2535, 2012. ISSN 1532-2890. 10.1002/asi.22768. URL http://dx.doi.org/10.1002/asi.22768.

S. Bengio and Y. Bengio. Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks, 11(3):550–557, 2000.

C. M. Bishop. Pattern recognition and machine learning. springer, 2006.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

L. Breiman. Pasting small votes for classification in large databases and on-line. Machine Learning, 36(1): 85–103, 1999.

J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296, 2009.

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11(Apr):1471–1490, 2010.

O. Chapelle. Training a support vector machine in the primal. Neural computation, 19(5):1155–1178, 2007.

45 REFERENCES 46

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265–292, 2001.

C. W. Curtis. Linear Algebra. Allyn and Bacon, 2 edition, 1968.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

K. Duan and S. S. Keerthi. Which is the best multiclass svm method? an empirical study. Multiple classifier systems, 3541:278–285, 2005.

A. Ö. Eren and M. Sert. Movie rating prediction using ensemble learning and mixed type attributes. In Signal Processing and Communications Applications Conference (SIU), 2017 25th, pages 1–4. IEEE, 2017.

Ericsson. Ericsson mobility report 2017, 2017. URL https://web.archive.org/web/20170723121909/ https://www.ericsson.com/en/mobility-report.

Forbes. Apple’s app store generating meaningful revenue, 2017. URL https://web.archive.org/ web/20170107174901/https://www.forbes.com/sites/chuckjones/2017/01/06/apples-app-store -generating-meaningful-revenue/#45c0b09e33db.

V. Glisovic. Forecasting the success of game-apps based on reviews. Master’s thesis, University of Amster- dam, 2016.

M. Harvey, M. J. Carman, I. Ruthven, and F. Crestani. Bayesian latent variable models for collaborative item rating prediction. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 699–708, New York, NY, USA, 2011. ACM. ISBN 978-1-4503- 0717-8. 10.1145/2063576.2063680. URL http://doi.acm.org/10.1145/2063576.2063680.

H. He, Y. Bai, E. A. Garcia, and S. Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pages 1322–1328. IEEE, 2008.

T. K. Ho. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8):832–844, 1998.

M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864, 2010.

46 REFERENCES 47

L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80–88. ACM, 2010.

P.-Y. Hsu, Y.-H. Shen, and X.-A. Xie. Predicting movies user ratings with imdb attributes. In International Conference on Rough Sets and Knowledge Technology, pages 444–453. Springer, 2014.

ILSVRC. Ilsvrc2016, 2016. URL http://image-net.org/challenges/LSVRC/2016/results.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ioffe15.html.

A. Khosla, A. Das Sarma, and R. Hamid. What makes an image popular? In Proceedings of the 23rd international conference on World wide web, pages 867–876. ACM, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

M. H. Latif and H. Afzal. Prediction of movies popularity using machine learning techniques. International Journal of Computer Science and Network Security (IJCSNS), 16(8):127, 2016.

G. Louppe and P. Geurts. Ensembles on random patches. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 346–361. Springer, 2012.

J. B. Lovins. Development of a stemming algorithm. Mech. Translat. & Comp. Linguistics, 11(1-2):22–31, 1968.

H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4):309–317, 1957.

E. Malmi. Quality matters: Usage-based app popularity prediction. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, UbiComp ’14 Adjunct, pages 391–396, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3047-3. 10.1145/ 2638728.2641698. URL http://doi.acm.org.proxy.uba.uva.nl:2048/10.1145/2638728.2641698.

M. Mazloom, R. Rietveld, S. Rudinac, M. Worring, and W. van Dolen. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 197–201, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3603-1. 10.1145/2964284.2967210. URL http://doi.acm.org/10.1145/2964284.2967210.

M. Meeker. 2014 internet trends, 2014. URL https://web.archive.org/web/20170712050029/http:// www.kpcb.com/blog/2014-internet-trends.

47 REFERENCES 48

M. Mestyán, T. Yasseri, and J. Kertész. Early prediction of movie box office success based on wikipedia activity big data. PloS one, 8(8):e71226, 2013.

V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.

A. Oghina, M. Breuss, M. Tsagkias, and M. de Rijke. Predicting imdb movie ratings using social media. In European Conference on Information Retrieval, pages 503–507. Springer, 2012.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

J. Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.

S. Pramod and A. Joshi. Prediction of movie success for real world movie data sets. 2017.

J. W. Pratt, H. Raiffa, and R. Schlaifer. Introduction to statistical decision theory. MIT press, 1995.

PyTorch. PyTorch vision models, 2017. URL https://web.archive.org/web/20170929201855/http:// pytorch.org/docs/master/torchvision/models.html.

M. Röder, A. Both, and A. Hinneburg. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399–408. ACM, 2015.

C. G. M. Snoek, M. Worring, and A. W. M. Smeulders. Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA ’05, pages 399–402, New York, NY, USA, 2005. ACM. ISBN 1-59593-044-2. 10.1145/1101149.1101236. URL http://doi.acm.org.proxy.uba.uva.nl:2048/10.1145/1101149.1101236.

K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.

G. Szabo and B. A. Huberman. Predicting the popularity of online content. Commun. ACM, 53(8):80–88, Aug. 2010a. ISSN 0001-0782. 10.1145/1787234.1787254.

48 REFERENCES 49

G. Szabo and B. A. Huberman. Predicting the popularity of online content. Commun. ACM, 53(8): 80–88, Aug. 2010b. ISSN 0001-0782. 10.1145/1787234.1787254. URL http://doi.acm.org/10.1145/ 1787234.1787254.

H. M. Wallach, D. M. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In Advances in neural information processing systems, pages 1973–1981, 2009.

S. Wieser. Forecasting the success of apps based on their visual appearance. Master’s thesis, University of Amsterdam, 2016.

T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5(Aug):975–1005, 2004.

49