Forecasting the Popularity of Applications an Analysis of Textual and Graphical Properties

Total Page:16

File Type:pdf, Size:1020Kb

Forecasting the Popularity of Applications an Analysis of Textual and Graphical Properties Forecasting the Popularity of Applications An analysis of textual and graphical properties Harro van der Kroft Master Thesis for Econometrics - Big Data Track Faculty of Economics and Business Section Econometrics Abstract This thesis contributes to the scarce literature pertaining to App Store content popularity prediction. By scraping data from the Apple App Store, we form feature sets pertaining to the textual and graphical domain. The methodology employed allows for the use of other data, from other online content sources, and fuses these feature sets by means of late fusion. This thesis researches the predictive power of Neural Networks and Support Vector Machines in parallel, and by layering different feature sets it ascertains that there is an added benefit in combining different feature sets. We reveal that there is predictive power in using the methodology outlined in this thesis. I II Acknowledgments I would like to sincerely thank my supervisor Prof. Dr. M. Worring for his supervision, patience, and enthusiasm. The passion entertained by Marcel has furthered my interests in the field of AI more than I could have hoped for. Furthermore, I would like to thank Leo Huberts, Diederik van Krieken, Frederique Arntz, and Dominique van der Vlist for their input and constructive criticism. II Statement of Originality This document is written by Student Harro van der Kroft who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents. III Contents 1 Introduction 1 2 Literature Review 3 2.1 Internet Movie Database ....................................... 3 2.2 Online content analysis ....................................... 3 2.3 Popularity Prediction ........................................ 4 2.4 Deep Learning and Image classification ............................... 4 2.5 Modality in features ......................................... 5 3 Theory 6 3.1 Statistics ............................................... 6 3.2 Natural Language Processing .................................... 7 3.2.1 TF-IDF ............................................ 7 3.2.2 LDA .............................................. 8 3.2.3 Pre-processing & stemming ................................. 9 3.2.4 Topic Number Estimation .................................. 9 3.3 Artificial Neural Networks ...................................... 10 3.3.1 Feed-forward network .................................... 11 3.3.2 Activation layers ....................................... 12 3.3.3 Network Training ....................................... 12 3.3.4 Loss function ......................................... 12 3.3.5 Other layers .......................................... 13 3.3.6 Normalization ......................................... 14 3.4 Support Vector Machine ....................................... 14 3.4.1 Ensemble Learning ...................................... 15 3.4.2 Kernel ............................................. 15 3.4.3 Parameters .......................................... 16 3.5 Synthetic Sampling .......................................... 16 4 Methodology 19 4.1 Pre-processing ............................................ 19 CONTENTS V 4.2 Feature Extraction .......................................... 20 4.2.1 Image ............................................. 20 4.2.2 LDA .............................................. 20 4.2.3 Genres ............................................. 20 4.3 Prediction Goal ............................................ 21 4.3.1 Continuous .......................................... 21 4.4 Prediction ............................................... 21 4.4.1 Sampling ........................................... 21 4.4.2 Support Vector Machine ................................... 22 4.4.3 Neural Network ........................................ 22 4.5 Summary ............................................... 23 5 Experiment 24 5.1 Origin & Explanation ........................................ 24 5.1.1 Statistics ........................................... 24 5.2 Genres Feature Set .......................................... 26 5.2.1 Statistics ........................................... 26 5.2.2 Results ............................................ 27 5.3 Image Feature Set .......................................... 29 5.3.1 Results ............................................ 29 5.4 Description Feature Set ....................................... 31 5.4.1 Parameters .......................................... 31 5.4.2 Results ............................................ 33 5.5 Title Feature Set ........................................... 34 5.5.1 Parameters .......................................... 34 5.5.2 Results ............................................ 36 5.6 Fusion ................................................. 37 5.6.1 Neural Network ........................................ 37 5.6.2 Support Vector Machine ................................... 38 5.7 Remarks ................................................ 40 6 Conclusion 42 V CONTENTS VI 6.1 Future Work ............................................. 43 References 45 VI 1 | Introduction With the introduction of the Apple iPhone in 2007, smart phones have become a fixture in the online consumption of media. There are over 3.9 billion active mobile data subscriptions worldwide, with estimates for 2022 being 6.9 bilion (Ericsson, 2017, p. 2). Furthermore, the data transfer over a monthly period associated with these active subscriptions is over 2.1 GiB (about 3 Compact Discs), in 2014. The fact that people spend an ever increasing amount of time on their phones (Meeker, 2014), means that online content consumption is a large and increasing part of people’s lives. Companies such as Google, Netflix, Amazon, Hulu, Apple, Microsoft, and many more try to captivate users with applications, movies and online content related to their respective fields and businesses. The mobile app development market is a large market. On August 16th, AppShopper.com reported that there were 1.6 million application available in the Apple App Store (AppShopper, 2017). A recent article by Forbes.com showed that for the 2016 calendar year the total money spent in the App Store was $30 billion, with developers receiving over $20 billion Forbes (2017). All in all these numbers show that there is a lot of revenue to be made in the online content business, with the App Store being a prime example of a medium serving online content. Online content however, is very diverse. The content ranges from images on Flickr to Microsoft Excel in the Android App Store. There is a lot of variety, and the attention span of people is intrinsically biased (or: short) (Szabo and Huberman, 2010a, p. 88). Therefore, the added value for each item has to be clearly communicated to the consumer. When doing so, one must consider the different feature sets pertaining to an item. Firstly, the graphical domain: thumbnails, videos, layouts, and presentation. Secondly, the textual domain: descriptions, titles, reviews. Lastly, more meta attributes may be considered: awards, mentions in other online content, and for movies: actors. However, this diversity means nothing without having a common denominator to pin the added value per consumer on. This diversity means nothing without having a common denominator to pin the added value per consumer on. A clear example example is the rating of an item. These ratings allow the consumer to show sentiment, and allows the content provider to have a proxy for their actual needed statistic: popularity. Popularity is a vague construct. We therefore have the need to quantify it. One of the ways this can be achieved is by the number of views (henceforth known as views more simply). The views show a good chunk 1 2 of popularity, but there is a fatal flaw in using this statistic to produce a proxy for popularity: it does not show the sentiment for a particular item. An item may have a large amount of views because of marketing but still fall short of consumer expectations of the particular content. Now as stated before, many content providers allow for the rating of an application. An example would be the rating of an application in the App Store of Apple: a 1-5 rating to show sentiment. Companies such as the aforementioned giants need to anticipate the effect of their next move. The biggest problem for most companies is: how will my future content evolve? Will HBO produce another season of their latest TV Show or will Netflix produce a new series in its entirety? An approximation the success online can garner more security for these companies. This paper answers the following question by developing a tool set/algorithm: Is it possible to predict the average rating of App Store content? With the following sub-questions: 1. How do Support Vector Machines (SVM), and Neural Networks (NN) perform? 2. How does performance depend on the exploitation of different feature sets? The tools used within this paper are based in the realm of Machine Learning: NN, SVM, and Latent Dirichlet Allocation (LDA), Support Vector Machines, and Ensemble Learning. The use of NN and SVM is because they allow for a classification problem to be solved, which is why they were chosen. The expectation of the results of this paper is that
Recommended publications
  • Content Analysis
    Content analysis Jim Macnamara University of Technology Sydney Abstract Because of the central role mass media and, more recently, social media play in contemporary literate societies, and particularly because of intensive interest in and often concern about the effects of media content on awareness, attitudes, and behaviour among media consumers, analysis of media content has become a widely-used research method among media and communication scholars and practitioners as well as sociologists, political scientists, and critical scholars. This chapter examines the history, uses and methods of media content analysis, including qualitative as well as quantitative approaches that draw on the techniques of textual, narrative and semiotic analysis; explains key steps such as sampling and coding; and discusses the benefits of conducting media content analysis. 1. A brief history of media content analysis Media content analysis is a specialized sub-set of content analysis, a well-established research method that has been used since the mid-eighteenth century. Karin Dovring (1954– 1955) reported that the Swedish state church used content analysis in 1743 to test whether a body of ninety hymns created by unsanctioned sources, titled Songs of Zion, were blasphemous, or whether they met the standards of the Church. In reviewing this early example of content analysis (which incidentally found no significant difference between unsanctioned and sanctioned hymns), Dovring identified several approaches used by the church, but reported that counting words and phrases and the context of their usage was the major focus. This approach remains central to content analysis today. An early form of media content analysis appeared in a 1787 political commentary published by The New Hampshire Spy, which critiqued an anti-Federalist essay.
    [Show full text]
  • Utilisation of Audio Mining Technologies for Researching Public Communication on Multimedia Platforms
    www.ssoar.info Utilisation of Audio Mining Technologies for Researching Public Communication on Multimedia Platforms Eble, Michael; Stein, Daniel Erstveröffentlichung / Primary Publication Sammelwerksbeitrag / collection article Empfohlene Zitierung / Suggested Citation: Eble, M., & Stein, D. (2015). Utilisation of Audio Mining Technologies for Researching Public Communication on Multimedia Platforms. In A. Maireder, J. Ausserhofer, C. Schumann, & M. Taddicken (Eds.), Digitale Methoden in der Kommunikationswissenschaft (pp. 329-345). Berlin https://doi.org/10.17174/dcr.v2.14 Nutzungsbedingungen: Terms of use: Dieser Text wird unter einer CC BY Lizenz (Namensnennung) zur This document is made available under a CC BY Licence Verfügung gestellt. Nähere Auskünfte zu den CC-Lizenzen finden (Attribution). For more Information see: Sie hier: https://creativecommons.org/licenses/by/4.0 https://creativecommons.org/licenses/by/4.0/deed.de Digital Communication Research.de Suggested Citation: Eble, M., & Stein, D. (2015). Utilisation of Audio Mining Technologies for Researching Public Communication on Multimedia Platforms. In A. Mai reder, J. Ausserhofer, C. Schumann, & M. Taddicken (Eds.), Digitale Meth- oden in der Kommunikationswissenschaft (S. 329-345). doi: 10.17174/dcr.v2.14 Abstract: The number and volume of spoken language corpora which are gen- erally available for research purposes increase signiicantly. That is due to the wide adoption of audio-visual communication on news websites and social web platforms. The respective messages that are published by professional and indi- vidual communicators are subject to online content analysis. To date, such analy- ses strongly rely on manually operated processes which come along with a huge effort for transcribing spoken language corpora into textual content. Hence, chal- lenges like the ever increasing volume, velocity and variety of multimedia content need to be faced.
    [Show full text]
  • Anti-Immigration Speech Detection on Twitter
    machine learning & knowledge extraction Article Monitoring Users’ Behavior: Anti-Immigration Speech Detection on Twitter Nikolaos Pitropakis 1,* , Kamil Kokot 1, Dimitra Gkatzia 1 , Robert Ludwiniak 1, Alexios Mylonas 2 and Miltiadis Kandias 3 1 School of Computing, Edinburgh Napier University, Edinburgh EH10 5DT, UK; [email protected] (K.K.); [email protected] (D.G.); [email protected] (R.L.) 2 School of Computing and Informatics, Bournemouth University, Poole BH12 5BB, UK; [email protected] 3 Department of Informatics, Athens University of Economics and Business, 104 34 Athina, Greece; [email protected] * Correspondence: [email protected] Received: 25 June 2020; Accepted: 30 July 2020; Published: 3 August 2020 Abstract: The proliferation of social media platforms changed the way people interact online. However, engagement with social media comes with a price, the users’ privacy. Breaches of users’ privacy, such as the Cambridge Analytica scandal, can reveal how the users’ data can be weaponized in political campaigns, which many times trigger hate speech and anti-immigration views. Hate speech detection is a challenging task due to the different sources of hate that can have an impact on the language used, as well as the lack of relevant annotated data. To tackle this, we collected and manually annotated an immigration-related dataset of publicly available Tweets in UK, US, and Canadian English. In an empirical study, we explored anti-immigration speech detection utilizing various language features (word n-grams, character n-grams) and measured their impact on a number of trained classifiers. Our work demonstrates that using word n-grams results in higher precision, recall, and f-score as compared to character n-grams.
    [Show full text]
  • Detecting Variation of Emotions in Online Activities
    Expert Systems With Applications 89 (2017) 318–332 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Detecting variation of emotions in online activities ∗ Despoina Chatzakou a, , Athena Vakali a, Konstantinos Kafetsios b a Department of Informatics, Aristotle University, Thessaloniki GR 54124, Greece b Department of Psychology, University of Crete, Rethymno GR 74100, Greece a r t i c l e i n f o a b s t r a c t Article history: Online text sources form evolving large scale data repositories out of which valuable knowledge about Received 27 December 2016 human emotions can be derived. Beyond the primary emotions which refer to the global emotional sig- Revised 25 July 2017 nals, deeper understanding of a wider spectrum of emotions is important to detect online public views Accepted 26 July 2017 and attitudes. The present work is motivated by the need to test and provide a system that categorizes Available online 27 July 2017 emotion in online activities. Such a system can be beneficial for online services, companies recommen- Keywords: dations, and social support communities. The main contributions of this work are to: (a) detect primary Emotion detection emotions, social ones, and those that characterize general affective states from online text sources, (b) Machine learning compare and validate different emotional analysis processes to highlight those that are most efficient, Lexicon-based approach and (c) provide a proof of concept case study to monitor and validate online activity, both explicitly and Hybrid process implicitly. The proposed approaches are tested on three datasets collected from different sources, i.e., news agencies, Twitter, and Facebook, and on different languages, i.e., English and Greek.
    [Show full text]
  • Targeted Readings Recommender Using Global Auto-Learning
    Targeted Readings Recommender using Global Auto-Learning Muhammad Irfan Malik1, Muhammad Junaid Majeed2, Muhammad Taimoor Khan3 FAST-NUCES, Peshawar Pakistan [email protected] [email protected] [email protected] Shehzad Khalid Bahria University, Islamabad Pakistan [email protected] ABSTRACT: Huge volume of content is produced on multiple online sources every day. It is not possible for a user to go through these articles and read about topics of interest. Secondly professional articles, blog and forum have many topics discussed in a single discussion. Therefore, a targeted readings recommender system is proposed that analyze all the docu- ments and discussions to highlight key issues discussed as topics. The topics are extracted with an Automatic knowledge- based topic modeling that allows multiple users to help grow the knowledgebase of the model which benefit new readers as well. On selecting the issues of interest the user is taken to those sections of the articles and discussions that are specifically of interest to the reader in relevance to the issue selected. The application has an ever growing knowledge-base to which every task from every user help the model grow in experience and improve its quality of learning. Key words: Recommender Systems, Knowledge Models, Online Content, Learning, Reading Received:7 March 2016, Revised 14 April 2016, Accepted 23 April 2016 © 2016 DLINE. All Rights Reserved 1. Introduction With the advent of social media platforms like Facebook, Twitter, LinkedIn, Amazon etc. users share their feelings, suggestions and trends with other like-minded users. They support discussions on topics of interest among multiple users arguing on topics through comments.
    [Show full text]
  • 20TH GENERAL ONLINE RESEARCH CONFERENCE 28 FEBRUARY to 2 MARCH 2018 in COLOGNE Bella Struminskaya, Florian Keusch, Otto Hellwig, Cathleen M
    Organized by 20TH GENERAL ONLINE RESEARCH CONFERENCE 28 FEBRUARY TO 2 MARCH 2018 IN COLOGNE Bella Struminskaya, Florian Keusch, Otto Hellwig, Cathleen M. Stützer, Meinald Thielsch, Alexandra Wachenfeld-Schell (Eds.) 20th GENERAL ONLINE RESEARCH CONFERENCE Proceedings, Cologne 2018 IMPRINT All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of the publisher: ISBN 978-3-9815106-6-9 Bella Struminskaya, Florian Keusch, Otto Hellwig, Cathleen M. Stützer, Meinald Thielsch, Alexandra Wachenfeld-Schell (Eds.) German Society for Online Research Deutsche Gesellschaft für Online-Forschung (DGOF) e.V. (www.dgof.de) Editing: Birgit Bujard, Anja Heitmann Layout and typeset: ro:stoff media GbR www.rostoff-media.de Copyright Photos: Title – fotolia/© ADDICTIVE STOCKP. Urheber: REDPIXEL P. 8 – fotolia/ © rcfotostock P. 23,24 – Foto Mario Callegaro: ©Gareth Davies P. 12,13 – © Costa Belibasakis/TH Köln 3. TABLE OF CONTENT 20th GENERAL Organisation 6 International Board 7 ONLINE Greeting from the DGOF 8 RESEARCH About DGOF 10 Portraits of the Board 11 CONFERENCE Greetings from the Local Partner 12 Sponsors & Organizers 14 Programme Overview 16 Workshops 19 Panel Discussion 23 Keynotes 24 GOR Best Practice Award 2018 27 Abstracts GOR Best Practice Award 2018 28 GOR Poster Award 2018 32 GOR Thesis Award 2018 33 Abstracts Thursday 35 Abstracts Friday 76 4. Lorem ipsum ORGANISATION DGOF BOARD DGOF OFFICE Dr. Otto Hellwig (Chairman DGOF board, respondi AG, Germany) Birgit Bujard, Anja Heitmann Prof. Dr. Florian Keusch (University of Mannheim, Germany) Dr. Bella Struminskaya (Utrecht University, Netherlands) SUPPORT Dr.
    [Show full text]