Using Random Forest Model to Predict Image Engagement Rate

DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2018 Using Random Forest model to predict image engagement rate FELIX EDER MARKO LAZIC KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Using Random Forest model to predict image engagement rate FELIX EDER, MARKO LAZIC Master in Computer Science Date: June 4, 2018 Supervisor: Jens Lagergren Examiner: Pawel Herman Swedish title: Användning av Random Forest model för att förutspå bildengagemangsfrekvens School of Electrical Engineering and Computer Science iii Abstract The purpose of this research is to investigate if Google Cloud Vision API combined with Random Forest Machine Learning algorithm is advanced enough in order to make a software that would evaluate how much an Instagram photo contributes to the image of a brand. The data set contains images scraped from the public Instagram feed fil- tered by #Nike, together with the meta data of the post. Each image was processed by the Google Cloud Vision API in order to obtain a set of descriptive labels for the content of the image. The data set was sent to the Random Forest algorithm in order to train the predictor. The results of the research shows that the predictor can only guess the correct score in about 4% of cases. The results are not very accurate, which is mostly because of the limiting factors of the Google Cloud Vision API. The conclusion that was drawn is that it is not possible to create a software that can accurately predict the engagement rate of an image with the technology that is publicly available today. iv Sammanfattning Syftet med denna forskning är att undersöka om Google Cloud Vision API kombinerat med Random Forest Machine Learning algoritmer är tillräckligt avancerade för att skapa en mjukvara som tillförlitligt kan evaluera hur mycket ett Instagram-inlägg kan bidra till bilden av ett varumärke. Datamängden innehåller bilder hämtade från Instagrams publika flöde filtrerat av #Nike, tillsammans med metadatan för inläg- get. Varje bild var bearbetad av Google Cloud Vision API för att få tag på en mängd deskriptiva etiketter för innehållet av en bild. Datamäng- den skickades till Random Forest-algoritmen för att träna dess model. Undersökningens resultat är inte särskilt exakta, vilket främst beror på de begränsade faktorerna från Google Cloud Vision API. Slutsat- sen som dras är att det inte är möjligt att tillförlitligt förutspå en bilds kvalitet med tekniken som finns allmänt tillgänglig idag. Contents 1 Introduction 1 1.1 Problem statement . 2 1.2 Scope . 2 1.3 Thesis overview . 3 2 Background 4 2.1 Terminology . 4 2.2 Machine Learning . 4 2.2.1 Decision tree . 5 2.2.2 Bootstrap aggregating . 5 2.2.3 Random Forests . 6 2.3 Image API and classification . 6 2.3.1 Cloud Vision API . 7 2.3.2 Clarifai Predict . 9 2.3.3 IBM Watson Visual Recognition . 11 2.3.4 Amazon Rekcognition . 13 2.4 Related work . 15 3 Methods 17 3.1 Points formula . 17 3.2 Image Source and Scraping . 18 3.3 Image Classification API . 19 3.4 Machine Learning Run . 19 3.4.1 Comparing image scores . 20 3.4.2 Noise reduction . 20 3.4.3 Algorithm parameters . 20 4 Results 22 4.1 Regression models . 22 4.1.1 R1 . 23 v vi CONTENTS 4.1.2 R2 . 25 4.1.3 R3 . 25 4.1.4 R4 . 30 4.1.5 R5 . 33 5 Discussion 34 5.1 Result analysis . 34 5.2 Limitations . 36 5.3 Future research . 36 6 Conclusion 38 Bibliography 39 A Source Code 41 Chapter 1 Introduction Our methods of communication throughout history has been ever changing, from developing our first languages to start writing letters and telegrams. But the signature method of communication in the twenty- first century is undoubtedly social media. It allows us to express our opinions and beliefs as well as keep in touch with our loved ones [15]. Social media has however had a significant impact on companies and brands, which now has a tool for open and direct communication with its customers, both ways. Websites such as Twitter, Face- book and LinkedIn offers a sense of community and connection between not only companies and people, but also between customers and fans themselves. If Facebook was its own country, it would be the worlds third most populous one (after China and India), which means that good communication between people and brands is almost but mandatory in todays ever-changing world. But this also means that companies have to be more careful with their online marketing and overall behavior, as customer backlash can quickly damage the brand’s image [15]. For all these reasons social media has become an important aspect for all types of brands, as it is viewed as great channel of communication and customer satisfaction. This area is however still pretty young and there is not a lot of tools for brands to directly reward their in- fluencers for the marketing they do for them. A lot of people upload images connected to certain brands out of their loyalty or love for a product, but there is not a lot of tools for brands to find these specific people that contribute the most to their online appearance and community. For example, the hash tag #Nike has reached over 33 mil- 1 2 CHAPTER 1. INTRODUCTION lion people on social media right now (according to Keyhole hash-tag tracker for 2018-03-12 http://keyhole.co/), but there is no way for Nike to actually go through all these posts to see what images relate to the brand appearance they want on social media and which images that detract from that. It is impossible to manually go through all this data and since social media user-bases only continue to grow this problem will only continue to get bigger for brands that try to establish their online image. The purpose of this thesis is to see if it is possible to build a software program that uses Machine Learning in order to decide which photos will be an advantage for a brand to associate with their online image and which photos will detract from it. This sort of software would save a lot of time and money and could be an invaluable tool for brands trying to build their business in this modern world of social media. 1.1 Problem statement This thesis aims to investigate whether it is possible to combine ex- isting Image Classification API:s with Machine Learning in order to determine if a photo will contribute to the wanted online image of a Brand. This would be an invaluable tool for brands trying to build their image. The question that this thesis will handle is: • Are random forest algorithms combined with Google Cloud Vi- sion API sufficiently advanced in order to determine the engagement rate of images based on a list of requirements? 1.2 Scope Random forest Machine Learning algorithm will be used, the training data for the algorithms will be 80 000 different posted images under a specific hash tag posted on Instagram. 25 000 images will be used as test data for the trained algorithms. One Image Classification API will be used in order to classify all images and give them a set of labels for the Machine Learning algorithm. CHAPTER 1. INTRODUCTION 3 1.3 Thesis overview Chapter 2 will introduce the Random Forest Machine Learning algorithm as well as related work. It will also describe the most popular commonly accessible Image Classification API:s as well as compare their results by sending in the same image as a test. Chapter 3 mo- tivates the choice of Machine Learning algorithm as well as which Image Classification API to use and how the test data was sampled. It will also explain how the research was conducted. Chapter 4 will present the results from the research and in chapter 5 the results will be analyzed. In chapter 6 conclusions will be drawn based on the pre- vious discussions. Chapter 2 Background 2.1 Terminology The following are a few new concepts used throughout the report: Ghost followers, also referred to as ghosts and ghost accounts or lurk- ers, are users on social media platforms who remain inactive or do not engage in activity. They are usually created by bots in order to boost the number of followers but they can also be created by people [4]. 2.2 Machine Learning Machine learning is a form of AI that enables a system to learn from data rather than through explicit programming [14]. More formal def- inition of the machine learning algorithms is provided by Tom M. Mitchell in his book from 1997: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E" [13] . Machine learning algorithms use data to learn to make predictions about unseen data in the future. Ma- chine learning algorithms can roughly be categorized in two types de- pending on the learning technique that is used: • Supervised learning : Where the data presented to the algorithm is labeled with the desired output • Unsupervised learning: No labels are available with learning data 4 CHAPTER 2. BACKGROUND 5 We will be focusing on analyzing result from a Random Decision Forest supervised learning algorithm. 2.2.1 Decision tree The basic idea with the decision trees is to test attributes sequentially and then also ask questions about the target sequentially.

Load more