And the Winner Is... Predicting the Outcome of Melodifestivalen by Analyzing the Sentiment Value of Tweets
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2017 And the winner is... Predicting the outcome of Melodifestivalen by analyzing the sentiment value of Tweets ALEXANDER KOSKI AND JENNIFER PERSSON KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION And the winner is... Predicting the outcome of Melodifestivalen by analyzing the sentiment value of Tweets ALEXANDER KOSKI AND JENNIFER PERSSON Bachelor’s Thesis in Computer Science Date: June 5, 2017 Supervisor: Iolanda Leite Examiner: Örjan Ekeberg Swedish title: Och vinnaren är... Att med sentimentalanalys på tweets förutspå resultatet av Melodifestivalen School of Computer Science and Communication iii Abstract In a world where a lot of people post their feelings about things on so- cial media, the interest in using sentiment analysis to be able to collect and understand these feelings has arisen. This thesis aims to investi- gate the possibility of predicting the outcome of a television compe- tition, decided partly by the viewers’ votes, using sentiment analysis on tweets. The lexicon AFINN and a Swedish translated version of it was used for the lexical sentiment analysis of this report. After pre- processing the tweets gathered from Twitter with the competition’s hashtag, the tweets were analysed and mapped to the different com- petitors. Each mapped tweet was scored with a sentiment value ac- cording to the lexicons. Six different predictions were derived from the sentiment values of the tweets. The predictions was compared to the real result of the competition using Kendall tau distance, where shorter distance indicates more similarities between the lists. The re- sult show that it is possible to make a rough prediction of the outcome of the competition where the best prediction was achieved by ranking the top 5 artists based on the sum of positive sentiment value for the songs. iv Sammanfattning I en värld där människor postar sina känslor och åsikter om saker på sociala medier, har intresset för att genom sentimentalanalys samman- ställa och förstå dem ökat. Den här uppsatsen undersöker om det är möjligt att förutspå resultatet av en tävling, där vinnaren delvis väljs av röster från tittarna, genom att göra sentimentalanalys på tweets. Det engelska lexikonet AFINN och en svensk översättning av den an- vändes för att göra den lexikala sentimentalanalysen för den här rap- porten. Efter att behandlat alla tweets med tävlingens hashtag, som samlats in från Twitter, genomfördes analysen och mappningen av de olika tweets till de olika bidragen. Varje mappat tweet blev tilldelat ett sentimentalvärde med hjälp av lexikonen. Sex olika alternativa resul- tatlistor togs fram baserat på sentimentalvärdena från alla sorterade tweets. De olika alternativa resultatlistorna jämfördes med det faktis- ka utfallet av tävlingen där likheten mellan dem mättes med Kendall tau distance, där kortare distans indikerar större likheter mellan listor- na. Resultatet visar att det går att göra en ungefärlig förutsägelse på tävlingsresultatet med hjälp av sentimentalanalys, där det alternativa resultatet som bäst stämde överens med verkligheten togs fram ge- nom att rangordna de 5 populäraste bidragen efter summan av deras positiva sentimentalvärden. Contents 1 Introduction 1 1.1 Problem Statement . .2 1.2 Scope . .2 2 Background 4 2.1 Melodifestivalen .........................4 2.2 Data-Mining using Twitter . .5 2.3 Sentiment Analysis . .5 2.3.1 Lexical Analysis . .6 2.4 Kendall tau distance . .6 2.5 Related works . .7 3 Method 8 3.1 Data-gathering . .8 3.2 Data pre-processing . .9 3.3 Producing ranking lists . 11 3.3.1 Scoring the tweets using lexical analysis . 11 3.3.2 Metrics to calculate from the existing data . 12 3.3.3 Kendall Tau comparison . 12 4 Result 14 4.1 Six ways of ranking . 14 4.2 Kendall tau distance comparison . 15 4.2.1 Complete list of rankings . 16 4.2.2 Top 5 list of ranking . 16 5 Discussion 18 5.1 Real World Comparison . 18 5.2 Limitations . 19 5.3 Further Research . 21 v vi CONTENTS 6 Conclusion 22 Bibliography 23 Chapter 1 Introduction Using computers to predict the outcome of future events by using mathematical models is something that has occurred for a long time. The big advantage of using computers for making predictions is the computer’s ability to handle large amounts of data quickly. It would not be possible for a human to analyse the same amount of data in reasonable time, while a computer is able to do it in a matter of min- utes. Today we can see many uses for computers predictive powers with models to predict traffic patterns on highway systems (Allström 2005), the rise and fall of stock prices (Aase 2011) and much more. In the recent decade social media has been growing exponentially with more and more people connecting and sharing their opinions and views on different matters. This has opened the door for mass collect- ing of opinions from the public. This new data can be used to predict outcome that relates to people’s feelings on specific topics. Political elections and brand evaluation are examples of areas where data from the masses are used to draw conclusions about a population’s future behavior (Chin, Zappone, and Zhao 2016) (Brandwatch 2017). To analyse all of this new data a field called sentiment analysis has emerged. Sentiment analysis is a way of analysing texts with the pur- pose of extracting its sentiment value. There are many different meth- ods of performing sentiment analysis on texts, one of them is called lexical analysis. Lexical analysis polarizes the given text as either pos- itive, neutral or negative by scoring the individual words sentiment with the help of a pre-defined lexicon (Thelwall et al. 2010). By using sentiment analysis this thesis aims to investigate the pos- sibility to predict the outcome of a competition where the viewers 1 2 CHAPTER 1. INTRODUCTION opinion play a great part in the result. A popular Swedish music competition is Melodifestivalen, in which artists compete with song and dance performances. The winner is partially determined by the artist who receives the highest number of votes cast from the viewers of the competition. Recently the competition has encouraged its viewers to interact with the show by tweeting using the competition’s official hashtag (Account 2017). This has created the opportunity to find out people’s opinions about the competing artists before the final voting results has been presented on live television. 1.1 Problem Statement The aim of this thesis is to investigate how a sentiment analysis of what people post online can relay the outcome of a competition based mostly on votes. This will give insight into how well sentiment anal- ysis can be applied to predicting the future and how it best may be used. The research question follows: • Is it possible to predict the outcome of the competition Melodifes- tivalen by doing a sentiment analysis of tweets posted by viewers during the show before the result is announced? 1.2 Scope The focus of this report is to perform sentiment analysis on posts made by viewers of the show posted on social media, specifically posted on Twitter. Limiting the research to only one social media increases the possibility to be more specific in the data gathering, and the decision to only look at posts on Twitter was made because of the limited length of the tweets and the ease of finding related data by looking at hashtags. This study is based on the assumption that the fetched tweets are a sample of how the viewers of the competition are planning to vote. A decision was made to look only at the tweets written in Swedish and English, since it is expected that the majority of tweets to be writ- ten in these languages. This was also a decision based on difficulties to guarantee the quality of lexicons in other languages. Finally, the voting period is during the air time of the competi- tion and thus will this research only analyse tweets posted during that CHAPTER 1. INTRODUCTION 3 time. The tweets included in the study will therefore only reflect the sentiment of the viewers before the winner was announced. Chapter 2 Background This chapter aims to introduce the competition Melodifestivalen and the concept of data mining using Twitter. It will also present the concept of sentiment analysis and the method of lexical sentiment analysis will be explained in more detail. After that, a method for comparing two lists to each others will be shown. Lastly some related work will be presented. 2.1 Melodifestivalen Melodifestivalen is an annual Swedish music competition arranged by the Swedish national public TV broadcaster, Sveriges Television, SVT. The competition is divided into a couple of qualification heats before the final. Twelve artists compete in every heat by displaying their own musical performance on stage, two winners from each heat qualifies to the grand finale with a chance of winning the whole competition. The winner of the grand finale is decided by a jury in combination with the popular vote, and will later represent Sweden in Eurovision Song Contest competing against other European countries. The jury gives half of the final points and the popular vote decides the other half, and the song with the highest combined score is the winner. Dur- ing the final in 2016th competition, roughly 12.5 million votes were registered over telephone (SVT 2016), which shows how popular this contest is in Sweden. In the last couple of years, Melodifestivalen has had its own hashtag, #melfest, on Twitter (Account 2017), to encourage the viewers of the show to tweet their opinion about the competition. 4 CHAPTER 2. BACKGROUND 5 2.2 Data-Mining using Twitter Twitter is a microblogging network with 319 million monthly active users in the fourth quarter of 2016 (Twitter 2017).