Institutionen För Datavetenskap Department of Computer and Information Science
Total Page:16
File Type:pdf, Size:1020Kb
Institutionen för datavetenskap Department of Computer and Information Science Examensarbete Twitter as the Second Channel av Matteus Hemström och Anton Niklasson LIU-IDA/LITH-EX-G--14/063--SE 2014-06-04 Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping Linköpings universitet Institutionen för datavetenskap Examensarbete Twitter as the Second Channel av Matteus Hemström och Anton Niklasson LIU-IDA/LITH-EX-G--14/063--SE 2014-06-04 Handledare: Niklas Carlsson Examinator: Nahid Shahmehri Students in the 5 year Information Technology program complete a semester- long software development project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, currently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culmi- nates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elic- itation. During the final stage of the semester, students create small groups and specialise in one topic, resulting in a bachelor thesis. The current re- port represents the results obtained during this specialization work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis. Abstract People share a big part of their lives and opinions on platforms such as Facebook and Twitter. The companies behind these sites do their absolute best to collect as much data as possible. This data could be used to extract opinions in many different ways. Every company, organization or public person is probably curious on what is being said about them right now. There are also areas where opinions are related to the outcome of an event. Examples of such events are presidential elections or the Eurovision Song Contest. In these events, peoples' votes will directly reflect the outcome of the elections or contests. We have developed a simplistic prototype that is able to predict the result of the Eurovision Song Contest using sentiment analysis on tweets. The prototype collects tweets about the event, performs sentiment analysis, and uses different filters to predict the ranks of the contestants. We evaluted our results with the actual voting results of the event and found a Pearson correlation of approximately 0.65. With more time and resources we believe that it is possible to create a highly accurate prediction model. It could be used in lots of different contexts. Politicians and their parties could use it to evaluate their campaigns. The press could use it to create more interesting news reports. Companies would be able to investigate their brand appreciation. A system like this could be used in many different fields. Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Problem Statement . .1 1.3 Contributions . .2 2 Theory and Related Work 3 2.1 Sentiment Analysis . .3 2.2 Collaborative Filtering . .4 3 Expectations 5 4 Methodology 6 4.1 Overview . .6 4.2 Data Collection . .6 4.3 Sentiment Analysis . .7 4.4 Visualization . .8 4.5 Filters . .9 5 Results 12 5.1 Dataset Characteristics . 12 5.2 Sentiment Analysis . 13 5.3 Prediction Results . 14 6 Discussion 17 6.1 Methodology . 17 6.1.1 Filters . 18 6.2 Results . 19 6.3 Ethics . 20 7 Conclusion 21 7.1 Future Work . 22 A Entity Mentions 24 B Language Distribution 25 C Correlation Plots 26 D Results From Visualization 27 Chapter 1 Introduction 1.1 Motivation The Eurovision Song Contest is a very popular event in most countries across Europe. It engages hundreds of millions of people over the course of a few weeks each spring. The whole show is broadcasted live by multiple TV- channels and people gather at home to support their favourite act. Although the main interaction by people is set to watching TV, there is a lot of activity in social media as well. The fact that the result of this contest is based on people's votes and that they continuously share their opinions for anyone to read creates great a opportunity for analysis. Our goal is to collect tweets via the Twitter API, analyse them in terms of sentiment and create a prediction of the final results. We would like to find out if our result can predict the outcome of the event with sufficient accuracy. 1.2 Problem Statement Using simple entity extraction and sentiment analysis, this thesis explores how information in tweets can be used to predict outcomes of competi- tions, such as the Eurovision Song Contest. The core question of this paper is: • Is it possible to predict the result of the Eurovision Song Contest by running sentiment analysis on tweets related to the topic? 1 1.3. CONTRIBUTIONS CHAPTER 1. INTRODUCTION 1.3 Contributions We have created a system that we call Eagle. It includes three mod- ules: • Data Collection This module is responsible for talking to the Twitter REST API. It downloads tweets and users, and saves the data in a MySQL database. • Data Analysis This module is responsible for extracting entities and analysing sen- timent using our heuristics. The entity extraction connects the tweet with one or more of our pre-defined entities. We do this by simple comparisons between the tweets body text and a list of identifiers that we have manually decided upon. • Visualization This module is responsible for presenting the analysed tweets. It pro- vides a simple web interface for building database queries. The web- site will then present a bar graph showing the result. It comes with 3 heuristics, 2 filters and an option for languages. We have also collected 737.793 tweets tagged with #eurovision. Since Twit- ter does not allow API access to tweets older than 8-10 days this data would probably be interesting in other projects. 2 Chapter 2 Theory and Related Work No more than a couple of years ago we had nowhere near as much data available to us as we have today. Since then, a lot of studies have been done in this field. There is also a growing interest of language analysis in commercial markets. 2.1 Sentiment Analysis The concept of sentiment analysis is to determine opinions in text. Many times the opinions are directed towards an entity. The entities are often political parties or competitors in some form. We are focusing on the Eu- rovision Song Contest, so let's use that as an example. We think of each individual act as an entity. This means that each tweet can have an opinion on each entity. There are many tools to perform sentiment analysis. A popular technique to characterize sentiment in short texts is Linguistic Inquiry and Word Count (LIWC). LIWC has been used to determine the sentimental value in tweets [4]. LIWC is a commercial product, The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods [9] is a paper describing how it was created. Another algorithm that has been used for sentiment analysis in tweets is Sen- tiStrength [10]. SentiStrenght is designed to extract sentimental values from MySpace comments [10]. Thelwall et al. [11] claims that \[..] the accurate detection of sentiment is domain-dependant" and that the SentiStrength al- gorithm is suitable for Twitter comments since they are similar to MySpace 3 2.2. COLLABORATIVE FILTERINGCHAPTER2. THEORY AND RELATED WORK comments. They also state that informal language and abbreviations may be common which is somewhat contradictory towards the work of Hu et al. [3]. They claim that Twitter is an \evolving medium whose language is a projection of the language of more formal media like news and blogs into a space restricted by size". 2.2 Collaborative Filtering As explained by Kim et al. [4] the entities and users could be organized in a matrix. Each column represents an entity, and each row is a user. Then each cell is given a rating of a user's sentiment towards an entity. Users that share sentiment towards a few entities are likely to have similar opinions on other entities. It is therefore possible to extrapolate opinions by identifying similar users. This is called collaborative filtering. It is most commonly used for recom- mending content to users. Even though our work is not centred around recommendations this is an interesting technique, as recommendations are basically predictions of opinions. While we did not implement any collaborative filtering in our analysis, we expect that it could be added as a compliment to the summarization model that we developed. 4 Chapter 3 Expectations Our expectations for this project is that the system will be able to accurately predict three entities out of the actual top five without any internal order or ranking. Initially we had an idea of predicting the complete results. A few weeks in we felt that it was a bit too ambitious. Predicting 60 % of the top contestants is still useful while also achievable from our point of view. Predicting 60 % is relevant in the context of the Eurovision Song Contest. That number would not mean anything if we were to predict a presidential election or something with a lower number of entities. Having many entities makes it harder in some ways. A big hurdle is trying to decide which entity some tweet is mentioning. Sentiment analysis is difficult, even for humans. An interesting fact is that humans are about 80 % accurate when it comes to deciding sentiment [8]. This means that a computer which is correct 10 out of 10 times would still not be considered correct by a human in every case, making it difficult to get a god end result.