Troll Detection a Comparative Study in Detecting Troll Farms on Twitter Using Cluster Analysis
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2016 Troll Detection A comparative study in detecting troll farms on Twitter using cluster analysis FELIX DE SILVA MARTIN ENGELIN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Trolldetektion En jämförande studie i att upptäcka trollfarmar på Twitter med hjälp av klusteralgoritmer FELIX DE SILVA MARTIN ENGELIN Examensarbete inom datalogi, DD143X Handledare: Dilian Gurov Examinator: Örjan Ekeberg CSC, KTH 2016-05-11 Abstract The purpose of this research is to test whether clustering algorithms can be used to detect troll farms in social networks. Troll farms are profes- sional organizations that spread disinformation online via fake personas. The research involves a comparative study of two different clustering algo- rithms and a dataset of Twitter users and posts that includes a fabricated troll farm. By comparing the results and the implementations of the K- means as well as the DBSCAN algorithm we have concluded that cluster analysis can be used to detect troll farms and that DBSCAN is better suited for this particular problem compared to K-means. Sammanfattning Målet med denna rapport är att testa om klusteringalgoritmer kan användas för att identifiera trollfarmer på sociala medier. Trollfarmer är professionella organisationer som sprider desinformation online med hjälp av falska identiteter. Denna rapport är en jämförande studie med två olika klusteringalgoritmer och en datamängd av Twitteranvändare och tweets som inkluderar en fabrikerad trollfarm. Genom att jämföra resultaten och implementationerna av algoritmerna K-means och DBSCAN får vi fram slutsatsen att klusteralgoritmer kan användas för att identifiera trollfar- mar och att DBSCAN är bättre lämpad för detta problem till skillnad från K-means. Contents 1 Introduction 1 1.1 Problem definition . .1 1.2 Scope and constraints . .1 2 Background 3 2.1 Twitter . .3 2.1.1 Twitter REST API . .3 2.2 IFTTT . .3 2.3 Troll . .4 2.3.1 History . .4 2.3.2 Characteristics . .4 2.4 Cluster Analysis . .5 2.4.1 What is a cluster? . .5 2.4.2 Similarity between data points . .6 2.4.3 Hierarchical Clustering . .6 2.4.4 Partitive Clustering . .7 2.4.5 Model-Based Clustering . .8 2.4.6 Density-based Clustering . .9 3 Method 10 3.1 Generate trolls . 10 3.2 Construct the cluster algorithms . 10 3.3 Collect Twitter data . 11 3.4 Generate results . 12 3.4.1 DBSCAN . 12 3.4.2 K-means . 12 3.5 Method reasoning . 12 4 Results 13 4.1 Twitter Data . 13 4.2 Algorithms . 13 4.2.1 K-means . 13 4.2.2 DBSCAN . 15 5 Discussion 18 5.1 Future research . 18 5.2 Method discussion . 19 6 Conclusion 20 7 Appendix 22 7.1 Twitter data . 22 7.2 K-means results - multiple run . 25 7.3 DBSCAN results . 26 1 Introduction For millions of people around the world social media sites are an integrated part of their daily life. There are hundreds of different social media sites supporting a wide range of practices and interests [5]. Social networks such as Facebook and Twitter have become a source for news and a platform for political and moral debate for a lot of users. Stories with different degrees of truthfulness are spread and little source criticism is applied by regular people as well as journalists. [10] The act of spreading disinformation on social media has developed from being caused by bored youths to being commercialized by organisations and political blocks in the form of troll farms. A troll farm is an organization whose sole purpose is to affect public opinion with the means of social media. A practical implementation of a system or a software that can identify troll farms could be used in order to stop them and therefore avoid the spread of disinformation. Such an implementation would be interesting to the politicians, media, social networks or organizations that are targeted since it could be used to clear their names. 1.1 Problem definition The aim of this project is to investigate ways to detect troll farms on Twitter with clustering algorithms and the Twitter API. The approach will be to study clustering algorithms, apply them to a database of tweets and analyze them. Clustering algorithms are very dependent on the cluster structure, and there is therefore no one algorithm that works on every problem instance. The goal is to research the Twitter REST API to uncover what kind of cluster algorithms is the most appropriate when trying to use cluster twitter users. The research will also involve different kinds of clustering models and appropriate algorithms for them, in order to find out if there is any comparable advantages or disadvantages between different clustering models. Therefore, the problem statement is: • Which clustering model is the most appropriate for clustering twitter data in search of troll farms? 1.2 Scope and constraints For this project we will research what types of clustering models there are and based on that research choose two models to use and analyze. Our search will be based on the activity of users rather than the content of their statuses. The main parameters to be analyzed are: • Time of day activity • Rate of tweets 1 Future research on this topic can analyze more models as well as other search parameters. A more detailed description of the search parameters can be found in section 3. 2 2 Background 2.1 Twitter Twitter is a social network platform based on one-to-many communication which enables users to post messages using 140 or fewer characters. These posts, called Tweets, can include plain text, links or other media hosted on different web servers. This simple design enables several different uses of Twitter, making it a sort of mashup of text, email, IM, news forum, microblog and social network. [13] Twttr, which was the original name for Twitter, was created in 2006. In the beginning it was a text message-based communication tool for groups. Text mes- sages (SMS) are for historical reasons limited to 160 characters, which resulted in Twitter’s character limit of 140: 20 characters for username, 140 characters for the message. The first real success for the platform was at SXSW Interactive in 2007 where it won the SXSW Web Award in the Blog category. At this point smartphones had not become a big hit yet and the user base for phones that could only text was big. This was an important reason for Twitters success, the fact that anyone could engage in social media without a computer. Today Twitter has evolved into a web-based product with simple but smart APIs. [13] 2.1.1 Twitter REST API API stands for Application Programming Interface, and is a set of routines, protocols, and tools for building software applications and it’s what allows an application to share its data to the rest of the world. Like a website, it is accessed through URL requests but instead of returning web pages it returns structured data. The Twitter API was originally divided into two REST APIs and a Streaming API. REpresentational State Transfer (REST) is an architectural style that makes sure that data is stateless, layered and well defined. This increases scalability and flexibility as well as ease of development. [9] The two REST APIs of Twitter were due to historical reasons. A company called Summize, Inc provided search capability for Twitter data. When Summize later on was acquired by Twitter it proved difficult to fully integrate Twitter Search and its API into the Twitter codebase. It took several years to do this, but today they are both integrated into a single REST API. [13] The REST API uses OAuth to identify Twitter users and applications. [8] 2.2 IFTTT IFTTT or "If This Then That" is a web service that can connect and aggregate many other web apps into one platform and then perform a specific action given some certain criteria. IFTTT give creative control over app and products to create recipes to perform these actions. A recipe simply connects apps and web services together to create an action that can be performed under the condition that some criteria has occurred. There are two types of recipes IF-recipes and DO-recipes that do actions in different manners. [7] 3 • IF-recipes runs automatically in the background and do its action when the recipe’s IF-condition has been fulfilled. • DO-recipes simply runs its action when it’s manually executed. 2.3 Troll An Internet Troll does not in any way resemble the original mythical creature from the old Scandinavian folklore. An Internet Troll (hence refereed to as Troll) is a person that interrupts, harasses or tries to impose his/her own opinions to others. [16] In early days trolls where mostly considered as a small nuisance on online forums. Since then the Internet has grown and the problem of trolling with it. From being an activity primarily performed by bored individuals, it has evolved to be industrialized by states and terrorist organizations. These professional groups are sometimes called Troll farms. [2] Trolls and their activities, trolling, can exist in any kind of social media. Twitter has become a popular platform for trolling activity due to its role as a news forum as well as the fact that anyone can create multiple accounts. 2.3.1 History Troll farms and their activities are often criminal, or at least morally question- able adn because of this fact their history is not completely clear. Professional troll farms have been known to exist in Russia since at least 2008, but possibly even before that and probably all around the world.