Twitter Spam As a Weapon to Drown Voices of Protest Abstract
Total Page:16
File Type:pdf, Size:1020Kb
Five incidents, one theme: Twitter spam as a weapon to drown voices of protest John-Paul Verkamp Minaxi Gupta School of Informatics and Computing, Indiana University {verkampj, minaxi}@cs.indiana.edu Abstract to Twitter from geographically diverse locations as well. In contrast, more than half of the legitimate accounts logged Social networking sites, such as Twitter and Facebook, in only from Russia. Also, the IP addresses of almost 40% have become an impressive force in the modern world with of machines posting spam appeared in blacklists, suggesting user bases larger than many individual countries. With such that they were already known to be compromised. influence, they have become important in the process of Given that political speech on Twitter has been suppressed worldwide politics. Those seeking to be elected often use on multiple other occasions, this paper is motivated by the social networking accounts to promote their agendas while desire to identify characteristics of diverse events from vari- those opposing them may seek to either counter those views ous countries. Doing so will enable us to judge whether it is or drown them in a sea of noise. Building on previous work possible to filter politically-motivated spam. This question that analyzed a Russian event where Twitter spam was used is important because Thomas et al.’s work reported that only as a vehicle to suppress political speech, we inspect five about half of the spam in the Russian incident was filtered political events from 2011 and 2012: two related to China by Twitter’s existing spam filtering mechanisms. and one each from Syria, Russia, and Mexico. Each of these Toward our goal, we analyze five different incidents events revolved around popular Twitter hashtags which were spread over 14 months where political speech on Twitter inundated with spam tweets intended to overwhelm the orig- was suppressed via spam. Two of these events are related inal content. to China and one each is related to Syria, Russia, and Mex- We find that the nature of spam varies sufficiently across ico. Each of these events revolved around popular Twitter incidents such that generalizations are hard to draw. Also, hashtags which were inundated with spam tweets intended spammers are evolving to mimic human activity closely. to dilute their content. Of these, Thomas et al. studied the However, a common theme across all incidents was that the Russian incident. Overall, we confirm a few previously accounts used to send spam were registered in blocks and known behaviors and identify a few new ones. To our had automatically generated usernames. Our findings can be dismay, we find that spammer behaviors vary sufficiently used to guide defense mechanisms to counter political spam across incidents such that generalizations are hard to draw. on social networks. Further, we also find that spammers are evolving to become 1 Introduction indistinguishable from legitimate users. These observations in turn imply that previous approaches–such as training Social networks, such as Facebook, Twitter, and Google+, supervised machine learning classifiers–are unlikely to be are an increasingly important part of the daily lives of bil- directly applicable and further research is needed to address lions of users. With politicians and other important fig- the problem of politically motivated spam. ures increasingly reaching out to social networks to com- The key observations from comparing various incidents municate, it is only to be expected that those with mali- are the following: cious intent would follow. Indeed, multiple sources such • Spam tweets in three incidents follow a distinct spiking as [2, 5, 10, 12, 15] have shown how Twitter can be used pattern. Spam in the two other incidents is either to spread spam and malicious content. Others have shown sustained or dwarfed by non-spam. how both legitimate and compromised accounts on social • Two of the incidents exhibit strong signs of scheduled networks are manipulated to make spam campaigns more activity. However, spammers in these incidents took effective [1, 4, 11]. care to mimic diurnal patterns typical of human activ- More to the topic of this paper, researchers have recently ity, perhaps in order to escape detection. studied the use of Twitter spam as a vehicle to spread pro- • Non-spam tweets use more conjunctions and preposi- paganda or to suppress political expression [7, 13]. Several tions compared to spam tweets. However, this analysis examples of the latter have appeared in the news over the last is challenging for Chinese language tweets because of couple of years. This includes attempts to suppress protests the lack of word breaks. against the disputed Russian parliamentary elections [6]; • In two incidents, URLs in tweets are less common suppression of information regarding the arrest of the Chi- than the baseline while in one other incident they are nese artist Ai Weiwei [3]; attempts to deter Twitter users significantly more common. from learning about Tibet [9]; inundation of pro-revolution • In two incidents, spammers target users directly using tweets from Arab Spring incidents [16]; and dilution of @ mentions. In the others, spammers rely primarily on protests against the Mexican presidential candidate, Enrique hashtag popularity. Peña Nieto [8]. • Spam accounts are registered in blocks in each incident Recent work by Thomas et al. studied how twenty five and the usernames used are automatically generated. thousand fraudulent Twitter accounts were marshaled to • Spammers are increasingly customizing account pro- send hundreds of thousands of spam tweets in an attempt files in newer incidents while older incidents relied to disrupt political conversations following the announce- heavily on default profiles. ment of Russia’s parliamentary election results [13]. These accounts were drawn from a pool of over one million fraud- 2 Methodology and data overview ulent accounts serving the spam-as-a-service market place. The authors found that fraudulent accounts were created We analyzed five different political events–two from China, using machines located all over the world. They logged on one from Russia (previously analyzed by Thomas et al. 1 Incident Dates Primary hashtag Interpretation Expanded set of hashtags Syria 1-13 April 2011 #syria Syria #syria, #bahrain, #egypt, #libya, #syria, #jan25 (Egypt), #feb14, #tahrir (Egypt), #yemen, #feb17 (Libya), #kuwait, China’11 4-6 April 2011 #aiweiwei Chinese artist, Ai Weiwei #aiww, #aiweiwei, #cn417 (Jasmine), #5mao (5 May), #freeaiww, #freeaiweiwei, #cn424 (Jasmine), #tateaww, #cnjasmine Russia 5-6 December 2011 #триумфальная Triumphal Square in Moscow #чп (abbr of Чрезвычайное Происшествие, extrodinary incident), #6дек (Dec 6),#5дек (Dec 5), #выборы (elections), #митинг (meeting), #триумфальная (Triumphal Square), #победазанами (victory is ours), #5dec, #навальный (surname, likely Navalny), #ridus China’12 12-15 March 2012 #freetibet Free Tibet #tibet, #freetibet, #china, #monday, #西藏 (Tibet), China’12 12-15 March 2012 #freetibet Free Tibet #tibet, #freetibet, #china, #monday, #西藏 (Tibet), #beijing, #shanghai, #india, #apple, #hongkong Mexico 19-20 May 2012 #marchaAntiEPN March against EPN #marchaantiepn, #marchaantipeña, #marchamundialantiepn, (initials of presidential) #marchayosoy132 (I am 132nd to march), candidate) #votomatacopete (vote for another), #epn, #epnveracruznotequiere (no more EPN), #pr, #amlocomp (initials of competitor), #yosoy132, Table 1: Dates and hashtags of interest for each of the five Twitter spam incidents considered (non-English hashtags translated where possible) [13]), one from Syria, and one from Mexico–where Twitter the possibility of compromised accounts taking part in the spam is thought to have played a significant role in sup- attacks, we have yet to find any compromised account being pressing event-related tweets. In order to collect data for used in any incident based on a manual inspection of spam these incidents, we used a portion of Twitter’s firehose data, accounts. which gave us a statistically valid sampling of an estimated One final aspect that we investigated was account activity one in ten tweets1. Although we have access to the full for both spam and non-spam accounts that did not directly 10% for each time period, we begin by filtering out all involve one of the hashtags in our lists. We found that spam but a single hashtag for each incident, gathered from news accounts in general had very few other tweets while legit- stories [3, 6, 8, 9, 16]. We refer to these hashtags as “seed” imate accounts maintained a steady flow of other activity, hashtags. The primary seed hashtag for each incident is peaking slightly during each incident. Since the behavior in shown in Table 1, along with the dates of each incident. each incident is the same when considering all tweets or only The seed hashtags are good starting points but do not hashtag-related tweets, we use the former for our analysis. paint a complete picture of the magnitude of each incident. Table 2 shows a summary of the numbers of spam and Therefore, for each incident, we started by initializing a legitimate tweets and accounts involved in each incident. set seed hashtags S and collected all available tweets T As this table shows, the five incidents varied widely. The involving any hashtag in S. We then updated S to contain Syrian incident had the most overall tweets but the per- the top n most common hashtags in T . We chose n = 10 centage of spam tweets was significantly lower than for hashtags to focus on the key hashtags related to each incident any other attack. In contrast, percentage of spam tweets in and also to avoid noise in our data set arising from irrelevant Russia, China’12, and Mexico were much higher, between hashtags. We repeated the process of hashtag expansion 62-73%. When comparing spam accounts, we find that until S stabilized for each individual incident. In each case, while both the percentage of spam tweets in China’12 and the algorithm took no more than three iterations.