1 54 2 Thousands of Small, Constant Rallies: 55 3 56 4 A Large-Scale Analysis of Partisan WhatsApp Groups 57 5 58 6 Victor S. Bursztyn Larry Birnbaum 59 7 [email protected] [email protected] 60 8 Department of Computer Science Department of Computer Science 61 9 Northwestern University Northwestern University 62 10 Evanston, Illinois, USA Evanston, Illinois, USA 63 11 64 ABSTRACT 1 INTRODUCTION 12 65 13 There is growing concern about the use of social platforms On October 28th 2018, amid strong polarization and con- 66 14 to push political narratives during elections. One very recent spiracy theories that flooded social media, Brazilians elected 67 15 case is Brazil’s, where WhatsApp is now widely perceived far-right candidate Jair Bolsonaro their next president. With 68 16 as a key enabler of the far-right’s rise to power. In this paper, over 120M users in Brazil – the second largest market in 69 17 we perform a large-scale analysis of partisan WhatsApp the world – the role that WhatsApp played in this electoral 70 18 groups to shed light on how both right- and left-wing users process has emerged as a major focus of attention and con- 71 19 leveraged the platform in the 2018 Brazilian presidential troversy due to its alleged importance in a large number of 72 20 election. Across its two rounds, we collected more than 2.8M successful political campaigns. 73 21 messages from over 45k users in 232 public groups (175 right- Indeed, WhatsApp in Brazil connects an audience com- 74 22 wing vs. 57 left-wing). After describing how we obtained a parable in scale to television’s to content created and dis- 75 23 sample that is many times larger than previous works, we tributed with almost no barriers or filters other than user 76 24 compare right- and left-wing users on their social network curation. Although the promise of inexpensive one-to-one 77 25 metrics, regional distribution, content-sharing habits, and mobile communication may have sparked this popularity – 78 26 most characteristic news sources. WhatsApp’s 1.5B users are largely in developing countries 79 27 – group chats arguably made it catch fire. The app allows 80 28 CCS CONCEPTS users to create groups where up to 256 users can share text 81 29 • Networks → Online social networks. and multimedia messages, transforming these groups into 82 30 highly active social spaces. Users can also create public in- 83 31 KEYWORDS vites to their groups and share them as URLs across the web, 84 transforming these groups into small public forums. 32 chat applications, WhatsApp, elections, partisanship, data 85 However, the fact that all messages circulate with end-to- 33 collection, social network analysis 86 34 end encryption hinders transparency and even law-enforcement, 87 making WhatsApp a fertile ground for bad actors. A recent 35 ACM Reference Format: 88 36 Victor S. Bursztyn and Larry Birnbaum. 2018. Thousands of Small, study on the Brazilian electorate found that false stories 89 37 Constant Rallies: A Large-Scale Analysis of Partisan WhatsApp that circulated massively through WhatsApp’s network were 90 38 Groups. In Woodstock ’18: ACM Symposium on Neural Gaze Detection, more far-reaching than initially assumed, as it revealed that 91 39 June 03–05, 2018, Woodstock, NY. ACM, New York, NY, USA, 5 pages. 90% of Bolsonaro’s supporters think they are truthful [2]. But 92 40 https://doi.org/10.1145/1122445.1122456 measuring effects on the surface only to speculate about such 93 41 a complex network isn’t enough to protect future elections 94 42 from the use of WhatsApp as a political weapon. Instead, it’s 95 43 Permission to make digital or hard copies of all or part of this work for necessary to learn how to measure partisan activity at scale. 96 personal or classroom use is granted without fee provided that copies are not 44 In practice, this is challenging because it requires finding 97 made or distributed for profit or commercial advantage and that copies bear a large number of partisan WhatsApp groups and following 45 this notice and the full citation on the first page. Copyrights for components 98 46 of this work owned by others than ACM must be honored. Abstracting with the digital rallies that happen in them. Next, we need to find 99 47 credit is permitted. To copy otherwise, or republish, to post on servers or to meaningful ways to analyze such rallies and characterize 100 48 redistribute to lists, requires prior specific permission and/or a fee. Request each partisan group. There are many analyses that can be 101 permissions from [email protected]. 49 made to compare right- and left-wing users in WhatsApp: 102 Woodstock ’18, June 03–05, 2018, Woodstock, NY 50 How their social networks are structured, how much do they 103 © 2018 Association for Computing Machinery. represent the larger population of voters, and what types of 51 ACM ISBN 978-1-4503-9999-9/18/06...$15.00 104 52 https://doi.org/10.1145/1122445.1122456 content are consumed and shared by them. 105 53 1 106 Woodstock ’18, June 03–05, 2018, Woodstock, NY Bursztyn and Birnbaum

107 Addressing the challenges involved in analyzing partisan 3 METHODOLOGY 160 108 activity in WhatsApp, we make the following contributions: Our data collection started on September 1, 2018, and was 161 109 162 • concluded for the purpose of this analysis on November 1, 110 We introduce a new data collection method capable of 163 growing a sample as much as necessary and towards 2018, thus spanning exactly two months. It included mean- 111 ingful events that preceded the electoral race (e.g., Brazil’s 164 112 specific directions in the political spectrum; 165 • Supreme Court barring former President Lula from the race 113 Our code, released to the research community upon 166 publication, can mine invites to other public WhatsApp on September 11) as well as the first and second rounds of 114 the election (on October 8 and 28, respectively). 167 115 groups from the groups already joined; 168 • All data was collected using a dedicated smartphone with 116 We analyze the first large-scale dataset of partisan 169 activity in WhatsApp using a variety of methods and 64GB of storage, allowing for a maximum of 1GB of data per 117 average day in our study. WhatsApp’s daily backups were 170 118 standpoints, contributing with real measurements about 171 right- vs. left-wing users in the platform. observed so that our portfolio of groups would never exceed 119 this safe margin. For our setting, in practice, it allowed the 172 120 tracking of 700-800 public groups – the total fluctuated as 173 2 RELATED WORK 121 groups were closed or added. 174 122 Several works indicate a recent rise of interest in analyzing In this section we describe our data collection method- 175 123 public WhatsApp groups: ology in two parts. First, we discuss the basic steps also 176 124 Rosenfeld et al. [8] provided an initial study of WhatsApp implemented by previous works [1, 3, 6, 7]. Second, we de- 177 125 messages with a particular focus on predictive analysis. Al- scribe our improvements, which substantially enhance data 178 126 though it comprised 4M messages, the dataset spanned only collection, now made available at a public repository. As 179 127 100 users and didn’t include the actual contents of those with previous works, we emphasize that the resulting data 180 128 messages – only a handful of meta-data. They noted that a collection complies to WhatsApp’s privacy policy. 181 129 small majority of messages originated from group messaging 182 130 and used it to distinguish WhatsApp from plain texting. Data Collection - Base Method 183 131 Garimella and Tyson [3] collected the first large-scale Public WhatsApp groups are characterized by the ability 184 132 WhatsApp dataset. To do so, they looked for invites to public to join them through a public URL created by their own- 185 133 WhatsApp groups in websites that aggregated and organized ers, called “an URL invite”. At the same time, all WhatsApp 186 134 information about such groups as well as by searching for the groups are limited to 256 users. Considering this, Caetano et 187 135 string “chat..com” in Google. After automating the al. [1, Figure 2] summarizes the base method in three steps: 188 136 process of joining groups using Selenium over WhatsApp’s 1) Searching the web for invites to public WhatsApp groups; 189 137 web client, their work was able to collect 454,000 messages 2) Trying to join them; and 3) Extracting data from them. 190 138 from 45,794 users in a six month period, encompassing a For step #1, a typical solution is to look for invites to pub- 191 139 total of 178 groups about a wide variety of themes. lic WhatsApp groups in a set of publicly accessible sources. 192 140 Caetano et al. [1] provided an initial study of political dis- These sources include: (i) websites aimed at organizing in- 193 141 cussions in Brazilian WhatsApp groups and, at the same time, formation about public WhatsApp groups; (ii) the web, in 194 142 offered valuable measurements for a non-electoral setting. general, by searching for the string that is the prefix of URL 195 143 Their dataset comprised 273,468 messages from 6,967 users in invites (“chat.whatsapp.com”) in Google; and (iii) the social 196 144 a one month period, encompassing a total of 81 groups. From web, by performing the same search in , Twitter, 197 145 these, only eight groups were selected and further analyzed YouTube, and . Furthermore, within these sources, 198 146 – four being political and four non-political. More recently, it’s often possible to filter groups dedicated to political dis- 199 147 Resende et al. [7] described a system aimed at helping jour- cussion: (i) by selecting categories most likely to host such 200 148 nalists to report on WhatsApp activity during the electoral groups; (ii) and (iii), by compiling a list of keywords referring 201 149 process. Their data comprised 210,609 messages from 6,314 to the political right and to the political left (e.g., candidate 202 150 users in a one month period, encompassing a total of 127 names, vice-presidents, parties), and combining this list with 203 151 public groups dedicated to political themes. A fraction of this the string “chat.whatsapp.com”. In our implementation, this 204 152 sample, however, may be considered politically neutral or process resulted in about 100 valid URL invites.1 205 153 mixed: there were 26 “debate groups” and 30 “news sharing For step #2, a typical solution is to use the Selenium-based 206 154 groups”, so partisan groups may be limited to 71 groups. Python script published by Garimella and Tyson [3] to auto- 207 155 In [6], Resende et al. provide a more up-to-date view on mate the joining process. Their code: (i) receives a list of URL 208 156 their collected dataset: it comprises 789,914 messages from 209 157 18,725 users, but it’s still limited to the first round of the 1The full list of keywords we used is available at https://github.com/ 210 158 Brazilian election and filled with heterogeneous groups. vbursztyn/whatsapp-data-collection/blob/master/keywords_invites.csv 211 159 2 212 Thousands of Small, Constant Rallies: A Large-Scale Analysis of Partisan WhatsApp Groups Woodstock ’18, June 03–05, 2018, Woodstock, NY

213 invites, (ii) opens a browser window, (iii) loads WhatsApp’s disseminated by group owners. This work is similarly com- 266 214 web client, and (iv) simply tries to join each group sequen- pliant with WhatsApp’s privacy policy as it states that all 267 215 tially. These attempts aren’t necessarily successful because users must be aware that their data will be shared with other 268 216 of the group size limitation. However, since the joining pro- members once they participate of a group, public or not. Un- 269 217 cess is now automated, new attempts can be scheduled until like previous works, no third-party tools were used for data 270 218 a spot appears. extraction: our data collection used WhatsApp’s chat export 271 219 For step #3, the solution varies. It ranges from simply ex- feature, meaning that all data we have accessed, processed, 272 220 porting group activity using WhatsApp’s chat export feature and analyzed were selected and sent by email through the 273 221 to obtaining access to WhatsApp’s local DB [3] or using a WhatsApp app. Many other systems rely on the same chat 274 222 third-party API to scrape WhatsApp’s web client [6, 7]. export feature, which is limited to the 10,000 most recent 275 223 messages if multimedia files are included or to the 40,000 276 224 Data Collection - Enhancements most recent messages otherwise. 277 225 We added step #4 to Caetano et al. [1, Figure 2]: 4) Mining 278 226 new URL invites sent to the groups already joined. 4 PARTISANSHIP: GENERAL CHARACTERISTICS 279 227 Based on the automated joining script by Garimella and Social Network Metrics 280 228 Tyson [3], our code: (i) opens a browser window, (ii) loads 281 229 WhatsApp’s web client, (iii) inserts the string “chat.whatsapp.com” In this analysis we evaluate structural properties of the net- 282 230 into the search bar, (iv) waits for the results, (v) scrolls an ar- works formed by right- and left-wing users in the 232 parti- 283 231 bitrary amount of times, (vi) mines all invites that are loaded san groups identified. To do so, we construct networks where 284 232 in the browser, (vii) retrieves each group’s information, and nodes represent active users (i.e., users who sent at least one 285 233 finally (viii) saves all information in a table that canbe a. message to at least one group) and edges represent pairs of 286 234 manually managed, and b. passed on to the automated join- users co-participating in a same group. Since we have three 287 235 ing script. Step #4 is particularly powerful as it creates a loop times as many right-wing groups (175 vs. 57), networks will 288 236 between joining groups and extracting more invites from have different sizes. Indeed, as Table 1 shows, the right-wing 289 237 their messages. This loop can be used to grow the sample network has 39,035 nodes (users) and 8.4M edges, while the 290 238 as much as necessary and towards specific directions – for left-wing has 6,242 nodes and 0.9M edges. 291 239 instance, by mining more invites from left-wing groups. In However, a more detailed analysis tells a different story. 292 240 practice, we are able to explore the interconnected nature of The difference in the number of connected components isn’t 293 241 partisan WhatsApp groups. proportional to the difference in sizes, which is confirmed 294 242 Last, in our work, we inspected each public WhatsApp by the extent of their largest connected components (LCCs): 295 243 group to assess whether its cover photo, group title or group 95% of nodes in the right-wing network belong to a single 296 244 description would explicitly support a specific candidate. 232 connected component, while this value is down to only 80% 297 245 groups had all the three elements explicitly supporting a of nodes in the left-wing network. Additionally, despite the 298 246 candidate, thus being deemed partisan. Among these, 175 difference in sizes, right-wing users have a smaller average 299 247 groups were clearly right-wing, as they supported far-right path length (APL): 3.03 vs. 3.13 for left-wing users. In other 300 248 candidate Jair Bolsonaro, whereas 57 groups were classified words, we find that right-wing users are more tightly 301 249 as left-wing. These 57 groups either supported Fernando connected in WhatsApp. 302 250 Haddad, the left-wing runner-up, or Ciro Gomes and Marina It’s also worth noting that our results for clearly parti- 303 251 Silva, who were center-to-left candidates that didn’t make it san groups show a substantially smaller APL compared to 304 252 to the second round and then declared support for Haddad. Resende et al.’s [6, Table 4] results for “political groups”, in 305 253 Therefore, as the BBC [5] did in their analysis of WhatsApp general: 3.03 & 3.13 (ours) vs. 3.95 (theirs). This happens 306 254 in India, we aggregate candidates to the left of Bolsonaro to despite the fact that our networks have 4+ times as many 307 255 represent the political left although the two partisan groups nodes (45,277 vs. 10,860). Therefore, we find that partisan 308 256 aren’t equidistant from the center. groups are more tightly connected when compared to 309 257 Our code is available at https://github.com/vbursztyn/ political groups, in general. 310 258 whatsapp-data-collection 259 Representation of Real Population of Voters 311 312 260 WhatsApp’s Privacy Policy In this analysis we evaluate a possible link between partisan 313 261 Like previous projects [1, 3, 6, 7], our work is based on pub- activity in WhatsApp and the real population of voters, as 314 262 lic WhatsApp groups, which are accessible to any user with we conjecture that the regional distribution of users in our 315 263 valid URL invites. These invites, especially in the case of sample should reflect the regional distribution of voters ina 316 264 partisan groups during the Brazilian election, were widely constituency. 265 317 3 318 Woodstock ’18, June 03–05, 2018, Woodstock, NY Bursztyn and Birnbaum

Table 2: Content-sharing habits. 319 Table 1: User network metrics. 372 320 373 Right-Wing Left-Wing 321 Right-Wing Left-Wing # of messages (%) 2,392,851 (100%) 429,835 (100%) 374 # of groups 175 57 322 # of multimedia 1,113,821 129,328 375 # of nodes 39,035 6,242 messages (%) (46.55%) (30.09%) 323 # of edges 8,423,514 872,957 # of messages 279,196 50,608 376 # of components 1,830 1,249 with URLs (%) (11.67%) (11.77%) 324 Largest 377 # from YouTube (%) 157,208 (56.31%) 22,378 (44,22%) 325 connected 95.31% 80.01% 378 component # from WhatsApp (%) 42,414 (15.19%) 9,902 (19.57%) Average # from Facebook (%) 30,172 (10.81%) 10,127 (20.01%) 326 3.03 3.13 379 path length # from Twitter (%) 5,602 (2.01%) 3,111 (6.15%) 327 # from Instagram (%) 6,586 (2.36%) 663 (1.31%) 380 328 381 329 382 330 383 331 384 332 385 333 386 334 387 335 388 336 389 337 390 338 391 339 Figure 1: Right- and left-wing users organized by state and normalized. 392 340 393 341 394 342 395 343 396 344 397 345 398 346 399 347 400 348 401 349 402 350 403 351 404 352 405 353 406 354 407 355 408 356 409 357 410 358 411 359 Figure 2: Largest rank order differences indicate news sources that are most characteristic of each partisan group. 412 360 413 361 414 362 First, in our sample, we obtain two regional distributions bars would decay similarly if the same ratio held. This way, 415 363 by processing the state area codes extracted from the dis- states where the “Sample” bar exceeds the “Constituency” 416 364 tinct phone numbers found in each partisan group. Next, we bar are overrepresented in our sample (e.g. “MG” in right- 417 365 compare these distributions with the final election results for wing groups), while states where the opposite happens are 418 366 each candidate in each state [9]. We normalize the popula- underrepresented (e.g., “BA” in left-wing groups). 419 367 tions from each state by the ones from São Paulo (“SP”), as it’s It’s worth noting that two regions are extraordinarily 420 368 where both Bolsonaro and Haddad garnered the most votes overrepresented in both partisan groups: the one com- 421 369 (15.3M and 7.2M, respectively). Note in Figure 1 that “SP” is prised of Brazil’s administrative district (“DF”), which re- 422 370 represented by the highest bars, normalized to 1.0. All other volves around political activity, and the one comprised of 423 371 4 424 Thousands of Small, Constant Rallies: A Large-Scale Analysis of Partisan WhatsApp Groups Woodstock ’18, June 03–05, 2018, Woodstock, NY

425 LW ( ) 478 426 voters who live abroad (“Int”). Although a possible explana- top 30). Likewise, Rankindex α returns the rank order of α 427 tion could be that these two regions were disproportionately among left-wing messages (adopting 30 as fallback). 479 428 engaged in political activism, a deeper analysis of the inter- We calculate a score for α among right-wing users by 480 429 national numbers is warranted as they could include ghost calculating the difference in rank orders, as follows: 481 430 accounts created through third-party services. 482 RW LW RW 483 431 Also, among right-wing users, most regions are overrepre- Scoreα = Rankindex (α) − Rankindex (α) 432 sented. It should be noted that these distortions could have Resulting in the most characteristic news sources shown 484 433 several origins: from differences in regional populations of in Figure 2, which matches domain knowledge at the same 485 434 WhatsApp users to possible differences in the sharing of time that it uncovers lesser-known sources. 486 435 invites between groups, which would cause our data collec- 487 436 tion method to mine more groups from more interconnected 5 CONCLUSION & FUTURE WORK 488 437 regions. The body of knowledge on Twitter mining suggests In this paper, we performed the first large-scale analysis of 489 438 there could be a number of biases in this population [4], thus partisan WhatsApp groups in the context of Brazil 2018 pres- 490 439 some analyses should be made with caution – especially if idential election. The methodology we disclosed allowed a 491 440 describing the actual constituencies. sample that is, at the same time, more specialized and sub- 492 441 stantially larger than described in previous works. We were 493 Content-Sharing Habits 442 able to analyze how right- and left-wing users organized 494 443 In this analysis we evaluate the 2.39M messages sent in right- a myriad of small, constant rallies in WhatsApp, finding a 495 444 wing groups and the 0.43M messages from left-wing groups number of distinct characteristics within right-wing groups 496 445 to outline the content landscape in each partisan group, as – right-wing users are more prevalent, tightly connected, ge- 497 446 seen in Table 2. ographically distributed, and shared more multimedia mes- 498 447 Interestingly, right-wing users send multimedia messages sages and YouTube videos. Future work will target more spe- 499 448 at a substantially higher rate: 46.55% vs. 30.09%. Despite cific behaviors such as expression of distrust or promotion 500 449 the small sample from Caetano et al. [1, Figure 5], we high- of certain types of information across the political spectrum. 501 450 light that these numbers are much higher than the 20% they 502 451 had found for political groups in a non-electoral setting. REFERENCES 503 452 Considering their baseline, use of multimedia messages [1] Josemar Alves Caetano, Jaqueline Faria de Oliveira, Hélder Seixas Lima, 504 453 by right-wing users more than doubled compared to Humberto T Marques-Neto, Gabriel Magno, Wagner Meira Jr, and 505 Virgílio AF Almeida. 2018. Analyzing and Characterizing Political Dis- 454 what was seen one year before. 506 cussions in WhatsApp Public Groups. arXiv preprint arXiv:1804.00397 507 455 Previous works [1, 6, 7] found that roughly 10% of all text (2018). 456 messages contain URLs. Among these, they further found [2] Folha de São Paulo. 2018. 90% of Bolsonaro’s Supporters Believe in Fake 508 457 that YouTube tops the list of popular domains followed by News, Study Says. https://folha.com/0futoq3n 509 458 Facebook and WhatsApp. Based on a substantially larger [3] Kiran Garimella and Gareth Tyson. 2018. WhatsApp, Doc? A First Look 510 at WhatsApp Public Group Data. In Proceedings of the ICWSM (2018). 459 sample, our results seem to confirm theirs. 511 [4] Alan Mislove, Sune Lehmann, Yong-Yeol Ahn, Jukka-Pekka Onnela, and 512 460 Our results suggest that the electoral process could be J Niels Rosenquist. 2011. Understanding the Demographics of Twitter 461 a strong driver for the use of multimedia messages Users. ICWSM 11, 5th (2011), 25. 513 462 in partisan groups, especially among right-wing users, [5] BBC Audiences Research. 2018. Duty, Identity, Credibility: “Fake News” 514 463 whereas the same effect isn’t seen in the total amount of and the Ordinary Citizen in India. https://downloads.bbc.co.uk/ 515 mediacentre/duty-identity-credibility.pdf 464 URLs shared. However, similarly to what was noted in the 516 [6] Gustavo Resende, Philipe Melo, Hugo Sousa, Johnnatan Messias, 517 465 United States, YouTube appears to play a role in informa- Marisa Vasconcelos, Jussara Almeida, and Fabrício Benevenuto. 2019. 466 tion diffusion especially for the political right – 56.31% of (Mis)Information Dissemination in WhatsApp: Gathering, Analyzing 518 467 right-wing URLs are YouTube videos. and Countermeasures. In Proceedings of the 2019 World Wide Web Con- 519 468 ference. 520 News Consumption [7] Gustavo Resende, Johnnatan Messias, Márcio Silva, Jussara Almeida, 469 521 Marisa Vasconcelos, and Fabrício Benevenuto. 2018. A System for 522 470 In this analysis we evaluate how right- and left-wing users Monitoring Public Political Groups in WhatsApp. In Proceedings of the 471 consume news by calculating their most characteristic news 24th Brazilian Symposium on Multimedia and the Web. ACM, 387–390. 523 472 sources. To do so, we count the most frequent news sources [8] Avi Rosenfeld, Sigal Sina, David Sarne, Or Avidov, and Sarit Kraus. 524 473 among right-wing messages and compile a rank with their 2016. WhatsApp Usage Patterns and Prediction Models. ICWSM/IUSSP 525 RW LW Workshop on Social Media and Demographic Research (2016). 474 top 30 sources (Rank ), doing the same for the left (Rank ). 526 [9] Tribunal Superior Eleitoral. 2018. 2018 Electoral Results. http://divulga. RW ( ) 527 475 Consequently, for a given source α, consider that Rankindex α tse.jus.br/oficial/index.html 476 returns the rank order of α among right-wing messages (or 528 477 30 if α isn’t in the rank, referring to the last position of a 529 5 530