Characterizing the public perception of WhatsApp through the lens of media

Josemar Alves Caetano1, Gabriel Magno1, Evandro Cunha1,2,Wagner Meira Jr.1, Humberto T. Marques-Neto3, Virgilio Almeida1,4 {josemarcaetano, magno, evandrocunha, meira}@dcc.ufmg.br, [email protected], [email protected]

1 Dept. of Computer Science, Universidade Federal de Minas Gerais (UFMG), Brazil 2 Leiden University Centre for Linguistics (LUCL), The Netherlands 3 Dept. of Computer Science, Pontif´ıciaUniversidade Cat´olicade Minas Gerais (PUC Minas), Brazil 4 Berkman Klein Center for Internet & Society, Harvard University, USA

1 Introduction The messaging service WhatsApp is, as of 2018, one Abstract of the most rapidly growing components of the global information and communication infrastructure, count- ing with 1.5 billion users who send around 60 billion WhatsApp is, as of 2018, a significant com- per day [Con18]. This tool combines one-to- ponent of the global information and commu- one, one-to-many and group communication by offer- nication infrastructure, especially in develop- ing private chats, broadcasts and public group chats, ing countries. However, probably due to its through which users are able to send text and media strong end-to-end encryption, WhatsApp be- (audio, image and video), as well as files in various came an attractive place for the dissemina- formats. tion of misinformation, extremism and other According to data published by Statista [Sta18], forms of undesirable behavior. In this pa- more than half of the population of Saudi Arabia, per, we investigate the public perception of Malaysia, Germany, Brazil, Mexico and Turkey were WhatsApp through the lens of media. We an- active WhatsApp users in 2017. Also, the Reuters alyze two large datasets of news and show the Institute Digital News Report 2018 [NFK+18] shows kind of content that is being associated with a rise in the use of messaging applications, including WhatsApp in different regions of the world WhatsApp, as sources of news in several parts of the and over time. Our analyses include the ex- world. This report indicates that WhatsApp use for amination of named entities, general vocabu- news has almost tripled since 2014 and it has surpassed lary and topics addressed in news articles that Twitter as a communication system in many countries. mention WhatsApp, as well as the polarity of One of the alleged reasons for this is that users are these texts. Among other results, we demon- looking for more private and secure spaces to com- strate that the vocabulary and topics around municate. In addition to this, WhatsApp turned out the term “” in the media have been to be an important platform for political propaganda changing over the years and in 2018 concen- and election campaigns, having held a central role in trate on matters related to misinformation, elections in Brazil, India [Goe18], Kenya, Malaysia, politics and criminal scams. More generally, Mexico and Zimbabwe, for instance. Also, WhatsApp our findings are useful to understand the im- has been frequently associated with the spread of mis- pact that tools like WhatsApp play in the con- information and disinformation [Wat18]. temporary society and how they are seen by Despite its prominence, continued growth and opac- the communities themselves. ity, there has been an insufficient number of studies exploring the various aspects of WhatsApp and sim- Copyright © CIKM 2018 for the individual papers by the papers' ilar mobile messaging applications [GWCG18]. Since authors. Copyright © CIKM 2018 for the volume as a collection WhatsApp provides encrypted end-to-end communi- by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). cation, it is a great challenge to conduct large-scale and cultural trends through quantitative analyses of analyses on the behavior of its users. In this work, texts, using sources like large collections of digitized we take a different approach: instead of looking at books. Several studies explore this method to in- inside the system, we focus on the public perception vestigate topics such as the dynamics of birth and of WhatsApp from outside sources. The goals of this death of words [PTHS12], semantic change [GB11], paper are: emotions in literary texts [ALGB13] and character- istics of modern societies [Rot14]. Some works pro- • to characterize how media in different countries pose a complementary approach to culturomics by us- interpret the role of WhatsApp in society; ing historical news data [Lee11], analyzing European news media [FTA+10] or the writing style and gen- • to analyze the evolution of the perception of der bias of particular topics in large corpora of news WhatsApp over time, from its creation until its articles [FALW+13]. Other works concentrate in spe- massive popularization; cific events in history, such as the Fukushima nuclear • to comprehend how sensitive topics, such as disaster [LWSVC14], by using large datasets of media politics, crime and extremism, are related to reports to understand aspects such as how the media WhatsApp in different regions of the world and polarity towards a topic changes over time. in distinct periods of time. Employing methods similar to the ones presented here, [CMC+18] investigate the perception and the To achieve these goals, we explore different techniques: conceptualization of the term “fake news” in the me- analysis of Web search behavior, co-occurring named dia, showing that contextual changes around this ex- entities and vocabulary, co-occurrence networks, top- pression might be observed after the United States ics addressed and textual polarity. According to our presidential election of 2016. However, as far as we are understanding, each of these methods is able to pro- concerned, this is the first work that uses these meth- vide additional information about the perception of ods to examine in detail how the term “whatsapp” is WhatsApp in the news articles investigated. As a being reported by news media in different parts of the whole, our results indicate that the media has sig- world, making us able to analyze how important top- nificantly changed its perception and portrayal of ics, such as misinformation, manipulation and extrem- WhatsApp: while in the period before 2013 the focus ism, might be associated with WhatsApp by societies. of the news was on WhatsApp features, in the follow- ing years the tool started to be more associated with On WhatsApp social issues, including the dissemination of misinfor- mation. Despite the increasing use of WhatsApp in the world, This paper is organized as follows: in Section 2, we few quantitative and large-scale studies about this review a selection of works on WhatsApp and, more application are currently avail- generally, on the use of textual datasets to under- able. [GT18] propose a data collection methodology for stand social phenomena; in Section 3, we describe our this application and perform a statistical exploration methodology of data collection and the overall char- to indicate how data from WhatsApp public groups acterization of the datasets used in this investigation; can be collected and analyzed. Also, [MGB17] collect next, in Section 4, we characterize the vocabulary, an- WhatsApp messages to monitor critical events during alyze the topics addressed and evaluate the polarity of Ghana’s 2016 presidential election, and [CdO13] an- the news articles contained in our datasets; finally, in alyze differences between WhatsApp and SMS mes- Section 5, we conclude the paper and present future saging system using a large-scale survey. [FCSD15] directions of work. investigate Facebook and WhatsApp traces collected from an European national wide mobile network and 2 Related Work characterize the usage of both applications. The work of [SHS+16] surveys users to investigate the usage of On the use of textual datasets to understand social WhatsApp groups and, more specifically, its implica- phenomena tions for mobile network traffic, while [RSS+18] collect Analyzing how a term is used over time and in a personal information and messages from one hundred geographic location is important to help in the un- WhatsApp users with the aim of understanding their derstanding of how cultural values, societal issues usage patterns. and customs are perceived by society and expressed All of these works investigate a limited part of through language [Cam13, Mat53]. Culturomics, for WhatsApp, therefore offering a restricted understand- example, is a concept proposed by [MSA+11] refer- ing of how this application is used. Nevertheless, here ring to a method for the study of human behavior we study this tool using large datasets of external data provided by news articles containing the term indicates the most common associated terms and the “whatsapp” in different regions of the world and cov- countries from which the highest volume of searches ering the whole WhatsApp history, thus shedding light are originated from. It is also possible to filter these not exactly on its usage, but on how it is viewed from results for given periods. For our investigations, we outside sources. collected data from searches made between 2010 and 2018, and use this information in Section 4.1. 3 Data Collection 4 Analyses and Results We use two large datasets of news articles in this study. The first one is a collection of texts from the Corpus In this section, we discuss the outcomes of differ- of News on the Web (NOW Corpus), which contains ent analyses aimed to understand the perception of articles from online newspapers and magazines writ- WhatsApp in the media. Each characterization is in- ten in English in 20 different countries from 2010 to troduced by a description of how it may contribute the present time [Dav13]. This corpus is available for to accomplish our goals, followed by the methodology download and online exploration1 and, according to employed and, finally, by a presentation and discussion its author, it is, at the moment of our data collection, of the results found. the largest corpus available in full-text format. In 31 May 2018, we gathered all the news articles containing 4.1 Web search behavior the 33,185 occurrences of the term “whatsapp” in the Before analyzing the public perception of WhatsApp NOW Corpus. These news articles cover every year through the lens of news articles from different regions in the corpus (from 2010 to 2018) and comprise all of the world, we investigate whether it is possible to 20 countries represented. These countries were then observe a change in the Web search behavior regard- grouped into six regions based on their geographic ing the term “whatsapp” through time. We use data locations (Africa, British Isles, Indian subcontinent, collected from Google Trends to perform this analysis. Oceania, Southeast Asia and the Americas). Our results show that, unsurprisingly, the number Our second dataset includes articles collected from of queries on the Google Search engine for the term Brazilian online newspapers and magazines, all written “whatsapp” is constantly growing since the release of in Portuguese, also containing the term “whatsapp”. this tool for Android devices in 2010, as indicated in We searched for articles starting from 2010, but did Figure 1. Also, Table 2 lists the five most frequent not find any from 2010 and 2011 containing the term search terms employed by users who also searched for “whatsapp”, so our second dataset contains news from “whatsapp” from 2010 to 2018. Here, we notice a shift 2012 to 2018. To build this dataset, we used the tool in the related terms through the years: in the first two Selenium2 to automate Web searches with the term years, most of the words are concerned with the down- “whatsapp” in the following ten major Brazilian news load of the app (“download”, “descargar”) and de- websites: Exame, Folha de S. Paulo, Gazeta do Povo, vice compatibility (“blackberry”, “iphone”, “nokia”); G1, O Estado de S. Paulo, R7, Terra, Universo On- then, from 2012 onwards, queries for “whatsapp” start (UOL), Valor Econˆomico and Veja. The total to be linked to different topics, especially features of number of occurrences of “whatsapp” extracted from the tool (“status unavailable”, “whatsapp encryption”, these websites on 31 May 2018 is 4,047. Finally, we “video status download”), but also content shared in used the Python library newspaper3 to collect the full WhatsApp (“imagens para whatsapp”, “el negro del texts of these news articles. whatsapp”). In Sections 4.2 to 4.6, we analyze the news texts from the two previously described datasets. Table 1 shows the number of news containing the term 4.2 Co-occurring named entities “whatsapp” in our two datasets, according to the geo- In natural language processing, named entity recog- graphical origin of the corresponding news media and nition is the task of extracting mentions of named the year of publication of the news article. entities – that is, definite noun phrases referring to In addition to these datasets, we also collected data individuals, organizations, dates, locations – in a from Google Trends4, an online tool that indicates the text [BLK09]. We here extract the most mentioned frequency of particular terms in the total volume of named entities in our NOW Corpus dataset for each searches in the Google Search engine. This tool also region and year of publication of the articles in order to understand who are the main actors related to the 1 https://corpus.byu.edu/now/ tool WhatsApp according to the media. In this paper, 2https://www.seleniumhq.org/ 3https://pypi.org/project/newspaper/ the co-occurrence is computed on a document level, 4https://trends.google.com/trends/ so we consider all the entities that are mentioned in Table 1: (a) Number of news articles containing the term “whatsapp” in our NOW Corpus dataset according to the geographical origin of the corresponding news media; (b) Number of news articles containing the term “whatsapp” in both NOW Corpus and Brazilian news articles datasets according to the year of publication.

(a) Geographical origin of news articles in our NOW Corpus dataset Region Country Occurrences United States 1,244 The Ameri- Canada 507 cas Jamaica 151 Total: 5.73% / 1,902 Singapore 2,889 Southeast Malaysia 2,578 Asia Philippines 253 Hong Kong 124 Total: 17.61% / 5,844 Great Britain 2,251 British Isles Ireland 2,152 Total: 13.27% / 4,403 Region Country Occurrences South Africa 5,274 Nigeria 1,607 Africa Kenya 1,585 Ghana 754 Tanzania 3 Total: 27.79% / 9,223 Australia 895 Oceania New Zealand 306 Total: 3.62% / 1,201 India 8,991 Indian subconti- Pakistan 1,353 nent Sri Lanka 186 Bangladesh 82 Total: 31.98% / 10,612

(b) Year of publication of news articles in both NOW Corpus and Brazilian news articles datasets Year 2010 2011 2012 2013 2014 2015 2016 2017 2018 Total Occurrences in NOW Corpus 4 41 145 393 1,101 1,642 7,266 11,677 14,636 33,185 Occurrences in Brazilian news articles 0 0 4 91 427 785 904 888 948 4,047 our news articles as co-occurring with the key-term ties accompanying the term “whatsapp” are usually “whatsapp”. other social media companies (“Facebook”, “Twit- To perform the named entity recognition, we use the ter”), countries (“US”, “India”), cities (“Dublin”, Natural Language Toolkit (NLTK)5 classifier trained “Delhi”) and demonyms (“African”, “Australian”). to recognize named entities. Since this tool does not When we analyze the continuation of the lists (not support texts in Portuguese, we do not include the displayed here due to space constraints), we also find dataset containing the Brazilian news articles in this that US-American individuals like Mark Zuckerberg analysis. and Donald Trump are highly mentioned across the Table 3 lists the ten most mentioned entities in globe. However, local entities are also mentioned in each different region considered in this investigation. their respective regions: among the entities not dis- Overall, we observe that the most mentioned enti- played in the table, the most mentioned persons or organized groups in each region are Mark Zuckerberg 5http://www.nltk.org/ Table 3: Most mentioned named entities in each region 35 (the entity “whatsapp” is excluded from the lists)

30 Region Entities 25 Facebook, Google, US, 20 The Americas Twitter, Instagram, Apple,

15 Android, American, Europe, China Facebook, Malaysia, India, 10 Southeast Asia Singapore, US, Malaysian, 5 Google, Indian, China, Chinese Normalized Volume of Searches

0 Facebook, Ireland, US, British Isles London, Irish, Google, 5 2010 2011 2012 2013 2014 2015 2016 2017 2018 British, Android, Dublin, Twitter Year Facebook, Twitter, South Africa, Figure 1: Normalized volume of queries for the term Africa Nigeria, African, Kenya, Instagram, Africa, Nigerian, US “whatsapp” in Google Search from 2010 to 2018 Facebook, US, Australia, Google, Australian, Apple, Oceania Table 2: Most frequent queries related to “whatsapp” Instagram, Twitter, on Google Search per year , New Zealand India, Facebook, Indian, Year Search terms Indian subcontinent Delhi, Mumbai, Pakistan, 2010 blackberry, iphone, service, download, android BJP, US, Twitter, Google 2011 blackberry, nokia, app, download, descargar to be related to social and political situations (Islamic error status, status unavailable, 2012 zello, sniffer download, double check State and BJP) from 2015 onwards, showing that WhatsApp gained importance outside of the world of 7, imagens para whatsapp, intell app up, 2013 pagare whatsapp, baixa whatsapp technology and business. blaue haken, for nokia xl, masti.com, 2014 facebook compra whatsapp, blue ticks on whatsapp 4.3 Semantic fields of the surrounding vocab- whatsapp web, whatsapp reborn, caling feature, ulary 2015 whatsapp transparante, llamadas whatsapp Besides the analysis of the named entities that appear negrita whatsapp, gb whatsapp, whatsapp encryption, 2016 in the same news articles as the term “whatsapp”, the el negro del whatsapp, cartas y whatsapp investigation of the general vocabulary co-occurring video status download, whatsapp plus 2017, 2017 with it is also valuable. One of the possible methods status tamil, wasap weed, whatsapp storing of performing such analysis is by observing the seman- gb whatsapp 2018, plus 2018, 2018 tic fields (i.e. groups to which semantically related call girls group link, browserling, whatsapp business items belong) of the words that appear in our news articles datasets, so to detect relevant concepts men- (the Americas and Oceania), Barisan Nasional (South- tioned in the texts [CMG+14]. Here, we use the tool east Asia), Paddy Jackson (British Isles), Uhuru Keny- Empath6 [FCB16], which provides a set of 194 lexical atta (Africa) and Narendra Modi (Indian subconti- categories representing different semantic fields, each nent). These findings suggest that news regarding the containing a list of words. Since Empath is available WhatsApp tool might deal with locally relevant enti- only in English, the dataset containing Brazilian arti- ties – which also are, most of the times, related to the cles was again not included in this analysis. local political scenarios. For this task, we first extracted all the words of each The ten most mentioned entities in each year are article and applied lemmatization – that is, we grouped displayed in Table 4. Among the entities that do not together their inflected forms so that they could be an- appear in the table due to space limitations, the most alyzed as single items based on their dictionary forms mentioned persons or organized groups in each year (lemmas). Lemmatization was performed employing are: Steve Jobs (2011), Neil Papworth (2012), Mark the WordNet Lemmatizer function provided by the Zuckerberg (2013 and 2014), Islamic State (2015 and Natural Language Toolkit and using verb as the part- 2016) and the Bharatiya Janata Party – BJP (2017 of-speech argument for the lemmatization method, as and 2018). This indicates that, in general, the most in [CMC+18]. Then, we counted the number of lem- relevant entities in the articles ceased to be linked to technology (Jobs, Papworth, Zuckerberg) and started 6https://github.com/Ejhfast/empath-client Table 4: Most mentioned named entities in each year ● Africa ● Ind. subc. ● S.E. Asia region (the entity “whatsapp” is excluded from the lists) ● Brit. Ils. ● Oceania ● America

crime government law Year Entities 2

● ● ● Android, BlackBerry Messenger, BlackBerry, ● ● ● ● ● ● ● ● ● 1 Africa ● ● ● ● ● ● ● ● ● 2010 Kik, iPhone, Nokia, WiFi, ● ● ● Nokia N8, Symbian, India 0

2 ● BlackBerry, iPhone, Facebook, ● ●

● ● ● 1 ● ● ● ● ● Brit. Ils. ● ● ● 2011 Android, , SMS, ● ● ● ● ● ● ● ● ● ● Google, Nokia, Apple, US 0 Facebook, SMS, Android, 2 ● ● ● ● ● ● 2012 India, iPhone, BlackBerry, ● ● ● ● ● ● ● 1 ● ● ● ● Ind. subc. ● ● ● US, Nokia, Skype, Twitter ● ● ● ● ● ● 0 ● Facebook, Android, Twitter, 2013 Google, Apple, India, 2 Words (avg. %) (avg. Words ● ● ● ● ● ● ● ● ● Skype, SMS, BlackBerry, Indian 1 ● ● ● ● Oceania ● ● ● ● ● ● ● ● ● ● ● ● Facebook, Google, Twitter, 0 ● ● 2014 US, Android, India, Apple, 2 ● ● ● ● ● WeChat, China, Instagram ● ● ● ● ● ● ● ● ● 1 ●

S.E. Asia ● ● ● ● Facebook, India, Twitter, ● ● ● ● 0 ● 2015 Google, US, South Africa, Android, Instagram, Skype, Apple 2 ● ● ● ● ● ● ● ● ● ● 1 ● ● Facebook, India, Twitter, America ● ● ● ● ● ● ● ● ● ● ● ● 2016 US, Google, Indian, Android, 0 ● ● ● Instagram, Apple, iPhone 2010 2012 2014 2016 2018 2010 2012 2014 2016 2018 2010 2012 2014 2016 2018 Publish Year Facebook, India, Twitter, 2017 US, Indian, Instagram, Google, London, South Africa, China Figure 2: Average percentage of use of words from the semantic fields crime, government and law in different Facebook, India, Twitter, regions and years (bars indicate standard errors of the 2018 US, Indian, Google, Instagram, Delhi, South African, mean values) 4.4 Co-occurrence networks matized words that appeared in each one of the Em- Another possible analysis on the vocabulary accom- path categories. In this phase, instead of using the panying a key-term in a corpus can be made through absolute frequency of words, we normalized it by di- the observation of co-occurrence networks. In our case, viding the frequency of words in each category by the this method enables the visualization of the most rel- total number of categorized words. evant words that appear in the same news articles as the term “whatsapp” through the means of graphs. In Since analyzing all the 194 Empath categories is this section, we consider both NOW Corpus and the impractical, we manually selected three relevant and Brazilian news articles dataset. noteworthy categories to scrutinize: crime, govern- For this analysis, we first extracted all the words ment and law. In Figure 2, we present the average pro- from the articles and removed stop words using the portion of words belonging to these categories in news lists provided by the Natural Language Toolkit for articles representing different regions across the years. English and Portuguese. Then, we extracted the On the whole, we observe an overall increase in the most relevant words from each document by using proportion of words belonging to the three analyzed the term frequency-inverse document frequency (tf-idf) categories, with most of the peaks (such as the ones technique, that reflects how important a word is to a of 2013 in Oceania) probably due to political events document in a corpus [RU11]. We calculated the tf- (e.g. Australian federal election of 2013). This finding idf for each pair (document, word) and extracted from indicates that words from the semantic fields crime, the document the top 50 words with the highest tf-idf government and law are being gradually more asso- scores. ciated with WhatsApp in news from different regions In the following step, we counted the number of of the world, corroborating the finding of Section 4.2 co-occurrences of the pairs of words. For each doc- that shows an increase in the association of WhatsApp ument, we obtained the list with its 50 most relevant with social and political situations in recent years. words (according to tf-idf) and incremented by one the (a) Africa (b) The Americas

(c) British Isles (d) Indian subcontinent

(e) Oceania (f) Southeast Asia

Figure 3: Co-occurrence networks for NOW Corpus news articles counter relative to each pair of words in this list (com- a graph in which vertices represent words and edges bination two by two). Instead of using the absolute indicate their co-occurrence in the same texts. count of articles in which two words co-occur, we nor- malized this value by dividing it by the total number Since there is a considerable number of documents of articles. At the end of this process, we obtained and news articles can be relatively long, the number of vertices and edges is large. For this reason, and due to the fact that our goal is to identify the most rele- (LDA) [BNJ03] to automatically discover topics dis- vant relationships, we selected only the top 200 edges cussed in texts. For this task, we first lowercased and with the highest weights. Finally, we calculated the tokenized all the words in the datasets. Then, we re- maximum spanning tree out of the remaining graph, moved stop words using, once again, the lists provided generating a graph that depicts the most relevant re- by the Natural Language Toolkit (after having added lationships in the format of a tree. the word “whatsapp” to the lists, since it appears in The final networks for the news written in En- all texts). Finally, we ran the LDA algorithm using glish are presented in Figure 3 and clearly show some the Python library spaCy7 for topic modeling. We clusters that generally represent different themes or used topic coherence score [NLGB10] to choose the specific events. Some of the most relevant ones are: optimum number of topics k to be returned by the al- the “data” clusters, related to privacy, regulation and gorithm. For each region and year, the LDA model data protection, containing words like “privacy” and returned these k topics containing terms ordered by the name of information technology companies; the importance in the corresponding text. We then se- “encryption” clusters, related to the discussion to- lected the most important topic as the representative wards WhatsApp’s end-to-end encryption and con- of each region and year. taining words like “security” and “message”; and the Table 5 shows the top-ranked ten terms produced “crime” clusters, with words like “police”, “attack” by our LDA model representing the main topic for each and “arrested”. region in each year. Here, for the Brazilian articles, we The network regarding Brazilian news articles, pre- translated the terms from Portuguese to English. sented in Figure 4, also shows two of the afore- It is interesting to observe that, between years 2010 mentioned clusters: the “data” cluster (“dados”, and 2013, the main topic in almost all regions was re- “usu´arios”, “mensagens”) and the “crime” cluster lated to WhatsApp features, device compatibility and (“pol´ıcia”,“civil”). Besides that, it presents at least differences between this application and other tech- two other particularly interesting clusters. The first nologies, like SMS. In the Indian subcontinent, how- one is related to the government blocking WhatsApp ever, the main topic of 2013 was about riots and poli- in Brazil, with words like “bloqueio”, “justi¸ca”and tics. “operadoras”; and the second one is the “truck drivers’ In the Americas, in years 2016-2017, the main topics strike” cluster, represented by the words “camin- of the news articles were also related to WhatsApp honeiros”, “greve” and “governo”. features. However, in 2014 and 2015 we can observe words like “refugee”, “libya” and “jihadist”, probably associated with events in the Arab world. In 2018, the main topic is related to the royal British wedding. In Brazil, in 2014, we observe a topic shift to news related to criminal scams in WhatsApp. It is interest- ing to note that the main topic of 2015 is related to a Brazilian court decision to block WhatsApp in the whole country (because the company did not cooper- ate in a criminal investigation). In 2016, year of the impeachment of president Dilma Rousseff, the main topic contains words like “dilma”, “impeachment” and “lula”, while the main topic in 2017 is also about politics, but containing more generic terms, such as “politics” and “government”. In 2018, however, we observe a clear dominance of terms related to Brazil truck drivers’ strike, considered the biggest strike in Figure 4: Co-ccurrence network for Brazilian news ar- the history of the country [Phi18]. In this occasion, ticles WhatsApp played an essential role in the organization of the strike, differently from previous protests that were mostly coordinated through Facebook and Twit- 4.5 Topics addressed ter. This result reinforces the claims that WhatsApp is a valuable tool to communicate and also to share In addition to investigate the vocabulary present in political ideas in Brazil. news articles mentioning WhatsApp, it is also possible In the Indian subcontinent, we note that, between to find the main topics addressed in the texts included in our datasets. We used latent Dirichlet allocation 7https://spacy.io/ Table 5: Main topics for each year in each region

Year The Americas Southeast Asia British Isles Africa Oceania Indian subcontinent Brazil land, asli, app, phone, free, kik, service, federal, right, community, nokia, download, top, message, text, user, message, customary, smartphone, iphone, major, 2010 say, livingston, indigenous, — — mobile, — blackberry, rim, blackberry, malaysian, skype, america, service, growth, contact government, india, ovi, store unlimited recognition charge, dutch, message, phone, business, small, free, text, mxit, knott craig, skype, iphone, net neutrality, handset, android, wilton, owner, call, app, market, platform, call, , extra, mobile, service, 2011 blackberry,patricio, viber, iphone, cheap, attention, free, phone, — internet, law, blackberry, product, plan, network, phone, charge, buy, facebok, mobile, platform, issue, state, iphone, symbiam, entepreneur, service service mobile, smartphone android, text kpn, skype text, app world, system, mobile, app, call, client, message, hutterite, wipf, hong kong, customer, app, year, game, stop, late, phone, nokia, nigerian, app, reconquer, colony, website, data, free, service, iphone, apple malware, asha, service, telecom, telefonica, 2012 waldner,help, people, messaging, communication, ransomware, price, launch, constitution, europe, launch, life, medium, roam, hutchison, social, lync, sonicwall, company, internet, growth, industry, viber, intensify, social application, lead facebook, network prevent,learn cost, news service, call, send instantaneous wannacry communal, market, technology, screen, small, science, kelemu, privacy, user, blackberry, , lau, speak, india, people, brazil, china, blackberry, feel, agricultural, woman, data, user, company, mobile, tencent, indian, blood, consume, keyboard, camera, research, african, message, 2013 device, playbook, commercial,chat, muslim, riot, difficulty, application, woman, school, dutch, policy, service, app, application, martin, politician, emerging, attempt, distraction, fibre, international, server, agency, popularity, release ltd, conference secular, domesticate, carbon, design develop, award address book muzaffarnagar invade minister, party, johnson, party, leader, pakistan, indian, canada, refugee, government, right, criminal, victim, cohen, birmingham, show, music, event, government, terrorist, game, family, party, state, page, victim, former, leader, african, art prime minister, attack, muslim 2014 libya, trip, political, law, false, security, britain, secretary, host, competition, election, kill, freedom, help, team, country, leader, click, virus, prime minister, world, winner, black national, senator, kashmir, army, furniture, huddle election browser, federal vote, election republican, journalist president geeta, pakistan, jihadist, cabinet, terrorist, security, carriers, social media, job, burundi, man, white, refugee, camp, woman, girl, edward, plausible, attack, intelligence, application, linkedin, online, election, protest, boat, data, web, ansar burney, outrage, nisman, law, business, service, 2015 twitter, facebook, nkurunziza, president, australian, karachi, nude, csec, communication, block, justice, brazil, company, professional, police, party, service, phone, sushma swaraj, guido, chamber, cameron, gchq, decision, jobseeker, network bujumbura use, communication police, comissioner, trove, inherently encryption voice, president raghavan business, canada, refugee, syrian, government, hong kong, china, scotland yard, nakuru, group, police, karachi, chera, border, aleppo, alkhuder, year, president, chow, market, religious, muslim, political, youth, punjab, ranger, cbsa, trade, homescreen, fear, lula, 2016 hktdc, president, country, british, party, member, arrest, pakistan, red tape, cfib, syria, earthquake, dilma, work, , product, authority, murder, jubilee, nyamira, kashmir, kill, government, agency, greece, aircraft, impeachment, fair, party cafe, pope, bbc governor, leadership medium, protest raise assad brazil, police police, politics, attack, school, immigrant, afghanistan, market, report, group, lam, chat, president, nasa, government, police, masood, country, kashmir, india, company, service, chat group, kenyan, raila, woman, westminster immigration, taliban, trump, 2017 user, include, help, responsible, content, leader, election, demonstration, isis, terrorist, trump, employee, policy, war, facebook, mobile, campaign, china, political, party, security, geddel, arrest, kill, visa, ban, refugee, obama, american, information service, team iebc, court prisoner, regime, birmingham policy, president troop arming facebook, people, china, wechat, government, meghan, harry, national, party, facebook, zuckerberg, world, internet, student, school, tencent, president, church, england, government, data, scandal, company, university, high, newsguard, app, truck drivers, ceremony leader, state, 2018 mistake, platform, technology, education, parent, chinese, company, strike, minister, wedding, support, michael, prevent, ad, social media, water, health, traffic, military coup, prince, kate, australian, authority, issue online, disease, study mall road, group, support, vow, royal wedding election, member change, new authority deputy, world cup years 2013 and 2017, the main topics were related to and “protest” are related to protests that occurred political themes. In the year 2013, for example, words during the Burundian election. In this occasion, the like “riot”, “muslim” and “muzaffarnagar” are associ- government temporarily blocked messaging services, ated with the riots in Muzaffarnagar, when some riot- including Facebook, WhatsApp and Twitter [Vir16]. ers used WhatsApp to promote violence. In the years In 2017, words like “election” and “president” are as- 2014-2016, rumors on terrorist attacks were dissemi- sociated with the suspicion that disinformation and nated through WhatsApp. In 2017, the main topic fake news were being used to influence Kenyans dur- seems to be associated to the decision of US presi- ing the elections [Sam17]. dent Donald Trump to not withdraw its troops from Afghanistan. There is also a clear dominance of words associ- ated with terrorist attacks in 2015 and 2017 in the In Africa, in 2014, the words “burundi”, “election” British Isles. These words are related to the use of WhatsApp to organize these acts [BBC17]. In South- Figure 5 depicts the average polarity of the news east Asia, news on WhatsApp are generally associated articles in each region and in each year, both in NOW with comparisons with WeChat and, in the year 2018, Corpus and in the dataset of Brazilian articles. We news in this region were associated with the Facebook– observe a major dominance of negative polarities in Cambridge Analytica data scandal. In Oceania, in almost all regions and years, but especially after 2013. the years 2015-2017, the main topics were associated News articles containing the term “whatsapp” are be- with refugees and immigration. WhatsApp played an coming more negative over time probably because of important role during the Syrian Civil War in these the nature of the news articles themselves: in Africa, years, since journalists and individuals living there for instance, the term “whatsapp” occasionally ap- used WhatsApp to communicate with people of for- peared in news articles about refugees8; in India, in eign countries [Boh17]. articles about the spread of fake news that resulted These results show that WhatsApp usage is highly in violence9; in Southeast Asia and the Americas, in associated with important political events in several news about the promotion of violence10; in Brazil, in regions of the world – particularly in Africa, Brazil and news concerning criminal scams11. India. The shift in the main topics addressed in the regions before 2013 (that were related to WhatsApp 4.7 Summary of results features and device compatibility) to, in the following The most relevant findings presented in this section years, political and criminal themes confirms results can be summarized as follows: (presented in previous sections) that indicate a grad- ual increase in the association of this application with • the interest for the term “whatsapp” is constantly social and political situations. increasing over the years, as indicated by the rise of news about this tool and of Google Search 4.6 Polarity queries for this term (Section 4.1); Our final investigation sheds light in another di- • this interest is being accompanied by a change of mension of the news articles containing the term framing around the term “whatsapp” in the me- “whatsapp”: now, we analyze the polarities of the ar- dia – from topics regarding WhatsApp features ticles – that is, whether the expressed opinions in the and technology to those related to misinforma- texts are mostly positive, negative or neutral. Here, we tion, politics and criminal scams (Sections 4.1, are interested in analyzing how the polarity of news ar- 4.3, 4.4, 4.5); ticles related to WhatsApp changes over time and in • the polarity of news articles containing the term different regions. “whatsapp” is becoming more negative over time, To do this, we performed sentiment analy- probably due to the fact that this tool is being sis in each of the articles in our datasets using gradually more associated with crimes, violence + SentiStrength [TBP 10], a tool that estimates the and fake news (Section 4.6). strength of positive and negative polarities in texts. This tool receives as input pieces of text and returns a 5 Concluding Remarks score that varies from -4 (negative) to +4 (positive). In this paper, we present a quantitative analysis on the Africa British Isles Oceania The Americas Brazil Indian subcontinent Southeast Asia public perception of the messaging tool WhatsApp in

1.0 news articles. For conducting our analyses, we used two datasets that cover the whole history of the ap- plication since its release for Android devices in 2010 0.5 until May 2018. The first of these datasets is a cor- pus of news articles written in English and published

0.0 from 2010 to 2018 in 20 countries, while the second one contains Brazilian news articles published from 2012 to Average polarity Average 2018. We also used data collected from Google Trends −0.5 in one of our analyses. Here, we investigated how media sources from dif-

−1.0 ferent parts of the world have been reporting stories re- lated to WhatsApp and whether the rise of the public 2010 2012 2014 2016 2018 Year 8https://bit.ly/2GSTlo7 9 Figure 5: Average polarity of news articles from differ- https://bit.ly/30qdcmn 10https://nyti.ms/2EehjIJ ent regions containing the term “whatsapp” over time 11https://bit.ly/2VNnoGY interest in this application over time was accompanied [Boh17] Lauren Bohn. Syrian history is unfold- by changes on its perception by the media. We ob- ing on WhatsApp. Retrieved from https: served changes in the vocabulary, in the mentioned en- //bit.ly/2Va1VaU. Accessed on May 16, tities, in the addressed topics and in the polarity of the 2019, 2017. articles mentioning the tool WhatsApp in our datasets. In particular, we noticed a shift on media perception [Cam13] C´esarNardelli Cambraia. Da lexicologia in almost all analyzed regions from the period before social a uma lexicologia s´ocio-hist´orica: 2013 – when the focus was on WhatsApp features and caminhos poss´ıveis. Revista de Estudos device compatibility – to the following years – when da Linguagem, 21(1):157–188, 2013. the application started to be gradually more associated [CdO13] Karen Church and Rodrigo de Oliveira. with misinformation, manipulation and extremism, as What’s up with whatsapp?: Compar- well as with political and criminal activities. ing mobile instant messaging behaviors The techniques and approaches proposed here can with traditional . In Proceedings be used to measure the media perception of any com- of the 15th International Conference on pany (or entity in general), but WhatsApp was cho- Human-computer Interaction with Mo- sen due to its influence in information (and misinfor- bile Devices and Services, MobileHCI ’13, mation) dispersion and to the fact that it has been pages 352–361, New York, NY, USA, related to topics such as extremism, corruption and 2013. ACM. political propaganda. In future works, we intend to add more analyses, use news articles from others re- [CMC+18] Evandro Cunha, Gabriel Magno, Josemar gions of the world where WhatsApp is popular (e.g. Caetano, Douglas Teixeira, and Virgilio Germany, Indonesia, Malaysia) and compare the per- Almeida. Fake news as we feel it: per- ception of WhatsApp in the media with the perception ception and conceptualization of the term of it in other sources, like social networks and news ar- “fake news” in the media. In Proceedings ticles comments. Also, we plan to compare the public of the 10th International Conference on perception of WhatsApp with the one of similar tools Social Informatics (SocInfo 2018), 2018. (e.g. Telegram, Facebook Messenger, WeChat) in or- + der to understand which of them are more likely to be [CMG 14] Evandro Cunha, Gabriel Magno, Mar- mentioned in certain types of news – for instance, in cos Andr´e Gon¸calves, C´esar Cambraia, political or crime-related news. and Virgilio Almeida. How you post is who you are: Characterizing Google+ 6 Acknowledgements status updates across social groups. In Proceedings of the 25th ACM Conference This work was partially supported by CNPq, CAPES, on Hypertext and Social Media (HT’14), FAPEMIG and the projects InWeb, MASWEB and pages 212–217, New York, NY, USA, INCT-Cyber. September 2014. Association for Comput- ing Machinery (ACM). References [Con18] John Constine. WhatsApp hits 1.5 billion [ALGB13] Alberto Acerbi, Vasileios Lampos, Philip monthly users. $19b? not so bad. Re- Garnett, and R Alexander Bentley. The trieved from https://tcrn.ch/2LdlavD. expression of emotions in 20th century Accessed on May 16, 2019., 2018. books. PLOS ONE, 8(3):e59030, 2013. [Dav13] Mark Davies. Corpus of News on the Web [BBC17] BBC News. WhatsApp must not be (NOW): 3+ billion words from 20 coun- ‘place for terrorists to hide’. Re- tries, updated every day. Available on- trieved from https://bbc.in/2ZIHKz8. line at https://corpus.byu.edu/now/, Accessed on May 16, 2019, 2017. 2013.

[BLK09] Steven Bird, Edward Loper, and Ewan [FALW+13] Ilias Flaounas, Omar Ali, Thomas Klein. Natural language processing with Lansdall-Welfare, Tijl De Bie, Nick Mos- Python. O’Reilly Media Inc., 2009. dell, Justin Lewis, and Nello Cristian- [BNJ03] David M Blei, Andrew Y Ng, and ini. Research methods in the age of dig- Michael I Jordan. Latent Dirichlet allo- ital journalism: Massive-scale automated analysis of news-content – topics, style cation. Journal of Machine Learning Re- search, 3(Jan):993–1022, 2003. and gender. Digital Journalism, 1(1):102– 116, 2013. [FCB16] Ethan Fast, Binbin Chen, and Michael S. [LWSVC14] Thomas Lansdall-Welfare, Saatviga Sud- Bernstein. Empath: Understanding topic hahar, Giuseppe A Veltri, and Nello Cris- signals in large-scale text. In Proceedings tianini. On the coverage of science in the of the 2016 CHI Conference on Human media: A big data study on the impact Factors in Computing Systems, CHI ’16, of the Fukushima disaster. In 2014 IEEE pages 4647–4657, New York, NY, USA, International Conference on Big Data, 2016. ACM. pages 60–66. IEEE, 2014.

[FCSD15] P. Fiadino, P. Casas, M. Schiavone, and [Mat53] Georges Mator´e. La m´ethode en lexi- A. D’Alconzo. Online social networks cologie: domaine fran¸cais. Didier, Paris, anatomy: On the analysis of facebook 1953. and whatsapp in cellular networks. In 2015 IFIP Networking Conference (IFIP [MGB17] Andres Moreno, Philip Garrison, and Networking), pages 1–9, May 2015. Karthik Bhat. Whatsapp for monitoring and response during critical events: Aggie [FTA+10] Ilias Flaounas, Marco Turchi, Omar Ali, in the ghana 2016 election. In 14th Int. Nick Fyson, Tijl De Bie, Nick Mosdell, Conf. on Information Systems for Crisis Justin Lewis, and Nello Cristianini. The Response and Management, 2017. structure of the EU mediasphere. PLOS + ONE, 5(12):e14243, 2010. [MSA 11] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, [GB11] Kristina Gulordava and Marco Baroni. Matthew K Gray, Joseph P Pickett, Dale A distributional similarity approach to Hoiberg, Dan Clancy, Peter Norvig, Jon the detection of semantic change in the Orwant, et al. Quantitative analysis of Google Books Ngram corpus. In Pro- culture using millions of digitized books. ceedings of the GEMS 2011 Workshop on Science, 331(6014):176–182, 2011. GEometrical Models of Natural Language + Semantics, pages 67–71. Association for [NFK 18] Nic Newman, Richard Fletcher, Anto- Computational Linguistics, 2011. nis Kalogeropoulos, David AL Levy, and Rasmus Kleis Nielsen. Reuters institute [Goe18] Vindu Goel. In India, Facebook’s digital news report 2018. http://www. WhatsApp plays central role in elec- digitalnewsreport.org/. Accessed on tions. Retrieved from https://nyti. May 4, 2018, 2018. ms/2Il6uV3. Accessed on May 16, 2019, 2018. [NLGB10] David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Auto- [GT18] Kiran Garimella and Gareth Tyson. matic evaluation of topic coherence. In Whatsapp, doc? A first look at Human Language Technologies: The 2010 whatsapp public group data. CoRR, Annual Conference of the North Ameri- abs/1804.01473, 2018. can Chapter of the Association for Com- putational Linguistics, HLT ’10, pages [GWCG18] Gaoyang Guo, Chaokun Wang, Jun 100–108, Stroudsburg, PA, USA, 2010. Chen, and Pengcheng Ge. Who is an- Association for Computational Linguis- swering to whom? finding “reply-to” re- tics. lations in group chats with long short- term memory networks. In Wookey [Phi18] Dom Phillips. Truckers’ strike highlights Lee, Wonik Choi, Sungwon Jung, and ’a dangerous moment’ for Brazil’s democ- Min Song, editors, Proceedings of the racy. Retrieved from https://bit.ly/ 7th International Conference on Emerg- 2HlmQMm. Accessed on May 16, 2019., ing Databases, pages 161–171, Singapore, 2018. 2018. Springer Singapore. [PTHS12] Alexander M Petersen, Joel Tenenbaum, [Lee11] Kalev Leetaru. Culturomics 2.0: Fore- Shlomo Havlin, and H Eugene Stanley. casting large-scale human behavior using Statistical laws governing fluctuations in global news media tone in time and space. word use from word birth to word death. First Monday, 16(9), 2011. Scientific Reports, 2, 2012. [Rot14] Steffen Roth. Fashionable functions: A [Sta18] Statista. Share of population in selected Google Ngram view of trends in func- countries who are active WhatsApp users tional differentiation (1800-2000). Inter- as of 3rd quarter 2017. Retrieved from national Journal of Technology and Hu- https://bit.ly/2k9ZV0y. Accessed on man Interaction, 10(2):34–58, 2014. May 16, 2019, 2018. [RSS+18] Avi Rosenfeld, Sigal Sina, David Sarne, [TBP+10] Mike Thelwall, Kevan Buckley, Georgios Or Avidov, and Sarit Kraus. A study of Paltoglou, Di Cai, and Arvid Kappas. whatsapp usage patterns and prediction Sentiment in short strength detection in- models without message content. CoRR, formal text. J. Am. Soc. Inf. Sci. Tech- abs/1802.03393, 2018. nol., 61(12):2544–2558, December 2010. [RU11] Anand Rajaraman and Jeffrey David Ull- [Vir16] Thierry Vircoulon. Burundi turns to man. Data Mining, page 1–17. Cambridge WhatsApp as political turmoil brings me- University Press, 2011. dia blackout. Retrieved from https:// [Sam17] Nanjira Sambuli. How Kenya became bit.ly/1U6m7OS. Accessed on May 16, the latest victim of ‘fake news’. Re- 2019, 2016. trieved from https://bit.ly/2XUS5GM. [Wat18] Jim Waterson. Fears mount over Accessed on May 16, 2019, 2017. WhatsApp’s role in spreading fake [SHS+16] Michael Seufert, Tobias Hoßfeld, Anika news. Retrieved from https://bit.ly/ Schwind, Valentin Burger, and Phuoc 2MzEHD6. Accessed on May 16, 2019, Tran-Gia. Group-based communication 2018. in whatsapp. In IFIP Networking Conf. (IFIP Networking) and Workshops, 2016, pages 536–541. IEEE, 2016.