Arxiv:2004.11480V1 [Cs.SI] 23 Apr 2020
Total Page:16
File Type:pdf, Size:1020Kb
Characterising User Content on a Multi-lingual Social Network Pushkal Agarwal,1 Kiran Garimella,2 Sagar Joglekar,1 Nishanth Sastry,1,4 Gareth Tyson,3,4 1King’s College London, 2MIT, 3Queen Mary University of London, 4Alan Turing Institute [email protected], [email protected], [email protected], [email protected], [email protected] Abstract We take as our case study the 2019 Indian General Elec- tion, considered to be the largest ever exercise in repre- Social media has been on the vanguard of political infor- sentative democracy, with over 67% of the 900 million mation diffusion in the 21st century. Most studies that look strong electorate participating in the polls. Similar to recent into disinformation, political influence and fake-news focus elections in other countries (Metaxas and Mustafaraj 2012; on mainstream social media platforms. This has inevitably Agarwal, Sastry, and Wood 2019), it is also undeniable made English an important factor in our current understand- ing of political activity on social media. As a result, there has that social media played an important role in the Indian only been a limited number of studies into a large portion General Elections, helping mobilise voters and spreading of the world, including the largest, multilingual and multi- information about the different major parties (Rao 2019; cultural democracy: India. In this paper we present our char- Patel 2019). Given the rich linguistic diversity in India, we acterisation of a multilingual social network in India called wish to understand the manner in which such a large scale ShareChat. We collect an exhaustive dataset across 72 weeks national effort plays out on social media despite the differ- before and during the Indian general elections of 2019, across ences in languages among the electorate. We focus our ef- 14 languages. We investigate the cross lingual dynamics by forts on understanding whether and to what extent informa- clustering visually similar images together, and exploring tion on social media crosses regional and linguistic divides. how they move across language barriers. We find that Tel- ugu, Malayalam, Tamil and Kannada languages tend to be To explore this, we have collected the first large-scale dominant in soliciting political images (often referred to as dataset, consisting of over 1.2 million posts across 14 lan- memes), and posts from Hindi have the largest cross-lingual guages during and before the election campaigning period, diffusion across ShareChat (as well as images containing text from ShareChat.1 This is a media and text-sharing platform in English). In the case of images containing text that cross designed for Indian users, with over 50 million monthly language barriers, we see that language translation is used to active users.2 In functionality, it is similar to services like widen the accessibility. That said, we find cases where the Instagram, with users sharing mostly multimedia content. same image is associated with very different text (and there- However, it has one major difference that aids our research fore meanings). This initial characterisation paves the way design: Unlike other global platforms, different languages for more advanced pipelines to understand the dynamics of are separated into communities of their own, which creates fake and political content in a multi-lingual and non-textual setting. a silo effect and a strong sense of commonality. Despite this regional appeal, it has accumulated tens of millions of users in just 4 years of existence, with most being first time Inter- 1 Introduction net users (Levin 2019). arXiv:2004.11480v1 [cs.SI] 23 Apr 2020 Figure 1 presents the ShareChat homepage, in which users Soon after its independence, India was divided internally must first select the language community into which they into states, based mainly on the different languages spo- post. Note that there is no support for English language dur- ken in each region, thus creating a distinct regional iden- ing the registration process,3 thereby creating a unique social tity among its citizens in addition to the idea of nation- media platform for regional languages. Thus, by mirroring hood (Guha 2017). There have often been conflicts and dis- the language-based divisions of the political map of India, agreements across state borders, where separate regional ShareChat offers a fascinating environment to study Indian identities have been asserted over national unity (Gupta identity along national and regional lines. This is in contrast 1970). Within this historical and political context, we wish to to other relevant platforms, such as WhatsApp, which do not understand the extent to which information is shared across different languages. 1An anonymised version of the dataset is available for non- commercial research usage from https://tiny.cc/share-chat. Copyright c 2020, Association for the Advancement of Artificial 2https://we.sharechat.com/ Intelligence (www.aaai.org). All rights reserved. 3https://techcrunch.com/2019/08/15/sharechat-twitter-seriesd/ a number of works looking into the multi-lingual nature of certain platforms. For example, language differences amongst Wikipedia editors (Kaffee and Simperl 2018), as well as general differences amongst language editions (Hale 2014; Hecht and Gergle 2010; Hale 2015). Often this goes beyond textual differences, showing that even images dif- fer amongst language editions (He et al. 2018). There have also been efforts to bridge these gaps (Bao et al. 2012), al- though often differences extend beyond language to include differences between facts and foci. This often leads editors to work on their non-native language editions (Hale 2015). Although highly relevant, our work differs in that we focus on a social community platform, in which individuals and Figure 1: ShareChat homepage and user interactions. groups communicate. This contrasts starkly with the use of community-driven online encyclopedias, which are focused on communal curation of knowledge. have such formal language boundaries (Melo et al. 2019; More closely related are the range of studies looking at Rao 2019; Garimella and Tyson 2018). the multi-lingual use of social platforms, e.g., blogs (Hale Our key hypothesis is that these language barriers may be 2012), reviews (Hale 2016) and Twitter (Nguyen, Tri- overcome via image-based content, which is less dependent eschnigg, and Cornips 2015; Hong, Convertino, and Chi on linguistic understanding than text or audio. To explore 2011). In most cases, these studies find a dominance of En- this, we formulate the following two research questions: glish language material. For example, Hong et al. (Hong, Convertino, and Chi 2011) found that 51% of tweets are in RQ-1 Can image content transcend language silos? If so, English. Curiously, it has been found that individuals of- how often does this happen, and amongst which lan- ten change their language choices to reflect different po- guages? sitions or opinions (Murthy et al. 2015). Another major RQ-2 What kinds of images are most successful in tran- part of this paper is the study of memes and image shar- scending language silos, and do their semantic meanings ing. Recent works have looked at meme sharing on plat- mutate? forms like Reddit, 4Chan and Gab, where they are often used to spread hate and fake news (Zannettou et al. 2018). To answer these questions, we exploit our ShareChat data These modalities are powerful, in that they often transcend and propose a methodology to cluster perceptually similar languages, and have sometimes provoked real world con- images, across languages, and understand how they change sequences (Zannettou et al. 2019b) or have been used as per community. We find that a number of these images are a mode of information warfare (Zannettou et al. 2019a; contain associated text (Zannettou et al. 2018). We therefore Raman et al. 2019). extract language features within images via Optical Char- Our research differs substantially from the works above. acter Recognition (OCR) and further annotate multi-lingual The above social media platform does not embed language images with translations and semantic tags. directly into their operations. Instead, language communities Using this data, we investigate the characteristics of im- form naturally. In contrast, ShareChat is themed around lan- ages that have a wider cross-lingual adoption. We find that a guage with strict distinctions between communities. Further- majority of the images have embedded text in them, which more, it has a strong focus on image-based sharing, which is recognised by OCR. This presence of text strongly affects we posit may not be impacted by language in the same way the sharing of images across languages. We find, for exam- as textual posts. As such, we are curious to understand the ple, that sharing is easier when the image text is in languages interplay between these strictly defined language communi- such as Hindi and English, which are lingua franca widely ties, and the capacity of images to cross language bound- understood across large portions of India. We also find that aries. To the best of our knowledge, this is the first paper the text of the image is translated when shared across dif- to empirically explore user and language behaviour at such ferent language communities, and that images spread more scale on a multimedia heavy, multi-lingual platform, across easily across linguistically related languages or in languages languages used (in India). spoken in adjacent regions. We even observe that sometimes message change during translation, e.g., political memes are 3 Data Collection altered to make them more consumable in a specific lan- guage, or the meaning is altered on purpose in the shared We have gathered, to the best of our knowledge, the first text. Our results have key implications for understanding po- large-scale multi-lingual social network dataset. To collect litical communications in multi-lingual environments.