Information Privacy Opinions on : A Cross-Language Study

Felipe González Andrea Figueroa UTFSM UTFSM Santiago, Chile Valparaíso, Chile [email protected] [email protected]

Claudia López Cecilia Aragon UTFSM University of Washington Valparaíso, Chile Seattle, United States [email protected] [email protected]

ABSTRACT The Cambridge Analytica scandal triggered a conversation on Twitter about data practices and their implications. Our research proposes to leverage this conversation to extend the understanding of how This collaboration was possible thanks to the information privacy is framed by users worldwide. We collected tweets about the scandal written in support of the Fulbright Program, under a 2017- Spanish and English between April and July 2018. We created a word embedding to create a reduced 18 Fulbright Fellowship award. This work was multi-dimensional representation of the tweets in each language. For each embedding, we conducted arXiv:1912.02852v1 [cs.IR] 5 Dec 2019 also partially funded by CONICYT Chile, under open coding to characterize the semantic contexts of key concepts: “information”, “privacy”, “company” grant Conicyt-Fondecyt Iniciación 11161026. and “users” (and their Spanish translations). Through a comparative analysis, we found a broader The first author acknowledges the support emphasis on privacy-related words associated with companies in English. We also identified more of the PIIC program from Universidad Téc- terms related to data collection in English and fewer associated with security mechanisms, control, nica Federico Santa María and CONICYT- PFCHA/MagísterNacional/2019-22190332 and risks. Our findings hint at the potential of cross-language comparisons of text to extend the understanding of worldwide differences in information privacy perspectives.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). CSCW ’19 Companion, November 9–13, 2019, Austin, TX, USA © 2019 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6692-2/19/11. https://doi.org/10.1145/3311957.3359501 Poster Presentation CSCW ’19 Companion, November 9–13, 2019, Austin, TX, USA

ACM Reference Format: Felipe González, Andrea Figueroa, Claudia López, and Cecilia Aragon. 2019. Information Privacy Opinions on Twitter: A Cross-Language Study. In 2019 Computer Supported Cooperative Work and Social Computing Companion Publication (CSCW ’19 Companion), November 9–13, 2019, Austin, TX, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3311957.3359501

INTRODUCTION Information privacy has been defined as “the ability of individuals to control the terms under which their personal information is acquired and used” [3]. According to public opinion polls, privacy is Table 1: Datasets before and after data one of the major concerns of people nowadays [14]. Different measurement instruments have been cleaning developed to identify, analyze and evaluate privacy concerns [1, 17]. Two well-known instruments are the Concern for Information Privacy (CFIP) [15] and Internet Users’ Information Privacy Concerns Dataset Spanish English (IUIPC) [10]. IUIPC adapts CFIP into the internet context [14] and is widely used even today [12, 17]. While it is expected that individuals from different world regions have different cultures, values #Tweets #Users #Tweets #Users and laws that can result in differences in their perceptions of information privacy and its impacts [1], Total 472,363 222,352 7,476,988 1,846,542 there is still a limited understanding of such differences. Original 106,656 47,951 1,572,371 574,452 In 2018, it was revealed that the of 87 million users were exposed and Human 74,644 36,056 975,678 410,180 used by Cambridge Analytica to support political campaigns [9]. This Cambridge Analytica scandal sparked a worldwide conversation on Twitter about this particular misuse of user data. Our project seeks to use these online public communications to identify differences and similarities on data privacy perspectives by people who write in different languages, which we see as a proxy to represent different world regions. We report our preliminary results and discuss their potential to deepen the understanding of information privacy perspectives worldwide.

RELATED WORK AND RESEARCH QUESTIONS The IUIPC instrument was designed to reflect and identify Internet users’ concerns about information privacy from a user perspective of fairness [10]. It contains three dimensions of concerns: the trade-off between personal data collection by others and perceived benefits, the users’ ability to control their personal information, and their awareness of organizational information privacy practices. Most research on privacy concerns has been conducted through questionnaires [16], such as the IUIPC [12]. While widely used to understand personal privacy concerns in North America and Europe (e.g., [5, 8]), these surveys have been less frequently applied in other world regions. Thus, this tendency has left open many research questions about how privacy perspectives vary across theglobe. Recent work has explored text mining as an alternative research method. Raber and Krüger found that IUIPC dimensions can be derived from written text12 [ ]. They observed a correlation between IUIPC concerns, as measured by a questionnaire, and LIWC language features of posts Poster Presentation CSCW ’19 Companion, November 9–13, 2019, Austin, TX, USA

from 100 users. Inspired by this line of work, we propose to analyze the semantic context of privacy- Table 2: Emergent categories during open related words in online communications in different languages to explore its potential for revealing coding worldwide differences in information privacy perspectives. In particular, we seek to use tweetsabout the Cambridge Analytica scandal in two languages as a corpus to observe similarities and differences Categories Description in how people conceptualize information privacy. Two research questions guide our work: Data & Informa- Direct references to these concepts and • RQ1: Do IUIPC dimensions emerge from online communications after a data breach scandal? tion examples of user data and information • RQ2: Are there differences in the semantic contexts of privacy-related words that comefrom Companies & Or- Entities that manipulate user data for online communications written in two different languages? ganizations their own purposes Users Data owners Data collection, Technologies or techniques to obtain, col- DATA AND METHODS handling and/or lect and/or handle data To answer our research questions, we collected tweets written in Spanish and English between April storage 1st and July 10th, 2018. We used the Twitter API to capture tweets that include hashtags or keywords Privacy & secu- Words associated with data privacy and related to the Cambridge Analytica scandal or data privacy, such as “#CambridgeAnalytica” and rity terms security “Facebook privacy”. We retrieved more than 7.4 million tweets written in English, and more than Security mecha- Tools and techniques that implement se- 470,000 tweets in Spanish (see Table 1). nisms curity services We cleaned our dataset in two ways. First, we removed all retweets to focus on original opinions. Privacy & secu- Entities or bad practices that can compro- This step downsized both datasets by 80%. Second, we attempted to eliminate tweets generated by rity risks mise sensitive data automated accounts so our study could indeed reflect people’s opinions. We chose Botometer [4] to Ownership Control over personal information identify potential bots. More details about this process can be found in [6]. Our final dataset includes agency Regulation Law, rule or regulation that controls the 74,644 tweets in Spanish and 975,678 tweets in English (see Table 1). use of user data Following Rho et al.’s approach [13], we used word embeddings [11] to analyze the semantic context Synonymous Words with the same meaning than the in which a concept under study is framed. Based on co-occurrence of terms, word embeddings create token a reduced multi-dimensional representation of a corpus of text that allows assessing the semantic Attribute or char- A characteristic of the token proximity among terms in a corpus. Thus, analyzing the closest terms of a given word can reveal the acteristic context in which it is used [13]. Action Action or activity linked to the token To enable a cross-language comparison, we built a word embedding for all tweets written in the Third party Entity that can not be categorized as User same language. Before creating them, we removed stopwords and transformed the text to lowercase. or Company because there is not suffi- We customized our stopwords to ensure that digits and symbols like “#” were removed but not the cient contextual information to do so words that contain it. Links and usernames were removed. As a result, our corpus comprised 76,128 Reaction or atti- Way of feeling or acting toward a person, unique words in English and 21,736 in Spanish. tude thing or situation. We considered seven word embedding architectures that involve Word2Vec/FastText, CBOW/Skipgram, Undetermined The relation between the token and the and different numbers of dimensions and epochs. Each word embedding architecture for the Eng- word is not exactly known, established or Google Analogy Test Set MTurk-287 defined lish corpus was evaluated over 15 evaluation methods [7] (e.g., , , ESSLI_1a). Considering all terms that appear at least 3 times and a window of 5 terms, a Word2Vec Token-category: yellow Privacy: light blue Other: gray CBOW architecture with 300 dimensions trained during 50 epochs achieved the best performance. Poster Presentation CSCW ’19 Companion, November 9–13, 2019, Austin, TX, USA

The same architecture was used for the Spanish corpus. Gender bias in our embeddings was reduced Table 3: Representation of privacy cate- using Bolukbasi’s methodology [2]. gories per each token in the Spanish and To analyze the semantic context in which information privacy was framed in Spanish and English, English word embeddings two of the authors conducted open coding of the 40 closest words to four privacy-related tokens: privacy, information, users, and company. Their respective Spanish translation were also used: privacidad, Spanish (%) English (%) información, usuarios, and empresa. Through an iterative process, the coders consolidated the open Company 2.50 12.50 codes into 15 categories (see Table 2). A total of 320 words were independently re-classified in these Users 30.00 30.00 categories. Considering the eight tokens, the average Cohen’s kappa score was 0.707. Information 32.50 27.50 Privacy 62.50 62.50 RESULTS AND CONCLUSIONS The coding process resulted in 15 categories. Three of them match our initial tokens: information, com- pany, and users. Instead, the token privacy can be associated with six different categories. Answering RQ1, five of them are related to the IUIPC dimensions. The IUIPC’s collection dimension refers to the “degree to which a person is concerned about the amount of individual-specific data possessed by others relative to the value of benefits received” [10]. Through open coding, we identify a data collection, handling and/or storage category that contains words associated with technology or techniques useful to obtain, collect or handle data, including databases, services, app and website. The control dimension denotes concerns about control over personal information. This is often exercised through approval, modification and opportunity to opt-in or opt-out [10]. Terms related to this dimension appear in the coding phase (e.g., consent, opt, permission) and are categorized as ownership agency. This category also includes advice directed to users and good privacy practices (e.g., prevent, protect , and avoid in Spanish along with cuidatusdatos, which means take care of your data). The third IUIPC dimension is awareness that refers to individual concerns about her/his awareness of organizational information privacy practices [10]. Three of our categories are associated with this dimension. Privacy and security terms comprise words such as confidentiality, transparency, safety, seguridaddigital (digital security), ciberseguridad (cybersecurity). Security mechanisms include, among others, the following terms: contraseñas (passwords) and encryption. Finally, privacy & security risks refers to entities or bad practices that can compromise sensitive data, for example: troyano (trojan), databreach, grooming, ciberdelincuente (cybercriminal). Through a comparative analysis of categories by token, we observe that a broader proportion of words is covered by privacy-related categories in English than in Spanish (see Table 3). This difference Figure 1: Representation of data privacy is largely explained by the context around the company token, which is 5 times larger in English. categories in each word embedding Regarding the IUIPC dimensions, there is a broader emphasis on collection as the category data collection, handling and/or storage cover more words in the English tokens, compared to the Spanish ones (see Figure 1). In turn, we note that the three categories related to awareness (privacy and security Poster Presentation CSCW ’19 Companion, November 9–13, 2019, Austin, TX, USA

terms, risks and security mechanisms) cover more words in Spanish than English. Lastly, ownership agency (control in IUIPC) appears slightly more in Spanish, especially regarding the token user. Full data is available at https://github.com/ Finally, another privacy-related category that emerges from our coding but can not be directly gonzalezf/Information-Privacy-Opinions-on- associated with the IUIPC dimensions is regulations. This category was highly relevant when Twitter-A-Cross-Language-Study analyzing the semantic context of the token privacy in both languages, but slightly more in Spanish. Overall, the preliminary results reported here suggest that social media text written in two languages can be used to reveal different emphasis on information privacy perspectives (RQ2).

REFERENCES [1] France Bélanger and Robert E Crossler. 2011. Privacy in the digital age: a review of information privacy research in information systems. MIS quarterly 35, 4 (2011), 1017–1042. [2] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems. 4349–4357. [3] Mary J Culnan and Robert J Bies. 2003. Consumer privacy: Balancing economic and justice considerations. Journal of social issues 59, 2 (2003), 323–342. [4] Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. Botornot: A system to evaluate social bots. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 273–274. [5] Tamara Dinev, Massimo Bellotto, Paul Hart, Vincenzo Russo, Ilaria Serra, and Christian Colautti. 2006. Privacy calculus model in e-commerce–a study of and the United States. European Journal of Information Systems 15, 4 (2006), 389–402. [6] Felipe González, Yihan Yu, Andrea Figueroa, Claudia López, and Cecilia Aragon. 2019. Global Reactions to the Cambridge Analytica Scandal: A Cross-Language Social Media Study. In Companion Proceedings of The 2019 World Wide Web Conference. ACM, 799–806. [7] Stanisław Jastrzebski, Damian Leśniak, and Wojciech Marian Czarnecki. 2017. How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks. (2017). [8] Naz Kaya and Margaret J Weber. 2003. Cross-cultural differences in the perception of crowding and privacy regulation: American and Turkish students. Journal of environmental psychology 23, 3 (2003), 301–309. [9] Jean-Rémi Lapaire. 2018. Why content matters. Zuckerberg, Vox Media and the Cambridge Analytica data leak. ANTARES: Letras e Humanidades 10, 20 (2018), 88–110. [10] Naresh K Malhotra, Sung S Kim, and James Agarwal. 2004. Internet users’ information privacy concerns (IUIPC): The construct, the scale, and a causal model. Information systems research 15, 4 (2004), 336–355. [11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. [12] Frederic Raber and Antonio Krüger. 2018. Privacy Perceiver: Using Social Network Posts to Derive Users’ Privacy Measures. In Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization. ACM, 227–232. [13] Eugenia Ha Rim Rho, Gloria Mark, and Melissa Mazmanian. 2018. Fostering Civil Discourse Online: Linguistic Behavior in Comments of #MeToo Articles across Political Perspectives. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 147. Poster Presentation CSCW ’19 Companion, November 9–13, 2019, Austin, TX, USA

[14] H Jeff Smith, Tamara Dinev, and Heng Xu. 2011. Information privacy research: an interdisciplinary review. MIS quarterly 35, 4 (2011), 989–1016. [15] H Jeff Smith, Sandra J Milberg, and Sandra J Burke. 1996. Information privacy: measuring individuals’ concerns about organizational practices. MIS quarterly (1996), 167–196. [16] Blase Ur and Yang Wang. 2013. A cross-cultural framework for protecting user privacy in online social media. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 755–762. [17] Haejung Yun, Gwanhoo Lee, and Dan J Kim. 2019. A chronological review of empirical research on personal information privacy concerns: An analysis of contexts and research constructs. Information & Management 56, 4 (2019), 570–601.