Creating a Whatsapp Dataset to Study Pre-Teen Cyberbullying

Creating a WhatsApp Dataset to Study Pre-teen Cyberbullying Rachele Sprugnoli1, Stefano Menini1, Sara Tonelli1, Filippo Oncini1,2, Enrico Maria Piras1,3 1Fondazione Bruno Kessler, Via Sommarive 18, Trento, Italy 2Department of Sociology and Social Research, University of Trento, Via Giuseppe Verdi 26, Trento 3School of Medicine and Surgery, University of Verona, Piazzale L. A. Scuro 10, Verona, Italy fsprugnoli;menini;satonelli;[email protected] [email protected] Abstract everyday life, however, the notion of cyberbullying indicates each episode of online activity aimed Although WhatsApp is used by teenagers as at offending, menacing, harassing or stalking an- one major channel of cyberbullying, such interactions remain invisible due to the app pri- other person. The latter definition of cyberbully- vacy policies that do not allow ex-post data ing is the one currently adopted in most of the re- collection. Indeed, most of the information search conducted on this phenomenon and in the on these phenomena rely on surveys regarding present work. The phenomenon has been recog- self-reported data. nised as a ubiquitous public health issue, as the In order to overcome this limitation, we de- literature clearly highlights negative consequences scribe in this paper the activities that led to for teenagers: studies show that victims are more the creation of a WhatsApp dataset to study likely to suffer from psychosocial difficulties, af- cyberbullying among Italian students aged 12- fective disorders and lower school performance 13. We present not only the collected chats (Tokunaga, 2010). Since the difference between with annotations about user role and type of offense, but also the living lab created in a col- cyberbullying and traditional forms of bullying laboration between researchers and schools to lies in the intentional use of electronic forms of monitor and analyse cyberbullying. Finally, contact against the designed victim, data on verbal we discuss some open issues, dealing with eth- harassment and cyberbullying stances are particu- ical, operational and epistemic aspects. larly useful to analyse the phenomenon. However, due to the private nature of these verbal exchanges, 1 Introduction very few datasets are available for the computa- Due to the profound changes in ICT technologies tional analysis of language. It should be noted over the last decades, teenagers communication that the possibilities offered by social networking has been subjected to a major shift. According platform to share privately content among users to the last report by the Italian Statistical Insti- combined with the increasing digital literacy of tute (ISTAT, 2014) in Italy 82.6% of children aged teenagers has the paradoxical effect to hinder the 11-17 use the mobile phone every day and 56.9% possibility to scrutinize and study the actual cyber- access the web on a daily basis. Despite being bullying activities. For instance, although What- of fundamental importance for teenagers’ social sApp is used by teenagers as one major channel of life, the use of these new technologies paved the cyberbullying (Aizenkot and Kashy-Rosenbaum, way to undesirable side effects, among which the 2018), their interactions remain invisible due to digitalisation of traditional forms of harassment. the privacy policies that impede ex-post data col- We refer to these form of harassment as “cyber- lection. Most of the information on these phe- bullying”. Should we adopt a narrow definition, nomena, though, relies on surveys regarding self- cyberbullying refers only to actions repeated over reported data. Yet, specifically for this reason, time with the aim to hurt someone. By defini- the possibility to study how cyberbullying interaction, cyberbullying is in fact ‘an aggressive, in- tions and offenses emerge through instant messag- tentional act carried out by a group or individ- ing app conversations is crucial to fight and pre- ual, using electronic forms of contact, repeatedly vent digital harassment. and over time against a victim who cannot easily In this light, this paper presents an innovative defend him or herself’ (Smith et al., 2008). In corpus of data on cyberbullying interaction gath- 51 Proceedings of the Second Workshop on Abusive Language Online (ALW2), pages 51–59 Brussels, Belgium, October 31, 2018. c 2018 Association for Computational Linguistics ered through a WhatsApp experimentation with manually annotated for the presence of cyberbul- lower secondary school students. After outlining lying. In this dataset, other than cyberbullying, the the CREEP project1 (CybeRbullying EffEcts Pre- authors provide different layers of associated in- vention) during which the activities were carried formation, for example the role of the writer, the out, and the living lab approach that led to the cre- typology of attack and the emotion. The afore- ation of the corpus, we present the provisional re- mentioned works have two main differences with sults of the computational analysis. respect to the data presented in this paper. First, In the discussion, we address the main ethical each annotation refers to an entire message (or a concerns raised by the experimentation and we group of messages) and not to specific expressions discuss the implications of such a methodology and portions of text. Second, the categories used as a tool for both research and cyberbullying for the annotation are more generic than the ones prevention programs. Annotated data, in a stand- adopted in our guidelines. off XML format, and annotation guidelines are A more detailed investigation on the cyberbul- available online2. lying dynamics can be found in Van Hee et al. (2015b,a) presenting the work carried out within NOTE: This paper contains examples of lan- the project AMiCA3, which aims at monitoring guage which may be offensive to some readers. web resources and automatically tracing harmful They do not represent the views of the authors. content in Dutch. The authors propose a detailed analysis of cyberbullying on a dataset of 85,485 2 Related Work posts collected from Ask.fm and manually annotated following a fine-grained annotation scheme In this Section we provide an overview of datasets that covers roles, typology of harassment and level created to study cyberbullying, highlighting the of harmfulness. This scheme has been recently ap- differences with respect to our Whatsapp corpus. plied also to English data (Van Hee et al., 2018). The sources used to build cyberbullying This approach is different from the previous ones datasets are several and cover many different web- since its annotations are not necessarily related sites and social networks. However, the main dif- to entire messages but can be limited to shorter ferences are not in the source of the data but in the strings (e.g., single words or short sequences of granularity and detail of the annotations. Reynolds words). We found this approach the more suitable et al.(2011) propose a dataset of questions and for our data, allowing us to obtain a detailed view answers from Formspring.me, a website with a of pre-teens use of offensive language and of the high amount of cyberbullying content. It consists strategies involved in a group with ongoing cyber- of 12,851 posts annotated for the presence of cy- bullying. The details on how we use the guidelines berbullying and severity. Another resource de- from the AMiCA project (Van Hee et al., 2015c) as veloped by Bayzick et al.(2011) consists of con- the starting point to define our annotation scheme versation transcripts (thread-style) extracted from can be found in Section5. MySpace.com. The conversations are divided into Another important difference between our cor- groups of 10 posts and annotated for presence and pus and the datasets previously discussed is the typology of cyberbullying (e.g. denigration or ha- source of our data. Indeed, among the many avail- rassment, flaming, trolling). Other studies cover able instant messaging and social media platforms, more popular social networks such as Instagram WhatsApp is not much investigated because it is and Twitter. For example, the dataset collected private, thus an explicit donation from the chat from Instagram by Hosseinmardi et al.(2015), participants is required to collect data (Verhei- consists of 2,218 media sessions (groups of 15+ jen and Stoop, 2016). Although in the last few messages associated to a media such as a video or years WhatsApp corpora have been released in a photo), with a single annotation for each media several languages including Italian (see for exam- session indicating the presence of cyberagressive ple the multilingual corpus built in the context behavior. The work by Sui(2015) is instead on of the “What’s up, Switzerland?” project4 (Ue- Twitter data, presenting a corpus of 7,321 tweets 3https://www.lt3.ugent.be/projects/ 1http://creep-project.eu/ amica 2http://dh.fbk.eu/technologies/ 4https://www.whatsup-switzerland.ch/ whatsapp-dataset index.php/en/ 52 berwasser and Stark, 2017), these corpora have hand. been collected with the goal to investigate spe- The creation of the corpus required the involve- cific linguistic traits such as code-switching, or to ment of students in a role-playing simulation of study whether the informal language used in the cyberbullying and has so far involved three lower chats affects the writing skills of students (Do- secondary schools’ classes of teenagers aged 12- rantes et al., 2018; Verheijen and Spooren, 2017). 13 from 2 different schools based in Trento, in With respect to the aforementioned works focused the North-East of Italy. The experimentation it- on the analysis of linguistic phenomena in What- self, described in the next section, was embedded sApp, our aim is to study a social phenomenon, in a larger process that required four to five meet- that is cyberbullying, by analysing how it is lin- ings, one per week, involving every time two so- guistically encoded in WhatsApp messages. Even cial scientists, two computational linguists and at if the relation between cyberbullying and What- least two teachers for the class.

Creating a Whatsapp Dataset to Study Pre-Teen Cyberbullying

Learning React Functional Web Development with React and Redux

UNIVERSITY of CALIFORNIA SAN DIEGO Simplifying Datacenter Fault

Crawling Code Review Data from Phabricator

Artificial Intelligence for Understanding Large and Complex

Unicorn: a System for Searching the Social Graph

2.3 Apache Chemistry

Learning React Modern Patterns for Developing React Apps

Elmulating Javascript

Ecological Momentary Assessments from a Cross-Platform, Native Application

“They Keep Coming Back Like Zombies”: Improving Software

Realtime Data Processing at Facebook

The Compassion Project