Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by AIR Universita degli studi di Milano Proceedings of the Thirteenth International AAAI Conference on Web and Social Media (ICWSM 2019) Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform Matteo Zignani, Christian Quadri, Alessia Galdeman, Sabrina Gaito, Gian Paolo Rossi Computer Science Department, University of Milan, Milan, Italy [email protected], [email protected], [email protected] [email protected], [email protected] Abstract pressions. Beyond countless other advantages, this democ- ratization and freedom of expression (Gayo-Avello 2015) is Our social communications and the expression of our beliefs also entailing the transfer of unpleasant offline behaviors to and thoughts are becoming increasingly mediated and dif- the online life. It is becoming increasingly clear that these fused by online social media. Beyond countless other advan- tages, this democratization and freedom of expression is also platforms open space for contents and conversations which entailing the transfer of unpleasant offline behaviors to the often become inappropriate and hurtful towards the online online life, such as cyberbullying, sexting, hate speech and, in communities that these systems are supporting. This nega- general, any behavior not suitable for the online community tive phenomenon is further supported by a recent statement people belong to. To mitigate or even remove these threats by the Facebook VP of product management saying that the from their platforms, most of the social media providers are Facebook community “submit tens of millions of reports a implementing solutions for the automatic detection and fil- week about potentially objectionable contents”1; and it is be- tering of such inappropriate contents. However, the data they coming more urgent since children and adolescents easily use to train their tools are not publicly available. access to social media but they are more susceptible to these In this context, we release a dataset gathered from Mastodon, contents and more exposed to threats such as cyberbullying a distribute online social network which is formed by com- (Dinakar, Reichart, and Lieberman 2011), online harassment munities that impose the rules of publication, and which al- or sexting (O’Keeffe, Clarke-Pearson, and others 2011). lows its users to mark their posts inappropriate if they per- ceived them not suitable for the community they belong to. To defeat these threats, social media providers, first and The dataset consists of all the posts with public visibility pub- foremost Facebook, are developing AI systems for automat- lished by users hosted on servers which support the English ically identifying inappropriate contents (Yenala et al. 2017) language. These data have been collected by implementing on the basis of the large dataset of content moderation they an ad-hoc tool for downloading the public timelines of the have. However, this kind of data are not publicly available servers, namely instances, that form the Mastodon platform, to the researchers’ community, which in the last years has along with the meta-data associated to them. The overall cor- been very active in providing tools for the automatic iden- pus contains over 5 million posts, spanning the entire life tification of specific types of inappropriate contents such of Mastodon. We associate to each post a label indicating as hate speech (Davidson et al. 2017; Vigna et al. 2017; whether or not its content is inappropriate, as perceived by the Ribeiro et al. 2017), sexually explicit texts (Jha and Mamidi user who wrote it. Moreover, we also provide the full descrip- tion of each instance. Finally, we present some basic statistics 2017) or images, offensive language (Xiang et al. 2012) or about the production of inappropriate posts and the character- racism (Founta et al. 2018). istics of their associated textual content. In this respect, we aim at providing the community with a dataset that could help carrying on the research on this topic. With this in mind, our attention has been directed to a Introduction decentralized social network whose features are well suited In little over a decade, online social networks and social me- to the purpose: Mastodon, a new and fast emerging decen- dia have come over almost every aspect of the life of billion tralized microblogging platform. In particular, it exhibits the people all over the world. Undoubtedly, all these social plat- following two services: i) that impose clearly stated rules of forms have facilitated the communications and broadly in- publication and censorship which must be accepted by the creased the audience any sort of message can reach. In fact, user who wants to register; and ii) it enables the user to pub- people have a large number of online communities available lish an alert if she thinks that the post she is publishing may to express, discuss and exchange any kind of information, be perceived as inappropriate by any reader. Thus, we re- from their beliefs and views on various topics to artistic ex- lease a dataset gathered from Mastodon consisting of two Copyright c 2019, Association for the Advancement of Artificial 1https://www.fastcompany.com/40566786/heres-how- Intelligence (www.aaai.org). All rights reserved. facebook-uses-ai-to-detect-many-kinds-of-bad-content 639 elements: i) all the posts with public visibility published by users hosted on Mastodon servers, namely instances, which support the English language; and ii) the policy, the code of conduct and the prohibited contents of each instance. Since Mastodon comes with a built-in function which allows users to mark their posts as “inappropriate” or “sensitive”, we as- sociate to each item a label indicating its appropriateness. Further, the definition of inappropriate – “Something that is inappropriate is not useful or suitable for a particular situa- tion or purpose.”2 – asks for a context in order to distinguish what is suitable. In our case the context is provided by the instance meta-data and we assume that users are aware of the Figure 1: The box to compose a toot with a content warning. policy of the instance they belong to, as they choose it at the registration time. The latter characteristics represent a nov- elty and a change of paradigm in the today’s social media ecosystem since they give the responsibility of publication act in Mastodon is the creation of a textual content, also and of a possible censorship back to the users themselves. called ”toot”. A toot, or post, cannot exceed the 500 char- Effectively, the dataset have been collected by implementing acter length. Each toot has a privacy option, and users can an ad-hoc spider for retrieving the timelines of the instances choose whether the post is public or private (Salve, Mori, which form the Mastodon platform and their meta-data. The and Ricci 2018). Public messages can be aggregated and dis- overall corpus contains over 5M posts, spanning the entire played on the federated timeline – a real-time wall which life of Mastodon. The DOI associated to the dataset, hosted collects the public contents coming from all the Mastodon at https://dataverse.harvard.edu, is 10.7910/DVN/R1HKVS. instances –, while private messages are shared on the time- lines of the user’s followers only. A third timeline, namely the local timeline, shows messages from users hosted on a Dataset singular instance only. It is to note that Mastodon only al- Mastodon characteristics. Before delving into the details lows a chronological ordering of the timelines, thus avoid- about the released dataset and the methodology we used ing any ranking mechanism based on advertisement or other to collect it, we briefly summarize what is Mastodon and recommendation algorithms. Finally, for the purpose of this the features we exploit to get and label inappropriate con- paper, we describe a further feature regarding the publication tents. Mastodon is a decentralized online social network of toots. When a user wants to post something that should with microblogging features, where each server runs open not be immediately visible or that may potentially result in- source software. The main aim of the project, which dates to appropriate for the community, she can “hide” it behind a 2016, is to restore control of the content distribution chan- content warning, as shown in Figure 1. nels to the people by avoiding the insertion of sponsored By clicking on the “CW” button, a user can enter a short users or posts in the feeds. Due to its decentralized archi- summary of what the ”body” of her post contains, namely a tecture, Mastodon is oriented towards small/medium com- spoiler-text, and the full content of her toot. Automatically, munities, called instances, which explicitly specify the top- the system marks this toot as “sensitive” and only shows the ics their users should be interested in. To this aim, each spoiler-text in all the timelines. We exploit this latter feature instance declares the contents which are not allowed and to build our released dataset. This way the toots are labelled the spirit of the community it supports through a full de- by the users, and we assume that they are aware of the policy scription and a list of allowed topics. All these pieces of of the instance and aware of what is appropriate or not for information are available at the registration time, so that their community. users are aware of the instance policy and whether or not a content is inappropriate for a given community. In fact, Collection Methodology Here we describe the collection the community-orientation strongly impacts the moderation methodology of the two main elements of our dataset: i) the procedures. For instance, one of the most popular European instance meta-data and ii) the local timelines of all the in- instances, “mastodon.social”, bans contents that are illegal stances which allow toots written in English.