Detecting Inappropriate Messages on Sensitive Topics That Could Harm a Company's Reputation
Total Page:16
File Type:pdf, Size:1020Kb
Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company’s Reputation∗ Nikolay Babakovz, Varvara Logachevaz, Olga Kozlovay, Nikita Semenovy, and Alexander Panchenkoz zSkolkovo Institute of Science and Technology, Moscow, Russia yMobile TeleSystems (MTS), Moscow, Russia fn.babakov,v.logacheva,[email protected] foskozlo9,[email protected] Abstract texts which are not offensive as such but can ex- press inappropriate views. If a chatbot tells some- Not all topics are equally “flammable” in terms thing that does not agree with the views of the com- of toxicity: a calm discussion of turtles or pany that created it, this can harm the company’s fishing less often fuels inappropriate toxic di- alogues than a discussion of politics or sexual reputation. For example, a user starts discussing minorities. We define a set of sensitive top- ways of committing suicide, and a chatbot goes ics that can yield inappropriate and toxic mes- on the discussion and even encourages the user to sages and describe the methodology of collect- commit suicide. The same also applies to a wide ing and labeling a dataset for appropriateness. range of sensitive topics, such as politics, religion, While toxicity in user-generated data is well- nationality, drugs, gambling, etc. Ideally, a chat- studied, we aim at defining a more fine-grained bot should not express any views on these subjects notion of inappropriateness. The core of inap- propriateness is that it can harm the reputation except those universally approved (e.g. that drugs of a speaker. This is different from toxicity are not good for your health). On the other hand, in two respects: (i) inappropriateness is topic- merely avoiding a conversation on any of those related, and (ii) inappropriate message is not topics can be a bad strategy. An example of such toxic but still unacceptable. We collect and re- unfortunate avoidance that caused even more rep- lease two datasets for Russian: a topic-labeled utation loss was also demonstrated by Microsoft, dataset and an appropriateness-labeled dataset. this time by its chatbot Zo, a Tay successor. To We also release pre-trained classification mod- protect the chatbot from provocative topics, the els trained on this data. developers provided it with a set of keywords asso- 1 Introduction ciated with these topics and instructed it to enforce the change of topic upon seeing any of these words The classification and prevention of toxicity (mali- in user answers. However, it turned out that the cious behaviour) among users is an important prob- keywords could occur in a completely safe con- lem for many Internet platforms. Since communica- text, which resulted in Zo appearing to produce tion on most social networks is predominantly tex- even more offensive answers than Tay.2 Therefore, tual, the classification of toxicity is usually solved simple methods cannot eliminate such errors. by means of Natural Language Processing (NLP). Thus, our goal is to make a system that can pre- This problem is even more important for develop- dict if an answer of a chatbot is inappropriate in arXiv:2103.05345v1 [cs.CL] 9 Mar 2021 ers of chatbots trained on a large number of user- any way. This includes toxicity, but also any an- generated (and potentially toxic) texts. There is a swers which can express undesirable views and 1 well-known case of Microsoft Tay chatbot which approve or prompt user towards harmful or illegal was shut down because it started producing racist, actions. To the best of our knowledge, this problem sexist, and other offensive tweets after having been has not been considered before. We formalize it fine-tuned on user data for a day. and present a dataset labeled for the presence of However, there exists a similar and equally im- such inappropriate content. portant problem, which is nevertheless overlooked Even though we aim at a quite specific task – by the research community. This is a problem of detection of inappropriate statements in the output ∗Warning: the paper contains textual data samples which can be considered offensive or inappropriate. 2https://www.engadget.com/2017-07-04- 1https://www.theverge.com/2016/3/24/ microsofts-zo-chatbot-picked-up-some- 11297050/tay-microsoft-chatbot-racist offensive-habits.html of a chatbot to prevent the reputational harm of a 2019). Offenses can be directed towards an individ- company, in principle, the datasets could be used ual, a group, or undirected (Zampieri et al., 2019), in other use-cases e.g. for flagging inappropriate explicit or implicit (Waseem et al., 2017). frustrating discussion in social media. Insults do not necessarily have a topic, but there It is also important to discuss the ethical aspect certainly exist toxic topics, such as sexism, racism, of this work. While it can be considered as another xenophobia. Waseem and Hovy(2016) tackle sex- step towards censorship on the Internet, we suggest ism and racism, Basile et al.(2019) collect texts that it has many use-cases which serve the common which contain sexism and aggression towards immi- good and do not limit free speech. Such applica- grants. Besides directly classifying toxic messages tions are parental control or sustaining of respectful for a topic, the notion of the topic in toxicity is tone in conversations online, inter alia. We would also indirectly used to collect the data: Zampieri like to emphasize that our definition of sensitive et al.(2019) pre-select messages for toxicity label- topics does not imply that any conversation con- ing based on their topic. Similarly, Hessel and Lee cerning them need to be banned. Sensitive topics (2019) use topics to find controversial (potentially are just topics that should be considered with extra toxic) discussions. care and tend to often flame/catalyze toxicity. Such a topic-based view of toxicity causes un- The contributions of our work are three-fold: intended bias in toxicity detection – a false asso- ciation of toxicity with a particular topic (LGBT, • We define the notions of sensitive topics and Islam, feminism, etc.) (Dixon et al., 2018; Vaidya inappropriate utterances and formulate the et al., 2020). This is in line with our work since we task of their classification. also acknowledge that there exist acceptable and • We collect and release two datasets for Rus- unacceptable messages within toxicity-provoking sian: a dataset of user texts labeled for sensi- topics. The existing work suggests algorithmic tive topics and a dataset labeled for inappro- ways for debiasing the trained models: Xia et al. priateness. (2020) train their model to detect two objectives: • We train and release models which define a toxicity and presence of the toxicity-provoking topic of a text and define its appropriateness. topic, Zhang et al.(2020) perform re-weighing of We open the access to the produced datasets, instances, Park et al.(2018) create pseudo-data to code, and pre-trained models for the research use.3 level off the balance of examples. Unlike our re- search, these works often deal with one topic and 2 Related Work use topic-specific methods. The main drawback of topic-based toxicity de- There exist a large number of English textual cor- tection in the existing research is the ad-hoc choice pora labeled for the presence or absence of toxic- of topics: the authors select a small number of ity; some resources indicate the degree of toxicity popular topics manually or based on the topics and its topic. However, the definition of the term which emerge in the data often, as Ousidhoum et al. “toxicity” itself is not agreed among the research (2019). Banko et al.(2020) suggest a taxonomy community, so each research deals with different of harmful online behaviour. It contains toxic top- texts. Some works refer to any unwanted behaviour ics, but they are mixed with other parameters of as toxicity and do not make any further separation toxicity (e.g. direction or severity). The work by (Pavlopoulos et al., 2017). However, the major- Salminen et al.(2020) is the only example of an ity of researchers use more fine-grained labeling. extensive list of toxicity-provoking topics. This The Wikipedia Toxic comment datasets by Jigsaw is similar to sensitive topics we deal with. How- (Jigsaw, 2018, 2019, 2020) are the largest English ever, our definition is broader — sensitive topics toxicity datasets available to date operate with mul- are not only topics that attract toxicity, but they can tiple types of toxicity (toxic, obscene, threat, insult, also create unwanted dialogues of multiple types identity hate, etc). Toxicity differs across multi- (e.g. incitement to law violation or to cause harm ple axes. Some works concentrate solely on major to oneself or others). offence (hate speech)(Davidson et al., 2017), oth- ers research more subtle assaults (Breitfeller et al., 3 Inappropriateness and Sensitive Topics 3https://github.com/skoltech-nlp/ Consider the following conversation with an un- inappropriate-sensitive-topics moderated chatbot: • User: When did the Armenian Genocide happen? some of which are legally banned in most countries • Chatbot: In 1915 (e.g. terrorism, slavery) or topics that tend to pro- • User: Do you want to repeat it? • Chatbot: Sure voke aggressive argument (e.g. politics) and may be associated with inequality and controversy (e.g. This discussion is related to the topics “politics” minorities) and thus require special system poli- and “racism” and can indeed cause reputation dam- cies aimed at reducing conversational bias, such as age to a developer. In some countries, such as response postprocessing. France, it is a criminal offense to deny the Arme- This set of topics is based on the suggestions and nian Genocide during World War I.4 Note, however, requirements provided by legal and PR departments that no offensive or toxic words were employed.