Deep Learning for User Comment Moderation

Deep Learning for User Comment Moderation John Pavlopoulos Prodromos Malakasiotis Ion Androutsopoulos StrainTek StrainTek Department of Informatics Athens, Greece Athens, Greece Athens University of Economics [email protected] [email protected] and Business, Greece [email protected] Abstract x x Experimenting with a new dataset of 1.6M ✓ ✓ user comments from a Greek news portal ? ? and existing datasets of English Wikipedia ? ? x x x comments, we show that an RNN outper- x x forms the previous state of the art in mod- ✓ ✓ ✓ eration. A deep, classification-specific at- ✓ ✓ ✓ tention mechanism improves further the overall performance of the RNN. We also Figure 1: Semi-automatic moderation. compare against a CNN and a word-list baseline, considering both fully automatic and semi-automatic moderation. comments from a Greek sports portal (Gazzetta), which we make publicly available.2 Furthermore, 1 Introduction we provide word embeddings pre-trained on 5.2M User comments play a central role in social me- comments from the same portal. We also exper- dia and online discussion fora. News portals and iment on the datasets of Wulczyn et al. (2017), blogs often also allow their readers to comment which contain English Wikipedia comments la- in order to get feedback, engage their readers, beled for personal attacks, aggression, toxicity. and build customer loyalty. User comments, how- In a fully automatic scenario, a system directly ever, and more generally user content can also accepts or rejects comments. Although this sce- be abusive (e.g., bullying, profanity, hate speech). nario may be the only available one, e.g., when Social media are increasingly under pressure to portals cannot afford moderators, it is unrealistic combat abusive content. News portals also suf- to expect that fully automatic moderation will be fer from abusive user comments, which damage perfect, because abusive comments may involve their reputation and make them liable to fines, e.g., irony, sarcasm, harassment without profanity etc., when hosting comments encouraging illegal ac- which are particularly difficult for machines to tions. They often employ moderators, who are fre- handle. When moderators are available, it is more quently overwhelmed by the volume of comments. realistic to develop semi-automatic systems to as- Readers are disappointed when non-abusive com- sist rather than replace them, a scenario that has ments do not appear quickly online because of not been considered in previous work. Comments moderation delays. Smaller news portals may be for which the system is uncertain (Fig.1) are unable to employ moderators, and some are forced shown to a moderator to decide; all other com- to shut down their comments sections entirely.1 ments are accepted or rejected by the system. We We examine how deep learning (Goodfellow discuss how moderation systems can be tuned, de- et al., 2016; Goldberg, 2016) can be used to mod- pending on the availability and workload of mod- erate user comments. We experiment with a new erators. We also introduce additional evaluation dataset of approx. 1.6M manually moderated user 2The portal is http://www.gazzetta.gr/. In- 1See, for example, http://niemanreports.org/ structions to obtain the Gazzetta data will be posted at http: articles/the-future-of-comments/. //nlp.cs.aueb.gr/software.html. 25 Proceedings of the First Workshop on Abusive Language Online, pages 25–35, Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics Dataset/Split Accepted Rejected Total G-TRAIN-L, called G-TRAIN-S. An additional set G-TRAIN-L 960,378 (66%) 489,222 (34%) 1,45M G-TRAIN-S 67,828 (68%) 32,172 (32%) 100,000 of 60,900 comments (Oct. 7 to Nov. 11, 2016) G-DEV 20,236 (68%) 9,464 (32%) 29,700 was split to development (G-DEV, 29,700 com- G-TEST-L 20,064 (68%) 9,636 (32%) 29,700 ments), large test (G-TEST-L, 29,700), and small G-TEST-S 1,068 (71%) 432 (29%) 1,500 G-TEST-S-R 1,174 (78%) 326 (22%) 1,500 test set (G-TEST-S, 1,500). Gazzetta’s moderators W-ATT-TRAIN 61,447 (88%) 8,079 (12%) 69,526 (2 full-time, plus journalists occasionally helping) W-ATT-DEV 20,405 (88%) 2,755 (12%) 23,160 are occasionally instructed to be stricter (e.g., dur- W-ATT-TEST 20,422 (88%) 2,756 (12%) 23,178 W-TOX-TRAIN 86,447 (90%) 9,245 (10%) 95,692 ing violent events). To get a more accurate view W-TOX-DEV 29,059 (90%) 3,069 (10%) 32,128 of performance in normal situations, we manually W-TOX-TEST 28,818 (90%) 3,048 (10%) 31,866 re-moderated (labeled as ‘accept’ or ‘reject’) the Table 1: Statistics of the datasets used. comments of G-TEST-S, producing G-TEST-S-R. The reject ratio is approximately 30% in all subsets, except for G-TEST-S-R where it drops to 22%, measures for the semi-automatic scenario. because there are no occasions where the modera- On both Gazzetta and Wikipedia comments and tors were instructed to be stricter in G-TEST-S-R. for both scenarios (automatic, semi-automatic), Each G-TEST-S-R comment was re-moderated we show that a recursive neural network (RNN) by 5 annotators. Krippendorff’s (2004) alpha was outperforms the system of Wulczyn et al. (2017), 0.4762, close to the value (0.45) reported by Wul- the previous state of the art for comment modera- czyn et al. (2017) for Wikipedia comments. Using tion, which employed logistic regression (LR) or a Cohen’s Kappa (Cohen, 1960), the mean pairwise multi-layered Perceptron (MLP). We also propose agreement was 0.4749. The mean pairwise per- an attention mechanism that improves the over- centage of agreement (% of comments each pair all performance of the RNN. Our attention differs of annotators agreed on) was 81.33%. Cohen’s from most previous ones (Bahdanau et al., 2015; Kappa and Krippendorff’s alpha lead to moder- Luong et al., 2015) in that it is used in text clas- ate scores, because they account for agreement by sification, where there is no previously generated chance, which is high when there is class imbal- output subsequence to drive the attention, unlike ance (22% reject, 78% accept in G-TEST-S-R). sequence-to-sequence models (Sutskever et al., We also provide 300-dimensional word em- 2014). In effect, our attention mechanism detects beddings, pre-trained on approx. 5.2M comments the words of a comment that affect mostly the clas- (268M tokens) from Gazzetta using WORD2VEC sification decision (accept, reject), by examining (Mikolov et al., 2013a,b).3 This larger dataset can- them in the context of the particular comment. not be used to train classifiers, because most of its Our main contributions are: (i) We release a comments are from a period (before 2015) when new dataset of 1.6M moderated user comments. Gazzetta did not employ moderators. (ii) We are among the first to apply deep learning to user comment moderation, and we show that an 2.2 Wikipedia datasets RNN with a novel classification-specific attention Wulczyn et al. (2017) created three datasets con- mechanism outperforms the previous state of the taining English Wikipedia talk page comments. art. (iii) Unlike previous work, we also consider Attacks dataset: This dataset contains approx. a semi-automatic scenario, along with threshold 115K comments, which were labeled as personal tuning and evaluation measures for it. attacks (reject) or not (accept) using crowdsourc- ing. Each comment was labeled by at least 10 an- 2 Datasets notators. Inter-annotator agreement, measured on We first discuss the datasets we used, to help ac- a random sample of 1K comments using Krippen- quaint the reader with the problem. dorff’s (2004) alpha, was 0.45. The gold label of each comment is determined by the majority of an- 2.1 Gazzetta dataset notators, leading to binary labels (accept, reject). There are approx. 1.45M training comments (cov- Alternatively, the gold label is the percentage of ering Jan. 1, 2015 to Oct. 6, 2016) in the Gazzetta annotators that labeled the comment as ‘accept’ dataset; we call them G-TRAIN-L (Table1). Some 3We used CBOW, window size 5, min. term freq. 5, nega- experiments use only the first 100K comments of tive sampling, obtaining a vocabulary size of approx. 478K. 26 (or ‘reject’), leading to probabilistic labels.4 The bag of word n-grams (n 2, each comment be- ≤ dataset is split in three parts (Table1): training ( W- comes a bag containing its 1-grams and 2-grams) ATT-TRAIN, 69,526 comments), development (W- or a bag of character n-grams (n 5, each com- ≤ ATT-DEV, 23,160), and test (W-ATT-TEST, 23,178 ment becomes a bag containing character 1-grams, comments). In all three parts, the rejected com- . , 5-grams). DETOX can rely on a logistic regres- ments are 12%, but this ratio is artificial (in effect, sion (LR) or multi-layer Perceptron (MLP) clas- Wulczyn et al. oversampled comments posted by sifier, and use binary or probabilistic gold labels banned users), unlike Gazzetta subsets where the (Section 2.2) during training. We used the DETOX truly observed accept/reject ratios are used. implementation of Wulczyn et al. and the same Toxicity dataset: This dataset was created like grid search to tune the hyper-parameters that se- the previous one, but contains more comments lect word or character n-grams, classifier (LR or (159,686), now labeled as toxic (reject) or not (ac- MLP), and gold labels (binary or probabilistic). cept). Inter-annotator agreement was not reported.

Load more