Deep Learning for User Comment Moderation

John Pavlopoulos Prodromos Malakasiotis Ion Androutsopoulos StrainTek StrainTek Department of Informatics Athens, Greece Athens, Greece Athens University of Economics [email protected] [email protected] and Business, Greece [email protected]

Abstract x x Experimenting with a new dataset of 1.6M ✓ ✓ user comments from a Greek news portal ? ?

and existing datasets of English ? ? x x x comments, we show that an RNN outper- x x forms the previous state of the art in mod- ✓ ✓ ✓

eration. A deep, classification-specific at- ✓ ✓ ✓ tention mechanism improves further the overall performance of the RNN. We also Figure 1: Semi-automatic moderation. compare against a CNN and a word-list baseline, considering both fully automatic and semi-automatic moderation. comments from a Greek sports portal (Gazzetta), which we make publicly available.2 Furthermore, 1 Introduction we provide word embeddings pre-trained on 5.2M User comments play a central role in social me- comments from the same portal. We also exper- dia and online discussion fora. News portals and iment on the datasets of Wulczyn et al. (2017), often also allow their readers to comment which contain English Wikipedia comments la- in order to get feedback, engage their readers, beled for personal attacks, aggression, toxicity. and build customer loyalty. User comments, how- In a fully automatic scenario, a system directly ever, and more generally user content can also accepts or rejects comments. Although this sce- be abusive (e.g., bullying, profanity, hate speech). nario may be the only available one, e.g., when are increasingly under pressure to portals cannot afford moderators, it is unrealistic combat abusive content. News portals also suf- to expect that fully automatic moderation will be fer from abusive user comments, which damage perfect, because abusive comments may involve their reputation and make them liable to fines, e.g., irony, sarcasm, harassment without profanity etc., when hosting comments encouraging illegal ac- which are particularly difficult for machines to tions. They often employ moderators, who are fre- handle. When moderators are available, it is more quently overwhelmed by the volume of comments. realistic to develop semi-automatic systems to as- Readers are disappointed when non-abusive com- sist rather than replace them, a scenario that has ments do not appear quickly online because of not been considered in previous work. Comments moderation delays. Smaller news portals may be for which the system is uncertain (Fig.1) are unable to employ moderators, and some are forced shown to a moderator to decide; all other com- to shut down their comments sections entirely.1 ments are accepted or rejected by the system. We We examine how deep learning (Goodfellow discuss how moderation systems can be tuned, de- et al., 2016; Goldberg, 2016) can be used to mod- pending on the availability and workload of mod- erate user comments. We experiment with a new erators. We also introduce additional evaluation dataset of approx. 1.6M manually moderated user 2The portal is http://www.gazzetta.gr/. In- 1See, for example, http://niemanreports.org/ structions to obtain the Gazzetta data will be posted at http: articles/the-future-of-comments/. //nlp.cs.aueb.gr/software.html.

25 Proceedings of the First Workshop on Abusive Language Online, pages 25–35, Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics Dataset/Split Accepted Rejected Total G-TRAIN-L, called G-TRAIN-S. An additional set G-TRAIN-L 960,378 (66%) 489,222 (34%) 1,45M G-TRAIN-S 67,828 (68%) 32,172 (32%) 100,000 of 60,900 comments (Oct. 7 to Nov. 11, 2016) G-DEV 20,236 (68%) 9,464 (32%) 29,700 was split to development (G-DEV, 29,700 com- G-TEST-L 20,064 (68%) 9,636 (32%) 29,700 ments), large test (G-TEST-L, 29,700), and small G-TEST-S 1,068 (71%) 432 (29%) 1,500 G-TEST-S-R 1,174 (78%) 326 (22%) 1,500 test set (G-TEST-S, 1,500). Gazzetta’s moderators W-ATT-TRAIN 61,447 (88%) 8,079 (12%) 69,526 (2 full-time, plus journalists occasionally helping) W-ATT-DEV 20,405 (88%) 2,755 (12%) 23,160 are occasionally instructed to be stricter (e.g., dur- W-ATT-TEST 20,422 (88%) 2,756 (12%) 23,178 W-TOX-TRAIN 86,447 (90%) 9,245 (10%) 95,692 ing violent events). To get a more accurate view W-TOX-DEV 29,059 (90%) 3,069 (10%) 32,128 of performance in normal situations, we manually W-TOX-TEST 28,818 (90%) 3,048 (10%) 31,866 re-moderated (labeled as ‘accept’ or ‘reject’) the Table 1: Statistics of the datasets used. comments of G-TEST-S, producing G-TEST-S-R. The reject ratio is approximately 30% in all sub- sets, except for G-TEST-S-R where it drops to 22%, measures for the semi-automatic scenario. because there are no occasions where the modera- On both Gazzetta and Wikipedia comments and tors were instructed to be stricter in G-TEST-S-R. for both scenarios (automatic, semi-automatic), Each G-TEST-S-R comment was re-moderated we show that a recursive neural network (RNN) by 5 annotators. Krippendorff’s (2004) alpha was outperforms the system of Wulczyn et al. (2017), 0.4762, close to the value (0.45) reported by Wul- the previous state of the art for comment modera- czyn et al. (2017) for Wikipedia comments. Using tion, which employed logistic regression (LR) or a Cohen’s Kappa (Cohen, 1960), the mean pairwise multi-layered Perceptron (MLP). We also propose agreement was 0.4749. The mean pairwise per- an attention mechanism that improves the over- centage of agreement (% of comments each pair all performance of the RNN. Our attention differs of annotators agreed on) was 81.33%. Cohen’s from most previous ones (Bahdanau et al., 2015; Kappa and Krippendorff’s alpha lead to moder- Luong et al., 2015) in that it is used in text clas- ate scores, because they account for agreement by sification, where there is no previously generated chance, which is high when there is class imbal- output subsequence to drive the attention, unlike ance (22% reject, 78% accept in G-TEST-S-R). sequence-to-sequence models (Sutskever et al., We also provide 300-dimensional word em- 2014). In effect, our attention mechanism detects beddings, pre-trained on approx. 5.2M comments the words of a comment that affect mostly the clas- (268M tokens) from Gazzetta using WORD2VEC sification decision (accept, reject), by examining (Mikolov et al., 2013a,b).3 This larger dataset can- them in the context of the particular comment. not be used to train classifiers, because most of its Our main contributions are: (i) We release a comments are from a period (before 2015) when new dataset of 1.6M moderated user comments. Gazzetta did not employ moderators. (ii) We are among the first to apply deep learning to user comment moderation, and we show that an 2.2 Wikipedia datasets RNN with a novel classification-specific attention Wulczyn et al. (2017) created three datasets con- mechanism outperforms the previous state of the taining English Wikipedia talk page comments. art. (iii) Unlike previous work, we also consider Attacks dataset: This dataset contains approx. a semi-automatic scenario, along with threshold 115K comments, which were labeled as personal tuning and evaluation measures for it. attacks (reject) or not (accept) using crowdsourc- ing. Each comment was labeled by at least 10 an- 2 Datasets notators. Inter-annotator agreement, measured on We first discuss the datasets we used, to help ac- a random sample of 1K comments using Krippen- quaint the reader with the problem. dorff’s (2004) alpha, was 0.45. The gold label of each comment is determined by the majority of an- 2.1 Gazzetta dataset notators, leading to binary labels (accept, reject). There are approx. 1.45M training comments (cov- Alternatively, the gold label is the percentage of ering Jan. 1, 2015 to Oct. 6, 2016) in the Gazzetta annotators that labeled the comment as ‘accept’ dataset; we call them G-TRAIN-L (Table1). Some 3We used CBOW, window size 5, min. term freq. 5, nega- experiments use only the first 100K comments of tive sampling, obtaining a vocabulary size of approx. 478K.

26 (or ‘reject’), leading to probabilistic labels.4 The bag of word n-grams (n 2, each comment be- ≤ dataset is split in three parts (Table1): training ( W- comes a bag containing its 1-grams and 2-grams) ATT-TRAIN, 69,526 comments), development (W- or a bag of character n-grams (n 5, each com- ≤ ATT-DEV, 23,160), and test (W-ATT-TEST, 23,178 ment becomes a bag containing character 1-grams, comments). In all three parts, the rejected com- . . . , 5-grams). DETOX can rely on a logistic regres- ments are 12%, but this ratio is artificial (in effect, sion (LR) or multi-layer Perceptron (MLP) clas- Wulczyn et al. oversampled comments posted by sifier, and use binary or probabilistic gold labels banned users), unlike Gazzetta subsets where the (Section 2.2) during training. We used the DETOX truly observed accept/reject ratios are used. implementation of Wulczyn et al. and the same Toxicity dataset: This dataset was created like grid search to tune the hyper-parameters that se- the previous one, but contains more comments lect word or character n-grams, classifier (LR or (159,686), now labeled as toxic (reject) or not (ac- MLP), and gold labels (binary or probabilistic). cept). Inter-annotator agreement was not reported. For Gazzetta, only binary gold labels were pos- Again, binary or probabilistic gold labels can be sible, since G-TRAIN-L and G-TRAIN-S have a used. The dataset is split in three parts (Table1): single gold label per comment. Unlike Wulczyn training (W-TOX-TRAIN, 95,692 comments), de- et al., we tuned the hyper-parameters by evalu- velopment (W-TOX-DEV, 32,128), and test (W- ating (computing AUC and Spearman, Section4) TOX-TEST, 31,866). In all three parts, the rejected on a random 2% of held-out comments of W-ATT- (toxic) comments are 10%, again an artificial ratio. TRAIN, W-TOX-TRAIN, or G-TRAIN-S, instead of Wikipedia comments are longer (median 38 the development subsets, to be able to obtain more and 39 tokens for attacks, toxicity) compared to realistic results from the development sets while Gazzetta’s (median 25). Wulczyn et al. (2017) developing the methods. The tuning always se- also created an ‘aggression’ dataset containing the lected character n-grams, as in the work of Wul- same comments as the personal attacks one, but czyn et al., and LR to MLP, whereas Wulczyn et al. now labeled as aggressive or not. The (proba- reported slightly higher performance for the MLP 6 bilistic) labels of the two datasets are very highly on W-ATT-DEV. The tuning also selected proba- correlated (0.8992 Spearman, 0.9718 Pearson) and bilistic labels when available (Wikipedia datasets), we do not consider the aggression dataset further. as in the work of Wulczyn et al. 3.2 RNN-based methods 3 Methods RNN: The RNN method is a chain of GRU We experimented with an RNN operating on word cells (Cho et al., 2014) that transforms the to- embeddings, the same RNN enhanced with our at- kens w1 . . . , wk of each comment to hidden states tention mechanism (a-RNN), several variants of h1 . . . , hk, followed by an LR layer that uses hk a-RNN, a vanilla convolutional neural network to classify the comment (accept, reject). Formally, CNN d V ( ) also operating on word embeddings, the given the vocabulary V , a matrix E R ×| | con- ∈ DETOX system of Wulczyn et al. (2017), and a taining d-dimensional word embeddings, an initial baseline that uses word lists with precision scores. h0, and a comment c = w1, . . . , wk , the RNN h i m computes h1, . . . , hk as follows (ht R ): 3.1 DETOX ∈ ˜ DETOX (Wulczyn et al., 2017) was the previous ht = tanh(Whxt + Uh(rt ht 1) + bh) − state of the art in comment moderation, in the ht = (1 zt) ht 1 + zt h˜t sense that it had the best reported results on the − − zt = σ(Wzxt + Uzht 1 + bz) Wikipedia datasets (Section 2.2), the largest previ- − rt = σ(Wrxt + Urht 1 + br) ous publicly available datasets of moderated user − 5 comments. DETOX represents each comment as a ˜ m where ht R is the proposed hidden state at po- ∈ 4We also construct probabilistic gold labels (in addition to sition t, obtained by considering the word embed- binary ones) for G-TEST-S-R, where there are 5 annotators. ding xt of token wt and the previous hidden state 5Two of the co-authors of Wulczyn et al. (2017) are with Jigsaw, who recently announced Perspective, a system to de- API (http://www.perspectiveapi.com/). tect ‘toxic’ comments. Perspective is not the same as DETOX 6Wulczyn et al. (2017) report results only on W-ATT-DEV. (personal communication), but we were unable to obtain sci- We repeated the tuning by evaluating on W-ATT-DEV, and entific articles describing it. We have applied for access to its again character n-grams with LR were selected.

27 ... α ht 1; denotes element-wise multiplication; rt α1 × h1 α2 × h2 k × hk m− ∈ R is the reset gate (for rt all zeros, it allows the m RNN to forget the previous state ht 1); zt R softmax − ∈ z is the update gate (for t all zeros, it allows the (l) (l) (l) α1 α2 αk rejection RNN to ignore the new proposed h˜t, hence also probability x , and copy h as h ); σ is the sigmoid func- acceptance t t 1 t probability − m d m m Logistic tion; Wh,Wz,Wr R × ; Uh,Uz,Ur R × ; Regression m ∈ ∈ ...... bh, bz, br R . Once hk has been computed, the

∈ Attention MLP LR layer estimates the probability that comment c 1 m should be rejected, with Wp R × , bp R: h h ... h ∈ ∈ h0 1 2 k RNN

PRNN(reject c) = σ(Wphk + bp) x x ... x | 1 2 k a-RNN: When the attention mechanism is added, Figure 2: Illustration of a-RNN. the LR layer considers the weighted sum hsum of all the hidden states, instead of just hk (Fig.2): k mechanism of Yang et al. is part of a classification hsum = atht (1) method for longer texts (e.g., product reviews). t=1 X Their method uses two GRURNNs, both bidirec- Pa RNN(reject c) = σ(Wphsum + bp) − | tional (Schuster and Paliwal, 1997), one turning the word embeddings of each sentence to a sen- The weights at are produced by an attention mech- tence embedding, and one turning the sentence anism, which is an MLP with l layers: embeddings to a document embedding, which is (1) (1) (1) then fed to an LR layer. Yang et al. use their at- at = ReLU(W ht + b ) (2) ... tention mechanism in both RNNs, to assign atten- (l 1) (l 1) (l 2) (l 1) tion scores to words and sentences. We consider a − = ReLU(W a − + b ) t − t − shorter texts (comments), we have a single RNN, (l) (l) (l 1) (l) 8 at = W at − + b and we assign attention scores to words only. (l) (l) (l) RNN In a variant of a-RNN, called da-RNN (di- at = softmax(at ; a1 , . . . , ak ) da- : rect attention), the input to the first layer of the at- (1) (l 1) r (l) (1) where at , . . . , at − R , at , at R, W tention mechanism is the embedding xt of word r m (2) (l ∈1) r r ∈(l) 1 r∈ (1,x) r d R × , W ,...,W − R × , W R × , wt, rather than ht (cf. Eq.2; W R × ): (1) (l 1) r (l)∈ ∈ ∈ b , . . . , b − R , b R. The softmax ∈ (l) ∈ (1) (1,x) (1) operates across all the at (t = 1, . . . , k), making at = ReLU(W xt + b ) (3) the attention weights at sum to 1. Our attention mechanism differs from most previous ones (Mnih Intuitively, the attention of a-RNN considers each et al., 2014; Bahdanau et al., 2015; Xu et al., 2015; word embedding xt in its (left) context, modelled Luong et al., 2015) in that it is used in a classifi- by ht, whereas the attention of da-RNN considers cation setting, where there is no previously gen- directly xt without its context, but hsum is still the erated output subsequence (e.g., partly generated weighted sum of the hidden states (Eq.1). translation) to drive the attention (e.g., assign more eq-RNN: In another variant of a-RNN, called eq- weight to source words to translate next), unlike RNN, we assign equal attention to all the hidden seq2seq models (Sutskever et al., 2014). It assigns states. The feature vector of the LR layer is now 1 k larger weights at to hidden states ht correspond- the average hsum = k t=1 ht (cf. Eq.1). ing to positions where there is more evidence that da-CENT: For ablationP testing, we also experi- the comment should be accepted or rejected. ment with a variant, called da-CENT, that does not Yang et al. (2016) use a similar attention mech- use the hidden states of the RNN. The input to the anism, but ours is deeper. In effect they always attention mechanism is now directly the embed- set l = 2, whereas we allow l to be larger (tuning ding xt instead of ht (as in da-RNN, Eq.3), and selects l = 4).7 On the other hand, the attention 8We tried a bidirectional instead of unidirectional GRU 7 Yang et al. use tanh instead of ReLU in Eq.2, which chain in our methods, also replacing the LR layer by a deeper (l) works worse in our case, and no bias b in the l-th layer. classification MLP, but there were no improvements.

28 h is the weighted average (centroid) of word sum 0.0 accept gray reject 1.0 k 9 embeddings hsum = t=1 atxt (cf. Eq.1). ta : accept tr : reject eq-CENT: For furtherP ablation, we also experi- threshold threshold ment with eq-CENT, which uses neither the RNN nor the attention mechanism. The feature vector Figure 3: Illustration of threshold tuning. of the LR layer is now simply the average of word 1 k embeddings hsum = k t=1 xt (cf. Eq.1). pooled outputs) goes through a dropout layer (Hin- We set l = 4, d P= 300, m = r = 128, ton et al., 2012)(p = 0.5), and then to an LR layer, having tuned the hyper-parameters of RNN and which provides PCNN(reject c). For Gazzetta, the | a-RNN on the same 2% held-out training com- CNN is the same, except that n = 1,..., 5, lead- ments used to tune DETOX; da-RNN, eq-RNN, da- ing to 1,500 features per comment. All hyper- CENT, and eq-CENT use the same hyper-parameter parameters were tuned on the 2% held-out train- values as a-RNN, to make their results more di- ing comments used to tune the other methods. rectly comparable and save time. We use Glo- Again, we use 300-dimensional word embeddings, rot initialization (Glorot and Bengio, 2010), cross- which are now randomly initialized, since tuning entropy loss, and Adam (Kingma and Ba, 2015).10 indicated this was better than initializing to pre- Early stopping evaluates on the same held-out sub- trained embeddings. OOV words are treated as in sets. For Gazzetta, word embeddings are initial- the RNN-based methods. All embeddings are up- ized to the WORD2VEC embeddings we provide dated. Early stopping evaluates on the held-out (Section 2.1). For the Wikipedia datasets, they subsets. Again, we use Glorot initialization, cross- 13 are initialized to GLOVE embeddings (Pennington entropy loss, and Adam. et al., 2014).11 In both cases, the embeddings are updated during backpropagation. Out of vocabu- 3.4 LIST baseline lary (OOV) words, meaning words not encountered A baseline, called LIST, collects every word w that in the training set and/or words we have no ini- occurs in more than 10 (for W-ATT-TRAIN, W- tial embeddings for, are mapped (during training TOX-TRAIN, G-TRAIN-S) or 100 comments (for and testing) to a single randomly initialized em- G-TRAIN-L) in the training set, along with the pre- bedding, which is also updated during training.12 cision of w, i.e., the ratio of rejected training com- ments containing w divided by the total number 3.3 CNN of training comments containing w. The resulting We also compare against a vanilla CNN operating lists contain 10,423, 11,360, 16,864, and 21,940 W ATT TRAIN W TOX on word embeddings. We describe the CNN only word types, when using - - , - - briefly, because it is very similar to that of of Kim TRAIN, G-TRAIN-S, G-TRAIN-L, respectively. For a comment c, P (reject c) is the maximum pre- (2014); see also Goldberg (2016) for an introduc- LIST | c tion to CNNs, and Zhang and Wallace (2015). cision of all the words in . For Wikipedia comments, we use a ‘narrow’ 3.5 Tuning thresholds convolution layer, with kernels sliding (stride 1) All methods produce a p = P (reject c) per com- over (entire) embeddings of word n-grams of sizes | ment c. In semi-automatic moderation (Fig.1), a n = 1,..., 4. We use 300 kernels for each n comment is directly rejected if its p is above a re- value, a total of 1,200 kernels. The outputs of jection threshold t , it is directly accepted if p is each kernel, obtained by applying the kernel to r below an acceptance threshold t , and it is shown the different n-grams of a comment c, are then a to a moderator if t p t (gray zone of Fig.3). max-pooled, leading to a single output per ker- a ≤ ≤ r In our experience, moderators (or their employ- nel. The resulting feature vector (1,200 max- ers) can easily specify the approximate percentage 9 We also tried tf-idf scores in the hsum of da-CENT, instead of comments they can afford to check manually of attention scores, but preliminary results were poor. (e.g., 20% daily) or, equivalently, the approximate 10We used Keras (http://keras.io/) with the Ten- sorFlow back-end (http://www.tensorflow.org/). percentage of comments the system should han- 11See https://nlp.stanford.edu/projects/ dle automatically. We call coverage the latter per- glove/. We use ‘Common Crawl’ (840B tokens). centage; hence, 1 coverage is the approximate 12For Gazzetta, words encountered only once in the train- − ing set (G-TRAIN-L or G-TRAIN-S) are also treated as OOV. 13We implemented the CNN directly in TensorFlow.

29 percentage of comments to be checked manually. Training dataset: G-TRAIN-S G-DEVG-TEST-LG-TEST-SG-TEST-S-R System By contrast, moderators are baffled when asked to AUCAUCAUCAUC Spearman tune tr and ta directly. Consequently, we ask them RNN 75.75 75.10 74.40 80.27 51.89 to specify the approximate desired coverage. We a-RNN 76.19 76.15 75.83 80.41 52.51 da-RNN 75.96 75.90 74.25 80.05 52.49 G then sort the comments of the development set ( - eq-RNN 74.31 74.01 73.28 77.73 45.77 DEV, W-ATT-DEV, W-TOX-DEV) by p, and slide da-CENT 75.09 74.96 74.20 79.92 51.04 ta from 0.0 to 1.0 (Fig.3). For each ta value, eq-CENT 73.93 73.82 73.80 78.45 48.14 we set t to the value that leaves a 1 coverage CNN 70.97 71.34 70.88 76.03 42.88 r − DETOX 72.50 72.06 71.59 75.67 43.80 percentage of development comments in the gray LIST 61.47 61.59 61.26 64.19 24.33 zone (ta p tr). We then select the ta (and Training dataset: G-TRAIN-L ≤ ≤ G-DEVG-TEST-LG-TEST-SG-TEST-S-R tr) that maximizes the weighted harmonic mean System F (P ,P ) AUCAUCAUCAUC Spearman β reject accept on the development set: RNN 79.50 79.41 79.23 84.17 59.31 a-RNN 79.64 79.58 79.67 84.69 60.87 (1 + β2) P P F (P ,P ) = · reject · accept da-RNN 79.60 79.56 79.38 84.40 60.83 β reject accept 2 eq-RNN 77.45 77.76 77.28 82.11 55.01 β Preject + Paccept · da-CENT 78.73 78.64 78.62 83.53 57.82 eq-CENT 76.76 76.85 76.30 82.38 53.28 where Preject is the rejection precision (correctly CNN 77.57 77.35 78.16 83.98 55.90 rejected comments divided by rejected comments) DETOX ––––– and Paccept is the acceptance precision (correctly LIST 67.04 67.06 66.17 69.51 33.61 accepted divided by accepted). Intuitively, cover- Table 2: Results on Gazzetta comments. age sets the width of the gray zone, whereas Preject and Paccept show how certain we can be that the red (reject) and green (accept) zones are free of also show that RNN is always better than CNN and misclassified comments. We set β = 2, emphasiz- DETOX; there is no clear winner between CNN and ing P , because moderators are more worried accept DETOX. Furthermore, a-RNN is always better than about wrongly accepting abusive comments than RNN 14 on Gazzetta comments (Table2), but not al- wrongly rejecting non-abusive ones. The se- ways on Wikipedia comments (Table3). Another lected t , t (tuned on development data) are then a r observation is that da-RNN is always worse than used in experiments on test data. In fully auto- a-RNN (Tables2–3), confirming that the hidden matic moderation, coverage = 100% and t = t ; a r states of the RNN are a better input to the attention otherwise, threshold tuning is identical. mechanism than word embeddings. The perfor- 4 Experimental results mance of da-RNN deteriorates further when equal attention is assigned to the hidden states (eq-RNN), Following Wulczyn et al. (2017), we report when the weighted sum of hidden states (hsum) is in Tables2–3 AUC scores (area under ROC replaced by the weighted sum of word embeddings curve), along with Spearman correlations be- (da-CENT), or both (eq-CENT). Also, da-CENT tween system-generated probabilities P (accept c) outperforms eq-CENT, indicating that the atten- | and human probabilistic gold labels (Section 2.2) tion mechanism improves the performance of sim- when probabilistic gold labels are available.15 ply averaging word embeddings. The Wikipedia A first observation is that increasing the size of subsets are easier (all methods perform better on the Gazzetta training set (G-TRAIN-S to G-TRAIN- Wikipedia subsets, compared to Gazzetta). L, Table2) significantly improves the performance Figure4 shows F2(Preject,Paccept) on G-TEST- of all methods; we do not report DETOX results L, G-TEST-S, W-ATT-TEST, W-TOX-TEST, when for G-TRAIN-L, because its implementation could ta, tr are tuned on the corresponding develop- not handle the size of G-TRAIN-L. Tables2–3 ment tests for varying coverage. For the Gazzetta 14 More precisely, when computing Fβ , we reorder the de- datasets, we show results training on G-TRAIN-S velopment comments by time posted, and split them into (solid lines) and G-TRAIN-L (dashed). The differ- batches of 100. For each ta (and tr) value, we compute Fβ per batch and macro-average across batches. The resulting ences between RNN and a-RNN are again small, thresholds lead to Fβ scores that are more stable over time. but it is now easier to see that a-RNN is overall 15 When computing AUC, the gold label is the majority la- better. Again, a-RNN and RNN are better than CNN bel of the annotators. When computing Spearman, the gold label is probabilistic (% of annotators that accepted the com- and DETOX, and the results improve with a larger ment). The decisions of the systems are always probabilistic. training set (dashed). On W-ATT-TEST and W-

30 Training dataset: W-ATT-TRAIN W-ATT-DEVW-ATT-TEST System AUC Spearman AUC Spearman RNN 97.39 71.92 97.71 72.79 a-RNN 97.46 71.59 97.68 72.32 da-RNN 97.02 71.49 97.31 72.11 eq-RNN 92.66 60.77 92.85 60.16 da-CENT 96.73 70.13 97.06 71.08 eq-CENT 92.30 57.21 92.81 56.33 CNN 96.91 70.06 97.07 70.21 (%) (%) DETOX 96.26 67.75 96.71 68.09 LIST 93.05 55.39 92.91 54.55 Training dataset: W-TOX-TRAIN W-TOX-DEVW-TOX-TEST System AUC Spearman AUC Spearman RNN 98.20 68.84 98.42 68.89 a-RNN 98.22 68.95 98.38 68.90 da-RNN 98.05 68.59 98.28 68.55 (%) (%) eq-RNN 94.72 55.48 95.04 55.86 da-CENT 97.83 67.86 97.94 67.74 eq-CENT 94.31 53.35 94.61 52.93 Figure 4: F2 scores for varying coverage. Dashed CNN 97.76 65.50 97.86 65.56 lines were obtained using a larger training set. DETOX 97.16 63.57 97.13 63.24 LIST 93.96 51.35 93.95 51.18 ject’. Napoles et al. also reported that up/down Table 3: Results on Wikipedia comments. votes, a form of social filtering, are inappropriate proxies for comment and thread quality. Lee et al. TOX-TEST, a-RNN obtains Paccept,Preject 0.94 (2014) discuss social filtering in detail and propose ≥ for all coverages (Fig.4, call-outs). On the more features (e.g., thread depth, no. of revisiting users) difficult Gazzetta datasets, a-RNN still obtains to assess the quality of a thread without processing Paccept,Preject 0.85 when tuned for 50% cov- the texts of its comments. Diakopoulos (2015) dis- ≥ erage. When tuned for 100% coverage, comments cusses how editors select high quality comments. for which the system is uncertain (gray zone) can- In further work, Napoles et al. (2017a) aimed to not be avoided and there are inevitably more mis- identify high quality threads. Their best method classifications; the use of F2 during threshold tun- converts each comment to a comment embedding ing places more emphasis on avoiding wrongly using DOC2VEC (Le and Mikolov, 2014). An accepted comments, leading to high P ( accept ≥ ensemble of Conditional Random Fields (CRFs) 0.82), at the expense of wrongly rejected com- (Lafferty et al., 2001) assigns labels (from their an- ments, i.e., sacrificing P ( 0.56). On the reject ≥ notation scheme, e.g., for sentiment, off-topic) to re-moderated G-TEST-S-R (similar diagrams, not the comments of each thread, viewing each thread shown), Paccept,Preject become 0.96, 0.88 for cov- as a sequence of DOC2VEC embeddings. The de- erage 50%, and 0.92, 0.48 for coverage 100%. cisions of the CRFs are then used to convert each thread to a feature vector (total count and mean 5 Related work marginal probability of each label in the thread), Napoles et al. (2017b) developed an annotation which is passed on to an LR classifier. Further scheme for online conversations, with 6 dimen- improvements were observed when additional fea- sions for comments (e.g., sentiment, tone, off- tures were added, BOW counts and POS n-grams topic) and 3 dimensions for threads. The scheme being the most important ones. Napoles et al. was used to label a dataset, called YNACC, of (2017a) also experimented with a CNN, similar to 9.2K comments (2.4K threads) from Yahoo News that of Section 3.3, which was not however a top- and 16.6K comments (1K threads) from the Inter- performer, presumably because of the small size net Argument Corpus (Walker et al., 2012; Ab- of the training set (2.1K YNACC threads). bott et al., 2016). Abusive comments were fil- Djuric et al. (2015) experimented with 952K tered out, hence YNACC cannot be used for our manually moderated comments from Yahoo Fi- purposes, but it may be possible to extend the an- nance, but their dataset is not publicly available. notation scheme for abusive comments, to predict They convert each comment to a DOC2VEC em- more fine-grained labels, instead of ‘accept’ or ‘re- bedding, which is fed to an LR classifier. No-

31 bata et al. (2016) experimented with approx. 3.3M Cheng et al. (2015) predict which users would manually moderated comments from Yahoo Fi- be banned from on-line communities. Their best nance and News; their data are also not avail- system uses a Random Forest or LR classifier, with able.16 They used Vowpal Wabbit17 with char- features examining readability, activity (e.g., num- acter n-grams (n = 3,..., 5) and word n-grams ber of posts daily), community and moderator re- (n = 1, 2), hand-crafted features (e.g., com- actions (e.g., up-votes, number of deleted posts). ment length, number of capitalized or black-listed Lukin and Walker (2013) experimented with words), features based on dependency trees, aver- 5.5K utterances from the Argument Cor- ages of WORD2VEC embeddings, and DOC2VEC- pus (Walker et al., 2012; Abbott et al., 2016) an- like embeddings. Character n-grams were the notated with nastiness scores, and 9.9K utterances best, on their own outperforming Djuric et al. from the same corpus annotated for sarcasm.18 In (2015). The best results, however, were obtained a bootstrapping manner, they manually identified using all features. By contrast, we use no hand- cue words and phrases (indicative of nastiness or crafted features and parsers, making our methods sarcasm), used the cue words to obtain training easily portable to other domains and languages. comments, and extracted patterns from the train- Wulczyn et al. (2017) experimented with char- ing comments. Xiang et al. (2012) also employed acter and word n-grams, based on the findings of bootstrapping to identify users whose tweets fre- Nobata et al. (2016). We included their dataset and quently or never contain profane words, and col- moderation system (DETOX) in our experiments. lected 381M tweets from the two user types. They Wulczyn et al. also used DETOX (trained on W- trained decision tree, Random Forest, or LR clas- ATT-TRAIN) as a proxy (instead of human anno- sifiers to distinguish between tweets from the two tators) to automatically classify 63M Wikipedia user types, testing on 4K tweets manually labeled comments, which were then used to study the as containing profanity or not. The classifiers problem of personal attacks (e.g., the effect of used topical features, obtained via LDA (Blei et al., allowing anonymous comments, how often per- 2003), and a feature indicating the presence of at sonal attacks were followed by moderation ac- least one of approx. 330 known profane words. tions). Our methods could replace DETOX in stud- Sood et al. (2012a; 2012b) experimented with ies of this kind, since they perform better. 6.5K comments from Yahoo Buzz, moderated via Waseem et al. (2016) used approx. 17K tweets crowdsourcing. They showed that a linear SVM, annotated for hate speech. Their best method representing each comment as a bag of word bi- was an LR classifier with character n-grams (n = grams and stems, performs better than word lists. 1,..., 4) and a gender feature. Badjatiya et al. Their best results were obtained by combining the (2017) experimented with the same dataset us- SVM with a word list and edit distance. ing LR, SVMs (Cortes and Vapnik, 1995), Ran- Yin et al. (2009) used posts from chat rooms dom Forests (Ho, 1995), Gradient Boosted Deci- and discussion fora (<15K posts in total) to train sion Trees (GBDT)(Friedman, 2002), CNN (similar an SVM to detect online harassment. They used to that of Section 3.3), LSTM (Greff et al., 2015), TF-IDF, sentiment, and context features (e.g., sim- FastText (Joulin et al., 2017). They also consid- ilarity to other posts in a thread).19 Our methods ered alternative feature sets: character n-grams, tf- might also benefit by considering threads, rather idf vectors, word embeddings, averaged word em- than individual comments. Yin et al. point out that beddings. Their best results were obained using unlike other abusive content, spam in comments GBDT with averaged word embeddings learned by or discussion fora (Mishne et al., 2005; Niu et al., the LSTM, starting from random embeddings. 2007) is off-topic and serves a commercial pur- Warner and Hirschberg (2012) aimed to detect pose. Spam is unlikely in Wikipedia discussions anti-semitic speech, experimenting with 9K para- and extremely rare so far in Gazzetta comments. graphs and a linear SVM. Their features consider Mihaylov and Nakov (2016) identify comments windows of up to 5 tokens, the tokens of each win- posted by opinion manipulation trolls. Dinakar et dow, their order, POS tags, Brown clusters etc., fol- lowing Yarowsky (1994). 18For sarcasm, see Davidov et al. (2010), Gonzalez-Ibanez et al. (2011), Joshi et al. (2015), Oraby et al. (2016). 16According to Nobata et al., their clean test dataset (2K 19Sentiment features have been used by several methods, comments) would be made available, but it is currently not. but sentiment analysis (Pang and Lee, 2008; Liu, 2015) is 17See http://hunch.net/˜vw/. typically not directly concerned with abusive content.

32 al. (2011) and Dadvar et al. (2013) detect cyber- Acknowledgments bullying. Chandrinos et al. (2000) detect porno- This work was funded by Google’s Digital News graphic web pages, using a Naive Bayes classifier Initiative (project ML2P, contract 362826).21 We with text and image features. Spertus (1997) flag are grateful to Gazzetta for the data they pro- flame messages in Web feedback forms, using de- vided. We also thank Gazzetta’s moderators for cision trees and hand-crafted features. A Kaggle their feedback, insights, and advice. dataset for insult detection is also available.20 It contains 6.6K comments (3,947 train, 2,647 test) labeled as insults or not. However, abusive com- References ments that do not directly insult other participants R. Abbott, B. Ecker, P. Anand, and M. A. Walker. 2016. of the same discussion are not classified as insults, Internet Argument Corpus 2.0: An SQL schema for even if they contain profanity, hate speech, insults dialogic social media and the corpora to go with it. to third persons etc. In LREC. Portoroz, Slovenia. P. Badjatiya, S. Gupta, M. Gupta, and V. Varma. 2017. Deep learning for hate speech detection in tweets. 6 Conclusions In WWW (Companion). Perth, Australia, pages 759– 760. We experimented with a new publicly available D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural dataset of 1.6M moderated user comments from a machine translation by jointly learning to align and Greek sports news portal and two existing datasets translate. In ICLR. San Diego, CA. of English Wikipedia talk page comments. We D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent showed that a GRURNN operating on word em- Dirichlet Allocation. Journal of Machine Learning beddings outperforms the previous state of the art, Research 3:993–1022. which used an LR or MLP classifier with char- K.V. Chandrinos, I. Androutsopoulos, G. Paliouras, acter or word n-gram features. It also outper- and C.D. Spyropoulos. 2000. Automatic Web rat- forms a vanilla CNN operating on word embed- ing: Filtering obscene content on the Web. In Proc. of the 4th European Conference on Research and dings, and a baseline that uses an automatically Advanced Technology for Digital Libraries. Lisbon, constructed word list with precision scores. A Portugal, pages 403–406. novel, deep, classification-specific attention mech- J. Cheng, C. Danescu-Niculescu-Mizil, and anism improves further the overall results of the J. Leskovec. 2015. Antisocial behavior in on- RNN. The attention mechanism also improves the line discussion communities. In Proc. of the results of a simpler method that averages word em- International AAAI Conference on Web and Social beddings. We considered both fully automatic and Media. Oxford University, England, pages 61–70. semi-automatic moderation, along with threshold K. Cho, B. van Merrienboer, C. Gulcehre, D. Bah- tuning and evaluation measures for both. danau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN We plan to consider user-specific information encoder–decoder for statistical machine translation. (e.g., ratio of comments rejected in the past) and In EMNLP. Doha, Qatar, pages 1724–1734. thread statistics (e.g., thread depth, number of re- J. Cohen. 1960. A coefficient of agreement for nom- visiting users) (Dadvar et al., 2013; Lee et al., inal scales. Educational and Psychological Mea- 2014; Cheng et al., 2015; Waseem and Hovy, surement 20(1):37–46. 2016). We also plan to explore character-level C. Cortes and Vladimir Vapnik. 1995. Support-Vector RNNs or CNNs (Zhang et al., 2015), for example Networks. Machine Learning 20(3):273–297. to produce embeddings of unknown or obfuscated words from characters (dos Santos and Zadrozny, M. Dadvar, D. Trieschnigg, R. Ordelman, and 2014; Ling et al., 2015). We are also exploring F. de Jong. 2013. Improving cyberbullying detec- tion with user context. In ECIR. Moscow, Russia, how the attention scores of a-RNN can be used pages 693–696. to highlight ‘suspicious’ words or phrases when showing gray comments to moderators. D. Davidov, O. Tsur, and A. Rappoport. 2010. Semi- supervised recognition of sarcastic sentences in and Amazon. In CoNLL. Uppsala, Sweden, pages 107–116. 20See http://www.kaggle.com/, data description of the competition ‘Detecting Insults in Social Commentary’. 21See https://digitalnewsinitiative.com/.

33 N. Diakopoulos. 2015. Picking the NYT picks: Edito- D. P. Kingma and J. Ba. 2015. Adam: A method for rial criteria and automation in the curation of online stochastic optimization. In ICLR. San Diego, CA. news comments. Journal of the International Sym- posium on Online Journalism 5:147–166. K. Krippendorff. 2004. Content Analysis: An Introduc- tion to Its Methodology (2nd edition). Sage Publica- K. Dinakar, R. Reichart, and H. Lieberman. 2011. tions. Modeling the detection of textual cyberbullying. In The Social Mobile Web. Barcelona, Spain, volume J. D. Lafferty, A. McCallum, and F. C. N. Pereira. WS-11-02 of AAAI Workshops, pages 11–17. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Ra- In ICML. Williamstown, MA, pages 282–289. dosavljevic, and N. Bhamidipati. 2015. Hate speech detection with comment embeddings. In WWW. Q. V. Le and T. Mikolov. 2014. Distributed represen- Florence, Italy, pages 29–30. tations of sentences and documents. In ICML. Bei- jing, China, pages 1188–1196. C. N. dos Santos and B. Zadrozny. 2014. Learning character-level representations for part-of-speech J.-T. Lee, M.-C. Yang, and H.-C. Rim. 2014. Discov- tagging. In ICML. Beijing, China, pages 1818– ering high-quality threaded discussions in online fo- 1826. rums. Journal of Computer Science and Technology 29(3):519–531. J.H. Friedman. 2002. Stochastic gradient boost- ing. Computational Statistics and Data Analysis W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fer- 38(4):367–378. mandez, S. Amir, L. Marujo, and T. Lu´ıs. 2015. Finding function in form: Compositional character X. Glorot and Y. Bengio. 2010. Understanding the dif- models for open vocabulary word representation. In ficulty of training deep feedforward neural networks. EMNLP. Lisbon, Portugal, pages 1520–1530. In Proc. of the International Conference on Artifi- cial Intelligence and Statistics. Sardinia, Italy, pages B. Liu. 2015. Sentiment Analysis – Mining Opinions, 249–256. Sentiments, and Emotions. Cambridge University Press. Y. Goldberg. 2016. A primer on neural network mod- els for natural language processing. Journal of Arti- S. Lukin and M. Walker. 2013. Really? well. ap- ficial Intelligence Research 57:345–420. parently bootstrapping improves the performance of sarcasm and nastiness classifiers for online dialogue. R. I. Gonzalez-Ib´ a´nez,˜ S. Muresan, and N. Wacholder. In Proc. of the Workshop on Language in Social Me- 2011. Identifying sarcasm in Twitter: A closer look. dia. Atlanta, Georgia, pages 30–40. In ACL. Portland, Oregon, pages 581–586. T. Luong, H. Pham, and C. D. Manning. 2015. Effec- I. Goodfellow, Y. Bengio, and A. Courville. 2016. tive approaches to attention-based neural machine Deep Learning. MIT Press. translation. In EMNLP. Lisbon, Portugal, pages 1412–1421. K. Greff, R.K. Srivastava, J. Koutn´ık, B.R. Steune- brink, and J. Schmidhuber. 2015. LSTM: A search T. Mihaylov and P. Nakov. 2016. Hunting for troll space Odyssey. CoRR abs/1503.04069. comments in news community forums. In ACL. Berlin, Germany, pages 399–405. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2012. Improv- T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. ing neural networks by preventing co-adaptation of Efficient estimation of word representations in vec- feature detectors. CoRR abs/1207.0580. tor space. In ICLR. Scottsdale, AZ. T.K. Ho. 1995. Random Decision Forests. In Proc. T. Mikolov, W.-t. Yih, and G. Zweig. 2013b. Linguis- of the 3rd International Conference on Document tic regularities in continuous space word representa- Analysis and Recognition. Montreal, Canada, vol- tions. In NAACL-HLT. Atlanta, GA, pages 746–751. ume 1, pages 278–282. G. Mishne, D. Carmel, and R. Lempel. 2005. Blocking A. Joshi, V. Sharma, and P. Bhattacharyya. 2015. Har- spam with language model disagreement. In nessing context incongruity for sarcasm detection. Proc. of the International Workshop on Adversarial In ACL. Beijing, China, pages 757–762. Information Retrieval on the Web. Chiba, Japan. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. 2017. Bag of tricks for efficient text classification. 2014. Recurrent models of visual attention. In In EACL (short papers). Valencia, Spain, pages 427– NIPS. Montreal, Canada, pages 2204–2212. 431. C. Napoles, A. Pappu, and J. Tetreault. 2017a. Au- Y. Kim. 2014. Convolutional neural networks for sen- tomatically identifying good conversations online tence classification. In EMNLP. Doha, Qatar, pages (yes, they do exist!). In Proc. of the International 1746–1751. AAAI Conference on Web and Social Media.

34 C. Napoles, J. Tetreault, E. Rosato, B. Provenzale, and E. Wulczyn, N. Thain, and L. Dixon. 2017. Ex A. Pappu. 2017b. Finding good conversations on- machina: Personal attacks seen at scale. In WWW. line: The Yahoo News annotated comments corpus. Perth, Australia, pages 1391–1399. In Proc. of the Linguistic Annotation Workshop. Va- lencia, Spain, pages 13–23. G. Xiang, B. Fan, L. Wang, J. Hong, and C. Rose. 2012. Detecting offensive tweets via topical feature dis- Y. Niu, Y.-M. Wang, H. Chen, M. Ma, and F. Hsu. covery over a large scale twitter corpus. In CIKM. 2007. A quantitative study of forum spamming us- Maui, Hawaii, pages 1980–1984. ing context-based analysis. In Proc. of the Annual Network and Distributed System Security Sympo- K. Xu, J. Ba, J.R. Kiros, K. Cho, A.C. Courville, sium. San Diego, CA, pages 79–92. R. Salakhutdinov, R.S. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption gener- C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and ation with visual attention. In ICML. Lille, France, Y. Chang. 2016. Abusive language detection in pages 2048–2057. online user content. In WWW. Montreal, Canada, pages 145–153. Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. 2016. Hierarchical attention networks S. Oraby, V. Harrison, L. Reed, E. Hernandez, for document classification. In NAACL-HLT. San E. Riloff, and M. A. Walker. 2016. Creating and Diego, CA, pages 1480–1489. characterizing a diverse corpus of sarcasm in dia- logue. In SIGDial. Los Angeles, CA, pages 31–41. D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: Application to accent restoration B. Pang and L. Lee. 2008. Opinion mining and senti- in Spanish and French. In ACL. Las Cruces, NM, ment analysis. Foundations and Trends in Informa- pages 88–95. tion Retrieval 2(1-2):1–135. D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kon- J. Pennington, R. Socher, and C. Manning. 2014. tostathis, and L. Edwards. 2009. Detection of ha- GloVe: Global vectors for word representation. In rassment on Web 2.0. In Proc. of the WWW work- EMNLP. Doha, Qatar, pages 1532–1543. shop on Content Analysis in the Web 2.0. Madrid, Spain. M. Schuster and K. K. Paliwal. 1997. Bidirectional re- current neural networks. IEEE Transacions of Sig- X. Zhang, J. Zhao, and Y. LeCun. 2015. Character- nal Processing 45(11):2673–2681. level convolutional networks for text classification. S. Sood, J. Antin, and E. F. Churchill. 2012a. Profanity In NIPS. Montreal, Canada, pages 649–657. use in online communities. In SIGCHI. Austin, TX, Y. Zhang and B. C. Wallace. 2015. A sensitivity anal- pages 1481–1490. ysis of (and practitioners’ guide to) convolutional S. Sood, J. Antin, and E. F. Churchill. 2012b. Us- neural networks for sentence classification. CoRR ing crowdsourcing to improve profanity detection. abs/1510.03820. In AAAI Spring Symposium: Wisdom of the Crowd. Stanford, CA, pages 69–74. E. Spertus. 1997. Smokey: Automatic recognition of hostile messages. In Proc. of the National Con- ference on Artificial Intelligence and the Innovative Applications of Artificial Intelligence Conference. Providence, Rhode Island, pages 1058–1065. I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. Montreal, Canada, pages 3104–3112. M. A. Walker, J. E. Fox Tree, P. Anand, R. Abbott, and J. King. 2012. A corpus for research on deliberation and debate. In LREC. Istanbul, Turkey, pages 4445– 4452. W. Warner and J. Hirschberg. 2012. Detecting hate speech on the World Wide Web. In Proc. of the 2nd Workshop on Language in Social Media. Montreal, Canada, pages 19–26. Z. Waseem and D. Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proc. of the NAACL Student Research Workshop. San Diego, CA, pages 88–93.

35