StRE: Self Attentive Edit Quality Prediction in Wikipedia
Soumya Sarkar∗1, Bhanu Prakash Reddy*2, Sandipan Sikdar3 Animesh Mukherjee4
IIT Kharagpur, India1,2,4,RWTH Aachen, Germany3 [email protected],[email protected] [email protected], [email protected]
Abstract guidelines for them1. Problem: The inherent openness of Wikipedia has Wikipedia can easily be justified as a behe- also made it vulnerable to external agents who moth, considering the sheer volume of con- tent that is added or removed every minute to intentionally attempt to divert the unbiased, ob- its several projects. This creates an immense jective discourse to a narrative which is aligned scope, in the field of natural language pro- with the interest of the malicious actors. Our pilot cessing toward developing automated tools for study on manually annotated 100 Wikipedia pages content moderation and review. In this paper of four categories (25 pages each category) shows we propose Self Attentive Revision Encoder us that at most 30% of the edits are reverted (See (StRE) which leverages orthographic similar- Fig1). Global average of number of reverted dam- ity of lexical units toward predicting the qual- ∼ 2 ity of new edits. In contrast to existing propo- aging edits is 9% . This makes manual interven- sitions which primarily employ features like tion to detect these edits with potential inconsis- page reputation, editor activity or rule based tent content, infeasible. Wikipedia hence deploys heuristics, we utilize the textual content of machine learning based classifiers (West et al., the edits which, we believe contains superior 2010; Halfaker and Taraborelli, 2015) which pri- signatures of their quality. More specifically, marily leverage hand-crafted features from three we deploy deep encoders to generate repre- aspects of revision (i) basic text features like re- sentations of the edits from its text content, peated characters, long words, capitalized words which we then leverage to infer quality. We further contribute a novel dataset containing etc. (ii) temporal features like inter arrival time be- ∼ 21M revisions across 32K Wikipedia pages tween events of interest (iii) dictionary based fea- and demonstrate that StRE outperforms exist- tures like presence of any curse words or informal ing methods by a significant margin – at least words (e.g., ‘hello’, ‘yolo’). Other feature based 17% and at most 103%. Our pre-trained model approaches include (Daxenberger and Gurevych, achieves such result after retraining on a set as 2013; Bronner and Monz, 2012) which generally small as 20% of the edits in a wikipage. This, follow a similar archetype. to the best of our knowledge, is also the first at- tempt towards employing deep language mod- Proposed model: In most of the cases, the ed- els to the enormous domain of automated con- its are reverted because they fail to abide by the tent moderation and review in Wikipedia. edit guidelines, like usage of inflammatory word- ing, expressing opinion instead of fact among oth- 1 Introduction ers (see Fig2). These flaws are fundamentally related to the textual content rather than tempo- Wikipedia is the largest multilingual encyclopedia ral patterns or editor behavior that have been de- known to mankind with the current English ver- ployed in existing methods. Although dictionary sion consisting of more than 5M articles on highly based approaches do look into text to a small ex- diverse topics which are segregated into cate- tent (swear words, long words etc.), they account gories, constructed by a large editor base of more for only a small subset of the edit patterns. We than 32M editors (Hube and Fetahu, 2019). To en- further hypothesize that owing to the volume and courage transparency and openness, Wikipedia al- lows anyone to edit its pages albeit with certain 1en.wikipedia.org/Wikipedia:List of policies *Both∗ authors contributed equally 2stats.wikimedia.org/EN/PlotsPngEditHistoryTop.htm
3962 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3962–3972 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics Facebook and Myspace are social Facebook is lot better than Myspace 7000 Total Edits networking sites is every aspects. Reverted Edits 6000 Facebook formerly known as thefacebook 5000 Facebook formerly known as thefacebook is is RETARTED as well as a social networking a social networking service for high school, 4000 service for high school, college, university college, university communities communities 3000 Edit Count Edit 2000 Dr Eric Schmidt, FORMER CEO OF NOVELL, [[Eric Schmidt]], took over Google’s CEO took over Google’s CEO when co-founder 1000 when co-founder Larry Page stepped down Larry Page stepped down 0 Company Person Technology Concept Category Google is foraying into other Google is trying to become the jack of businesses, which other companies all trades have recently dominated. Figure 1: Average number of edits and average number of damaging edits, i.e., reverted edits for four different Figure 2: Examples of edits in Facebook and Google categories of pages. A fraction (at most 30%) of user Wikipedia page. The blue bubbles are the original generated edits are damaging edits. sentences. The orange bubbles indicate damaging ed- its while the green bubbles indicate ‘good faith’ edits. Good faith edits are unbiased formal English sentence variety of Wikipedia data, it is impossible to de- while damaging edits often correspond to incoherent velop a feature driven approach which can en- use of language, abusive language, imperative mood, compass the wide array of dependencies present opinionated sentences etc. in text. In fact, we show that such approaches are inefficacious in identifying most of the damag- our model to pages with lower number of edits. ing edits owing to these obvious limitations. We hence propose Self Atttentive Revision Encoder Contributions: Our primary contributions in this (StRE) which extracts rich feature representations paper are summarized below - of an edit that can be further utilized to predict (i) We propose a deep neural network based model whether the edit has damaging intent. In specific, to predict edit quality in Wikipedia which utilizes we use two stacked recurrent neural networks to language modeling techniques, to encode seman- encode the semantic information from sequence tic information in natural language. of characters and sequence of words which serve (ii) We develop a novel dataset consisting of a twofold advantage. While character embeddings ∼ 21M unique edits extracted from ∼ 32K extract information from out of vocabulary tokens, Wikipedia pages. In fact our proposed model i.e., repeated characters, misspelled words, mali- outperforms all the existing methods in detecting cious capitalized characters, unnecessary punctu- damaging edits on this dataset. ation etc., word embeddings extract meaningful (iii) We further develop a transfer learning set up features from curse words, informal words, imper- which allows us to deploy our model to newer cat- ative tone, facts without references etc. We further egories without the need for training from scratch. employ attention mechanisms (Bahdanau et al., Code and sample data related to the pa- 2014) to quantify the importance of a particular per are available at https://github.com/ character/word. Finally we leverage this learned bhanu77prakash/StRE. representation to classify an edit to be damaging 2 The Model or valid. Note that StRE is reminiscent of struc- tured self attentive model proposed in (Lin et al., In this section we give a detailed description of 2017) albeit used in a different setting. our model. We consider an edit to be a pair of Findings: To determine the effectiveness of our sentences with one representing the original (Por) model, we develop an enormous dataset consisting while the other representing the edited version of ∼ 21M edits across 32K wikipages. We observe (Ped). The input to the model is the concatenation that StRE outperforms the closest baseline by at of Por and Ped (say P = {Por||Ped}) separated by a least 17% and at most 103% in terms of AUPRC. delimiter (‘||’). We assume P consists of wi words Since it is impossible to develop an universal and ci characters. Essentially we consider two lev- model which performs equally well for all cat- els of encoding - (i) character level to extract pat- egories, we develop a transfer learning (Howard terns like repeated characters, misspelled words, and Ruder, 2018) set up which allows us to de- unnecessary punctuation etc. and (ii) word level ploy our model to newer categories without train- to identify curse words, imperative tone, opinion- ing from scratch. This further allows us to employ ated phrases etc. In the following we present how
3963 we generate a representation of the edit and utilize an MLP. Each embedded character is then passed it to detect malicious edits. The overall architec- through a bidirectional LSTM to obtain the hidden ture of StRE is presented in Fig3. states for each character. Formally, we have
2.1 Word encoder yi = σ(Wcci + bc), i ∈ [0,T] (7) Given an edit P with wi,i ∈ [0,L] words, we first −→ −−−→ h i = LSTM(yi), i ∈ [0,T] (8) embed the words through a pre-trained embedding ←− ←−−− matrix We such that xi = Wewi. This sequence of h i = LSTM(yi), i ∈ [T,0] (9) embedded words is then provided as an input to a bidirectional LSTM (Hochreiter and Schmidhu- We next calculate the attention scores for each ber, 1997) which provides representations of the hidden state hi as words by summarizing information from both di- z = σ(W h + b ) (10) rections. i c i c exp(zT u ) α = i c (11) x = W w , i ∈ [0,L] (1) i T T i e i ∑i=0 exp(zi uc) −→ −−−→ v i = LSTM(xi), i ∈ [0,L] (2) Rc = ∑αihi (12) ←− ←−−− i v i = LSTM(xi), i ∈ [L,0] (3) Note that uc is a character context vector which We obtain the representation for each word by is learned during training. concatenating the forward and the backward hid- −→ ←− den states vi = [ v i, v i]. Since not all words con- 2.3 Edit classification tribute equally to the context, we deploy an atten- The edit vector Ep (for an edit P) is the concate- tion mechanism to quantify the importance of each nation of character and word level encodings Ep = word. The final representation is then a weighted [Rc,Rw] which we then use to classify whether an aggregation of the words. edit is valid or damaging. Typically, we perform u = (W v + b ) i σ w i w (4) p = so ftmax(WpEp + bp) T exp(ui uw) βi = T T (5) Finally we use binary cross entropy between pre- ∑ exp(u uw) i=0 i dicted and the true labels as our training loss. Rw = ∑βivi (6) i 2.4 Transfer learning setup
To calculate attention weights (αi) for a hidden Note that it is not feasible to train the model from state hi, it is first fed through a single layer per- scratch every time a new page in an existing or a ceptron and then a softmax function is used to cal- new category is introduced. Hence we propose a culate the weights. Note that we use a word con- transfer learning setup whereby, for a new page, text vector uw which is randomly initialized and we use the pre-trained model and only update the is learnt during the training process. The use of weights of the dense layers during training. The context vector as a higher level representation of a advantages are twofold - (i) the model needs only fixed query has been argued in (Sukhbaatar et al., a limited amount of training data and hence can 2015; Kumar et al., 2016). Note that the attention easily be trained on the new pages and (ii) we ben- score calculation is reminiscent of the one pro- efit significantly on training time. posed in (Yang et al., 2016). 3 Dataset 2.2 Character encoder Wikipedia provides access to all Wikimedia The character encoder module is similar to the project pages in the form of xml dumps, which word encoder module with minor differences. For- is periodically updated3. We collect data from mally we consider P ({P ||P }) as a sequence of or ed dumps made available by English Wikipedia T characters c , i ∈ [0,T]. Instead of using pre- i project on June 2017 which contains information trained embeddings as in case of word encoder, we about 5.5M pages. define an embedding module, parameters of which is also learned during training which is basically 3https://dumps.wikimedia.org/enwiki/20181120
3964 Resources Count Concatenation & Pages 32394 softmax U
uc w Total edits 21,848960
ퟐ ퟐ
훃 Positive edits 15,791575 휶 Negative edits 6,057385
h1 h2 hT v1 v2 vL Table 1: Summary of the dataset. h v 1 h2 hT 1 v2 vL Intuitively, the quantity q captures the propor- C C C w w w k|` 1 2 T 1 2 L tion of work done on edit k that remains in revision Figure 3: Overall architecture of StRE. The charac- k +` and it varies between qk|` ∈ [−1,1], when the ter encoding and the word encoding components of the value falls outside this range, it is capped within model are shown in the left and right respectively. This these two values. We compute the mean quality is followed by the attention layer followed by concate- of the edit by averaging over multiple future revi- nation and softmax. sions as follows
1 L We extract a subset of pages related to the Com- q = q k L ∑ k|` puter Science category in Wikipedia. Utilizing the `=1 category hierarchy4 (Auer et al., 2007) (typically where L is the minimum among the number of a directed graph containing parent and child cate- subsequent revisions of the article. We have taken gories), we extract all articles under the Computer L = 10, which is consistent with the previous work Science category up to a depth of four levels which of Yardım et al.(2018). accounts for 48.5K Wikipedia pages across 1.5K Edit label: For each pair of edits we compute the categories5. We filter out pages with at least 100 edit quality score. If quality score is ≥ 0 we label edits which leaves us with 32K pages. For each an edit to be −1, i.e., done in good faith. However page in our dataset we performed pairwise differ- all edits with quality score < 0 are labeled 1, i.e., ence operation6 between its current and previous damaging edits. We further check that bad qual- versions to obtain a set of pairs with each consist- ity edits are indeed damaging edits by calculating ing of a sentence and its subsequent modified ver- what fraction of low score edits are reverted and sion. what fraction of high score edits are not reverted. Edit quality: In order to train our model to iden- This result is illustrated in Figure4. Information tify quality edits from damaging edits we need a whether an edit is reverted or not can be calculated deterministic score for an edit. Our quality score by mining Wikipedia’s revert graph following the is based on the intuition that if changes intro- same technique illustrated by (Kittur et al., 2007). duced by the edit are preserved, it signals that the The results clearly show that a large proportion of edit was beneficial, whereas if the changes are re- bad quality edits are indeed reverted by the edi- verted, the edit likely had a negative effect. This tors and similarly a large fraction of good quality idea is adapted from previous work of Adler et al. edits are not reverted. Though bad quality edits (2011). are often reverted, all reverted edits are not bad. Consider a particular article and denote it by vk Malicious agents often engage in interleaving re- its k-th revision (i.e., the state of the article after verts, i.e., edit wars (Kiesel et al., 2017) as well the k-th edit). Let d(u,v) be the Levenshtein dis- as pesudo reverts. Hence we use quality metric to tance between two revisions. We define the quality label damaging edits which is well accepted in the of edit k from the perspective of the article’s state literature (Adler et al., 2008). We provide a sum- after ` ≥ 1 subsequent edits as mary of the data in Table1. Our final data can be represented by a triplet < si,s f ,l > where si is the d(vk−1,vk+`) − d(vk,vk+`) initial sentence, s f is the modified sentence and l qk|` = . d(vk−1,vk) indicates the edit label.
4Dbpedia.org 4 Experiments 5Typical categories include ‘Computational Science’, ‘Artificial Intelligence’ etc. In this section we demonstrate the effectiveness of 6https://docs.python.org/2/library/difflib.html our model compared to other existing techniques.
3965