1 Automatically Labeling Low Quality Content on 2 by Leveraging Patterns in Editing Behavior 3 4 5 ANONYMOUS AUTHOR(S) 6 Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles 7 have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous 8 number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically 9 identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current 10 labeling approaches are tedious and produce noisy labels. Here, we propose an automated labeling approach 11 that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia editsanduses 12 the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated 13 article statements are examples that no longer need semantic improvements. We show that training existing 14 sentence quality classification algorithms on our labels improves their performance compared to training 15 them on existing labels. Our work shows that editing behaviors of Wikipedia editors provide better labels than labels generated by crowdworkers who lack the context to make judgments that the editors would agree with. 16 17 CCS Concepts: • Human-centered computing → Social recommendation; Computer supported coop- 18 erative work; Empirical studies in collaborative and social computing; Wikis; Social tagging systems. 19 Additional Key Words and Phrases: Wikipedia, labeling, Machine Learning. 20 ACM Reference Format: 21 Anonymous Author(s). 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging 22 Patterns in Editing Behavior. 1, 1 (October 2020), 19 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 23 24 1 INTRODUCTION 25 26 Wikipedia [27], an online encyclopedia, aims to be the ultimate source of encyclopedic knowledge 27 by achieving a high quality for all its articles. High quality articles are definitive source of knowledge 28 on the topic and serve the purpose of providing information to Wikipedia readers in a concise 29 manner, without causing confusion and wasting time [25]. Thus, Wikipedia editors have defined a 30 comprehensive content assessment criteria, called the WP1.0 Article Quality Assessment scale [29] 31 to grade article quality on a scale from the most basic "stub" (articles with basic information about 32 the topic, without proper citations and Wikipedia-defined structure) to the exemplary "Featured 33 Articles" (well-written and well-structured, comprehensive and properly cited articles). 34 Article maintenance, as opposed to creating new articles and content, has become a significant 35 portion of what Wikipedia editors do [17]. Currently, editors rate article quality and identify and 36 make required improvements manually, which is taxing and time-consuming. Being a collaborative 37 editing platform, articles are in a constant state of churn and current assessments are quickly 38 outdated because articles will have been modified by others. For the limited number of experienced 39 editors on Wikipedia, performing such assessments across a set of 6.5 million Wikipedia articles is 40 a huge bottleneck [23]; currently only about 7,000 of articles have "Featured Article" status and 41 only about 33,000 have the second best "Good Article" status [29]. 42 With continuously declining number of editors on Wikipedia [24], automating quality assess- 43 ment tasks could reduce the workload of remaining editors. Supervised Machine Learning (ML) 44 has already automated tasks like vandalism detection [12] and overall article quality prediction 45 [26]. Such ML approaches require labeled sets of examples of Wikipedia content that requires 46 improvement (positive examples) and content that do not (negative examples). One of the main 47 2020. XXXX-XXXX/2020/10-ART $15.00 48 https://doi.org/10.1145/nnnnnnn.nnnnnnn 49 , Vol. 1, No. 1, Article . Publication date: October 2020. 2 Anon.

50 51 52 53 54 55 56 57 58 59 60 61 62 63 Fig. 1. Our proposed pipeline for labeling low-quality statements on Wikipedia. We start with our automated 64 labeling approach (top row), where we obtain a large corpus of historic Wikipedia statement edits, and label 65 their semantic intent using programmatic rules. We extract positive statements from relevant semantic edits 66 and negative statements from Featured Articles. We then use our labels to existing train Machine Learning 67 models, and test them by comparing with labeling approaches from past research (middle row). Existing 68 models trained on our labels can then be deployed to automatically detect Wikipedia statements that require 69 improvement (bottom row). 70 71 72 reasons for the success of those existing ML approaches [12, 26] (both have been deployed to 73 Wikipedia) is relative ease of obtaining labels either because they are visually salient (e.g., in case 74 of vandalism) or already part of existing practices (e.g., editors manually record article quality on 75 talk pages of Wikipedia articles as part of existing article assessment). 76 However, automating other quality assessment tasks (e.g., identifying sentences that require 77 citation, sentences with non-neutral point of view, sentences that require clarification) requires 78 labels at the Wikipedia sentence level which makes automating such tasks difficult. Wikipedia 79 editors rarely manually flag outstanding Wikipedia statement quality issues as part of their editing 80 process [1]. Even existing crowdsourcing-based labeling method [15, 22, 34] could produce noisy 81 Wikipedia statement quality labels, especially when crowdworkers, who are not domain experts, 82 lack knowledge about Wikipedia policies on content quality [8, 11, 16]. 83 Here, we present a method for automatically labeling Wikipedia statement quality across im- 84 provement categories directly from past Wikipedia editors’ editing behavior to enable article quality 85 improvements (Figure1). To label positive examples (statements that need improvements), we 86 implemented Wikipedia core content principles guidelines [30] as syntax-based rules to capture the 87 meaning or intent of a historic edit (e.g., added citations, removed bias, clarified statement) for each 88 statement quality category we want to classify (e.g., needs citation, needs bias-removal, or needs 89 clarification). Each historic edit then indicates that the edited statement needed that particular 90 improvement resulting in a positive example. We follow Redi et. al [22] approach and label all 91 statements in featured articles as negative examples (statements that do not need improvements). 92 To illustrate our approach, we built three statement quality detection pipelines (including cor- 93 responding rules) for three Wikipedia quality improvements categories: 1) citations (adding or 94 modifying references and citations for verifiability), 2) Neutral Point of View (NPOV) edits (rewriting 95 using encyclopedic, neutral tone; removing bias), and 3) clarifications (specifying or explaining an 96 existing fact or meaning by example or discussion without adding new information). We validated 97 our automated labeling approach by comparing performance of existing deep learning models [2] 98 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior3

99 trained using existing, baseline labeling approaches (e.g., implicit labeling [22], crowdsourcing [15]) 100 and our automatically extracted labels. Our results showed that existing models trained using our 101 automatic labeling method achieved 20% and 15% improvement in F1-score for citations and NPOV 102 respectively than same models trained on data labeled using existing approaches. 103 Our work provides further evidence that the edits produced by Wikipedians working in their 104 context provide better signal for supporting their work than labels generated by crowdworkers 105 who lack the context to make judgments about sentence quality that Wikipedians would agree 106 with. Learning from implicit editing behavior of Wikipedia editors allowed us to produce labels 107 that capture the nuances of Wikipedia quality policies. Our work has implications for the growth 108 of collaborative content spaces where different people come together to curate content adhering to 109 the standards and purpose of the space. 110 111 2 CHALLENGES OF LABELING LOW QUALITY CONTENT ON WIKIPEDIA 112 Automated approaches to improving and maintaining good quality of articles on Wikipedia have 113 received considerable attention. For example, Wikipedia has deployed automatic vandalism de- 114 tection [12] that effectively relieves editors of the burden of manually fighting vandals. Thishas 115 made fighting a relatively easy task as bots have taken over mostofthe 116 responsibility of detecting and reverting vandalism edits [9], leaving the editors to make more 117 content related edits. Existing article quality models [5, 26] already automatically rate Wikipedia 118 articles quality based on their content and structure. 119 Such automated efforts have been possible in part because of the availability of quality labelsfor 120 such tasks. For example, a small subset of visually-salient, hand-labeled examples are sufficient 121 for even simple ML models to identify vandalism with high accuracy[9]. Also, training existing 122 article quality models[26] involves using existing quality labels for over 6.5 million articles that 123 Wikipedians have already generated when manually rating article quality as part of existing 124 processes. 125 Unfortunately, that kind of automated assistance to editors has not extended beyond these two 126 tasks because labels for other quality assessment and improvement tasks are not immediately 127 available. Although Wikipedia encourages editors to manually flag outstanding content issues 128 with cleanup templates markup (e.g., marking a sentence with {} template [33]) or 129 label Wikipedia edits with a free-form edit intent summary (e.g., point-of-view), their usage is not 130 standardised and only few Wikipedia sentences that need improvement or Wikipedia edits actually 131 have them [1]. 132 Existing attempts to supplement such labels via crowdsourcing [15, 22, 34] produce few labels 133 when using Wikipedia editors as labelers or produce noisy labels when using crowdworkers who 134 are not Wikipedia editors and do not always provide reliable judgments on what content needs 135 improvement due to their lack of knowledge about the nuances of Wikipedia policies [8, 16]. 136 Although crowdsourcing has been used in the past [19] to successfully label examples of vandal- 137 ism, it is important to note that annotating vandalism is simpler than examples related to concepts 138 like need for citations, neutrality of point-of-view, and clarifications, since the concept of vandalism 139 could be commonly shared between lay web users and Wikipedians. In the absence of a widely 140 accepted clear standard of categorization of Wikipedia statement quality, most of the other tasks 141 that editors perform are hard to label. 142 To get around explicitly asking editors or crowdworkers to label the quality of Wikipedia 143 statements, existing research [22] has attempted to obtain labels implicitly by propagating article 144 quality label to all sentences in the article. For example, Redi et al. [22] showed that citation labels 145 are easy to obtain because presence/absence of citations in "Featured Articles" acts as an implicit 146 label that statements with citations needed them and those without did not. Although such implicit 147 , Vol. 1, No. 1, Article . Publication date: October 2020. 4 Anon.

148 labels can be used to label negative examples across semantic improvement categories, they cannot 149 extract positive examples of needed improvements for categories such as NPOV or clarification. 150 Recently, Yang et al. [34] created a taxonomy of Wikipedia edits based on semantic intentions. 151 This taxonomy comprehensively covers the tasks Wikipedians do, ranging from fighting vandalism 152 and copy-editing to making content clarifications and simplifications. Although Yang etal.[34] 153 used crowdsourcing to label Wikipedia statements according to their taxonomy, such taxonomy 154 also lends itself to being directly converted into programmatic rules. Such rule-based analyses 155 has shown promise in identifying vandalism-related edit reverts [20] and distinguishing between 156 conflict and non-conflict edit revert activity10 [ ]. However, it is not immediately obvious how such 157 rules can be adapted to the problem of automatically labeling Wikipedia statements quality using 158 the taxonomy above. 159 160 3 METHOD FOR AUTOMATICALLY LABELING LOW QUALITY CONTENT 161 Building machine learning models for Wikipedia statement quality identification and improvement 162 requires quality labels on individual statements from Wikipedia articles so that patterns in low 163 quality statements can be learnt. While has cleanup templates [33] that can be 164 used to annotate statements having quality issues (e.g., needing clarification, needing citations, 165 needing grammar improvements) inline, their usage is not uniform across editors. Further, other 166 language do not have well defined cleanup templates for tagging quality issues with 167 statements like English Wikipedia. On the other hand, collaborative editing with the aim of 168 improving content is at the heart of wiki ideology. Thus, it is meaningful to learn quality improving 169 behaviors directly from edits. In this scenario, starting with identifying the semantic intent of the 170 edit has two benefits: 1) Editing is common across all Wikipedia languages, hence provides fora 171 common approach for all languages, 2) Semantic identification of edits is easier given their syntax 172 compared to identifying the issues in free form natural language statements. 173 Existing research around identifying the semantic intent of Wikipedia edits[34] is useful for 174 getting the overall intent of a Wikipedia edit which may be composed of multiple changed para- 175 graphs with their own intentions. To identify statements that were semantically improved we need 176 semantic intent at the statement level. Thus, our proposed pipeline described below starts with the 177 identification of semantic intention of Wikipedia edits using regular expressions to realize ourgoal 178 of getting statements with quality improvement labels. 179 Our proposed pipeline for labeling low-quality statements on Wikipedia includes three steps: 1) 180 using a combination of rules to pre-process edits and identify parts in the edit corresponding to a 181 particular semantic type, 2) For a given semantic type (e.g., citations) extracting positive statements 182 from edits of that semantic type ("needing the semantic improvement"). Further, extracting an 183 equivalent number of statements from "Featured Articles" as negative examples ("not needing 184 the semantic improvement"), and 3) training GRU[4] based RNN models on these set of labeled 185 statements to identify when an unknown statement needs that kind of semantic improvement. 186 The effectiveness of our labeling approach relies on two key factors: 1) Wikipedia has millions 187 of edits. Even if our rules make major exclusions, the enormous amount of edits ensures that we 188 are still left with good amount of data to train effective deep learning models, 2) Syntax based 189 regular expression rules allow us to target specific changes made in an edit with zero ambiguity, 190 and discard the rest. This allows us to extract useful information from a lot more edits than just the 191 edits where single line changes are made, which previous works were restricted to. 192 We build statement quality detection pipeline for three semantic categories: 1) Citations - add 193 or modify references and citations; remove unverified text, 2) Point-of-View (POV) - rewrite using 194 encyclopedic, neutral tone; remove bias; apply due weight, 3) Clarifications - specify or explain an 195 existing fact or meaning by example or discussion without adding new information. These semantic 196 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior5

197 category definition are derived from the definitions of the semantic edit types that was formulated 198 in Yang et. al[34] in consultation with Wikipedia editors. Since semantic edit types represent the 199 intent with which a particular edit (improvement) is made, it is logical to use the same definition 200 for the semantic improvement category (what is improved). The definitions are not absolute as 201 their interpretations still vary1, but it provides a good starting point to base future work on. 202 Note that Wikipedia’s Neutral Point of View (NPOV) policy [32] is a very broad policy covering 203 a variety of cases of bias in text. To fix bias in content, two major types of point-of-view edits 204 happen on Wikipedia (cite this) 1) inline point-of-view edit - one or two bias words are deleted or 205 replaced with a more neutral word in the sentence to fix the issue, 2, 2) Full deletion of paragraphs3, or 206 sentences not following the WP:NPOV policy[32]. We only address inline point-of-view edits in this 207 paper. Any reference, henceforth, to point-of-view edits will refer to inline point-of-view edits. We 208 discuss our rationale for the same in Section 4.2. 209 We now define the generic rules used in our pre-processing step. Later, in section4, we illustrate 210 how to use these rules to detect citation, point-of-view, and clarification semantic category of edits. 211 212 3.1 Identifying semantic intents in Wikipedia edits 213 Typical Wikipedia edits 4 contain a variety of changes performed with different intentions. For 214 example, certain paragraphs may be made more neutral, certain others may be deleted because of 215 lack of citations, and some others may be copy-edited (fixing style and grammar). These changes 216 can be additions/modifications/deletions of free text or markup related syntax. 217 3.1.1 Preprocessing Diff. The first step towards getting high quality diffs of edits that aredone 218 with a specific semantic intention is pre-processing the diffs so that syntax based rulescanbe 219 applied effectively. Since paragraph is the smallest logical unit describing an idea or content,we 220 group the individual changes in an edit (also called as segments) by paragraph and also keep track 221 of the context for each segment. We use the deltas library 5 for processing the diffs. 222 We call these grouped paragraph segments as paragraph_edits. Each of these paragraph segments 223 is a list of four tuple structure containing the prior context of the segment, deleted and inserted 224 segments and the post context of the segment. Since the context is paragraph, all the text till the 225 beginning and end of the paragraph is taken as context before and after respectively. Figure2 shows 226 an example edit with one segment in a paragraph containing a single deletion with a corresponding 227 insertion. 228 229 230 231 232 233 234 235 236 237 Fig. 2. An example of a segment in an edit with a deletion and a corresponding insertion. Before context is 238 shown with blue dashed line and after context with green dashed line. 239 240 1 241 https://en.wikipedia.org/wiki/Wikipedia_talk:Labels/Edit_types/Taxonomy 2https://en.wikipedia.org/wiki/?diff=745183020 242 3https://en.wikipedia.org/wiki/?diff=745912558 243 4https://en.wikipedia.org/wiki/?diff=743059193 244 5https://pythonhosted.org/deltas/ 245 , Vol. 1, No. 1, Article . Publication date: October 2020. 6 Anon.

246 Rule Description Type: Example 247 deleted_length_words Number of words in deleted segment, after Range: [0,30] 248 markup and stopwords removal 249 inserted_length_words Number of words in inserted segment after Range: [0,30] 250 markup and stopwords removal 251 deleted_inserted_diff_words Number of words differing in deleted & inserted Ordinal 252 after markup and stopwords removal 253 has_template_inserted Complete template inserted Discrete: {-1,0,1} 254 has_template_deleted Complete template deleted Discrete: {-1,0,1} 255 has_citation_inserted Has a complete citation inserted Discrete: {-1,0,1} 256 has_citation_deleted Has a complete citation deleted Discrete: {-1,0,1} 257 has_wikilink_inserted Has a complete wikilink inserted Discrete: {-1,0,1} 258 has_wikilink_deleted Has a complete wikilink deleted Discrete: {-1,0,1} 259 is_template_inserted Inserted part is internal modification of a template Discrete: {-1,0,1} 260 is_template_deleted Deleted part is internal modification of a template Discrete: {-1,0,1} 261 is_wikilink_inserted Inserted part is internal modification of a wikilink Discrete: {-1,0,1} 262 is_wikilink_deleted Deleted part is internal modification of a wikilink Discrete: {-1,0,1} 263 is_citation_inserted Inserted part is internal modification of a citation Discrete: {-1,0,1} 264 is_citation_deleted Deleted part is internal modification of a citation Discrete: {-1,0,1} 265 is_infobox_inserted Inserted part is internal modification of an infobox Discrete: {-1,0,1} 266 is_infobox_deleted Deleted part is internal modification of an infobox Discrete: {-1,0,1} 267 is_multiline_inserted Inserted part is multiline Discrete: {-1,0,1} 268 is_multiline_deleted Deleted part is multiline Discrete: {-1,0,1} 269 is_list_inserted Inserted is part of a list Discrete: {-1,0,1} 270 is_list_deleted Deleted is part of a list Discrete: {-1,0,1} 271 para_changes Number of changes in a paragraph Ordinal 272 para_length_words Length of the containing paragraph Ordinal 273 Table 1. Intention labeling rules 274 275 276 3.1.2 Rule Based Semantic Intention Labeling. We then use regular expression rules on these 277 paragraph edits to filter out segments of interest. They exploit the syntactic structure of theedit 278 segment to make predictions about its semantic intent. Table1 contains a list of all the rules we 279 define to identify different semantic intention categories. 280 The rules take on three types of values: 281 282 • Range - specifying the upper and lower bound on words, characters. 283 • Ordinal - Numerical value ranging from 0 to infinity - E.g. word count added 284 • Discrete - Taking three values [-1, 0, 1] corresponding to [exclude, don’t care, include]. E.g. 285 is_template_inserted set to -1 means the rule will not be satisfied if the inserted segment 286 contains a template modification. 287 We have dont_care for all the rules to allow skipping them if they are not of interest for a 288 particular category. For example, has_wikilink_inserted is a dont_care for the citation category as it 289 is only concerned with looking for the citation tags in the edit. Each rule corresponds to a specific 290 syntax of wiki markup. A subset of the given rules is helpful for semantic labeling of a specific 291 category. In section4 we discuss the intuition behind the rules and how to craft rules for new 292 semantic types based on its definition. Table2 specifies the subset of rules we use for identifying 293 each of citations, point-of-view and clarification edits in edit segments. 294 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior7

295 Category Rule Value 296 Citations is_citation_inserted 1 297 is_citation_inserted or deleted -1 (False) 298 is_template_inserted or deleted -1 (False) 299 Clarification is_wikilink_inserted or deleted -1 (False) 300 is_infobox_inserted or deleted -1 (False) 301 is_multiline_inserted or deleted -1 (False) 302 inserted_length_words [0, 10] 303 deleted_length_words [0, 5] 304 is_citation_inserted or deleted -1 (False) 305 is_template_inserted or deleted -1 (False) 306 Point-of-view is_wikilink_inserted or deleted -1 (False) 307 is_infobox_inserted or deleted -1 (False) 308 is_multiline_inserted or deleted -1 (False) 309 comment_matches "POV" 310 para_changes 1 311 inserted_length_words [0, 2] 312 deleted_length_words [0, 3] 313 Table 2. Rules for citations, point-of-view and clarification edits 314 315 316 317 318 3.2 Extracting weakly labeled statements 319 We use the rules described in the previous section for identifying citations, point-of-view and 320 clarification changes in Wikipedia edits on 6.5 million Wikipedia edits extracted from 100,000 321 Wikipedia articles of quality ranging from the most basic "stub" articles to "Featured Articles". 322 323 3.2.1 Reverts Labeling. Before labeling the semantic category for the 6.5 million edits, we filter out 324 the one’s that were reverted because most of the reverts[6] are because of vandalism. Learning 325 from edits that are vandalism does not add any value as they have no valid intention. We use a 326 revert window of 15 edits and a max_revert time of 2 days to identify edits that were reverted[7]. 6 327 We use the mwreverts library to label edits that were reverted and exclude them from the labeling 328 step. 329 330 3.2.2 Statement extraction. On the remaining set of non-reverted edits, we run our rule based 331 system to tag segments in the edits with the semantic intent. After we identify the semantic intent 332 for each segments in the edit, we extract the statements from the relevant semantically labeled 333 segments as a positive example for the semantic category. For example, figure3 shows a Wikipedia 334 edit whose semantic intent is clarification, and the corresponding sentence that was clarified 335 by adding information. The prior version of the modified statement is "While the exact cause is 336 unknown, it is believed to involve a combination of [[Genetics|genetic]] and environmental factors.". 337 This sentence is a positive example for "needing clarification" because it was clarified to "While 338 the exact cause is unknown, Tourette’s is believed to involve a combination of [[Genetics|genetic]] and 339 environmental factors." 340 341 6 342 https://pythonhosted.org/mwreverts/ 343 , Vol. 1, No. 1, Article . Publication date: October 2020. 8 Anon.

344 345 346 347 348 349 350 351 352 353 354 Fig. 3. An edit having one of the intents as clarification and the clarified statement. Note that the edit also 355 contains other changes but for the purpose of "clarification", we discard the rest of the changes, that are not 356 caught by the rules, as irrelevant. 357 358 359 3.3 Training Statement Quality Models 360 With datasets of weakly labeled statements, we train deep learning models for identification of 361 statement quality along the proposed semantic category dimensions so that they can be improved. 362 We use sequence based RNN models as they have shown promise in similar classification tasks 363 in the past [22]. Further, using the same models as previous works which we compare against, 364 eliminates the possibility of variations caused by model changes. 365 366 4 EVALUATION OF SEMANTIC INTENT RULES 367 368 Before evaluating our statement quality labels, we provide an evaluation of the rules being used 369 to extract the statements for the three semantic categories. As stated in section3, we build rules 370 for the three semantic categories: 1) citations, 2) point-of-view, and 3) clarifications. We provide 371 explanations for the rules for each of the three semantic edit types that and justify the rules based on 372 their definitions[34]. Refer to table2 for the semantic category rules. Table3 contains a subjective 373 evaluation of our rules for the three categories on specific diff examples. The column revision-id is 374 the unique identifier to access the given Wikipedia diff from where the example is obtained. The 375 full diff url is https://en.wikipedia.org/wiki/?diff= 376 4.1 Citations 377 378 Citations in Wikipedia are added by adding citation templates [28] and referencing them by adding 379 "" tags in the wikitext of the article. Wikitext is the markup in which articles on Wikipedia are 380 written. Since citations are easily identifiable by the presence of a "" tag, identifying whethera 381 Wikipedia edit has the semantic type "citation" simplifies to looking for the addition of the "" 382 tag in the added portion of the diff. 383 384 4.2 Point-of-View (POV) 385 Labeling point-of-view edits is the hardest because of the linguistic nature of the category. As 386 Recasence et. al [21] notes, even humans not well versed with Wikipedia’s Neutral Point of View 387 (NPOV) policies face difficulty in correctly identifying point-of-view edits. Neutral Point ofView 388 (NPOV) is one of the most debated policies of English Wikipedia, and even Wikipedians have 389 disagreements over it [32]. As noted in section3, inline point-of-view edits and major deletions of 390 paragraphs because of point-of-view issues are fundamentally different from the lens of identifica- 391 tion. Therefore, we argue that they should be targeted by different models. Inline point-of-view 392 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior9

393 revision- Line category statement comment 394 id No. 395 879909094 17 citation ...DHS was "fully prepared" for the new "" added at the end 396 Address. regarding news of the an- 397 nouncement 398 293538899 1 citation ...despite its "mostly scathing" recep- new "" added inline. 399 tion among historians 400 852409759 18 POV ..student must successfully pass one word - successfully was 401 three.. deleted so we extract the full 402 sentence. 403 621178816 71 POV ”Let’s Be Cops” has been met with mostly replaced with gener- 404 mostly negative reviews. ally which satisfies our rule 405 criteria. 406 275534794 1 clarifi- The venue for the summit meetings Tokyo clarified to Tokyo, 407 cation was the State Guesthouse in Tokyo. Japan (0-2 words addition 408 rule) 409 174763849 1 clarifi- ”’Tokudane!”’ is a morning [[news morning clarified to weekday 410 cation program]] airing on [[Fuji Televi- morning (0-2 words addition 411 sion]], a [[television station]] in rule) 412 [[Japan]]. 413 669670844 17 clarifi- Though Sonnenblick’s ideas about clarified to ...ideas about the 414 cation the structure and function the hu- relationship between the... (0- 415 man heart today constitute medical- 2 non-stop words addition 416 scientific commonsense, they were rule) 417 utterly novel at the time. 418 Table 3. Qualitative evaluation of rules 419 420 421 422 edits require contextual information to discriminate specific bias inducing words, and are harder to 423 address. We leave identification of biased paragraphs for future work. 424 To get high quality examples of point-of-view edits, we restrict edits with single changes with 425 a "POV" comment. This ensures that the comment "POV" is indeed meant for the change in the 426 edit we will be extracting. Further, inline point-of-view edits mean that one to two words will 427 be deleted or replaced by a neutral wording. Thus, inserted_length_words rule is set to [0,2] and 428 deleted_length_word is set to [0,3]. 429 We skip edits with comments that match "POV", but fail the rule to have a single change in the 7 430 paragraph . While at-least one of the changes in such edits will be a point-of-view edit, the rest 431 of the changes are not guaranteed to be point-of-view edits. Including these non point-of-view 432 changes introduces noise to the data. We can afford to discard these noisy point-of-view edits since 433 we are still able to get enough training data from the inline edits described previously. 434 435 4.3 Clarifications 436 Clarification edits are defined in the taxonomy as "specify or explain an existing fact or meaning 437 by example or discussion without adding new information". The rules capture this meaning in the 438 following ways: 439 440 7Example of a noisy point-of-view edit - https://en.wikipedia.org/wiki/?diff=745970378 441 , Vol. 1, No. 1, Article . Publication date: October 2020. 10 Anon.

442 • "without adding new information" - no multiline addition - neither a new statement nor a 443 newline (is_multiline_inserted or deleted set to -1 (False)) 444 • "specify or explain an existing fact or meaning by example" - addition of a few significant 445 words (non-stopwords) (See the rule inserted_length_words set to [0,10]) 446 • is_citation/template/wikilink/infobox_inserted or deleted set to -1 (False) make general syntax 447 based exclusions which cannot be clarifications. 448 • All other rules are set to 0 (don’t care) as they are not relevant to the clarifications syntax. 449 450 5 EVALUATION OF STATEMENT QUALITY LABELS 451 In order to evaluate the effectiveness of our automatically generated statement quality labels, we 452 turn to training Machine Learning models with our labeled statements to perform the task of 453 whether a Wikipedia statement "needs" a specific semantic improvement or not. We will refer 454 to statement quality labels generated by our approach as "Edit-labels" henceforth. We compare 455 the effectiveness of our labels on two classification tasks attempted previously - 1) citations- 456 given a Wikipedia statement, identify whether it needs citations or not [22], 2) point-of-view - 457 given a Wikipedia statement, identify whether it is biased or not according to Wikipedia’s NPOV 458 policy [15]. For the semantic category clarification, we do not have any prior work to compare 459 against. We provide our own analysis for this category and the interpretations of the model 460 output. For comparison, we use the same RNN models from previous works and only change the 461 labels they are trained on. Specifically, we use GRU based Recurrent Neural Networks with global 462 attention [2,4] as they have shown promise in both the approaches. 463 464 5.1 Statement Representation 465 Recurrent Neural Networks (RNN) [13] are a class of deep learning algorithms that work on 466 sequential data and are good at finding patterns and dependencies between the elements ofthe 467 sequence. Thus they have wide use in Natural Language Processing in text classification tasks 468 because a sentence or a document can be thought of as a sequence of words with dependencies 469 between words. The input to these models is the sentence represented as a sequence of numerical 470 features, one for each word. We define the following input features: 471 472 5.1.1 Word representation. The sentence is represented as a sequence of word embeddings (푤1,푤2...푤푛). 473 We use GloVe word embeddings [18] to represent each word. The embedding dimension 푊푒푚푏 ∈ 474 R100 475 476 5.1.2 Part of Speech representation. Part-of-speech (POS) tags [3] have found extensive use in the 477 NLP community because of their usefulness in capturing additional information about relations 478 between words. We additionally represent as a sequence of pos-tag for each word (푝1, 푝2, ...푝푛). Like 100 479 the word feature, each pos-tag is represented in the 100-dimensional embedding space 푊푝표푠 ∈ R . 480 Unlike GloVe word embeddings which are pre-trained, we train the pos-tag embeddings with the 481 classification task. We use POS tag represenation for training only point-of-view and clarification 482 detection models. This is because baseline work on citations[22] does not use POS tags as an input. 483 484 5.1.3 Section representation. Previous work in citations [22] has shown that adding the section 485 representation alongwith word representations improves the performance of citation detection. 486 This is because a section represents a coherent group of information and different sections need 487 different levels of citations. For example, sections like History describing historical events need more 100 488 citations than a section on the plot of a movie. We train the section embedding matrix 푊푆 ∈ R 489 as part of the classification objective and use it in combination with the word vectors input. Weuse 490 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior 11

491 the section inputs only for the citations task. Baseline work on point-of-view does not use section 492 input. 493 494 5.2 Training models to identify need of semantic improvement 495 496 Dataset Train Validation Test 497 Citations 68,000 9,780 19,561 498 Point-of-view 129,500 18,500 37,000 499 Clarifications 139,608 19,944 39,888 500 Table 4. Dataset splits for citations, point-of-view, and clarification categories created from edit labels. 501 502 503 Table4 shows our dataset statistics for each of the three semantic categories - citations, point-of- 504 view and clarifications. We use the weakly labeled statements extracted in section3 as positive 505 examples of needing the semantic improvement to the RNN models. For the negative samples, we 506 randomly sample an equal number of statements from Wikipedia "Featured Articles". We extract 507 these statements without any pre-processing, i.e., with full wiki-markup. When training the models, 508 we strip the input sentences of all the wiki-markup, remove special characters and convert all text to 509 lowercase. For the citations detection task, we normalize the section input using -utilities8 510 Our train-validation-test split is 70%, 10% and 20%. We use tensorflow with keras for training the 511 models. 512 513 5.3 Evaluation of citation 514 We evaluate our edit-labels for citations by training a GRU based RNN model with global attention 515 on Edit-labels as well as the Featured Article dataset from previous work[22]. In the Featured 516 Articles dataset, presence of a citation on a "Featured Article" statement is taken as a positive label 517 and the absence thereof as a negative label. The argument is that Featured articles go through an 518 extensive review process over multiple stages and statements that needed citations would have 519 been cited because "No Original Research" is one of the three core policies of Wikipedia[30]. 520 Table6 shows the results for citations. We test both the model trained from the Featured Article 521 dataset and our model on two datasets 1) LQN-full, 2) Edit-labels. "LQN-full" dataset is extracted 522 from the same articles as "LQN" in Redi et. al[22]. We use both citations and "citation-needed" 523 tags as positive signals. "citation-needed" tags are added on Wikipedia pages by editors where 524 they think a statement needs a citation. Thus, in addition to actual citations, this is also a positive 525 signal for "needing citations" because atleast one Wikipedia editor thought so. Statements with 526 "citation-needed" tags or actual citations on these articles are taken as positive examples of needing 527 citations. Statements in paragraphs with no citations or citation-needed tags are taken as negative 528 examples. We call our dataset "LQN-full" because we extract all the statements from these articles 529 giving us a large set of test statements to evaluate citations (∼300,000) compared to (∼20,000) in the 530 "LQN" dataset of the previous work. "Edit-labels test" dataset is the test split of the citation dataset 531 describe in Table4. 532 For both the testing datasets, we can see that the model trained on edit-labels outperforms the 533 model trained on limited examples from Featured Articles. We also see that when tested on a large 534 set of statements from low quality articles, both the approaches do not perform very well in terms 535 of identifying citations. One of the reasons why the model trained on statements from Featured 536 articles does not perform very well is that the RNN model also learns the patterns in the Featured 537 538 8https://pythonhosted.org/mediawiki-utilities/ 539 , Vol. 1, No. 1, Article . Publication date: October 2020. 12 Anon.

540 Training Testing 541 Training Testing Positive Negative Positive Negative 542 Featured Articles LQN-full 10,000 10,000 149,000 151,000 543 Featured Articles Edit-labels 10,000 10,000 9,782 9,782 544 Edit-labels LQN-full 39,122 39,122 149,000 151,000 545 Edit-labels Edit-labels 39,122 39,122 9,782 9,782 546 Table 5. Statistics for the citation datasets used for training and testing. 547 548 549 Training Testing 550 LQN-full Edit-labels test 551 (300000 552 samples) 553 P R F-1 P R F-1 554 Featured Arti- 0.65 0.38 0.48 0.48 0.34 0.40 555 cles 556 Edit-labels 0.54 0.67 0.60 0.65 0.73 0.69 557 Table 6. Testing results for citations when trained on labels from Featured Articles vs Edit-labels 558 559 560 articles. In this case, the length of the statement. Figure4 shows the proportion of statements of 561 varying lengths (in words) in Featured articles and low-quality articles. Majority of Featured article 562 statements lie in the range of 20-40 statements whereas low-quality article statements have lengths 563 in the range 10-30. This is expected because Featured article statements are well groomed because 564 of the extensive review9, whereas statements in low-quality articles may just be stating a fact, not 565 necessarily conveying the information in an elegant manner. Training on edit-labels captures this 566 variation because we’re training statements across the entire encyclopedia. We see that models 567 trained on a large set of statements extracted from edits across articles of all quality are better at 568 generalizing to the task of identifying the need for citations. 569 We re-iterate our goal here - to show that using edit labels for building models is better because 570 they are coming from the same set of articles where the models predictions will actually be used. 571 Improving the model classification accuracy is future work. 572 573 5.4 Evaluation of point-of-view 574 575 Training Testing 576 Training Testing Positive Negative Positive Negative 577 Crowdsourced Crowdsourced 1,290 1,290 370 370 578 Edit-labels Crowdsourced 74,000 74,000 1,843 1,843 579 Edit-labels Edit-labels 74,000 74,000 18,500 18,500 580 Table 7. Statistics for the POV datasets used for training and testing. Crowdsourced is from previous work 581 582 583 584 We evaluate our edit-labels for point-of-view by training a GRU based RNN model with global at- 585 tention on Edit-labels as well as the crowdsourced dataset from baseline point-of-view work[15]. Ta- 586 ble7 shows the statistics of datasets used for testing for point-of-view. "Crowdsourced" refers to the 587 9https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria 588 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior 13

589 590 591 592 593 594 595 596 597 598 599 600 601 602 Fig. 4. Distribution of statement length (words) in Featured and low-quality articles 603 604 605 Testing Featured cw-hard Edit-labels 606 (Crowdsourced) (Crowdsourced) 607 Training P R F-1 P R F-1 P R F-1 ROC- 608 AUC 609 Crowdsource 0.89 0.71 0.79 0.71 0.64 0.67 0.78 0.67 0.72 0.70 610 trained 611 Edit-labels 0.79 0.85 0.82 0.42 0.85 0.56 0.81 0.86 0.83 0.92 612 trained 613 Table 8. Testing results of Point-of-view with crowdsourcing and edit-labels training 614 615 616 617 test sets from previous work[15] (obtained via corwdsourcing). We compare our edit-labels against 618 these. The "crowdsourced" set consists of two datasets: "featured" and "cw-hard". Positive statements 619 in both these datasets are the same. They are obtained by extracting randomly sampled statements 620 from Wikipedia edits with comment "POV" with single line addition/deletion/replacement. A 621 randomly sampled subset (5,000) of these statements were were labeled as neutral/biased by crowd- 622 workers. 1,843 statements from this set labeled as biased by the crowd were taken as positive 623 examples. Negative statements for "featured" come from Featured articles, similar to our work. 624 Negative statements for "cw-hard" are the statements that the crowdworkers labeled as "neutral". 625 Note that extracting single line changes with comment "POV" makes them inline point-of-view edits 626 as discussed in section 4.2, which keeps our classification objective consistent with the previous 627 work for comparison. 628 To compare the effectiveness of edit-labels against crowdsourced labels, we use the sameGRU 629 based attention model as Hube et. al [15]. We only use global attention[2] for comparison. The 630 inputs to the model are vectors from the two sentence representations - 1) Word representations, 631 and 2) POS representations - embeddings for the part-of-speech tags. 632 Table8 shows testing statistics on the two crowdsourced datasets from previous work and our 633 test split of the edit labels dataset. We did not reproduce the results of the previous work (1st row, 634 1st, and 2nd columns), i.e., test results on crowdsourced datasets when trained on the crowdsourced 635 datasets. We report the numbers from one of the best models from the baseline work[15] (also 636 matching our experimental setup) - GRU based RNN with global attention and word and POS tags 637 , Vol. 1, No. 1, Article . Publication date: October 2020. 14 Anon.

638 Statement Comment 639 1. After the Battle of Nicopolis in 1396 and the fall of the OK: This looks alright 640 Vidin Tsardom three years later, the Ottomans conquered to me assuming it comes 641 all Bulgarian lands south of the Danube, with sporadic with a citation that sup- 642 resistance ending when the Ottomans gained a firm hold ports the firm (long term) 643 on the Balkans by conquering Constantinople in 1453. control of the Balkans. If 644 it’s not supported by the 645 citation, it would be POV 646 at best. 647 2. It is widely accepted that the band is joke and is collec- POV. joke is problematic 648 tively thought to be the biggest sale out in rock history. 649 3. Also known to have lots of friends, and it is possibly due POV. charming should 650 to the fact they are known to be charming as well as loyal be removed 651 to friends and family members. 652 4. The family was a branch of the FitzGerald dynasty, or POV. "Questionably" 653 Geraldines, related to the Earls of Desmond (extinct), who needs to be cited and it 654 were questionably granted extensive lands in County isn’t phrased in a neutral 655 Limerick by the Duke of Normandy by way of conquest. tone. E.g. I would prefer 656 " questioned 657 the granted lands". 658 5. In fact, this participation may be a reaction to the Catholic POV. May be? Is? {{who}} 659 church’s active political involvement. said that? 660 Table 9. Manual assessment of a small sample of crowd-labeled neutral statements 661 662 663 664 as input. Edit-labels trained model performs better for "Featured" and "Edit-labels" dataset. The "cw- 665 hard" dataset consists of negative examples which the crowdworkers labeled as "neutral". However, 666 these statements were sourced from Wikipedia edits with comment as "POV". We manually assessed 667 some "neutral" labeled examples from the "cw-hard" dataset for clarity. One of the authors, who 668 has researched Wikipedia editors for about a decade and is also a Wikipedia editor, identified some 669 issues with the crowd-labeled neutral statements in the "cw-hard" dataset. Table9 shows a small 670 random sample of the negative (neutral labeled) examples from the "cw-hard" dataset with our own 671 assessments alongwith the reasons. Four out of five statements have point-of-view issues but they 672 are not obvious without knowledge of the Wikipedia NPOV policies. One of the common reasons 673 for a statement having a point-of-view issue is not having a citation. This is because a citation 10 674 pushes the opinion on the content, taking it away from the content writer which is acceptable . 675 Crowdworkers cannot be expected to be aware of such nuances unless explicitly explained. 676 677 5.5 Evaluation of clarification 678 679 Testing Edit-labels 680 Training P R F-1 ROC-AUC 681 Edit-labels 0.75 0.75 0.75 0.83 682 Table 10. Testing results for the clarification category 683 684 685 10https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view#Explanation_of_the_neutral_point_of_view 686 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior 15

687 No prior research around detecting whether a statement needs clarification exists, hence we 688 report and interpret our evaluation on the test set for clarifications. Table 10 shows the results for 689 clarification evaluations. We are able to achieve an F1-score of 0.75 for the clarification category 690 and ROC-AUC of 0.83. In table 11 we show some examples where clarifications were made, what 691 clarifications were made and whether the model correctly flagged the need for the same. TPstands 692 for true positives and FP stands for false positives. Edit-labels is used for both training and testing. 693 Refer to table4 for the statistics of the training and testing splits of the Edit-labels dataset for the 694 clarification category. 695 696 revision- Statement Predi- Clarification 697 id ction 698 21577990 In sum, NLP promotes methods which are verifiable and TP ..which 699 have so far been found to be largely false, inaccurate or are largely 700 ineffective. verifiable.. 701 459001268 It debuted at #1 on the ”New York Times” Bestseller TP ..came in that 702 List (”Fallen” came in at # 2), remaining at that position week at.. 703 through the week of October 17. 704 669670844 During this period, Sonnenblick made the two great con- TP ..two great sci- 705 tributions that would define his career. entific contri- 706 butions 707 N/A In his 2013 autobiography, Jackson stated that there was, FP N/A 708 and that Martin and some white Yankees would tell racist 709 jokes. 710 810160688 with his leg injured he is barely able to get away but is TP ..rescued by 711 rescued by a kingdom soldier that is still alive Alavaro, a 712 kingdom.. 713 Table 11. Manual assessment of clarification examples 714 715 716 717 5.6 Attention 718 RNN models with attention[2] place different weights on different words in the statement while 719 making the predictions. We use these weights to visualize the focus that the classifier places on 720 the different words for the given statement. This is useful for analyzing the model outputs bythe 721 users in actual deployments. Editors can look at the most important words for a given task and 722 take quick decisions instead of reading the full statement. Figure5 shows the attention weights 723 for some statements from each of citations, point-of-view and clarifications. Blue is used for true 724 positives and red is used for false positives. The darker the word, the more the attention on that 725 word for prediction for the specific statement. 726 For citations, attention is given to reporting verbs like damages done or produce remarkably 727 good estimates which require proof of opinion. This is in line with previous work[22] who report 728 that such verbs or presence of facts lead to high likelihood of needing citations. For point-of-view, 729 attention seems to be particularly helpful as focus is given on words such as very popular, formidable, 730 plotted their massacre which are the typical deletions made in inline point-of-view edits as per the 731 guideline - avoid stating opinion as facts 11. This is because, if such words are uncited, they cause 732 a point-of-view issue as the opinion becomes an opinion of the content writer. For clarifications, 733 734 11https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view#Explanation_of_the_neutral_point_of_view 735 , Vol. 1, No. 1, Article . Publication date: October 2020. 16 Anon.

736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 Fig. 5. Attention visualization of examples for the three categories 757 758 759 attention is placed near the words where additions were made in the statement. For example, 760 consider statement 1 in clarification true positives, rescued by a Kingdom soldier is clarified to 761 rescued by Alavaro, a kingdom soldier and a high attention is placed on rescued (see table 11 - last 762 example). 763 764 6 DISCUSSION 765 Our flexible content flaw detection pipeline has several implications both for Wikipedia and 766 collaborative systems in general. By learning implicitly from expert behaviors, we 1) Eliminate 767 expensive hand labeling of examples, either through crowdwork (noisy) [23] or experts (difficult to 768 reach), required to train ML models, 2) Learn the processes underlying the collaborative system 769 directly from past data without the need for explicit feature engineering to capture these processes. 770 Further, existing research [14] has shown that better data collection is of greater importance 771 than model development towards building fairer ML systems. As such, traditional ML systems 772 trained on fixed hand-labeled datasets cannot respond to user feedback and challenges of fairness 773 quickly, without modifying or reworking the training datasets. Our flexible rule based semantic edit 774 labeling system followed by the statement quality extraction pipeline makes for a swift feedback 775 loop wherein Wikipedia editors can give feedback on a small set of model predictions which will 776 inform which rules in the semantic edit labeling step need re-work. The entire pipeline can be run 777 again leading to fast model changes and deployments based on real-time feedback. The semantic 778 edit-intention labeling step is interpretable making the entire feedback process participatory. 779 Our pipeline also allows learning new or improved quality identification models by using an 780 appropriate semantic edit labeling model. For example, a copy-edit (fixing grammar, improving 781 style[31]) edit intent identification model can be plugged into the edit-labeling step of the pipeline 782 and it’ll extract positive examples that need copy-editing and grammar improvements. In another ex- 783 ample, a model that can semantically label simplification edits, can be used to extract large instances 784 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior 17

785 of complex-simple sentence pairs for text simplification tasks. This will be a direct improvement 786 over current text simplification approaches [35] which uses statements from corresponding articles 787 of Simple English Wikipedia and English Wikipedia and aligns them to generate a large set of 788 complex-simple sentence pairs. 789 790 7 CONCLUSION AND FUTURE WORK 791 We show how content policies of Wikipedia can be coded to identify meaningful semantic improve- 792 ments across Wikipedia, which can then be used to label content that was improved as a positive 793 example of needing the semantic improvement. We further show that effective deep learning 794 models can be trained using a large set of these semantically labeled statements. As a first step, 795 using our proposed pipeline, several quality identification models can be built that can flag issues 796 in statements along various dimensions (e.g. grammar, simplicity, etc). By combining models of 797 several semantic categories, better statement quality identification models can be built. Avery 798 compelling argument is with citations and point-of-view. Often, if a fact is cited, it is not considered 799 as pushing a point-of-view. This is because, according to NPOV the guidelines[32], opinions in the 800 content are alright, but opinions of the curator of the content are not allowed. The pipeline can 801 also be used to personalize models according to a specific space. This space can be specific topics 802 to address content issues that are sensitive to topic spaces, like point-of-view or specific classes 803 of articles. For example, an article in a C-class may need more copyedits and grammar fixes than 804 an article in A-class which may need more structuring. In such cases, building statement quality 805 detection models for C-class articles will benefit from learning from edits that happen in C-class 806 articles. 807 Statement quality detection on Wikipedia articles will be followed by actually improving the 808 quality of articles as they evolve through various quality stages. Further, directly identifying 809 statement quality along different semantic dimensions allows us to do fine grained analysis ofhow 810 good quality articles evolve and what improvements were made by different actors over time which 811 led to quality improvements. Using such analysis, models can be built that can be used to accelerate 812 quality improvements in collaborative systems or assist users in the same. 813 814 REFERENCES 815 [1] Anderka, M., Stein, B., and Lipka, N. Predicting quality flaws in user-generated content: the case of wikipedia. In 816 The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR ’12, Portland, 817 OR, USA, August 12-16, 2012 (2012), W. R. Hersh, J. Callan, Y. Maarek, and M. Sanderson, Eds., ACM, pp. 981–990. 818 [2] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track 819 Proceedings (2015), Y. Bengio and Y. LeCun, Eds. 820 [3] Biber, D. Variation across speech and writing. Cambridge University Press, 1991. 821 [4] Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning 822 phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 823 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL (2014), A. Moschitti, B. Pang, and W. Daelemans, Eds., ACL, 824 pp. 1724–1734. 825 [5] Dalip, D. H., Gonçalves, M. A., Cristo, M., and Calado, P. Automatic quality assessment of content created 826 collaboratively by web communities: a case study of wikipedia. In Proceedings of the 2009 Joint International Conference 827 on Digital Libraries, JCDL 2009, Austin, TX, USA, June 15-19, 2009 (2009), F. Heath, M. L. Rice-Lively, and R. Furuta, Eds., 828 ACM, pp. 295–304. [6] Ekstrand, M. D., and Riedl, J. rv you’re dumb: identifying discarded work in wiki article history. In Proceedings 829 of the 2009 International Symposium on Wikis, 2009, Orlando, Florida, USA, October 25-27, 2009 (2009), D. Riehle and 830 A. Bruckman, Eds., ACM. 831 [7] Flöck, F., Vrandecic, D., and Simperl, E. Revisiting reverts: accurate revert detection in wikipedia. In 23rd ACM 832 Conference on Hypertext and Social Media, HT ’12, Milwaukee, WI, USA, June 25-28, 2012 (2012), E. V. Munson and 833 , Vol. 1, No. 1, Article . Publication date: October 2020. 18 Anon.

834 M. Strohmaier, Eds., ACM, pp. 3–12. 835 [8] Forte, A., Andalibi, N., Gorichanaz, T., Kim, M. C., Park, T., and Halfaker, A. Information fortification: An online 836 citation behavior. In Proceedings of the 2018 ACM Conference on Supporting Groupwork (New York, NY, USA, 2018), GROUP ’18, Association for Computing Machinery, p. 83–92. 837 [9] Geiger, R. S., and Halfaker, A. When the levee breaks: without bots, what happens to wikipedia’s quality control 838 processes? In Proceedings of the 9th International Symposium on Open Collaboration, Hong Kong, China, August 05 - 07, 839 2013 (2013), A. Aguiar and D. Riehle, Eds., ACM, pp. 6:1–6:6. 840 [10] Geiger, R. S., and Halfaker, A. Operationalizing conflict and cooperation between automated software agents in 841 wikipedia: A replication and expansion of ’even good bots fight’. Proc. ACM Hum. Comput. Interact. 1, CSCW (2017), 49:1–49:33. 842 [11] Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., and Huang, J. Garbage in, garbage out?: do machine learning 843 application papers in social computing report where human-labeled training data comes from? In FAT* (2020), ACM, 844 pp. 325–336. 845 [12] Halfaker, A. Interpolating quality dynamics in wikipedia and demonstrating the keilana effect. In OpenSym (2017), 846 ACM, pp. 19:1–19:9. [13] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780. 847 [14] Holstein, K., Wortman Vaughan, J., Daumé III, H., Dudik, M., and Wallach, H. Improving fairness in machine 848 learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in 849 Computing Systems (2019), pp. 1–16. 850 [15] Hube, C., and Fetahu, B. Neural based statement classification for biased language. In Proceedings of the Twelfth 851 ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019 (2019), J. S. Culpepper, A. Moffat, P. N. Bennett, and K. Lerman, Eds., ACM, pp. 195–203. 852 [16] Kittur, A., Chi, E. H., and Suh, B. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI 853 Conference on Human Factors in Computing Systems (New York, NY, USA, 2008), CHI ’08, Association for Computing 854 Machinery, p. 453–456. 855 [17] Kittur, A., Suh, B., Pendleton, B. A., and Chi, E. H. He says, she says: conflict and coordination in wikipedia. In 856 CHI (2007), ACM, pp. 453–462. [18] Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 857 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A 858 meeting of SIGDAT, a Special Interest Group of the ACL (2014), A. Moschitti, B. Pang, and W. Daelemans, Eds., ACL, 859 pp. 1532–1543. 860 [19] Potthast, M. Crowdsourcing a wikipedia vandalism corpus. In SIGIR (2010), ACM, pp. 789–790. 861 [20] Priedhorsky, R., Chen, J., Lam, S. K., Panciera, K. A., Terveen, L. G., and Riedl, J. Creating, destroying, and restoring value in wikipedia. In Proceedings of the 2007 International ACM SIGGROUP Conference on Supporting Group Work, 862 GROUP 2007, Sanibel Island, Florida, USA, November 4-7, 2007 (2007), T. Gross and K. Inkpen, Eds., ACM, pp. 259–268. 863 [21] Recasens, M., Danescu-Niculescu-Mizil, C., and Jurafsky, D. Linguistic models for analyzing and detecting biased 864 language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 865 August 2013, Sofia, Bulgaria, Volume 1: Long Papers (2013), The Association for Computer Linguistics, pp. 1650–1659. 866 [22] Redi, M., Fetahu, B., Morgan, J. T., and Taraborelli, D. Citation needed: A taxonomy and algorithmic assessment of wikipedia’s verifiability. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019 867 (2019), L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, and L. Zia, Eds., ACM, pp. 1567–1578. 868 [23] Stvilia, B., Twidale, M. B., Smith, L. C., and Gasser, L. Information quality work organization in wikipedia. Journal 869 of the American society for information science and technology 59, 6 (2008), 983–1001. 870 [24] Suh, B., Convertino, G., Chi, E. H., and Pirolli, P. The singularity is not near: slowing growth of wikipedia. In 871 Proceedings of the 2009 International Symposium on Wikis, 2009, Orlando, Florida, USA, October 25-27, 2009 (2009), D. Riehle and A. Bruckman, Eds., ACM. 872 [25] Wang, R. Y., and Strong, D. M. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 873 4 (1996), 5–33. 874 [26] Warncke-Wang, M., Cosley, D., and Riedl, J. Tell me more: an actionable quality model for wikipedia. In OpenSym 875 (2013), ACM, pp. 8:1–8:10. 876 [27] Wikipedia. Wikipedia, sep 2020. [28] Wikipedia. Wikipedia:citation templates, sep 2020. 877 [29] Wikipedia. Wikipedia:content assessment, sep 2020. 878 [30] Wikipedia. Wikipedia:core content policies, sep 2020. 879 [31] Wikipedia. Wikipedia:manual of style (mos), sep 2020. 880 [32] Wikipedia. Wikipedia:neutral point of view, sep 2020. 881 [33] Wikipedia. Wikipedia:template cleanup, sep 2020. 882 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior 19

883 [34] Yang, D., Halfaker, A., Kraut, R. E., and Hovy, E. H. Identifying semantic edit intentions from revisions in wikipedia. 884 In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, 885 Denmark, September 9-11, 2017 (2017), M. Palmer, R. Hwa, and S. Riedel, Eds., Association for Computational Linguistics, pp. 2000–2010. 886 [35] Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., and Lee, L. For the sake of simplicity: Unsupervised extraction 887 of lexical simplifications from Wikipedia. In Human Language Technologies: The 2010 Annual Conference of the North 888 American Chapter of the Association for Computational Linguistics (Los Angeles, California, June 2010), Association for 889 Computational Linguistics, pp. 365–368. 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 , Vol. 1, No. 1, Article . Publication date: October 2020.