Automatically Labeling Low Quality Content on Wikipedia by Leveraging 22 Patterns in Editing Behavior
Total Page:16
File Type:pdf, Size:1020Kb
1 Automatically Labeling Low Quality Content on Wikipedia 2 by Leveraging Patterns in Editing Behavior 3 4 5 ANONYMOUS AUTHOR(S) 6 Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles 7 have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous 8 number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically 9 identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current 10 labeling approaches are tedious and produce noisy labels. Here, we propose an automated labeling approach 11 that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia editsanduses 12 the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated 13 article statements are examples that no longer need semantic improvements. We show that training existing 14 sentence quality classification algorithms on our labels improves their performance compared to training 15 them on existing labels. Our work shows that editing behaviors of Wikipedia editors provide better labels than labels generated by crowdworkers who lack the context to make judgments that the editors would agree with. 16 17 CCS Concepts: • Human-centered computing ! Social recommendation; Computer supported coop- 18 erative work; Empirical studies in collaborative and social computing; Wikis; Social tagging systems. 19 Additional Key Words and Phrases: Wikipedia, labeling, Machine Learning. 20 ACM Reference Format: 21 Anonymous Author(s). 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging 22 Patterns in Editing Behavior. 1, 1 (October 2020), 19 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 23 24 1 INTRODUCTION 25 26 Wikipedia [27], an online encyclopedia, aims to be the ultimate source of encyclopedic knowledge 27 by achieving a high quality for all its articles. High quality articles are definitive source of knowledge 28 on the topic and serve the purpose of providing information to Wikipedia readers in a concise 29 manner, without causing confusion and wasting time [25]. Thus, Wikipedia editors have defined a 30 comprehensive content assessment criteria, called the WP1.0 Article Quality Assessment scale [29] 31 to grade article quality on a scale from the most basic "stub" (articles with basic information about 32 the topic, without proper citations and Wikipedia-defined structure) to the exemplary "Featured 33 Articles" (well-written and well-structured, comprehensive and properly cited articles). 34 Article maintenance, as opposed to creating new articles and content, has become a significant 35 portion of what Wikipedia editors do [17]. Currently, editors rate article quality and identify and 36 make required improvements manually, which is taxing and time-consuming. Being a collaborative 37 editing platform, articles are in a constant state of churn and current assessments are quickly 38 outdated because articles will have been modified by others. For the limited number of experienced 39 editors on Wikipedia, performing such assessments across a set of 6.5 million Wikipedia articles is 40 a huge bottleneck [23]; currently only about 7,000 of articles have "Featured Article" status and 41 only about 33,000 have the second best "Good Article" status [29]. 42 With continuously declining number of editors on Wikipedia [24], automating quality assess- 43 ment tasks could reduce the workload of remaining editors. Supervised Machine Learning (ML) 44 has already automated tasks like vandalism detection [12] and overall article quality prediction 45 [26]. Such ML approaches require labeled sets of examples of Wikipedia content that requires 46 improvement (positive examples) and content that do not (negative examples). One of the main 47 2020. XXXX-XXXX/2020/10-ART $15.00 48 https://doi.org/10.1145/nnnnnnn.nnnnnnn 49 , Vol. 1, No. 1, Article . Publication date: October 2020. 2 Anon. 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Fig. 1. Our proposed pipeline for labeling low-quality statements on Wikipedia. We start with our automated 64 labeling approach (top row), where we obtain a large corpus of historic Wikipedia statement edits, and label 65 their semantic intent using programmatic rules. We extract positive statements from relevant semantic edits 66 and negative statements from Featured Articles. We then use our labels to existing train Machine Learning 67 models, and test them by comparing with labeling approaches from past research (middle row). Existing 68 models trained on our labels can then be deployed to automatically detect Wikipedia statements that require 69 improvement (bottom row). 70 71 72 reasons for the success of those existing ML approaches [12, 26] (both have been deployed to 73 Wikipedia) is relative ease of obtaining labels either because they are visually salient (e.g., in case 74 of vandalism) or already part of existing practices (e.g., editors manually record article quality on 75 talk pages of Wikipedia articles as part of existing article assessment). 76 However, automating other quality assessment tasks (e.g., identifying sentences that require 77 citation, sentences with non-neutral point of view, sentences that require clarification) requires 78 labels at the Wikipedia sentence level which makes automating such tasks difficult. Wikipedia 79 editors rarely manually flag outstanding Wikipedia statement quality issues as part of their editing 80 process [1]. Even existing crowdsourcing-based labeling method [15, 22, 34] could produce noisy 81 Wikipedia statement quality labels, especially when crowdworkers, who are not domain experts, 82 lack knowledge about Wikipedia policies on content quality [8, 11, 16]. 83 Here, we present a method for automatically labeling Wikipedia statement quality across im- 84 provement categories directly from past Wikipedia editors’ editing behavior to enable article quality 85 improvements (Figure1). To label positive examples (statements that need improvements), we 86 implemented Wikipedia core content principles guidelines [30] as syntax-based rules to capture the 87 meaning or intent of a historic edit (e.g., added citations, removed bias, clarified statement) for each 88 statement quality category we want to classify (e.g., needs citation, needs bias-removal, or needs 89 clarification). Each historic edit then indicates that the edited statement needed that particular 90 improvement resulting in a positive example. We follow Redi et. al [22] approach and label all 91 statements in featured articles as negative examples (statements that do not need improvements). 92 To illustrate our approach, we built three statement quality detection pipelines (including cor- 93 responding rules) for three Wikipedia quality improvements categories: 1) citations (adding or 94 modifying references and citations for verifiability), 2) Neutral Point of View (NPOV) edits (rewriting 95 using encyclopedic, neutral tone; removing bias), and 3) clarifications (specifying or explaining an 96 existing fact or meaning by example or discussion without adding new information). We validated 97 our automated labeling approach by comparing performance of existing deep learning models [2] 98 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior3 99 trained using existing, baseline labeling approaches (e.g., implicit labeling [22], crowdsourcing [15]) 100 and our automatically extracted labels. Our results showed that existing models trained using our 101 automatic labeling method achieved 20% and 15% improvement in F1-score for citations and NPOV 102 respectively than same models trained on data labeled using existing approaches. 103 Our work provides further evidence that the edits produced by Wikipedians working in their 104 context provide better signal for supporting their work than labels generated by crowdworkers 105 who lack the context to make judgments about sentence quality that Wikipedians would agree 106 with. Learning from implicit editing behavior of Wikipedia editors allowed us to produce labels 107 that capture the nuances of Wikipedia quality policies. Our work has implications for the growth 108 of collaborative content spaces where different people come together to curate content adhering to 109 the standards and purpose of the space. 110 111 2 CHALLENGES OF LABELING LOW QUALITY CONTENT ON WIKIPEDIA 112 Automated approaches to improving and maintaining good quality of articles on Wikipedia have 113 received considerable attention. For example, Wikipedia has deployed automatic vandalism de- 114 tection [12] that effectively relieves editors of the burden of manually fighting vandals. Thishas 115 made fighting vandalism on Wikipedia a relatively easy task as bots have taken over mostofthe 116 responsibility of detecting and reverting vandalism edits [9], leaving the editors to make more 117 content related edits. Existing article quality models [5, 26] already automatically rate Wikipedia 118 articles quality based on their content and structure. 119 Such automated efforts have been possible in part because of the availability of quality labelsfor 120 such tasks. For example, a small subset of visually-salient, hand-labeled examples are sufficient 121 for even simple ML models to identify vandalism with high accuracy[9]. Also, training existing