Automatically Labeling Low Quality Content on Wikipedia by Leveraging 22 Patterns in Editing Behavior

Total Page:16

File Type:pdf, Size:1020Kb

Automatically Labeling Low Quality Content on Wikipedia by Leveraging 22 Patterns in Editing Behavior 1 Automatically Labeling Low Quality Content on Wikipedia 2 by Leveraging Patterns in Editing Behavior 3 4 5 ANONYMOUS AUTHOR(S) 6 Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles 7 have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous 8 number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically 9 identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current 10 labeling approaches are tedious and produce noisy labels. Here, we propose an automated labeling approach 11 that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia editsanduses 12 the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated 13 article statements are examples that no longer need semantic improvements. We show that training existing 14 sentence quality classification algorithms on our labels improves their performance compared to training 15 them on existing labels. Our work shows that editing behaviors of Wikipedia editors provide better labels than labels generated by crowdworkers who lack the context to make judgments that the editors would agree with. 16 17 CCS Concepts: • Human-centered computing ! Social recommendation; Computer supported coop- 18 erative work; Empirical studies in collaborative and social computing; Wikis; Social tagging systems. 19 Additional Key Words and Phrases: Wikipedia, labeling, Machine Learning. 20 ACM Reference Format: 21 Anonymous Author(s). 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging 22 Patterns in Editing Behavior. 1, 1 (October 2020), 19 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 23 24 1 INTRODUCTION 25 26 Wikipedia [27], an online encyclopedia, aims to be the ultimate source of encyclopedic knowledge 27 by achieving a high quality for all its articles. High quality articles are definitive source of knowledge 28 on the topic and serve the purpose of providing information to Wikipedia readers in a concise 29 manner, without causing confusion and wasting time [25]. Thus, Wikipedia editors have defined a 30 comprehensive content assessment criteria, called the WP1.0 Article Quality Assessment scale [29] 31 to grade article quality on a scale from the most basic "stub" (articles with basic information about 32 the topic, without proper citations and Wikipedia-defined structure) to the exemplary "Featured 33 Articles" (well-written and well-structured, comprehensive and properly cited articles). 34 Article maintenance, as opposed to creating new articles and content, has become a significant 35 portion of what Wikipedia editors do [17]. Currently, editors rate article quality and identify and 36 make required improvements manually, which is taxing and time-consuming. Being a collaborative 37 editing platform, articles are in a constant state of churn and current assessments are quickly 38 outdated because articles will have been modified by others. For the limited number of experienced 39 editors on Wikipedia, performing such assessments across a set of 6.5 million Wikipedia articles is 40 a huge bottleneck [23]; currently only about 7,000 of articles have "Featured Article" status and 41 only about 33,000 have the second best "Good Article" status [29]. 42 With continuously declining number of editors on Wikipedia [24], automating quality assess- 43 ment tasks could reduce the workload of remaining editors. Supervised Machine Learning (ML) 44 has already automated tasks like vandalism detection [12] and overall article quality prediction 45 [26]. Such ML approaches require labeled sets of examples of Wikipedia content that requires 46 improvement (positive examples) and content that do not (negative examples). One of the main 47 2020. XXXX-XXXX/2020/10-ART $15.00 48 https://doi.org/10.1145/nnnnnnn.nnnnnnn 49 , Vol. 1, No. 1, Article . Publication date: October 2020. 2 Anon. 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Fig. 1. Our proposed pipeline for labeling low-quality statements on Wikipedia. We start with our automated 64 labeling approach (top row), where we obtain a large corpus of historic Wikipedia statement edits, and label 65 their semantic intent using programmatic rules. We extract positive statements from relevant semantic edits 66 and negative statements from Featured Articles. We then use our labels to existing train Machine Learning 67 models, and test them by comparing with labeling approaches from past research (middle row). Existing 68 models trained on our labels can then be deployed to automatically detect Wikipedia statements that require 69 improvement (bottom row). 70 71 72 reasons for the success of those existing ML approaches [12, 26] (both have been deployed to 73 Wikipedia) is relative ease of obtaining labels either because they are visually salient (e.g., in case 74 of vandalism) or already part of existing practices (e.g., editors manually record article quality on 75 talk pages of Wikipedia articles as part of existing article assessment). 76 However, automating other quality assessment tasks (e.g., identifying sentences that require 77 citation, sentences with non-neutral point of view, sentences that require clarification) requires 78 labels at the Wikipedia sentence level which makes automating such tasks difficult. Wikipedia 79 editors rarely manually flag outstanding Wikipedia statement quality issues as part of their editing 80 process [1]. Even existing crowdsourcing-based labeling method [15, 22, 34] could produce noisy 81 Wikipedia statement quality labels, especially when crowdworkers, who are not domain experts, 82 lack knowledge about Wikipedia policies on content quality [8, 11, 16]. 83 Here, we present a method for automatically labeling Wikipedia statement quality across im- 84 provement categories directly from past Wikipedia editors’ editing behavior to enable article quality 85 improvements (Figure1). To label positive examples (statements that need improvements), we 86 implemented Wikipedia core content principles guidelines [30] as syntax-based rules to capture the 87 meaning or intent of a historic edit (e.g., added citations, removed bias, clarified statement) for each 88 statement quality category we want to classify (e.g., needs citation, needs bias-removal, or needs 89 clarification). Each historic edit then indicates that the edited statement needed that particular 90 improvement resulting in a positive example. We follow Redi et. al [22] approach and label all 91 statements in featured articles as negative examples (statements that do not need improvements). 92 To illustrate our approach, we built three statement quality detection pipelines (including cor- 93 responding rules) for three Wikipedia quality improvements categories: 1) citations (adding or 94 modifying references and citations for verifiability), 2) Neutral Point of View (NPOV) edits (rewriting 95 using encyclopedic, neutral tone; removing bias), and 3) clarifications (specifying or explaining an 96 existing fact or meaning by example or discussion without adding new information). We validated 97 our automated labeling approach by comparing performance of existing deep learning models [2] 98 , Vol. 1, No. 1, Article . Publication date: October 2020. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behavior3 99 trained using existing, baseline labeling approaches (e.g., implicit labeling [22], crowdsourcing [15]) 100 and our automatically extracted labels. Our results showed that existing models trained using our 101 automatic labeling method achieved 20% and 15% improvement in F1-score for citations and NPOV 102 respectively than same models trained on data labeled using existing approaches. 103 Our work provides further evidence that the edits produced by Wikipedians working in their 104 context provide better signal for supporting their work than labels generated by crowdworkers 105 who lack the context to make judgments about sentence quality that Wikipedians would agree 106 with. Learning from implicit editing behavior of Wikipedia editors allowed us to produce labels 107 that capture the nuances of Wikipedia quality policies. Our work has implications for the growth 108 of collaborative content spaces where different people come together to curate content adhering to 109 the standards and purpose of the space. 110 111 2 CHALLENGES OF LABELING LOW QUALITY CONTENT ON WIKIPEDIA 112 Automated approaches to improving and maintaining good quality of articles on Wikipedia have 113 received considerable attention. For example, Wikipedia has deployed automatic vandalism de- 114 tection [12] that effectively relieves editors of the burden of manually fighting vandals. Thishas 115 made fighting vandalism on Wikipedia a relatively easy task as bots have taken over mostofthe 116 responsibility of detecting and reverting vandalism edits [9], leaving the editors to make more 117 content related edits. Existing article quality models [5, 26] already automatically rate Wikipedia 118 articles quality based on their content and structure. 119 Such automated efforts have been possible in part because of the availability of quality labelsfor 120 such tasks. For example, a small subset of visually-salient, hand-labeled examples are sufficient 121 for even simple ML models to identify vandalism with high accuracy[9]. Also, training existing
Recommended publications
  • 1 Wikipedia: an Effective Anarchy Dariusz Jemielniak, Ph.D
    Wikipedia: An Effective Anarchy Dariusz Jemielniak, Ph.D. Kozminski University [email protected] Paper presented at the Society for Applied Anthropology conference in Baltimore, MD (USA), 27-31 March, 2012 (work in progress) This paper is the first report from a virtual ethnographic study (Hine, 2000; Kozinets, 2010) of Wikipedia community conducted 2006-2012, by the use of participative methods, and relying on an narrative analysis of Wikipedia organization (Czarniawska, 2000; Boje, 2001; Jemielniak & Kostera, 2010). It serves as a general introduction to Wikipedia community, and is also a basis for a discussion of a book in progress, which is going to address the topic. Contrarily to a common misconception, Wikipedia was not the first “wiki” in the world. “Wiki” (originated from Hawaiian word for “quick” or “fast”, and named after “Wiki Wiki Shuttle” on Honolulu International Airport) is a website technology based on a philosophy of tracking changes added by the users, with a simplified markup language (allowing easy additions of, e.g. bold, italics, or tables, without the need to learn full HTML syntax), and was originally created and made public in 1995 by Ward Cunningam, as WikiWikiWeb. WikiWikiWeb was an attractive choice among enterprises and was used for communication, collaborative ideas development, documentation, intranet, knowledge management, etc. It grew steadily in popularity, when Jimmy “Jimbo” Wales, then the CEO of Bomis Inc., started up his encyclopedic project in 2000: Nupedia. Nupedia was meant to be an online encyclopedia, with free content, and written by experts. In an attempt to meet the standards set by professional encyclopedias, the creators of Nupedia based it on a peer-review process, and not a wiki-type software.
    [Show full text]
  • Wikipedia and Intermediary Immunity: Supporting Sturdy Crowd Systems for Producing Reliable Information Jacob Rogers Abstract
    THE YALE LAW JOURNAL FORUM O CTOBER 9 , 2017 Wikipedia and Intermediary Immunity: Supporting Sturdy Crowd Systems for Producing Reliable Information Jacob Rogers abstract. The problem of fake news impacts a massive online ecosystem of individuals and organizations creating, sharing, and disseminating content around the world. One effective ap- proach to addressing false information lies in monitoring such information through an active, engaged volunteer community. Wikipedia, as one of the largest online volunteer contributor communities, presents one example of this approach. This Essay argues that the existing legal framework protecting intermediary companies in the United States empowers the Wikipedia community to ensure that information is accurate and well-sourced. The Essay further argues that current legal efforts to weaken these protections, in response to the “fake news” problem, are likely to create perverse incentives that will harm volunteer engagement and confuse the public. Finally, the Essay offers suggestions for other intermediaries beyond Wikipedia to help monitor their content through user community engagement. introduction Wikipedia is well-known as a free online encyclopedia that covers nearly any topic, including both the popular and the incredibly obscure. It is also an encyclopedia that anyone can edit, an example of one of the largest crowd- sourced, user-generated content websites in the world. This user-generated model is supported by the Wikimedia Foundation, which relies on the robust intermediary liability immunity framework of U.S. law to allow the volunteer editor community to work independently. Volunteer engagement on Wikipedia provides an effective framework for combating fake news and false infor- mation. 358 wikipedia and intermediary immunity: supporting sturdy crowd systems for producing reliable information It is perhaps surprising that a project open to public editing could be highly reliable.
    [Show full text]
  • 'Anyone Can Edit', Not Everyone Does: Wikipedia and the Gender
    Heather Ford and Judy Wajcman ‘Anyone can edit’, not everyone does: Wikipedia and the gender gap Article (Accepted version) (Refereed) Original citation: Ford, Heather and Wajcman, Judy (2017) ‘Anyone can edit’, not everyone does: Wikipedia and the gender gap. Social Studies of Science, 47 (4). pp. 511-527. ISSN 0306-3127 DOI: 10.1177/0306312717692172 © 2017 The Authors This version available at: http://eprints.lse.ac.uk/68675/ Available in LSE Research Online: September 2017 LSE has developed LSE Research Online so that users may access research output of the School. Copyright © and Moral Rights for the papers on this site are retained by the individual authors and/or other copyright owners. Users may download and/or print one copy of any article(s) in LSE Research Online to facilitate their private study or for non-commercial research. You may not engage in further distribution of the material or use it for any profit-making activities or any commercial gain. You may freely distribute the URL (http://eprints.lse.ac.uk) of the LSE Research Online website. This document is the author’s final accepted version of the journal article. There may be differences between this version and the published version. You are advised to consult the publisher’s version if you wish to cite from it. Anyone can edit, not everyone does: Wikipedias infrastructure and the gender gap Heather Ford School of Media and Communication, University of Leeds, UK Judy Wajcman Department of Sociology, London School of Economics, UK Abstract Feminist STS has continues to define what counts as knowledge and expertise.
    [Show full text]
  • Wikipedia Citations: a Comprehensive Data Set of Citations with Identifiers Extracted from English Wikipedia
    RESEARCH ARTICLE Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia Harshdeep Singh1 , Robert West1 , and Giovanni Colavizza2 an open access journal 1Data Science Laboratory, EPFL 2Institute for Logic, Language and Computation, University of Amsterdam Keywords: citations, data, data set, Wikipedia Downloaded from http://direct.mit.edu/qss/article-pdf/2/1/1/1906624/qss_a_00105.pdf by guest on 01 October 2021 Citation: Singh, H., West, R., & ABSTRACT Colavizza, G. (2020). Wikipedia citations: A comprehensive data set Wikipedia’s content is based on reliable and published sources. To this date, relatively little of citations with identifiers extracted from English Wikipedia. Quantitative is known about what sources Wikipedia relies on, in part because extracting citations Science Studies, 2(1), 1–19. https:// and identifying cited sources is challenging. To close this gap, we release Wikipedia doi.org/10.1162/qss_a_00105 Citations, a comprehensive data set of citations extracted from Wikipedia. We extracted DOI: 29.3 million citations from 6.1 million English Wikipedia articles as of May 2020, and https://doi.org/10.1162/qss_a_00105 classified as being books, journal articles, or Web content. We were thus able to extract Received: 14 July 2020 4.0 million citations to scholarly publications with known identifiers—including DOI, PMC, Accepted: 23 November 2020 PMID, and ISBN—and further equip an extra 261 thousand citations with DOIs from Crossref. Corresponding Author: As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with Giovanni Colavizza [email protected] an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science.
    [Show full text]
  • Community Or Social Movement? Piotr Konieczny
    Wikipedia: Community or social movement? Piotr Konieczny To cite this version: Piotr Konieczny. Wikipedia: Community or social movement?. Interface: a journal for and about social movements, 2009. hal-01580966 HAL Id: hal-01580966 https://hal.archives-ouvertes.fr/hal-01580966 Submitted on 4 Sep 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Interface: a journal for and about social movements Article Volume 1 (2): 212 - 232 (November 2009) Konieczny, Wikipedia Wikipedia: community or social movement? Piotr Konieczny Abstract In recent years a new realm for study of political and sociological phenomena has appeared, the Internet, contributing to major changes in our societies during its relatively brief existence. Within cyberspace, organizations whose existence is increasingly tied to this virtual world are of interest to social scientists. This study will analyze the community of one of the largest online organizations, Wikipedia, the free encyclopedia with millions of volunteer members. Wikipedia was never meant to be a community, yet it most certainly has become one. This study asks whether it is something even more –whether it is an expression of online activism, and whether it can be seen as a social movement organization, related to one or more of the Internet-centered social movements industries (in particular, the free and open-source software movement industry).
    [Show full text]
  • Wikipédia, Mythes Et Réalités
    Wikipédia, mythes et réalités David Monniaux Wikimédia France 28 janvier 2011 David Monniaux (Wikimédia France) Wikipédia, mythes et réalités 28 janvier 2011 1 / 62 Qu’est-ce que Wikipédia ? http://www.wikipedia.org/ I Un site Web. I Présentant une collection d’articles encyclopédiques. I Éditables par tout à chacun (via connexion Internet). I Pas de comité éditorial. I Dans de multiples langues : http://fr.wikipedia.org/ pour le français, http://en.wikipedia.org/ pour l’anglais. David Monniaux (Wikimédia France) Wikipédia, mythes et réalités 28 janvier 2011 2 / 62 Aspects juridiques Aspects juridiques Aspects éditoriaux En pratique Les articles à avoir... ou pas Le danger Wikipédia La CIA et le Vatican manipulent Wikipédia Google+Wikipédia dévoie la jeunesse La culture du copier-coller Wikipédia, surtout forte en culture populaire Une vérité ? Conclusion David Monniaux (Wikimédia France) Wikipédia, mythes et réalités 28 janvier 2011 3 / 62 Aspects juridiques Hébergement Initialement, projet sur quelques machines hébergées chez Bomis, entreprise de Jimmy Wales. Wikipédia est maintenant un site important : e I Comscore décembre 2010 : 12 site aux USA e I Comscore 2010 : 5 site aux USA (après Google, Microsoft, Yahoo !, Facebook et devant AOL, eBay, Ask, Amazon...) e I Médiamétrie novembre 2010 : 6 site en France (après Google, Facebook, Microsoft, Orange, Youtube et devant Free, Yahoo !, Pages Jaunes...). De très loin le premier site non commercial, premier site culturel et éducatif. David Monniaux (Wikimédia France) Wikipédia, mythes et réalités 28 janvier 2011 4 / 62 Aspects juridiques Hébergement haut débit Jusqu’à 90000 requêtes http/s Les pannes de Wikipédia sont rapportées dans la presse ! Ceci nécessite : I Hébergement solide, matériel suffisant..
    [Show full text]
  • Genre Analysis of Online Encyclopedias. the Case of Wikipedia
    Genre analysis online encycloped The case of Wikipedia AnnaTereszkiewicz Genre analysis of online encyclopedias The case of Wikipedia Wydawnictwo Uniwersytetu Jagiellońskiego Publikacja dofi nansowana przez Wydział Filologiczny Uniwersytetu Jagiellońskiego ze środków wydziałowej rezerwy badań własnych oraz Instytutu Filologii Angielskiej PROJEKT OKŁADKI Bartłomiej Drosdziok Zdjęcie na okładce: Łukasz Stawarski © Copyright by Anna Tereszkiewicz & Wydawnictwo Uniwersytetu Jagiellońskiego Wydanie I, Kraków 2010 All rights reserved Książka, ani żaden jej fragment nie może być przedrukowywana bez pisemnej zgody Wydawcy. W sprawie zezwoleń na przedruk należy zwracać się do Wydawnictwa Uniwersytetu Jagiellońskiego. ISBN 978-83-233-2813-1 www.wuj.pl Wydawnictwo Uniwersytetu Jagiellońskiego Redakcja: ul. Michałowskiego 9/2, 31-126 Kraków tel. 12-631-18-81, 12-631-18-82, fax 12-631-18-83 Dystrybucja: tel. 12-631-01-97, tel./fax 12-631-01-98 tel. kom. 0506-006-674, e-mail: [email protected] Konto: PEKAO SA, nr 80 1240 4722 1111 0000 4856 3325 Table of Contents Acknowledgements ........................................................................................................................ 9 Introduction .................................................................................................................................... 11 Materials and Methods .................................................................................................................. 14 1. Genology as a study ..................................................................................................................
    [Show full text]
  • Citations Needed: Build Your Wikipedia Skills While Building the World’S Encyclopedia
    A companion guide to deepen your learning during the WebJunction webinar on January 10, 2018, at 3:00 pm EST Citations Needed: Build Your Wikipedia Skills While Building the World’s Encyclopedia A glimpse into the inner workings of English Wikipedia for information professionals The Five Pillars of Wikipedia What are the ways in which the five pillars of Wikipedia align 1. Wikipedia is an encyclopedia with the mission of libraries? 2. It is written from a Neutral Point of View (NPOV) 3. It’s free content that anyone can use, edit, and distribute 4. Editors should treat each other with respect and civility 5. Wikipedia has no firm rules Learn about what U.S. public library staff are doing with Wikipedia in the WebJunction series Librarians Who Wikipedia List two (or more) insights you’ve gained about how Wikipedia editing works, such as the color-coded peer assessments that are shown in the chart below. 1. 2. How does learning about Wikipedia’s inner workings help you evaluate the quality of articles? Wikipedia’s articles are in a constant state of development, learn more about quality assessments made by other editors 1 | P a g e OCLC Wikipedia + Libraries: Better Together About the #1lib1ref campaign (and how you and your library can participate) What is the #1lib1ref campaign? How can you participate? How can your library participate? #1lib1ref The Wikipedia Library’s annual It’s easy! Follow the steps on Plan a #1lib1ref event for your #1lib1ref (“One Librarian, One pages three and four to insert a library, Wikipedia is better with Reference”) global campaign reference as a footnote citation.
    [Show full text]
  • The Missing Wikipedia Ads.Pdf
    The missing Wikipedia ads Designing targeted acquisition campaigns Dario Taraborelli • Wikimedia Foundation Wikimania 2014 • London, 9 August 2014 Q: How to use gaps and biases in Wikipedia to engage new and more diverse contributors A: Adsense for Wikipedia Targeted acquisition/contribution campaigns Overview Rationale (and debunking a few myths...) 1. scaling outreach campaigns 2. turning gaps into hooks 3. targeted outreach Proposal 1. applications 2. infrastructure needed Outreach campaigns work Monthly active editors by project Commons Wikidata Wiki Loves * wiki loves pride Q1: If outreach campaigns work, how do we make them cheap to programmatically run at scale? No shortage of work On the English Wikipedia only: 2.5M articles assessed as stubs1 20K articles need cleanup2 Hundreds of missing articles sought by at least 1K readers every week3 Eric Fischer: A sidewalk is not just some hunk of concrete. It is something that somebody made. It humanizes the city. Q2: If there is a large backlog of work to do, how can we make it programmatically accessible? Targeted outreach registered first-time reader user contributor acquire first, activate later first-time reader contributor activate first 30-day new editor activation by referral (source - data) Q3: How do we programmatically reach out to subject matter experts who are likely to become future Wikimedians? Q1: If outreach campaigns work, how do we make them cheap to run at scale? Q2: If there’s a large backlog of work to do, how can we make it programmatically accessible? Q3: How do we programmatically reach out to subject matter experts who are likely to become future Wikimedians? Targeted acquisition campaigns broadcast engage measure Applications Embeddable calls to action Women in Science Wikipedia needs your help The English Wikipedia article Women in Science needs contributors from a more global perspective.
    [Show full text]
  • Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features
    University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 2-2011 Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features B. Thomas Adler University of California, Santa Cruz, [email protected] Luca de Alfaro University of California, Santa Cruz -- Google, [email protected] Santiago M. Mola-Velasco Universidad Politcnica de Valencia, [email protected] Paolo Rosso Universidad Politcnica de Valencia, [email protected] Andrew G. West University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/cis_papers Part of the Other Computer Sciences Commons Recommended Citation B. Thomas Adler, Luca de Alfaro, Santiago M. Mola-Velasco, Paolo Rosso, and Andrew G. West, "Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features", Lecture Notes in Computer Science: Computational Linguistics and Intelligent Text Processing 6609, 277-288. February 2011. http://dx.doi.org/10.1007/978-3-642-19437-5_23 CICLing '11: Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics, Tokyo, Japan, February 20-26, 2011. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_papers/457 For more information, please contact [email protected]. Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features Abstract Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features.
    [Show full text]
  • Editing Wikipedia: a Guide to Improving Content on the Online Encyclopedia
    wikipedia globe vector [no layers] Editing Wikipedia: A guide to improving content on the online encyclopedia Wikimedia Foundation 1 Imagine a world in which every single human wikipedia globebeing vector [no layers] can freely share in the sum of all knowledge. That’s our commitment. This is the vision for Wikipedia and the other Wikimedia projects, which volunteers from around the world have been building since 2001. Bringing together the sum of all human knowledge requires the knowledge of many humans — including yours! What you can learn Shortcuts This guide will walk you through Want to see up-to-date statistics about how to contribute to Wikipedia, so Wikipedia? Type WP:STATS into the the knowledge you have can be freely search bar as pictured here. shared with others. You will find: • What Wikipedia is and how it works • How to navigate Wikipedia The text WP:STATS is what’s known • How you can contribute to on Wikipedia as a shortcut. You can Wikipedia and why you should type shortcuts like this into the search • Important rules that keep Wikipedia bar to pull up specific pages. reliable In this brochure, we designate shortcuts • How to edit Wikipedia with the as | shortcut WP:STATS . VisualEditor and using wiki markup • A step-by-step guide to adding content • Etiquette for interacting with other contributors 2 What is Wikipedia? Wikipedia — the free encyclopedia that anyone can edit — is one of the largest collaborative projects in history. With millions of articles and in hundreds of languages, Wikipedia is read by hundreds of millions of people on a regular basis.
    [Show full text]
  • Automatic Vandalism Detection in Wikipedia: Towards a Machine Learning Approach
    Automatic Vandalism Detection in Wikipedia: Towards a Machine Learning Approach Koen Smets and Bart Goethals and Brigitte Verdonk Department of Mathematics and Computer Science University of Antwerp, Antwerp, Belgium {koen.smets,bart.goethals,brigitte.verdonk}@ua.ac.be Abstract by users on a blacklist. Since the end of 2006 some vandal bots, computer programs designed to detect and revert van- Since the end of 2006 several autonomous bots are, or have dalism have seen the light on Wikipedia. Nowadays the most been, running on Wikipedia to keep the encyclopedia free from vandalism and other damaging edits. These expert sys- prominent of them are ClueBot and VoABot II. These tools tems, however, are far from optimal and should be improved are built around the same primitives that are included in Van- to relieve the human editors from the burden of manually dal Fighter. They use lists of regular expressions and consult reverting such edits. We investigate the possibility of using databases with blocked users or IP addresses to keep legit- machine learning techniques to build an autonomous system imate edits apart from vandalism. The major drawback of capable to distinguish vandalism from legitimate edits. We these approaches is the fact that these bots utilize static lists highlight the results of a small but important step in this di- of obscenities and ‘grammar’ rules which are hard to main- rection by applying commonly known machine learning al- tain and easy to deceive. As we will show, they only detect gorithms using a straightforward feature representation. De- 30% of the committed vandalism. So there is certainly need spite the promising results, this study reveals that elemen- for improvement.
    [Show full text]