Approaches to NLP-Supported Information Quality Management in Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Approaches to NLP-Supported Information Quality Management in Wikipedia The Quality of Content in Open Online Collaboration Platforms Approaches to NLP-supported Information Quality Management in Wikipedia Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) vorgelegt von Oliver Ferschke, M.A. geboren in Würzburg Tag der Einreichung: 15. Mai 2014 Tag der Disputation: 15. Juli 2014 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Dr. Hinrich Schütze, München Assoc. Prof. Carolyn Penstein Rosé, PhD, Pittsburgh Darmstadt 2014 D17 Please cite this document as URN: urn:nbn:de:tuda-tuprints-40929 URL: http://tuprints.ulb.tu-darmstadt.de/4092 This document is provided by tuprints, E-Publishing-Service of the TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] This work is published under the following Creative Commons license: Attribution – Non Commercial – No Derivative Works 2.0 Germany http://creativecommons.org/licenses/by-nc-nd/2.0/de/deed.en Abstract Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively- created resource that reflects the spirit of the open collaborative content creation paradigm. Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source. A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This in- cludes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This disser- tation shows how natural language processing approaches can be used to assist information quality management on a massive scale. In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particu- lar focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a know- ledge resource for language technology applications. We then proceed with the three main contributions of this thesis. Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing III and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality. As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data. We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present Flaw- Finder , a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated perfor- mance than the biased ones but the scores more closely resemble their actual performance in the wild. As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future. Our contribution in this area is threefold: (i) We describe a novel algorithm for seg- menting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algo- rithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of arti- cle improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification ap- proaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus. IV Zusammenfassung In den vergangenen zehn Jahren hat sich der Fokus des World Wide Web von primär sta- tischen Webseiten hin zu kollaborativ erstellten Inhalten verlagert. Die wichtigsten Eigen- schaften dieses neuen Paradigmas sind eine niedrige Veröffentlichungsschwelle und wenig oder gänzlich fehlende redaktionelle Kontrolle. Wenngleich dadurch die Vielfalt und Ak- tualität der verfügbaren Informationen verbessert wurde, fördert es zugleich auch die He- terogenität der Webinhalte hinsichtlich ihrer Qualität. Wikipedia ist das Paradebeispiel für eine große, erfolgreiche, kollaborativ erstellte Ressource, die den Geist freier Kollaboration widerspiegelt. Auch wenn jüngste Studien bestätigt haben, dass die Qualität von Wikipedia insgesamt hoch ist, ist es immer noch ein weiter Weg Wikipedia zu einer zuverlässigen und zitierbaren Quelle zu machen. Eine wichtige Voraussetzung zur Erreichung dieses Ziels ist eine Qualitätsmanagement- strategie, die sowohl mit der Größe von Wikipedia und ihrer offenen, nahezu anarchischen Organisationsstruktur umgehen kann. Eine solche Strategie schließt eine effiziente Kom- munikationsplattform für die Arbeitskoordination zwischen den Nutzern, sowie Techniken zur Überwachung von Qualitätsproblemen in der Enzyklopädie mit ein. Diese Dissertati- on zeigt auf, wie sprachtechnologische Methoden die bestehenden Ansätze zum Informa- tionsqualitätsmanagement in Wikipedia effektiv unterstützen können. Im ersten Teil der Dissertation führen wir die theoretischen Grundlagen für unsere Arbeit ein. Wir erörtern zunächst das relativ neue Konzept der freien Online-Kollaboration unter besonderer Be- rücksichtigung kollaborativen Schreibens. Vervollständigt wird diese Einführung mit ei- ner ausführlichen Diskussion der Wikipedia. Auf Basis dieser Grundlagen folgen die drei Hauptbeiträge der vorliegenden Arbeit. Wenngleich es bereits Versuche gab, bestehende Frameworks zur Erfassung von Infor- mationsqualität an die Bedürfnisse der Wikipedia anzupassen, hat bisher kein Modell die Text- und Schreibqualität als zentralen Faktor berücksichtigt. Da Wikipedia jedoch nicht nur eine Ansammlung von Fakten ist, sondern aus Volltextartikeln besteht, ist der Text- und Schreibqualität dieser Artikel eine zentrale Rolle bei den Qualitätsbetrachtungen zu- zuschreiben. Als ersten zentralen Beitrag dieser Dissertation definieren wir daher ein um- V fassendes Artikelqualitätsmodell, welches sowohl die Text- und Schreibqualität als auch die spezifischen Qualitätskriterien der Wikipedia in einem einzigen Modell zusammenführt. Es umfasst insgesamt 23 Qualitätsdimensionen in den Kategorien intrinsische Qualität , kon- textbezogene Qualität , Text- und Schreibqualität und strukturelle Qualität . Im zweiten zentralen Beitrag dieser Arbeit, stellen wir einen Ansatz zur automatischen Erkennung von Qualitätsmängeln in Wikipedia-Artikeln vor. Auch wenn die Idee hierzu bereits in früheren Arbeiten beschrieben wurde, haben wir in unseren Experimenten her- ausgefunden, dass dieser Ansatz von Natur aus anfällig für
Recommended publications
  • Position Description Addenda
    POSITION DESCRIPTION January 2014 Wikimedia Foundation Executive Director - Addenda The Wikimedia Foundation is a radically transparent organization, and much information can be found at www.wikimediafoundation.org . That said, certain information might be particularly useful to nominators and prospective candidates, including: Announcements pertaining to the Wikimedia Foundation Executive Director Search Kicking off the search for our next Executive Director by Former Wikimedia Foundation Board Chair Kat Walsh An announcement from Wikimedia Foundation ED Sue Gardner by Wikimedia Executive Director Sue Gardner Video Interviews on the Wikimedia Community and Foundation and Its History Some of the values and experiences of the Wikimedia Community are best described directly by those who have been intimately involved in the organization’s dramatic expansion. The following interviews are available for viewing though mOppenheim.TV . • 2013 Interview with Former Wikimedia Board Chair Kat Walsh • 2013 Interview with Wikimedia Executive Director Sue Gardner • 2009 Interview with Wikimedia Executive Director Sue Gardner Guiding Principles of the Wikimedia Foundation and the Wikimedia Community The following article by Sue Gardner, the current Executive Director of the Wikimedia Foundation, has received broad distribution and summarizes some of the core cultural values shared by Wikimedia’s staff, board and community. Topics covered include: • Freedom and open source • Serving every human being • Transparency • Accountability • Stewardship • Shared power • Internationalism • Free speech • Independence More information can be found at: https://meta.wikimedia.org/wiki/User:Sue_Gardner/Wikimedia_Foundation_Guiding_Principles Wikimedia Policies The Wikimedia Foundation has an extensive list of policies and procedures available online at: http://wikimediafoundation.org/wiki/Policies Wikimedia Projects All major projects of the Wikimedia Foundation are collaboratively developed by users around the world using the MediaWiki software.
    [Show full text]
  • Danish Resources
    Danish resources Finn Arup˚ Nielsen November 19, 2017 Abstract A range of different Danish resources, datasets and tools, are presented. The focus is on resources for use in automated computational systems and free resources that can be redistributed and used in commercial applications. Contents 1 Corpora3 1.1 Wikipedia...................................3 1.2 Wikisource...................................3 1.3 Wikiquote...................................4 1.4 ADL......................................4 1.5 Gutenberg...................................4 1.6 Runeberg...................................5 1.7 Europarl....................................5 1.8 Leipzig Corpora Collection..........................5 1.9 Danish Dependency Treebank........................6 1.10 Retsinformation................................6 1.11 Other resources................................6 2 Lexical resources6 2.1 DanNet....................................6 2.2 Wiktionary..................................7 2.3 Wikidata....................................7 2.4 OmegaWiki..................................8 2.5 Other lexical resources............................8 2.6 Wikidata examples with medical terminology extraction.........8 3 Natural language processing tools9 3.1 NLTK.....................................9 3.2 Polyglot....................................9 3.3 spaCy.....................................9 3.4 Apache OpenNLP............................... 10 3.5 Centre for Language Technology....................... 10 3.6 Other libraries................................
    [Show full text]
  • On the Evolution of Wikipedia
    On the Evolution of Wikipedia Rodrigo B. Almeida Barzan Mozafari Junghoo Cho UCLA Computer Science UCLA Computer Science UCLA Computer Science Department Department Department Los Angeles - USA Los Angeles - USA Los Angeles - USA [email protected] [email protected] [email protected] Abstract time. So far, several studies have focused on understanding A recent phenomenon on the Web is the emergence and pro- and characterizing the evolution of this huge repository of liferation of new social media systems allowing social inter- data [5, 11]. action between people. One of the most popular of these Recently, a new phenomenon, called social systems, has systems is Wikipedia that allows users to create content in a emerged from the Web. Generally speaking, such systems collaborative way. Despite its current popularity, not much allow people not only to create content, but also to easily is known about how users interact with Wikipedia and how interact and collaborate with each other. Examples of such it has evolved over time. systems are: (1) Social network systems such as MySpace In this paper we aim to provide a first, extensive study of or Orkut that allow users to participate in a social network the user behavior on Wikipedia and its evolution. Compared by creating their profiles and indicating their acquaintances; to prior studies, our work differs in several ways. First, previ- (2) Collaborative bookmarking systems such as Del.icio.us or ous studies on the analysis of the user workloads (for systems Yahoo’s MyWeb in which users are allowed to share their such as peer-to-peer systems [10] and Web servers [2]) have bookmarks; and (3) Wiki systems that allow collaborative mainly focused on understanding the users who are accessing management of Web sites.
    [Show full text]
  • The Culture of Wikipedia
    Good Faith Collaboration: The Culture of Wikipedia Good Faith Collaboration The Culture of Wikipedia Joseph Michael Reagle Jr. Foreword by Lawrence Lessig The MIT Press, Cambridge, MA. Web edition, Copyright © 2011 by Joseph Michael Reagle Jr. CC-NC-SA 3.0 Purchase at Amazon.com | Barnes and Noble | IndieBound | MIT Press Wikipedia's style of collaborative production has been lauded, lambasted, and satirized. Despite unease over its implications for the character (and quality) of knowledge, Wikipedia has brought us closer than ever to a realization of the centuries-old Author Bio & Research Blog pursuit of a universal encyclopedia. Good Faith Collaboration: The Culture of Wikipedia is a rich ethnographic portrayal of Wikipedia's historical roots, collaborative culture, and much debated legacy. Foreword Preface to the Web Edition Praise for Good Faith Collaboration Preface Extended Table of Contents "Reagle offers a compelling case that Wikipedia's most fascinating and unprecedented aspect isn't the encyclopedia itself — rather, it's the collaborative culture that underpins it: brawling, self-reflexive, funny, serious, and full-tilt committed to the 1. Nazis and Norms project, even if it means setting aside personal differences. Reagle's position as a scholar and a member of the community 2. The Pursuit of the Universal makes him uniquely situated to describe this culture." —Cory Doctorow , Boing Boing Encyclopedia "Reagle provides ample data regarding the everyday practices and cultural norms of the community which collaborates to 3. Good Faith Collaboration produce Wikipedia. His rich research and nuanced appreciation of the complexities of cultural digital media research are 4. The Puzzle of Openness well presented.
    [Show full text]
  • Jimmy Wales and Larry Sanger, It Is the Largest, Fastest-Growing and Most Popular General Reference Work Currently Available on the Internet
    Tomasz „Polimerek” Ganicz Wikimedia Polska WikipediaWikipedia andand otherother WikimediaWikimedia projectsprojects WhatWhat isis Wikipedia?Wikipedia? „Imagine„Imagine aa worldworld inin whichwhich everyevery singlesingle humanhuman beingbeing cancan freelyfreely shareshare inin thethe sumsum ofof allall knowledge.knowledge. That'sThat's ourour commitment.”commitment.” JimmyJimmy „Jimbo”„Jimbo” Wales Wales –– founder founder ofof WikipediaWikipedia As defined by itself: Wikipedia is a free multilingual, open content encyclopedia project operated by the non-profit Wikimedia Foundation. Its name is a blend of the words wiki (a technology for creating collaborative websites) and encyclopedia. Launched in January 2001 by Jimmy Wales and Larry Sanger, it is the largest, fastest-growing and most popular general reference work currently available on the Internet. OpenOpen and and free free content content RichardRichard StallmanStallman definition definition of of free free software: software: „The„The wordword "free""free" inin ourour namename doesdoes notnot referrefer toto price;price; itit refersrefers toto freedom.freedom. First,First, thethe freedomfreedom toto copycopy aa programprogram andand redistributeredistribute itit toto youryour neighbors,neighbors, soso thatthat theythey cancan useuse itit asas wellwell asas you.you. Second,Second, thethe freedomfreedom toto changechange aa program,program, soso ththatat youyou cancan controlcontrol itit insteadinstead ofof itit controllingcontrolling you;you; forfor this,this, thethe sourcesource
    [Show full text]
  • Instructor Basics: Howtouse Wikipedia As Ateaching Tool
    Instructor Basics: How to use Wikipedia as a teaching tool Wiki Education Foundation Wikipedia is the free online encyclopedia that anyone can edit. One of the most visited websites worldwide, Wikipedia is a resource used by most university students. Increasingly, many instructors around the world have used Wikipedia as a teaching tool in their university classrooms as well. In this brochure, we bring together their experiences to help you determine how to use Wikipedia in your classroom. We’ve organized the brochure into three parts: Assignment planning Learn key Wikipedia policies and get more information on designing assignments, with a focus on asking students to write Wikipedia articles for class. During the term Learn about the structure of a good Wikipedia article, the kinds of articles students should choose to improve, suggestions for what to cover in a Wikipedia lab session, and how to interact with the community of Wikipedia editors. After the term See a sample assessment structure that’s worked for other instructors. 2 Instructor Basics Assignment planning Understanding key policies Since Wikipedia started in 2001, the community of volunteer editors – “Wikipedians” – has developed several key policies designed to ensure Wikipedia is as reliable and useful as possible. Any assignment you integrate into your classroom must follow these policies. Understanding these cornerstone policies ensures that you develop an assignment that meets your learning objectives and improves Wikipedia at the same time. Free content Neutral point of view “The work students contribute to “Everything on Wikipedia must be Wikipedia is free content and becomes written from a neutral point of view.
    [Show full text]
  • Combining Wikidata with Other Linked Databases
    Combining Wikidata with other linked databases Andra Waagmeester, Dragan Espenschied Known variants in the CIViC database for genes reported in a WikiPathways pathway on Bladder Cancer Primary Sources: ● Wikipathways (Q7999828) ● NCBI Gene (Q20641742) ● CIViCdb (Q27612411) ● Disease Ontology (Q5282129) Example 1: Wikidata contains public data “All structured data from the main and property namespace is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy.” Wikidata requirement for Notability An item is acceptable if and only if it fulfills at least one of these two goals, that is if it meets at least one of the criteria below: ● It contains at least one valid sitelink to a page on Wikipedia, Wikivoyage, Wikisource, Wikiquote, Wikinews, Wikibooks, Wikidata, Wikispecies, Wikiversity, or Wikimedia Commons. ● It refers to an instance of a clearly identifiable conceptual or material entity.it can be described using serious and publicly available references. ● It fulfills some structural need, https://www.wikidata.org/wiki/Wikidata:Notability Wikidata property proposals “Before a new property is created, it has to be discussed here. When after some time there are some supporters, but no or very few opponents, the property is created by a property creator or an administrator. You can propose a property here or on one of the subject-specific pages listed
    [Show full text]
  • Free Knowledge for a Free World
    Collaborative Peer Production In a Health Context Jimmy Wales President, Wikimedia Foundation Wikipedia Founder What I will talk about •What is Wikipedia? •How the community works •Core principles of the Wikimedia Foundation •What will be free? “The ideal encyclopedia should be radical. It should stop being safe.” --1962, Charles van Doren, later a senior editor at Britannica Wikipedia’s Radical Idea: Imagine a world in which every single person is given free access to the sum of all human knowledge. That’s what we’re doing. What is the Wikimedia Foundation? •Non-profit foundation •Aims to distribute a free encyclopedia to every single person on the planet in their own language •Wikipedia and its sister projects •Funded by public donations •Partnering with select institutions wikimediafoundation.org What is Wikipedia? •Wikipedia is: • a freely licensed encyclopedia written by thousands of volunteers in many languages wikipedia.org What do I mean by free? •Free as in speech, not free as in beer •4 Freedoms – Freedom to copy – Freedom to modify – Freedom to redistribute – Freedom to redistribute modified versions How big is Wikipedia? •English Wikipedia is largest and has over 500 million words •English Wikipedia larger than Britannica and Microsoft Encarta combined •German Wikipedia equal in size to Brockhaus How big is Wikipedia Globally? • 740,000 - English • 292,000 - German • >100,000 - French, Japanese, Italian, Polish, Swedish • >50,000 - Dutch, Portuguese, Spanish • 2.2 million across 200 languages •30 with >10,000. 75 with >1000 Some Wikimedia Projects •Wikipedia •Wiktionary •Wikibooks •Wikiquote •Wikimedia Commons •Wikinews How popular is Wikipedia? • Top 40 website • According to Alexa.com, broader reach than..
    [Show full text]
  • What Is a Wiki? Tutorial 1 for New Wikieducators
    What is a Wiki? Tutorial 1 for new WikiEducators wikieducator.org book - Generated using the open source mwlib toolkit - see http://code.pediapress.com for more information 2 Introducing a Wiki Objectives In this tutorial we will: • provide an overview of what wikis are, • and show some examples of their different uses. • discuss the advantages and disadvantages of using wikis to develop content • describe the main features of WikiEducator What is a Wiki? The name "Wiki" was chosen by Ward Cunningham - - the creator of the first Wiki. It is a shortened form of "wiki- wiki", the Hawaiian word for quick. A wiki is a web site that is generally editable by anyone with a computer, a web browser, and an internet connection. Wikis use a quick and easy syntax to allow users to apply formatting to text and create links between pages. This simple formatting syntax means that authors no longer need to learn the complexities of HTML to create content on the web. The main strength of a wiki is that it gives people the ability to work collaboratively on the same document. The only software you need is an Wiki wiki sign outside Honolulu International Internet browser. Consequently, wikis are used Airport. (Image courtesy of A. Barataz) for a variety of purposes. If you make a mistake, it's easy to revert back to an earlier version of the document. All content sourced from WikiEducator.org and is licensed under CC-BY-SA or CC-BY where specified. 3 Examples of Wikis The largest and most talked about Wiki on the Internet is Wikipedia[1] Wikipedia is, for the most part, editable by anyone in the world with a computer and an internet connection and, at the time of this writing, contained over 1,500,000 pages.
    [Show full text]
  • WMF Advisory Board Beesley Starling
    Advisory Board interviews conducted at Wikimania: Angela Beesley Starling August 24-28, 2009 Key themes: . Need for experimentation/innovation within Wikimedia . Need for prioritization and rationalization of Wiki projects . Need for renewed efforts towards the charitable mission What is your history with Wikimedia? . Was on the first Board of Trustees, joined in 2003. Stepped down from that role and became Chair of the Advisory Board when that was set up. Husband is a MediaWiki developer on staff at Wikimedia (Tim Starling). Now working with Jimmy on Wikia. What are your major concerns in thinking about Wikimedia’s development over the next five years? . Concern that as the WMF gets bigger, the connection to the community will be weaker. Big software challenges; and it is harder to try out new things since the site is so big o Difficulty trying out new things; concerned that the community will object to change o We need new processes, a way for the Foundation to say they need to try something out. For example, with flagged revisions, it was not clear how or whether it would will get rolled out, when it might live, who would be responsible for it, and whether it would be for a trial period only, and if so what metrics would determine success of the trial. o Too few technical staff at Wikipedia, and too few people to review the code of volunteers. Major changes, particularly user-facing ones, are less likely to happen. The tech staff is focused on management of volunteers – filling in other bits around the edges.
    [Show full text]
  • Wikipedia's Labor Squeeze and Its Consequences Eric Goldman Santa Clara University School of Law, [email protected]
    Santa Clara Law Santa Clara Law Digital Commons Faculty Publications Faculty Scholarship 1-1-2010 Wikipedia's Labor Squeeze and Its Consequences Eric Goldman Santa Clara University School of Law, [email protected] Follow this and additional works at: http://digitalcommons.law.scu.edu/facpubs Part of the Internet Law Commons Recommended Citation 8 J. on Telecomm. & High Tech. L. 157 (2010) This Article is brought to you for free and open access by the Faculty Scholarship at Santa Clara Law Digital Commons. It has been accepted for inclusion in Faculty Publications by an authorized administrator of Santa Clara Law Digital Commons. For more information, please contact [email protected]. WIKIPEDIA'S LABOR SQUEEZE AND ITS CONSEQUENCES ERIC GOLDMAN* INT RO D U CTIO N ................................................................................... 158 I. MEASURING WIKIPEDIA'S SUCCESS ....................................... 159 II. THREATS TO WIKIPEDIA ......................................................... 161 III. WIKIPEDIA'S RESPONSE TO THE VANDAL AND SPAMMER T H REA T S ................................................................................... 164 A. Increased TechnologicalBarriers to Participation.................... 164 B. Increased Social Barriersto Participation............................... 167 IV. WIKIPEDIA'S LOOMING LABOR SUPPLY PROBLEMS ............. 170 A . E ditor Turnover ................................................................... 170 B. Wikipedia's Limited Toolkit to Attract New Editors..............
    [Show full text]
  • Detecting Wikipedia Vandalism Using Wikitrust*
    Detecting Wikipedia Vandalism using WikiTrust? Lab Report for PAN at CLEF 2010 B. Thomas Adler1, Luca de Alfaro2, and Ian Pye3 1 [email protected], Fujitsu Labs of America, Inc. 2 [email protected], Google, Inc. and UC Santa Cruz (on leave) 3 [email protected], CloudFlare, Inc. Abstract WikiTrust is a reputation system for Wikipedia authors and content. WikiTrust computes three main quantities: edit quality, author reputation, and content reputation. The edit quality measures how well each edit, that is, each change introduced in a revision, is preserved in subsequent revisions. Authors who perform good quality edits gain reputation, and text which is revised by sev- eral high-reputation authors gains reputation. Since vandalism on the Wikipedia is usually performed by anonymous or new users (not least because long-time vandals end up banned), and is usually reverted in a reasonably short span of time, edit quality, author reputation, and content reputation are obvious candi- dates as features to identify vandalism on the Wikipedia. Indeed, using the full set of features computed by WikiTrust, we have been able to construct classifiers that identify vandalism with a recall of 83.5%, a precision of 48.5%, and a false positive rate of 8%, for an area under the ROC curve of 93.4%. If we limit our- selves to the set of features available at the time an edit is made (when the edit quality is still unknown), the classifier achieves a recall of 77.1%, a precision of 36.9%, and a false positive rate of 12.2%, for an area under the ROC curve of 90.4%.
    [Show full text]