Approaches to NLP-Supported Information Quality Management in Wikipedia

The Quality of Content in Open Online Collaboration Platforms Approaches to NLP-supported Information Quality Management in Wikipedia Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) vorgelegt von Oliver Ferschke, M.A. geboren in Würzburg Tag der Einreichung: 15. Mai 2014 Tag der Disputation: 15. Juli 2014 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Dr. Hinrich Schütze, München Assoc. Prof. Carolyn Penstein Rosé, PhD, Pittsburgh Darmstadt 2014 D17 Please cite this document as URN: urn:nbn:de:tuda-tuprints-40929 URL: http://tuprints.ulb.tu-darmstadt.de/4092 This document is provided by tuprints, E-Publishing-Service of the TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] This work is published under the following Creative Commons license: Attribution – Non Commercial – No Derivative Works 2.0 Germany http://creativecommons.org/licenses/by-nc-nd/2.0/de/deed.en Abstract Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively- created resource that reflects the spirit of the open collaborative content creation paradigm. Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source. A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This in- cludes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation shows how natural language processing approaches can be used to assist information quality management on a massive scale. In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particu- lar focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis. Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing III and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality. As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data. We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present Flaw- Finder , a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance in the wild. As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future. Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus. IV Zusammenfassung In den vergangenen zehn Jahren hat sich der Fokus des World Wide Web von primär sta- tischen Webseiten hin zu kollaborativ erstellten Inhalten verlagert. Die wichtigsten Eigen- schaften dieses neuen Paradigmas sind eine niedrige Veröffentlichungsschwelle und wenig oder gänzlich fehlende redaktionelle Kontrolle. Wenngleich dadurch die Vielfalt und Ak- tualität der verfügbaren Informationen verbessert wurde, fördert es zugleich auch die He- terogenität der Webinhalte hinsichtlich ihrer Qualität. Wikipedia ist das Paradebeispiel für eine große, erfolgreiche, kollaborativ erstellte Ressource, die den Geist freier Kollaboration widerspiegelt. Auch wenn jüngste Studien bestätigt haben, dass die Qualität von Wikipedia insgesamt hoch ist, ist es immer noch ein weiter Weg Wikipedia zu einer zuverlässigen und zitierbaren Quelle zu machen. Eine wichtige Voraussetzung zur Erreichung dieses Ziels ist eine Qualitätsmanagement- strategie, die sowohl mit der Größe von Wikipedia und ihrer offenen, nahezu anarchischen Organisationsstruktur umgehen kann. Eine solche Strategie schließt eine effiziente Kom- munikationsplattform für die Arbeitskoordination zwischen den Nutzern, sowie Techniken zur Überwachung von Qualitätsproblemen in der Enzyklopädie mit ein. Diese Dissertati- on zeigt auf, wie sprachtechnologische Methoden die bestehenden Ansätze zum Informa- tionsqualitätsmanagement in Wikipedia effektiv unterstützen können. Im ersten Teil der Dissertation führen wir die theoretischen Grundlagen für unsere Arbeit ein. Wir erörtern zunächst das relativ neue Konzept der freien Online-Kollaboration unter besonderer Be- rücksichtigung kollaborativen Schreibens. Vervollständigt wird diese Einführung mit einer ausführlichen Diskussion der Wikipedia. Auf Basis dieser Grundlagen folgen die drei Hauptbeiträge der vorliegenden Arbeit. Wenngleich es bereits Versuche gab, bestehende Frameworks zur Erfassung von Infor- mationsqualität an die Bedürfnisse der Wikipedia anzupassen, hat bisher kein Modell die Text- und Schreibqualität als zentralen Faktor berücksichtigt. Da Wikipedia jedoch nicht nur eine Ansammlung von Fakten ist, sondern aus Volltextartikeln besteht, ist der Text- und Schreibqualität dieser Artikel eine zentrale Rolle bei den Qualitätsbetrachtungen zu- zuschreiben. Als ersten zentralen Beitrag dieser Dissertation definieren wir daher ein um- V fassendes Artikelqualitätsmodell, welches sowohl die Text- und Schreibqualität als auch die spezifischen Qualitätskriterien der Wikipedia in einem einzigen Modell zusammenführt. Es umfasst insgesamt 23 Qualitätsdimensionen in den Kategorien intrinsische Qualität , kon- textbezogene Qualität , Text- und Schreibqualität und strukturelle Qualität . Im zweiten zentralen Beitrag dieser Arbeit, stellen wir einen Ansatz zur automatischen Erkennung von Qualitätsmängeln in Wikipedia-Artikeln vor. Auch wenn die Idee hierzu bereits in früheren Arbeiten beschrieben wurde, haben wir in unseren Experimenten her- ausgefunden, dass dieser Ansatz von Natur aus anfällig für

Approaches to NLP-Supported Information Quality Management in Wikipedia

Position Description Addenda

Danish Resources

On the Evolution of Wikipedia

The Culture of Wikipedia

Jimmy Wales and Larry Sanger, It Is the Largest, Fastest-Growing and Most Popular General Reference Work Currently Available on the Internet

Instructor Basics: Howtouse Wikipedia As Ateaching Tool

Combining Wikidata with Other Linked Databases

Free Knowledge for a Free World

What Is a Wiki? Tutorial 1 for New Wikieducators

WMF Advisory Board Beesley Starling

Wikipedia's Labor Squeeze and Its Consequences Eric Goldman Santa Clara University School of Law, [email protected]

Detecting Wikipedia Vandalism Using Wikitrust*