The Writing Process in Online Mass Collaboration NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction

Total Page:16

File Type:pdf, Size:1020Kb

The Writing Process in Online Mass Collaboration NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction The Writing Process in Online Mass Collaboration NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Dr.-Ing. vorgelegt von Johannes Daxenberger, M.A. geboren in Rosenheim Tag der Einreichung: 28. Mai 2015 Tag der Disputation: 21. Juli 2015 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Dr. Karsten Weihe, Darmstadt Assoc. Prof. Ofer Arazy, Ph.D., Alberta Darmstadt 2016 D17 Please cite this document as URN: urn:nbn:de:tuda-tuprints-52259 URL: http://tuprints.ulb.tu-darmstadt.de/5225 This document is provided by tuprints, E-Publishing-Service of the TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] This work is published under the following Creative Commons license: Attribution – Non Commercial – No Derivative Works 3.0 Germany http://creativecommons.org/licenses/by-nc-nd/3.0/de/deed.en Abstract In the past 15 years, the rapid development of web technologies has created novel ways of collaborative editing. Open online platforms have attracted millions of users from all over the world. The open encyclopedia Wikipedia, started in 2001, has become a very prominent example of a largely successful platform for collaborative editing and knowledge creation. The wiki model has enabled collaboration at a new scale, with more than 30,000 monthly active users on the English Wikipedia. Traditional writing research deals with questions concerning revision and the writing process itself. The analysis of collaborative writing additionally raises questions about the interaction of the involved authors. Interaction takes place when authors write on the same document (indirect interaction), or when they coordinate the collaborative writing process by means of communication (direct interaction). The study of collaborative writing in on- line mass collaboration poses several interesting challenges. First and foremost, the writing process in open online collaboration is typically characterized by a large number of revi- sions from many different authors. Therefore, it is important to understand the interplay and the sequences of different revision categories. As the quality of documents produced in a collaborative writing process varies greatly, the relationship between collaborative re- vision and document quality is an important field of study. Furthermore, the impact of direct user interaction through background discussions on the collaborative writing pro- cess is largely unknown. In this thesis, we tackle these challenges in the context of online mass collaboration, using one of the largest collaboratively created resources, Wikipedia, as our data source. We will also discuss to which extent our conclusions are valid beyond Wikipedia. We will be dealing with three aspects of collaborative writing in Wikipedia. First, we carry out a content-oriented analysis of revisions in the Wikipedia revision history. This includes the segmentation of article revisions into human-interpretable edits. We develop a taxonomy of edit categories such as spelling error corrections, vandalism or informa- tion adding, and verify our taxonomy in an annotation study on a corpus of edits from the English and German Wikipedia. We use the annotated corpora as training data to create iii models which enable the automatic classification of edits. To show that our model is able to generalize beyond our own data, we train and test it on a second corpus of English Wiki- pedia revisions. We analyze the distribution of edit categories and frequent patterns in edit sequences within a larger set of article revisions. We also assess the relationship between edit categories and article quality, finding that the information content in high-quality arti- cles tends to become more stable after their promotion and that high-quality articles showa higher degree of homogeneity with respect to frequent collaboration patterns as compared to random articles. Second, we investigate activity-based roles of users in Wikipedia and how they relate to the collaborative writing process. We automatically classify all revisions in a representative sample of Wikipedia articles and cluster users in this sample into seven intuitive roles. The roles are based on the editing behavior of the users. We find roles such asVandals, Watchdogs, or All-round Contributors. We also analyze the stability of our discovered roles across time and analyze role transitions. The results show that although the nature of roles remains stable across time, more than half of the users in our sample changed their role between two time periods. Third, we analyze the correspondence between indirect user interaction through col- laborative editing and direct user interaction through background discussion. We analyze direct user interaction using the notion of turns, which has been established in previous work. Turns are snippets from Wikipedia discussion pages. We introduce the notion of corresponding edit-turn-pairs. A corresponding edit-turn-pair consists of a turn and an edit from the same Wikipedia article; the turn forms an explicit performative and the edit corresponds to this performative. This happens, for example, when a user complains about a missing reference in the discussion about an article, and another user adds an appropri- ate reference to the article itself. We identify the distinctive properties of corresponding edit-turn-pairs and use them to create a model for the automatic detection of correspond- ing and non-corresponding edit-turn-pairs. We show that the percentage of corresponding edit-turn-pairs in a corpus of flawed English Wikipedia articles is typically below 5% and varies considerably across different articles. The thesis is concluded with a summary of our main contributions and findings.The growing number of collaborative platforms in commercial applications and education, e.g. in massive open online learning courses, demonstrates the need to understand the collab- orative writing process and to support collaborating authors. We also discuss several open issues with respect to the questions addressed in the main parts of the thesis and point out possible directions for future work. Many of the experiments we carried out in the course of this thesis rely on supervised text classification. In the appendix, we explain the con- cepts and technologies underlying these experiments. We also introduce the DKPro TC framework, which was substantially extended as part of this thesis. iv Zusammenfassung Die Weiterentwicklung von Webtechnologien in den vergangenen 15 Jahren hat vollkom- men neue Formen gemeinschaftlichen Schreibens im Web hervorgebracht. Open-Access Online-Plattformen haben Millionen Benutzer, die über die gesamte Erde verteilt sind.Die Online-Enzyklopädie Wikipedia, gegründet im Jahr 2001, hat sich zu einer der bekanntesten und erfolgreichsten Plattformen für gemeinschaftliches Schreiben und Wissensgenerierung entwickelt. Das Wiki-Modell macht Zusammenarbeit in einer neuen Dimension möglich, so dass bspw. in der englischen Wikipedia jeden Monat mehr als 30.000 Benutzer aktiv sind. Die traditionelle Schreibforschung setzt sich mit Fragen über Revision und den Schreib- prozess auseinander. Die Analyse gemeinschaftlichen Schreibens interessiert sich darüber hinaus für die Interaktion der beteiligten Benutzer. Solche Interaktion findet statt wenn Autoren am selben Dokument schreiben (indirekte Interaktion), oder wenn Autoren den gemeinschaftlichen Schreibprozess mittels mündlicher oder schriftlicher Kommunikation koordinieren (direkte Interaktion). Die Erforschung gemeinschaftlichen Schreibens unter massiver Zusammenarbeit auf Online-Plattformen beinhaltet mehrere interessante Heraus- forderungen. Der gemeinschaftliche Schreibprozess im Web ist gekennzeichnet durch eine typischer- weise sehr hohe Zahl von Änderungen, die von vielen verschiedenen Autoren stammen. Dementsprechend ist es unverzichtbar, den Zusammenhang und die Abfolge unterschied- licher Revisionstypen zu verstehen. Da die inhaltliche Qualität der Dokumente, die unter Zusammenarbeit erstellt werden, sehr unterschiedlich ist, ist außerdem die Erforschung der Korrelation zwischen gemeinschaftlichen Änderungen und Dokumentqualität ein wichti- ges Feld. Desweiteren ist der Einfluss direkter Benutzerinteraktion mittels Diskussionen im Hintergrund auf den gemeinschaftlichen Schreibprozess größtenteils unbekannt. In der vorliegenden Arbeit setzen wir uns mit diesen Herausforderungen im Kontext massiver Zusammenarbeit auf Online-Plattformen auseinander. Dabei verwenden wir Wikipedia, eine der größten gemeinschaftlich erstellen Online-Ressourcen, als Datengrundlage. Wir werden auch diskutieren, inwiefern unsere Erkenntnisse über Wikipedia hinaus Gültigkeit besitzen. v Drei Hauptaspekte gemeinschaftlichen Schreibens in Wikipedia stellen das Grundge- rüst dieser Arbeit dar. Als erstes führen wir eine inhaltliche Analyse von Revisionstypen in der Wikipedia Versionsgeschichte durch, wozu auch die Segmentierung von Artikelre- visionen in kleinere Edits, die einfacher zu interpretieren sind, zählt. Wir entwickeln eine Taxonomie für Edittypen, die bspw. Rechtschreibkorrekturen, Vandalismus oder Ergänzun- gen von Information beinhaltet. Die Taxonomie wird getestet in einer Annotationsstudie auf Edits aus der englischen und der deutschen Wikipedia. Wir verwenden die annotier- ten Korpora als Trainingsdaten zum Erstellen eines Modells für die automatischen Klas- sifikation
Recommended publications
  • Short-Format Workshops Build Skills and Confidence for Researchers to Work with Data
    Paper ID #23855 Short-format Workshops Build Skills and Confidence for Researchers to Work with Data Kari L. Jordan Ph.D., The Carpentries Dr. Kari L. Jordan is the Director of Assessment and Community Equity for The Carpentries, a non-profit that develops and teaches core data science skills for researchers. Marianne Corvellec, Institute for Globally Distributed Open Research and Education (IGDORE) Marianne Corvellec has worked in industry as a data scientist and a software developer since 2013. Since then, she has also been involved with the Carpentries, pursuing interests in community outreach, educa- tion, and assessment. She holds a PhD in statistical physics from Ecole´ Normale Superieure´ de Lyon, France (2012). Elizabeth D. Wickes, University of Illinois at Urbana-Champaign Elizabeth Wickes is a Lecturer at the School of Information Sciences at the University of Illinois, where she teaches foundational programming and information technology courses. She was previously a Data Curation Specialist for the Research Data Service at the University Library of the University of Illinois, and the Curation Manager for WolframjAlpha. She currently co-organizes the Champaign-Urbana Python user group, has been a Carpentries instructor since 2015 and trainer since 2017, and elected member of The Carpentries’ Executive Council for 2018. Dr. Naupaka B. Zimmerman, University of San Francisco Mr. Jonah M. Duckles, Software Carpentry Jonah Duckles works to accelerate data-driven inquiry by catalyzing digital skills and building organiza- tional capacity. As a part of the leadership team, he helped to grow Software and Data Carpentry into a financially sustainable non-profit with a robust organization membership in 10 countries.
    [Show full text]
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Operational Transformation in Co-Operative Editing
    INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 5, ISSUE 01, JANUARY 2016 ISSN 2277-8616 Operational Transformation In Co-Operative Editing Mandeep Kaur, Manpreet Singh, Harneet Kaur, Simran Kaur ABSTRACT: Cooperative Editing Systems in real-time allows a virtual team to view and edit a shared document at the same time. The document shared must be synchronized in order to ensure consistency for all the participants. This paper describes the Operational Transformation, the evolution of its techniques, its various applications, major issues, and achievements. In addition, this paper will present working of a platform where two users can edit a code (programming file) at the same time. Index Terms: CodeMirror, code sharing, consistency maintenance, Google Wave, Google Docs, group editors, operational transformation, real-time editing, Web Sockets ———————————————————— 1 INTRODUCTION 1.1 MAIN ISSUES RELATED TO REAL TIME CO- Collaborative editing enables portable professionals to OPERATIVE EDITING share and edit same documents at the same time. Elementally, it is based on operational transformation, that Three inconsistency problems is, it regulates the order of operations according to the Divergence: Contents at all sites in the final document transformed execution order. The idea is to transform are different. received update operation before its execution on a replica Causality-violation: Execution order is different from the of the object. Operational Transformation consists of a cause-effect order. centralized/decentralized integration procedure and a transformation function. Collaborative writing can be E.g., O1 → O3 defined as making changes to one document by a number of persons. The writing can be organized into sub-tasks Intention-violation: The actual effect is different from the assigned to each group member, with the latter part of the intended effect.
    [Show full text]
  • Omnipedia: Bridging the Wikipedia Language
    Omnipedia: Bridging the Wikipedia Language Gap Patti Bao*†, Brent Hecht†, Samuel Carton†, Mahmood Quaderi†, Michael Horn†§, Darren Gergle*† *Communication Studies, †Electrical Engineering & Computer Science, §Learning Sciences Northwestern University {patti,brent,sam.carton,quaderi}@u.northwestern.edu, {michael-horn,dgergle}@northwestern.edu ABSTRACT language edition contains its own cultural viewpoints on a We present Omnipedia, a system that allows Wikipedia large number of topics [7, 14, 15, 27]. On the other hand, readers to gain insight from up to 25 language editions of the language barrier serves to silo knowledge [2, 4, 33], Wikipedia simultaneously. Omnipedia highlights the slowing the transfer of less culturally imbued information similarities and differences that exist among Wikipedia between language editions and preventing Wikipedia’s 422 language editions, and makes salient information that is million monthly visitors [12] from accessing most of the unique to each language as well as that which is shared information on the site. more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with In this paper, we present Omnipedia, a system that attempts a multilingual Wikipedia experience. These include to remedy this situation at a large scale. It reduces the silo visualizing content in a language-neutral way and aligning effect by providing users with structured access in their data in the face of diverse information organization native language to over 7.5 million concepts from up to 25 strategies. We present a study of Omnipedia that language editions of Wikipedia. At the same time, it characterizes how people interact with information using a highlights similarities and differences between each of the multilingual lens.
    [Show full text]
  • Lock-Free Collaboration Support for Cloud Storage Services With
    Lock-Free Collaboration Support for Cloud Storage Services with Operation Inference and Transformation Jian Chen, Minghao Zhao, and Zhenhua Li, Tsinghua University; Ennan Zhai, Alibaba Group Inc.; Feng Qian, University of Minnesota - Twin Cities; Hongyi Chen, Tsinghua University; Yunhao Liu, Michigan State University & Tsinghua University; Tianyin Xu, University of Illinois Urbana-Champaign https://www.usenix.org/conference/fast20/presentation/chen This paper is included in the Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20) February 25–27, 2020 • Santa Clara, CA, USA 978-1-939133-12-0 Open access to the Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20) is sponsored by Lock-Free Collaboration Support for Cloud Storage Services with Operation Inference and Transformation ⇤ 1 1 1 2 Jian Chen ⇤, Minghao Zhao ⇤, Zhenhua Li , Ennan Zhai Feng Qian3, Hongyi Chen1, Yunhao Liu1,4, Tianyin Xu5 1Tsinghua University, 2Alibaba Group, 3University of Minnesota, 4Michigan State University, 5UIUC Abstract Pattern 1: Losing updates Alice is editing a file. Suddenly, her file is overwritten All studied This paper studies how today’s cloud storage services support by a new version from her collaborator, Bob. Sometimes, cloud storage collaborative file editing. As a tradeoff for transparency/user- Alice can even lose her edits on the older version. services friendliness, they do not ask collaborators to use version con- Pattern 2: Conflicts despite coordination trol systems but instead implement their own heuristics for Alice coordinates her edits with Bob through emails to All studied handling conflicts, which however often lead to unexpected avoid conflicts by enforcing a sequential order.
    [Show full text]
  • The Culture of Wikipedia
    Good Faith Collaboration: The Culture of Wikipedia Good Faith Collaboration The Culture of Wikipedia Joseph Michael Reagle Jr. Foreword by Lawrence Lessig The MIT Press, Cambridge, MA. Web edition, Copyright © 2011 by Joseph Michael Reagle Jr. CC-NC-SA 3.0 Purchase at Amazon.com | Barnes and Noble | IndieBound | MIT Press Wikipedia's style of collaborative production has been lauded, lambasted, and satirized. Despite unease over its implications for the character (and quality) of knowledge, Wikipedia has brought us closer than ever to a realization of the centuries-old Author Bio & Research Blog pursuit of a universal encyclopedia. Good Faith Collaboration: The Culture of Wikipedia is a rich ethnographic portrayal of Wikipedia's historical roots, collaborative culture, and much debated legacy. Foreword Preface to the Web Edition Praise for Good Faith Collaboration Preface Extended Table of Contents "Reagle offers a compelling case that Wikipedia's most fascinating and unprecedented aspect isn't the encyclopedia itself — rather, it's the collaborative culture that underpins it: brawling, self-reflexive, funny, serious, and full-tilt committed to the 1. Nazis and Norms project, even if it means setting aside personal differences. Reagle's position as a scholar and a member of the community 2. The Pursuit of the Universal makes him uniquely situated to describe this culture." —Cory Doctorow , Boing Boing Encyclopedia "Reagle provides ample data regarding the everyday practices and cultural norms of the community which collaborates to 3. Good Faith Collaboration produce Wikipedia. His rich research and nuanced appreciation of the complexities of cultural digital media research are 4. The Puzzle of Openness well presented.
    [Show full text]
  • Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science
    (forthcoming in the Journal of the Association for Information Science and Technology) Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science Misha Teplitskiy Grace Lu Eamon Duede Dept. of Sociology and KnowledgeLab Computation Institute and KnowledgeLab University of Chicago KnowledgeLab University of Chicago [email protected] University of Chicago [email protected] (773) 834-4787 [email protected] (773) 834-4787 5735 South Ellis Avenue (773) 834-4787 5735 South Ellis Avenue Chicago, Illinois 60637 5735 South Ellis Avenue Chicago, Illinois 60637 Chicago, Illinois 60637 Abstract With the rise of Wikipedia as a first-stop source for scientific knowledge, it is important to compare its representation of that knowledge to that of the academic literature. Here we identify the 250 most heavi- ly used journals in each of 26 research fields (4,721 journals, 19.4M articles in total) indexed by the Scopus database, and test whether topic, academic status, and accessibility make articles from these journals more or less likely to be referenced on Wikipedia. We find that a journal’s academic status (im- pact factor) and accessibility (open access policy) both strongly increase the probability of its being ref- erenced on Wikipedia. Controlling for field and impact factor, the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to paywall journals. One of the implica- tions of this study is that a major consequence of open access policies is to significantly amplify the dif- fusion of science, through an intermediary like Wikipedia, to a broad audience. Word count: 7894 Introduction Wikipedia, one of the most visited websites in the world1, has become a destination for information of all kinds, including information about science (Heilman & West, 2015; Laurent & Vickers, 2009; Okoli, Mehdi, Mesgari, Nielsen, & Lanamäki, 2014; Spoerri, 2007).
    [Show full text]
  • Linfield College, Computer Science Department COMP 161: Basic Programming Syllabus Spring 2015 Prof
    Linfield College, Computer Science Department COMP 161: Basic Programming Syllabus Spring 2015 Prof. Daniel Ford COURSE DESCRIPTION: Introduces the basic concepts of programming: reading and writing unambiguous descriptions of sequential processes. Emphasizes introductory algorithmic strategies and corresponding structures. (QR). COMP 160 is required for a major in electronic arts and is a prerequisite for COMP 161 and all other required courses for a major or minor in computer science. Extensive hands-on experience with programming projects. Lecture/discussion, lab. PREREQUISITES: Prerequisite: COMP 160. CREDITS: 3 LECTURES: Renshaw 211 Section 1: MWF 2:00 – 2:50 PM WORKSHOPS: Renshaw 211 Workshop: Mondays and Wednesday, 8—9 PM Additional help and tutoring is available by request and through the learning center. PROFESSOR: Daniel Ford Office: Renshaw 210 Phone: x2706 E-mail: [email protected] Office Hours: MWF 10:30—11:30 AM, TTh 2:00—3:00 PM, and by appointment. RECOMMENDED BOOKS: § Cay S. Horstmann, Big Java, http://www.horstmann.com/bigjava.html. Recommended for COMP 160 & 161. Wiley 5th Edition (2012 Early Objects) ~ $90, 4th Edition, 3rd Edition, …, 2nd Edition (2005) ~ $10. § Y. Daniel Liang, Introduction to Java Programming, Comprehensive Version, http://cs.armstrong.edu/liang/intro9e/. Recommended for COMP 160 & 161. Faster, more advanced, but with fewer tips than Big Java. 9th Edition (2012) ~ $90, 8th Edition, …, 6th Edition (2006) ~$10. § Cay S. Horstmann & Gary Cornell, Core Java™, Volume I – Fundamentals and Volume II – Advanced Features (9th Edition), 2012 http://www.horstmann.com/corejava.html. The best reference manuals for Java. Recommended for CS majors and professional Java programmers.
    [Show full text]
  • Wikipedia Editing History in Dbpedia Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin
    Wikipedia editing history in DBpedia Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin To cite this version: Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin. Wikipedia editing history in DB- pedia : extracting and publishing the encyclopedia editing activity as linked data. IEEE/WIC/ACM International Joint Conference on Web Intelligence (WI’ 16), Oct 2016, Omaha, United States. hal- 01359575 HAL Id: hal-01359575 https://hal.inria.fr/hal-01359575 Submitted on 2 Sep 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Wikipedia editing history in DBpedia extracting and publishing the encyclopedia editing activity as linked data Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin Université Côte d’Azur, Inria, CNRS, I3S, France Wimmics, Sophia Antipolis, France [email protected] Abstract— DBpedia is a huge dataset essentially extracted example, the French editing history dump represents 2TB of from the content and structure of Wikipedia. We present a new uncompressed data. This data extraction is performed by extraction producing a linked data representation of the editing stream in Node.js with a MongoDB instance. It takes 4 days to history of Wikipedia pages. This supports custom querying and extract 55 GB of RDF in turtle on 8 Intel(R) Xeon(R) CPU E5- combining with other data providing new indicators and insights.
    [Show full text]
  • The Case of Croatian Wikipedia: Encyclopaedia of Knowledge Or Encyclopaedia for the Nation?
    The Case of Croatian Wikipedia: Encyclopaedia of Knowledge or Encyclopaedia for the Nation? 1 Authorial statement: This report represents the evaluation of the Croatian disinformation case by an external expert on the subject matter, who after conducting a thorough analysis of the Croatian community setting, provides three recommendations to address the ongoing challenges. The views and opinions expressed in this report are those of the author and do not necessarily reflect the official policy or position of the Wikimedia Foundation. The Wikimedia Foundation is publishing the report for transparency. Executive Summary Croatian Wikipedia (Hr.WP) has been struggling with content and conduct-related challenges, causing repeated concerns in the global volunteer community for more than a decade. With support of the Wikimedia Foundation Board of Trustees, the Foundation retained an external expert to evaluate the challenges faced by the project. The evaluation, conducted between February and May 2021, sought to assess whether there have been organized attempts to introduce disinformation into Croatian Wikipedia and whether the project has been captured by ideologically driven users who are structurally misaligned with Wikipedia’s five pillars guiding the traditional editorial project setup of the Wikipedia projects. Croatian Wikipedia represents the Croatian standard variant of the Serbo-Croatian language. Unlike other pluricentric Wikipedia language projects, such as English, French, German, and Spanish, Serbo-Croatian Wikipedia’s community was split up into Croatian, Bosnian, Serbian, and the original Serbo-Croatian wikis starting in 2003. The report concludes that this structure enabled local language communities to sort by points of view on each project, often falling along political party lines in the respective regions.
    [Show full text]
  • Language-Agnostic Relation Extraction from Abstracts in Wikis
    information Article Language-Agnostic Relation Extraction from Abstracts in Wikis Nicolas Heist, Sven Hertling and Heiko Paulheim * ID Data and Web Science Group, University of Mannheim, Mannheim 68131, Germany; [email protected] (N.H.); [email protected] (S.H.) * Correspondence: [email protected] Received: 5 February 2018; Accepted: 28 March 2018; Published: 29 March 2018 Abstract: Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph. Keywords: relation extraction; knowledge graphs; Wikipedia; DBpedia; DBkWik; Wiki farms 1. Introduction Large-scale knowledge graphs, such as DBpedia [1], Freebase [2], Wikidata [3], or YAGO [4], are usually built using heuristic extraction methods, by exploiting crowd-sourcing processes, or both [5].
    [Show full text]