Efficient Extraction and Query Benchmarking of Wikipedia Data

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Extraction and Query Benchmarking of Wikipedia Data Efficient Extraction and Query Benchmarking of Wikipedia Data Der Fakult¨atf¨urMathematik und Informatik der Universit¨atLeipzig eingereichte DISSERTATION zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr. Ing.) im Fachgebiet Informatik vorgelegt von M.Sc. Mohamed Mabrouk Mawed Morsey geboren am 15. November 1980 in Kairo, Agypten¨ Leipzig, den 13.4.2014 Acknowledgement First of all I would like to thank my supervisors, Dr. Jens Lehmann, Prof. S¨oren Auer, and Prof. Klaus-Peter F¨ahnrich, without whom I could not start my Ph.D. at the Leipzig University. Special thanks to my direct supervisor Dr. Jens Lehmann, with whom I have started work on my Ph.D. proposal submitted to Deutscher Akademischer Aus- tauschdienst (DAAD), in order to pursue my Ph.D. at the Agile Knowledge Engineering and Semantic Web (AKSW) group. He has continuously supported me through my Ph.D. work, giving advices and recommendations for further re- search steps. His comments and notes were very helpful for me particularly during writing the papers we have published together. I would like to thank him also for proofreading this thesis and for his helpful feedback, which led to improving the quality of that thesis. Special thanks also to thank Prof. S¨orenAuer, whom I have first contacted asking for a vacancy to conduct my Ph.D. research at his research group. As the head of our group, Prof. S¨orenAuer has established a weekly meeting, called "Writing Group", for discussing how to write robust scientific papers. His notes and comments, during these meetings, were very useful for directing me to the correct way to write a good scientific paper. I would like also to thank Prof. Klaus-Peter F¨ahnrich, for the regular follow-up meetings he has managed in order to evaluate the performance of all Ph.D. students. During these meetings, he has proposed several directions, for me and for other Ph.D. students as well, on how to deepen and extend our research points. I would like to thank all of my colleagues in the Machine Learning and Ontology Engineering (MOLE) group, for providing me with their useful comments and guidelines, especially during the initial phase of my Ph.D.. I would like to dedicate this work to the souls of my parents, without whom I could not do anything in my life. With their help and support, I could take my first steps in my scientific career. Special thanks goes to all my family members, my son Momen, my wife Hebaalla, and my dearest sisters Reham and Nermeen. III Bibliographic Data Title: Efficient Extraction and Query Benchmarking of Wikipedia Data Author: Mohamed Mabrouk Mawed Morsey Institution: Universit¨atLeipzig, Fakult¨atf¨urMathematik und Informatik Statistical Information: 128 pages, 29 Figures, 19 tables, 2 appendices, 98 literature references Abstract Knowledge bases are playing an increasingly important role for integrating infor- mation between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by ex- tracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, IV it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time- consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases' data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. V Contents 1. Introduction1 1.1. Motivation...............................1 1.2. Contributions.............................3 1.3. Chapter Overview...........................5 2. Semantic Web Technologies7 2.1. Semantic Web Definition.......................7 2.2. Resource Description Framework - RDF..............7 2.2.1. RDF Resource.........................8 2.2.2. RDF Property.........................8 2.2.3. RDF Statement........................9 2.2.4. RDF Serialization Formats.................. 10 2.2.4.1. N-Triples...................... 10 2.2.4.2. RDF/XML..................... 11 2.2.4.3. N3.......................... 11 2.2.4.4. Turtle........................ 12 2.2.5. Ontology............................ 12 2.2.6. Ontology Languages..................... 13 2.2.6.1. RDFS........................ 13 2.2.6.2. OWL........................ 14 2.2.7. SPARQL Query Language.................. 14 2.2.8. Triplestore........................... 16 3. Overview on the DBpedia Project 17 3.1. Introduction to DBpedia....................... 17 3.2. DBpedia Extraction Framework................... 19 3.2.1. General Architecture..................... 19 3.2.2. Extractors........................... 20 3.2.3. Raw Infobox Extraction................... 22 3.2.4. Mapping-Based Infobox Extraction............. 22 3.2.5. URI Schemes......................... 25 3.2.6.
Recommended publications
  • Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network
    In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network Morten Warncke-Wang1, Anuradha Uduwage1, Zhenhua Dong2, John Riedl1 1GroupLens Research Dept. of Computer Science and Engineering 2Dept. of Information Technical Science University of Minnesota Nankai University Minneapolis, Minnesota Tianjin, China {morten,uduwage,riedl}@cs.umn.edu [email protected] ABSTRACT 1. INTRODUCTION Wikipedia has become one of the primary encyclopaedic in- The world: seven seas separating seven continents, seven formation repositories on the World Wide Web. It started billion people in 193 nations. The world's knowledge: 283 in 2001 with a single edition in the English language and has Wikipedias totalling more than 20 million articles. Some since expanded to more than 20 million articles in 283 lan- of the content that is contained within these Wikipedias is guages. Criss-crossing between the Wikipedias is an inter- probably shared between them; for instance it is likely that language link network, connecting the articles of one edition they will all have an article about Wikipedia itself. This of Wikipedia to another. We describe characteristics of ar- leads us to ask whether there exists some ur-Wikipedia, a ticles covered by nearly all Wikipedias and those covered by set of universal knowledge that any human encyclopaedia only a single language edition, we use the network to under- will contain, regardless of language, culture, etc? With such stand how we can judge the similarity between Wikipedias a large number of Wikipedia editions, what can we learn based on concept coverage, and we investigate the flow of about the knowledge in the ur-Wikipedia? translation between a selection of the larger Wikipedias.
    [Show full text]
  • Volunteer Contributions to Wikipedia Increased During COVID-19 Mobility Restrictions
    Volunteer contributions to Wikipedia increased during COVID-19 mobility restrictions Thorsten Ruprechter1,*, Manoel Horta Ribeiro2, Tiago Santos1, Florian Lemmerich3, Markus Strohmaier3,4, Robert West2, and Denis Helic1 1Graz University of Technology, 8010 Graz, Austria 2EPFL, 1015 Lausanne, Switzerland 3RWTH Aachen University, 52062 Aachen, Germany 4GESIS – Leibniz Institute for the Social Sciences, 50667 Cologne, Germany *Corresponding author ([email protected]) Wikipedia, the largest encyclopedia ever created, is a global initiative driven by volunteer contribu- tions. When the COVID-19 pandemic broke out and mobility restrictions ensued across the globe, it was unclear whether Wikipedia volunteers would become less active in the face of the pandemic, or whether they would rise to meet the increased demand for high-quality information despite the added stress inflicted by this crisis. Analyzing 223 million edits contributed from 2018 to 2020 across twelve Wikipedia language editions, we find that Wikipedia’s global volunteer community responded remarkably to the pandemic, substantially increasing both productivity and the number of newcom- ers who joined the community. For example, contributions to the English Wikipedia increased by over 20% compared to the expectation derived from pre-pandemic data. Our work sheds light on the response of a global volunteer population to the COVID-19 crisis, providing valuable insights into the behavior of critical online communities under stress. Wikipedia is the world’s largest encyclopedia, one of the most prominent volunteer-based information systems in existence [18, 29], and one of the most popular destinations on the Web [2]. On an average day in 2019, users from around the world visited Wikipedia about 530 million times and editors voluntarily contributed over 870 thousand edits to one of Wikipedia’s language editions (Supplementary Table 1).
    [Show full text]
  • Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics
    computers Article Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland * Correspondence: [email protected]; Tel.: +48-(61)-639-27-93 Received: 10 May 2019; Accepted: 13 August 2019; Published: 14 August 2019 Abstract: On Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, the quality of information about the same topic depends on the language. Any interested user can improve an article and that improvement may depend on the popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular selected topics are among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories, we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach, we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
    [Show full text]
  • Florida State University Libraries
    )ORULGD6WDWH8QLYHUVLW\/LEUDULHV 2020 Wiki-Donna: A Contribution to a More Gender-Balanced History of Italian Literature Online Zoe D'Alessandro Follow this and additional works at DigiNole: FSU's Digital Repository. For more information, please contact [email protected] THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS & SCIENCES WIKI-DONNA: A CONTRIBUTION TO A MORE GENDER-BALANCED HISTORY OF ITALIAN LITERATURE ONLINE By ZOE D’ALESSANDRO A Thesis submitted to the Department of Modern Languages and Linguistics in partial fulfillment of the requirements for graduation with Honors in the Major Degree Awarded: Spring, 2020 The members of the Defense Committee approve the thesis of Zoe D’Alessandro defended on April 20, 2020. Dr. Silvia Valisa Thesis Director Dr. Celia Caputi Outside Committee Member Dr. Elizabeth Coggeshall Committee Member Introduction Last year I was reading Una donna (1906) by Sibilla Aleramo, one of the most important ​ ​ works in Italian modern literature and among the very first explicitly feminist works in the Italian language. Wanting to know more about it, I looked it up on Wikipedia. Although there exists a full entry in the Italian Wikipedia (consisting of a plot summary, publishing information, and external links), the corresponding page in the English Wikipedia consisted only of a short quote derived from a book devoted to gender studies, but that did not address that specific work in great detail. As in-depth and ubiquitous as Wikipedia usually is, I had never thought a work as important as this wouldn’t have its own page. This discovery prompted the question: if this page hadn’t been translated, what else was missing? And was this true of every entry for books across languages, or more so for women writers? My work in expanding the entry for Una donna was the beginning of my exploration ​ ​ into the presence of Italian women writers in the Italian and English Wikipedias, and how it relates back to canon, Wikipedia, and gender studies.
    [Show full text]
  • Does Wikipedia Matter? the Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Discus­­ Si­­ On­­ Paper No
    Dis cus si on Paper No. 15-089 Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Dis cus si on Paper No. 15-089 Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko First version: December 2015 This version: September 2017 Download this ZEW Discussion Paper from our ftp server: http://ftp.zew.de/pub/zew-docs/dp/dp15089.pdf Die Dis cus si on Pape rs die nen einer mög lichst schnel len Ver brei tung von neue ren For schungs arbei ten des ZEW. Die Bei trä ge lie gen in allei ni ger Ver ant wor tung der Auto ren und stel len nicht not wen di ger wei se die Mei nung des ZEW dar. Dis cus si on Papers are inten ded to make results of ZEW research prompt ly avai la ble to other eco no mists in order to encou ra ge dis cus si on and sug gesti ons for revi si ons. The aut hors are sole ly respon si ble for the con tents which do not neces sa ri ly repre sent the opi ni on of the ZEW. Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices ∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ First version: December 2015 This version: September 2017 September 25, 2017 Abstract We document a causal influence of online user-generated information on real- world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional information on Wikipedia about cities affects tourists’ choices of overnight visits.
    [Show full text]
  • Wikipedia Matters∗
    Wikipedia Matters∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ September 29, 2017 Abstract We document a causal impact of online user-generated information on real-world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional content on Wikipedia pages about cities affects tourists’ choices of overnight visits. Our treatment of adding information to Wikipedia increases overnight visits by 9% during the tourist season. The impact comes mostly from improving the shorter and incomplete pages on Wikipedia. These findings highlight the value of content in digital public goods for informing individual choices. JEL: C93, H41, L17, L82, L83, L86 Keywords: field experiment, user-generated content, Wikipedia, tourism industry 1 Introduction Asymmetric information can hinder efficient economic activity. In recent decades, the Internet and new media have enabled greater access to information than ever before. However, the digital divide, language barriers, Internet censorship, and technological con- straints still create inequalities in the amount of accessible information. How much does it matter for economic outcomes? In this paper, we analyze the causal impact of online information on real-world eco- nomic outcomes. In particular, we measure the impact of information on one of the primary economic decisions—consumption. As the source of information, we focus on Wikipedia. It is one of the most important online sources of reference. It is the fifth most ∗We are grateful to Irene Bertschek, Avi Goldfarb, Shane Greenstein, Tobias Kretschmer, Thomas Niebel, Marianne Saam, Greg Veramendi, Joel Waldfogel, and Michael Zhang as well as seminar audiences at the Economics of Network Industries conference in Paris, ZEW Conference on the Economics of ICT, and Advances with Field Experiments 2017 Conference at the University of Chicago for valuable comments.
    [Show full text]
  • Wikipedia Matters∗
    Wikipedia Matters∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ July 14, 2019 Abstract We document a causal impact of online user-generated information on real-world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional content on Wikipedia pages about cities affects tourists’ choices of overnight visits. Our treatment of adding information to Wikipedia in- creases overnight stays in treated cities compared to non-treated cities. The impact is largely driven by improvements to shorter and relatively incomplete pages on Wikipedia. Our findings highlight the value of content in digital public goods for informing individual choices. JEL: C93, H41, L17, L82, L83, L86 Keywords: field experiment, user-generated content, Wikipedia, tourism industry ∗We are grateful to Irene Bertschek, Avi Goldfarb, Shane Greenstein, Tobias Kretschmer, Michael Luca, Thomas Niebel, Marianne Saam, Greg Veramendi, Joel Waldfogel, and Michael Zhang as well as seminar audiences at the Economics of Network Industries conference (Paris), ZEW Conference on the Economics of ICT (Mannheim), Advances with Field Experiments 2017 Conference (University of Chicago), 15th Annual Media Economics Workshop (Barcelona), Conference on Digital Experimentation (MIT), and Digital Economics Conference (Toulouse) for valuable comments. Ruetger Egolf, David Neseer, and Andrii Pogorielov provided outstanding research assistance. Financial support from SEEK 2014 is gratefully acknowledged. †Collegio Carlo Alberto and CEPR, [email protected] ‡Collegio Carlo Alberto, [email protected] §Georgia Institute of Technology & University of East Anglia, [email protected] ¶ZEW, [email protected] 1 1 Introduction Asymmetric information can hinder efficient economic activity (Akerlof, 1970). In recent decades, the Internet and new media have enabled greater access to information than ever before.
    [Show full text]
  • Companies and Wikipedia: Friend Or Foe?
    Lundquist Wikipedia Research 2017. Since 2008 COMPANIES AND WIKIPEDIA: FRIEND OR FOE? In this report: European, Italian, German and Swiss editions (Updated in May 2017) Wikipedia is the fifth most visited website in the world, with articles about companies perennially well positioned on the first page of search results. Yet despite this visibility, the articles about the 100 largest companies in Europe, the 100 largest in Italy, the 30 German companies included in the DAX 30 index and the 48 largest in Switzerland often lack information and present critical issues. With the already small number of active Wikipedia editors decreasing, this situation is likely to worsen. Some companies think that by editing articles about themselves they have an easy workaround. However, any company that does so runs the risk of violating Wikipedia’s rules, therefore creating a hostile environment. The potential resulting reputational backlash would be enormous. Since the first edition in 2008 our research revealed the low quality of the vast majority of company Wikipedia pages. For this reason, Lundquist has refined a set of tested guidelines, in order to help companies to safely approach the free encyclopedia as well as engage with its vast online community in a constructive manner. This proposed alliance entails abiding by Wikipedia’s rules so as to ensure information is accurate. When done correctly, a Wikipedia article is a win for both the encyclopedia and companies. 1 - Lundquist Wikipedia Research 2015 LUNDQUIST WIKIPEDIA RESEARCH As part of its research into online corporate CONTENTS information, the Lundquist Wikipedia Research covers the article content of major corporations.
    [Show full text]
  • Quality Assessment Process in Wikipedia's Vetrina
    Observatorio (OBS*) Journal, 8 (2009), 148-161 1646-5954/ERC123483/2009 148 Quality assessment process in Wikipedia’s Vetrina: the role of the community’s policies and rules Sara Monaci, Università degli Studi di Torino Abstract The increasing growth of Wikipedia poses many questions about its organizational model and its development as a free-open knowledge repository. Yochai Benkler describes Wikipedia as a CBPP (commons-based peer production) system: a platform which enables users to easily generate knowledge contents and to manage them collaboratively and on free-voluntary basis. Quality is one of the main concerns related to such a system. How would a CBPP environment guarantee at the same time the openness of its organization and a good level of accreditation? The paper offers an overview of the quality assessment processes in it.wiki’s Vetrina section. It also suggests an explanation to quality assessment which questions Benkler’s hypothesis. Thanks to a qualitative analysis carried out through in-depth interviews to Wikipedia users and through a period of ethnographic observation, the paper outlines Vetrina’s organization and the factors related to the evaluation of quality contents. Copyright © 2009 (Sara Monaci). Licensed under the Creative Commons Attribution Noncommercial No Derivatives (by-nc-nd). Available at http://obs.obercom.pt. Observatorio (OBS*) Journal, 8 (2009) Sara Monaci 139 Introduction. Wikipedia as a tool for collective knowledge production Since its appearance in 2001, Wikipedia gained progressively the attention of media and Internet users as an exemplary project of on line collaboration. The biggest Web Encyclopaedia includes today 10 million articles in more than 250 languages and involves about 75.000 active users distributed worldwide (http://en.wikipedia.org/wiki/Wikipedia:About).
    [Show full text]
  • Wikipedia @ 20
    Wikipedia @ 20 Wikipedia @ 20 Stories of an Incomplete Revolution Edited by Joseph Reagle and Jackie Koerner The MIT Press Cambridge, Massachusetts London, England © 2020 Massachusetts Institute of Technology This work is subject to a Creative Commons CC BY- NC 4.0 license. Subject to such license, all rights are reserved. The open access edition of this book was made possible by generous funding from Knowledge Unlatched, Northeastern University Communication Studies Department, and Wikimedia Foundation. This book was set in Stone Serif and Stone Sans by Westchester Publishing Ser vices. Library of Congress Cataloging-in-Publication Data Names: Reagle, Joseph, editor. | Koerner, Jackie, editor. Title: Wikipedia @ 20 : stories of an incomplete revolution / edited by Joseph M. Reagle and Jackie Koerner. Other titles: Wikipedia at 20 Description: Cambridge, Massachusetts : The MIT Press, [2020] | Includes bibliographical references and index. Identifiers: LCCN 2020000804 | ISBN 9780262538176 (paperback) Subjects: LCSH: Wikipedia--History. Classification: LCC AE100 .W54 2020 | DDC 030--dc23 LC record available at https://lccn.loc.gov/2020000804 Contents Preface ix Introduction: Connections 1 Joseph Reagle and Jackie Koerner I Hindsight 1 The Many (Reported) Deaths of Wikipedia 9 Joseph Reagle 2 From Anarchy to Wikiality, Glaring Bias to Good Cop: Press Coverage of Wikipedia’s First Two Decades 21 Omer Benjakob and Stephen Harrison 3 From Utopia to Practice and Back 43 Yochai Benkler 4 An Encyclopedia with Breaking News 55 Brian Keegan 5 Paid with Interest: COI Editing and Its Discontents 71 William Beutler II Connection 6 Wikipedia and Libraries 89 Phoebe Ayers 7 Three Links: Be Bold, Assume Good Faith, and There Are No Firm Rules 107 Rebecca Thorndike- Breeze, Cecelia A.
    [Show full text]
  • Computational Social Roles
    COMPUTATIONAL SOCIAL ROLES Diyi Yang CMU-LTI-19-001 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Robert E. Kraut, Co-Chair (Carnegie Mellon University) Eduard Hovy, Co-Chair (Carnegie Mellon University) Brandy L. Aven (Carnegie Mellon University) Dan Jurafsky (Stanford University) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright c 2019 Diyi Yang Keywords: Computational social science, conversational acts, computer-supported cooperative work, generative model, human-computer interaction, machine learning, natural language processing, neural network, online health communities, recommender system, social network, social support, semantics, self-disclosure, social roles, social science, well-being, Wikipedia Abstract Millions of people participate in online communities, exchange expertise and ideas, and collaborate to produce complex artifacts. They often enact a variety of social roles in the process of helping their communities and the public at large, which strongly influence the amount and types of work they do, and how they coordinate their activities. Better understanding members’ roles benefits members by clarifying how they should behave to participate effectively and also benefits the community overall by encouraging members to contribute in ways that best use their skills and interests. Social sciences have provided rich theoretic taxonomies of social roles within groups, while natural language processing techniques enable us to automate the identification of social roles in online communities. However, most social science work has focused on generic roles without accommo- dating the activities associated with tasks in specific contexts or automat- ing the process of role identification.
    [Show full text]
  • A Systematic Review of Scholarly Research on Wikipedia
    Downloaded from orbit.dtu.dk on: Oct 06, 2021 The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia Okoli, Chitu; Mehdi, Mohamad; Mesgari, Mostafa; Nielsen, Finn Årup; Lanamäki, Arto Link to article, DOI: 10.2139/ssrn.2021326 Publication date: 2012 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Okoli, C., Mehdi, M., Mesgari, M., Nielsen, F. Å., & Lanamäki, A. (2012). The People’s Encyclopedia Under the Gaze of the Sages: A Systematic Review of Scholarly Research on Wikipedia. https://doi.org/10.2139/ssrn.2021326 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. The people’s encyclopedia under the gaze of the sages: A systematic review of scholarly research on Wikipedia Chitu Okoli John Molson School of Business,
    [Show full text]