Building Knowledge Graphs: Marcus Klang

Total Page:16

File Type:pdf, Size:1020Kb

Building Knowledge Graphs: Marcus Klang Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking Marcus Klang Doctoral Dissertation, 2019 Department of Computer Science Lund University ISBN 978-91-7895-286-1 (printed version) ISBN 978-91-7895-287-8 (electronic version) ISSN 1404-1219 LU-CS-DISS: 2019-04 Dissertation 64, 2019 Department of Computer Science Lund University Box 118 SE-221 00 Lund Sweden Email: [email protected] WWW: http://cs.lth.se/marcus-klang Typeset using LATEX Printed in Sweden by Tryckeriet i E-huset, Lund, 2019 © 2019 Marcus Klang Abstract Things such as organizations, persons, or locations are ubiquitous in all texts cir- culating on the internet, particularly in the news, forum posts, and social media. Today, there is more written material than any single person can read through during a typical lifespan. Automatic systems can help us amplify our abilities to find relevant information, where, ideally, a system would learn knowledge from our combined written legacy. Ultimately, this would enable us, one day, to build automatic systems that have reasoning capabilities and can answer any question in any human language. In this work, I explore methods to represent linguistic structures in text, build processing infrastructures, and how they can be combined to process a compre- hensive collection of documents. The goal is to extract knowledge from text via things, entities. As text, I focused on encyclopedic resources such as Wikipedia. As knowledge representation, I chose to use graphs, where the entities corre- spond to graph nodes. To populate such graphs, I created a named entity linker that can find entities in multiple languages such as English, Spanish, and Chi- nese, and associate them to unique identifiers. In addition, I describe a published state-of-the-art Swedish named entity recognizer that finds mentions of entities in text that I evaluated on the four majority classes in the Stockholm-Umeå Corpus (SUC) 3.0. To collect the text resources needed for the implementation of the algorithms and the training of the machine-learning models, I also describe a document repre- sentation, Docria, that consists of multiple layers of annotations: A model capable of representing structures found in Wikipedia and beyond. Finally, I describe how to construct processing pipelines for large-scale processing with Wikipedia using Docria. Contents Preface v Acknowledgements ix Popular Science Summary in Swedish xi Introduction 1 1 Introduction ............................. 1 2 Natural Language Processing (NLP) ................ 4 3 Corpus ................................ 9 4 Infrastructure ............................ 15 5 Evaluation .............................. 27 6 Machine Learning .......................... 33 7 Data Representation for Machine Learning ............. 36 8 Models ............................... 42 9 Document Database ......................... 48 10 Named Entity Recognition ..................... 50 11 Named Entity Linking ....................... 52 12 Conclusion ............................. 56 Bibliography ............................... 58 Paper I – Named Entity Disambiguation in a Question Answering System 67 Paper II – WIKIPARQ: A Tabulated Wikipedia Resource Using the Par- quet Format 71 Paper III – Docforia: A Multilayer Document Model 81 Paper IV – Multilingual Supervision of Semantic Annotation 87 Paper V – Langforia: Language Pipelines for Annotating Large Collec- tions of Documents 99 iv Contents Paper VI – Overview of the Ugglan Entity Discovery and Linking System105 Paper VII – Linking, Searching, and Visualizing Entities in Wikipedia 119 Paper VIII – Comparing LSTM and FOFE-based Architectures for Named Entity Recognition 127 Paper IX – Docria: Processing and Storing Linguistic Data with Wikipedia133 Paper X – Hedwig: A Named Entity Linker 141 Preface List of Included Publications I Named entity disambiguation in a question answering system. Marcus Klang and Pierre Nugues In Proceedings of the The Fifth Swedish Language Technology Conference (SLTC 2014), Uppsala, November 13-14 2014. II WIKIPARQ: A tabulated Wikipedia resource using the Parquet for- mat. Marcus Klang and Pierre Nugues. In Proceedings of the Ninth International Conference on Language Re- sources and Evaluation (LREC 2016), pages 4141–4148, Portoro, Slovenia, May 2016. III Docforia: A multilayer document model. Marcus Klang and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. IV Multilingual supervision of semantic annotation.. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1007–1017, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. V Langforia: Language pipelines for annotating large collections of doc- uments. Marcus Klang and Pierre Nugues. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 74–78, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. VI Overview of the Ugglan entity discovery and linking system. Marcus Klang, Firas Dib, and Pierre Nugues. vi Preface In Proceedings of the Tenth Text Analysis Conference (TAC 2017), Gaithers- burg, Maryland, November 2017. VII Linking, searching, and visualizing entities in Wikipedia. Marcus Klang and Pierre Nugues. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 3426–3432, Miyazaki, Japan, May 7-12, 2018 2018. Euro- pean Language Resources Association (ELRA). VIII Comparing LSTM and FOFE-based architectures for named entity Marcus Klang and Pierre Nugues. In Proceedings of the The Seventh Swedish Language Technology Confer- ence (SLTC 2018), pages 54–57, Stockholm, October 7-9 2018. IX Docria: Processing and storing linguistic data with Wikipedia. Marcus Klang and Pierre Nugues. In Proceedings of the 22nd Nordic Conference on Computational Linguis- tics, Turku, October 2019. X Hedwig: A Named Entity Linker Marcus Klang To be submitted. Contribution Statement Marcus Klang is the main contributor to all the papers included in this doctoral thesis when listed as first author. He was the main designer and implementor of the research experiments and responsible for most of the writing. In the paper Overview of the Ugglan Entity Discovery and Linking System, Firas Dib contributed the FOFE-based named entity recognizer that was used as part of the mention detection system. Peter Exner was the main contributor in the papers A Distant Supervision Approach to Semantic Role Labeling and Multi- lingual Supervision of Semantic Annotation, with Marcus Klang contributing the named entity linker used to produce part of the input to the system. In the paper Linking, Searching, and Visualizing Entities for the Swedish Wikipedia, Marcus Klang contributed infrastructure tools and resources. The supervisor Prof. Pierre Nugues contributed to the design of the experi- ments, writing of articles, and reviewed the content of the papers. vii List of Additional Publications The following papers were related, but not included in this thesis. Specifically, papers XI, XIII, XIV were succeeded by paper IV. XI Using distant supervision to build a proposition bank. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of the The Fifth Swedish Language Technology Conference (SLTC 2014), Uppsala, November 13-14 2014. XII A platform for named entity disambiguation. Marcus Klang and Pierre Nugues. In Proceedings of the workshop on semantic technologies for research in the humanities and social sciences (STRiX), Gothenburg, November 24-25 2014. XIII A distant supervision approach to semantic role labeling. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of the Fourth Joint Conference on Lexical and Computa- tional Semantics, pages 239–248, Denver, Colorado, June 2015. Associa- tion for Computational Linguistics. XIV Generating a Swedish semantic role labeler. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. XV Linking, searching, and visualizing entities for the Swedish Wikipedia. Anton Södergren, Marcus Klang, and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. XVI Pairing wikipedia articles across languages. Marcus Klang and Pierre Nugues. In Proceedings of the Open Knowledge Base and Question Answering Work- shop (OKBQA 2016), pages 72–76, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. Acknowledgements First, I would like to express my deepest gratitude to my supervisor, Professor Pierre Nugues. For his steadfast support, inspiration, patience, and outstanding ability to bring clarity when you need it the most. I am most grateful that he introduced me to the field of natural language processing. I also would like to thank my colleagues in the Department of Computer Sci- ence. For their inspiration, many interesting discussions, and their ability to sus- tain a wonderful, curious atmosphere that makes it an excellent workplace. Specif- ically, I want to thank Peter Möller, Lars Nilsson, Anders Bruce, and Tomas Richter
Recommended publications
  • Klicksafe-Zusatzmodul "Wikipedia – Gemeinsam Wissen Gestalten"
    Co-financed by the Connecting Europe Facility of the European Union ist das deutsche Awareness Centre im CEF Telecom Programm der Europäischen Union. klicksafe sind: Landeszentrale für Medien und Kommunikation (LMK) Rheinland-Pfalz – www.lmk-online.de Landesanstalt für Medien NRW – www.medienanstalt-nrw.de Diese Broschüre wurde erstellt in Zusammenarbeit mit: Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V. unter Einbeziehung von Aktiven der Wikipedia-Community Wikipedia Gemeinsam Wissen gestalten Zusatzmodul zu Knowhow für junge User Materialien für den Unterricht Co-financedFacility by the of theConnecting European Europe Union klicksafe-Büros c/o Landeszentrale für Medien und c/o Landesanstalt für Medien NRW Mehr Sicherheit im lnternet Kommunikation (LMK) Rheinland-Pfalz Zollhof 2 durch Medienkompetenz Turmstraße 10 40221 Düsseldorf 67059 Ludwigshafen E-Mail: [email protected] ln Zusammenarbeit mit: E-Mail: [email protected] Internet: www.klicksafe.de Internet: www.klicksafe.de lmpressum Titel: Wir unterstützen Bildungs-, Wissenschafts- und Kultur institutionen Wikipedia – Gemeinsam Wissen gestalten dabei, ihre Inhalte für die Allgemeinheit frei zugänglich und nutzbar zu machen, und setzen uns für die Verbesserung politischer und Autorinnen: Franziska Hahn (klicksafe), Stefanie Rack (klicksafe) gesetzlicher Rahmenbedingungen ein, um das Teilen von Wissen zu erleichtern. Redaktion: Elly Köpf (Wikimedia Deutschland e. V.), Andreas Paul, Christina Rupprecht (Wikimedia Deutschland e. V.), Es wird darauf hingewiesen, dass alle Angaben trotz sorgfältiger Katja Ullrich (Wikimedia Deutschland e. V.) Bearbeitung ohne Gewähr erfolgen und eine Haftung der Autorinnen ausgeschlossen ist. Lektorat: Dirk Diemer Die kommerzielle, aber auch nichtkommerzielle Vervielfältigung Verantwortlich: Birgit Kimmel, Päd. Leitung EU-Initiative klicksafe der Texte dieser Broschüre und deren Verbreitung ist erlaubt unter der Lizenz Creative Commons Namensnennung-Share Alike 4.0 4.
    [Show full text]
  • Offline Product / Content Experience on the Ground Research Reports Money
    Ofine context descriptive value web KA video content Lite/Kolibri/Learning Equality Internet in a product feedback Box video content device product feedback software compressed ZIM content product users product / content Orange product Foundation Medical practitioner local content money money (remote product users medical help clinic) users Patient product Wikifundi product Khan Academy content (remote product specs Wikimedicine clinic) money ? project content curation money distribution Columbia research research reports Kiwix / team Prisoners OpenZIM ZIM content build Kiwix app ofine editing capability product feedback compressed ZIM contentcompressed ZIM content software compressed ZIM content software Other Kiwix money RACHEL Kiwix content reusers research reports Rachel product and content Grant based money distributors (Gabriel compressedmoney ZIM content Thullen) money ofine product / content users Governments Experience App and users WMF Grants Play store content re sellers WMF Experience Phone Education Wikipedia App resalers WMF School training School administrator Partnerships teachers (low s (low ofine product / content resource resource WMF school) school) Phone Product manufacturer WMF ofine product / content s Comms app with content users product feedback users NGOs and online teaching tools Unicef content (wiki edu) product / content Wikipedia app user (Android) distribution Mobile network experience on the ground operators Students Other ofine (low Wikipedia Endless Wikipedia resource editors apps school) XOWA Egranary ofine product / content Content Wif access curators Refugees points (Wikipedia 1.0).
    [Show full text]
  • Working-With-Mediawiki-Yaron-Koren.Pdf
    Working with MediaWiki Yaron Koren 2 Working with MediaWiki by Yaron Koren Published by WikiWorks Press. Copyright ©2012 by Yaron Koren, except where otherwise noted. Chapter 17, “Semantic Forms”, includes significant content from the Semantic Forms homepage (https://www. mediawiki.org/wiki/Extension:Semantic_Forms), available under the Creative Commons BY-SA 3.0 license. All rights reserved. Library of Congress Control Number: 2012952489 ISBN: 978-0615720302 First edition, second printing: 2014 Ordering information for this book can be found at: http://workingwithmediawiki.com All printing of this book is handled by CreateSpace (https://createspace.com), a subsidiary of Amazon.com. Cover design by Grace Cheong (http://gracecheong.com). Contents 1 About MediaWiki 1 History of MediaWiki . 1 Community and support . 3 Available hosts . 4 2 Setting up MediaWiki 7 The MediaWiki environment . 7 Download . 7 Installing . 8 Setting the logo . 8 Changing the URL structure . 9 Updating MediaWiki . 9 3 Editing in MediaWiki 11 Tabs........................................................... 11 Creating and editing pages . 12 Page history . 14 Page diffs . 15 Undoing . 16 Blocking and rollbacks . 17 Deleting revisions . 17 Moving pages . 18 Deleting pages . 19 Edit conflicts . 20 4 MediaWiki syntax 21 Wikitext . 21 Interwiki links . 26 Including HTML . 26 Templates . 27 3 4 Contents Parser and tag functions . 30 Variables . 33 Behavior switches . 33 5 Content organization 35 Categories . 35 Namespaces . 38 Redirects . 41 Subpages and super-pages . 42 Special pages . 43 6 Communication 45 Talk pages . 45 LiquidThreads . 47 Echo & Flow . 48 Handling reader comments . 48 Chat........................................................... 49 Emailing users . 49 7 Images and files 51 Uploading . 51 Displaying images . 55 Image galleries .
    [Show full text]
  • A Tabulated Wikipedia Resource Using the Parquet Format Klang
    WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format Klang, Marcus; Nugues, Pierre Published in: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) 2016 Link to publication Citation for published version (APA): Klang, M., & Nugues, P. (2016). WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4141-4148). European Language Resources Association. http://www.lrec- conf.org/proceedings/lrec2016/pdf/31_Paper.pdf Total number of authors: 2 General rights Unless other specific re-use rights are stated the following general rights apply: Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Read more about Creative commons licenses: https://creativecommons.org/licenses/ Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. LUND UNIVERSITY PO Box 117 221 00 Lund +46 46-222 00 00 WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format Marcus Klang, Pierre Nugues Lund University, Department of Computer science, Lund, Sweden [email protected], [email protected] Abstract Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications.
    [Show full text]
  • Těžba Grafových Dat Z Wikipedie
    Západočeská univerzita v Plzni Fakulta aplikovaných věd Katedra informatiky a výpočetní techniky Bakalářská práce Těžba grafových dat z Wikipedie Plzeň 2016 Martin Mach PROHLÁŠENÍ Prohlašuji, že jsem bakalářskou práci vypracoval samostatně a výhradně s použitím citovaných pramenů. V Plzni dne 4. května 2016.................................. Martin Mach ABSTRAKT Cílem této práce je vytvořit nástroj umožňující zpracování historických událostí z Wikipe- die do podoby grafu. Vzhledem k povaze zdroje dat a požadavku na interakci s uživatelem byl za platformu zvolen webový prohlížeč Google Chrome a jeho systém rozšíření. Vytvo- řený nástroj umožňuje výběr preferovaných článků, na základě nichž jsou nabízeny další články dle odhadu jejich relevance. Obsah článků je zpracováván a jsou z něj získávány informace specifické pro typ dané historické události. Mezi jednotlivými články jehledán vzájemný vztah, který potom, spolu s typem tohoto vztahu, tvoří hrany získaného grafu. Vytvořené řešení poskytuje uživateli možnost zpracovat historické události do formy re- šerše reprezentované grafem. Množství získávaných informací a jejich vzájemnou spojitost je možné ovlivnit pomocí systému modulů. Výsledek je poté možno zobrazit na časové ose, případně je možné upravit výstup aplikace tak, aby mohl být zobrazen nástrojem dle volby uživatele. KLÍČOVÁ SLOVA Wikipedie, těžba dat, graf, historická událost, časová osa, Google Chrome, rozšíření ABSTRACT The purpose of this thesis is to create a tool which would allow to process historical events from Wikipedia into the form of a graph. Due to the nature of the data source and the requi- rement for user interaction it was decided to choose Google Chrome and its extensions as a platfrom. The tool allows user to choose articles of his choice, based on which it will find and recommend other articles according to their estimated relevance.
    [Show full text]
  • Wikimedia Foundation Quarterly Report
    Wikimedia Foundation Quarterly Report April — June 2015 Q4 of the fiscal year Foreword We are pleased to bring you the Wikimedia Foundation’s Quarterly Report for Q4 of the 2014/15 fiscal year. This is the third report since we switched from a monthly cycle, to align with our quarterly goal setting process. We are continuing to optimize the report’s format and the organization’s quarterly review process that the report is based on, to bring you better information at a lower overhead for the teams that take out time from their work to tell you how they have been doing. Participation in the review process is good and growing. This issue includes some new pieces of information, e.g. the approximate size of each team (in FTE, on average during this quarter), and for each objective, the number of team members who were involved with a significant amount of their time. The overall metrics scorecard now contains new, more reliable uptime numbers for both readers and contributors. As before, we are including an overview slide summarizing successes and misses across all teams. In a mature 90 day goal setting process, the “sweet spot” is for about 75% of goals to be a success. Organizations that are meeting 100% of their goals are not typically setting aggressive goals. Terry Gilbey, Chief Operating Officer 2 Quarterly metrics scorecard (beta) Participation Readership Sign-ups 1,329k -11.8% from Q3 Page Views 17.8B/mo Q3: N/A +36.6% y-o-y Crawlers excluded y-o-y: N/A New editors 15.8k/mo -7.9% from Q3 Visitors Unique visitors to comScore desktop
    [Show full text]
  • Wikimedia Free Videos Download
    wikimedia free videos download Celebra los 20 años de Wikipedia y las personas que lo hacen posible → In this time of uncertainty, our enduring commitment is to provide reliable and neutral information for the world. The nonprofit Wikimedia Foundation provides the essential infrastructure for free knowledge. We host Wikipedia, the free online encyclopedia, created, edited, and verified by volunteers around the world, as well as many other vital community projects. All of which is made possible thanks to donations from individuals like you. We welcome anyone who shares our vision to join us in collecting and sharing knowledge that fully represents human diversity. (中文 (繁体中文) · Français (France) · Русский (Россия) · Español (España) · Deutsch (Deutschland · (اﻟﻌﺮﺑﯿﺔ (اﻟﻌﺎﻟﻢ · English 1 Wikimedia projects belong to everyone. You made it. It is yours to use. For free. That means you can use it, adapt it, or share what you find on Wikimedia sites. Just do not write your own bio, or copy/paste it into your homework. 2 We respect your data and privacy. We do not sell your email address or any of your personal information to third parties. More information about our privacy practices are available at the Wikimedia Foundation privacy policy, donor privacy policy, and data retention guidelines. 3 People like you keep Wikipedia accurate. Readers verify the facts. Articles are collaboratively created and edited by a community of volunteers using reliable sources, so no single person or company owns a Wikipedia article. The Wikimedia Foundation does not write or edit, but you and everyone you know can help. 4 Not all wikis are Wikimedia.
    [Show full text]
  • Wikipedia @ 20
    Wikipedia @ 20 Wikipedia @ 20 Stories of an Incomplete Revolution Edited by Joseph Reagle and Jackie Koerner The MIT Press Cambridge, Massachusetts London, England © 2020 Massachusetts Institute of Technology This work is subject to a Creative Commons CC BY- NC 4.0 license. Subject to such license, all rights are reserved. The open access edition of this book was made possible by generous funding from Knowledge Unlatched, Northeastern University Communication Studies Department, and Wikimedia Foundation. This book was set in Stone Serif and Stone Sans by Westchester Publishing Ser vices. Library of Congress Cataloging-in-Publication Data Names: Reagle, Joseph, editor. | Koerner, Jackie, editor. Title: Wikipedia @ 20 : stories of an incomplete revolution / edited by Joseph M. Reagle and Jackie Koerner. Other titles: Wikipedia at 20 Description: Cambridge, Massachusetts : The MIT Press, [2020] | Includes bibliographical references and index. Identifiers: LCCN 2020000804 | ISBN 9780262538176 (paperback) Subjects: LCSH: Wikipedia--History. Classification: LCC AE100 .W54 2020 | DDC 030--dc23 LC record available at https://lccn.loc.gov/2020000804 Contents Preface ix Introduction: Connections 1 Joseph Reagle and Jackie Koerner I Hindsight 1 The Many (Reported) Deaths of Wikipedia 9 Joseph Reagle 2 From Anarchy to Wikiality, Glaring Bias to Good Cop: Press Coverage of Wikipedia’s First Two Decades 21 Omer Benjakob and Stephen Harrison 3 From Utopia to Practice and Back 43 Yochai Benkler 4 An Encyclopedia with Breaking News 55 Brian Keegan 5 Paid with Interest: COI Editing and Its Discontents 71 William Beutler II Connection 6 Wikipedia and Libraries 89 Phoebe Ayers 7 Three Links: Be Bold, Assume Good Faith, and There Are No Firm Rules 107 Rebecca Thorndike- Breeze, Cecelia A.
    [Show full text]
  • 0A Front Matter-Copyright-Approval-Title Page
    Copyright by James Joseph Brown, Jr. 2009 The Dissertation Committee for James Joseph Brown, Jr. certifies that this is the approved version of the following dissertation: Hospitable Texts Committee: _________________________________ D. Diane Davis, Supervisor _________________________________ N. Katherine Hayles _________________________________ Cynthia Haynes _________________________________ Clay Spinuzzi _________________________________ Margaret Syverson Hospitable Texts By James Joseph Brown, Jr., B.S., M.A. Dissertation Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy The University of Texas at Austin May 2009 Acknowledgements The trail of citations that led to this project is long and filled with the most gracious and brilliant scholars I could ever have hoped to work with. First and foremost, Diane Davis (DDD) was willing to read countless drafts of these chapters. As DDD will tell you, I work in threes. Three drafts of each chapter, numerous conversations, and various DDD urgings to “slow down” brought you the project that follows. DDD taught me how to read, and I can only hope that this dissertation is a reflection of her ability to perform and teach a gracious and careful mode of scholarship. My entire dissertation committee was immensely helpful throughout the process. This project would never have gotten off the ground without Clay Spinuzzi’s help during the prospectus stages. Feedback at various stages from Kate Hayles, Cynthia Haynes, and Peg Syverson was invaluable. In addition, all of these scholars can find their own stamp on this project at various moments. Their theories and wisdom are the stuff out of which my writing has emerged.
    [Show full text]
  • From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts
    Digital Humanities 2016 From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts Proceedings of the Workshop Workshop held in conjunction with Digital Humanities 2016 July 11, 2016 Krakow, Poland Series: Linköping Electronic Conference Proceedings, No. 126 ISSN: 1650-3686 eISSN: 1650-3740 ISBN: 978-91-7685-733-5 Preface This volume contains the papers presented at the Workshop on Resources and Methods for Semantic Processing of Digital Works/Texts held on July 10, 2016 in Krakow: From Digitization to Knowledge (D2K 2016). The goal of this workshop was twofold: First, provide a venue for researchers to describe and discuss practical methods and tools used in the construction of seman- tically annotated text collections, the raw material necessary to build knowledge-rich applications. A second goal was to report on the on-going development of new tools and resources for providing access to the rich information contained in large text col- lections: Ontologies, framenets, syntactic parsers, semantic parsers, entity linkers, etc. with a specific focus on digital humanities. The committee decided to accept eight papers, where each submission was re- viewed by at least two reviewers. We wish to thank all the authors for the quality of their contributions as well as the members of the program committee for their time and reviews. The organization of this workshop would not have been possible without the sup- port of the Swedish Research Council and the frame program: Det digitaliserade samhallet¨ , ‘The Digitized Society’. We are also grateful to the EasyChair conference management site and Jonas Wisbrant who helped us produce these proceedings as well as to Linkoping¨ University Electronic Press for publishing them.
    [Show full text]
  • Industry Map: Offline Wikimedia March 15, 2017 So What Is This?
    Industry map: Offline Wikimedia March 15, 2017 So what is this? The New Readers team is working to address access as a barrier through supporting offline reading. The following is our understanding of how Wikipedia’s free content gets into the hands of readers who don’t always have an internet connection. That includes content packages, software, hardware, and distribution channels. We should continue to build on this industry analysis with a deeper understanding of use cases and strategic priorities. From there, we will be able to match solutions to opportunities. Wikipedia App WMF Servers Reader > content (prototype) How does content get to Raspberry Pi (online) readers? Kiwix apps Wikimed apps Smartphone 3rd party (see slide 5) Kiwix .ZIM Reader Desktop Kiwix software XOWA Proprietary XOWA data Device Endless EGranary ?? CD Pedia Legend Content 3rd Kiwix WMF Prosp- [unknown] party owned owned ective Kiwix distribution How do Kiwix’s content Wikipedia packages get to readers? Preload app (prototype) App store (requires internet) Reader P2P Wikimed apps Kiwix Kiwix apps ($275) Raspberry Pi Self install ReadersReaders Readers Institution Grantees Kiwix .ZIM (eg schools) Ted Whole projets Curated content 3rd party (see (pic or no pic) (eg. Wikimed) Project slide 5) Luxembourg Wikimedia Content Others Legend 3rd Kiwix WMF Prosp- Distribution party owned owned ective Other [known] players using Kiwix Cost Locations Device Comments Digital Doorway South Africa Solar powered Now deployed w/ terminal solar-powered container Rachel $99 - $399 Guatemala, Liberia, Content hot spot Sierra Leone (impact reports) Internet-in-a-box $100 + Proposed Wikimed Content hot spot (not yet deployed) Digisoft $800-1700 Africa Projector with offline Digital classroom system": content Projector with preloaded offline content (but can also go online).
    [Show full text]
  • A Tabulated Wikipedia Resource Using the Parquet Format
    WikiParq: A Tabulated Wikipedia Resource Using the Parquet Format Marcus Klang, Pierre Nugues Lund University, Department of Computer science, Lund, Sweden [email protected], [email protected] Abstract Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/. Keywords: Wikipedia, Parquet, query language. 1. Introduction to the Wikipedia original content, WikiParq makes it easy 1.1.
    [Show full text]