Wikibrain: Democratizing Computation on Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
WikiBrain: Democratizing computation on Wikipedia Shilad Sen Toby Jia-Jun Li Macalester College University of Minnesota St. Paul, Minnesota Minneapolis, Minnesota [email protected] [email protected] ∗ WikiBrain Team Brent Hecht Matt Lesicko, Ari Weiland, Rebecca Gold, University of Minnesota Yulun Li, Benjamin Hillmann Minneapolis, Minnesota Macalester College [email protected] St. Paul, Minnesota ABSTRACT has raced towards this goal, growing in articles, revisions, 1 Wikipedia is known for serving humans' informational needs. and gigabytes by factor of 20. The planet has taken notice. Over the past decade, the encyclopedic knowledge encoded Humans turn to Wikipedia's 31 million articles more than in Wikipedia has also powerfully served computer systems. any other online reference. The online encylopedia's 287 languages attract a global audience that make it the sixth Leading algorithms in artificial intelligence, natural language 2 processing, data mining, geographic information science, and most visited site on the web. many other fields analyze the text and structure of articles However, outside the public eye, computers have also come to build computational models of the world. to rely on Wikipedia. Algorithms across many fields extract Many software packages extract knowledge from Wiki- meaning from the encyclopedia by analyzing articles' text, pedia. However, existing tools either (1) provide Wikipedia links, titles, and category structure. Search engines such as Google and Bing respond to user queries by extracting struc- data, but not well-known Wikipedia-based algorithms or (2) 3 narrowly focus on one such algorithm. tured knowledge from Wikipedia . Leading semantic relat- This paper presents the WikiBrain software framework, edness (SR) algorithms estimate the strength of association an extensible Java-based platform that democratizes access between words by mining Wikipedia's text, link structure to a range of Wikipedia-based algorithms and technologies. [32], and concept structure [5]. Geospatial technologies ex- WikiBrain provides simple access to the diverse Wikipedia tract geographic information from Wikipedia to understand data needed for semantic algorithms and technologies, rang- the world around us in entirely new ways (e.g. [22, 20]). ing from page views to Wikidata. In a few lines of code, Though powerful, these Wikipedia-centric algorithms pre- a developer can use WikiBrain to access Wikipedia data sent serious engineering obstacles. Downloading, parsing, and state-of-the-art algorithms. WikiBrain also enables re- and saving hundreds of gigabytes of database content re- searchers to extend Wikipedia-based algorithms and evalu- quires careful software engineering. Extracting structure ate their extensions. WikiBrain promotes a new vision of the from Wikipedia requires computers to reconcile the messy, Wikipedia software ecosystem: every researcher and devel- human process of article creation and revision. Finally, de- oper should have access to state-of-the-art Wikipedia-based termining exactly what information an application needs, technologies. whether Wikipedia provides that information, and how that information can be efficiently delivered taxes software de- signers. 1. INTRODUCTION These engineering challenges stand as a barrier to develop- In 2004, Jimmy Wales articulated his audacious vision for ers of intelligent applications and researchers of Wikipedia- Wikipedia: \Imagine a world in which every single person on based algorithms. While large software companies such as the planet is given free access to the sum of all human knowl- Google and Microsoft extensively use Wikipedia-mined knowl- edge [31]." In the decade since Wales' declaration, Wikipedia edge1, smaller companies with limited engineering resources have a much harder time leveraging it. Researchers mining ∗[email protected], [email protected], rebecca- [email protected], [email protected], bhillh- Wikipedia face similar engineering challenges. Instead of [email protected] focusing on innovative contributions, they frequently spend Permission to make digital or hard copies of all or part of this work for per- hours reinventing the Wikipedia data processing wheel or sonal or classroom use is granted without fee provided that copies are not wrangling a complex Wikipedia analysis framework to meet made or distributed for profit or commercial advantage and that copies bear their needs. This situation has created a fragmented Wiki- this notice and the full citation on the first page. Copyrights for components pedia software ecosystem in which it is extremely difficult to of this work owned by others than the author(s) must be honored. Abstract- reproduce results and extend Wikipedia-based technologies. ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a 1http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia fee. Request permissions from [email protected]. 2http://www.alexa.com/topsites OpenSym ’14, August 27 - 29 2014, Berlin, Germany. Copyright is held 3 by the owner/author(s). Publication rights licensed to ACM. ACM 978-1- http://googleblog.blogspot.com/2012/05/ 4503-3016-9/14/08 $15.00. http://dx.doi.org/10.1145/2641580.2641615 introducing-knowledge-graph-things-not.html This paper introduces WikiBrain,4 a software library that software packages and datasets for processing Wikipedia data. democratizes access to state-of-the-art Wikipedia-based al- We divide this discussion into two parts: the first is targeted gorithms and techniques. In a few minutes, a developer at software and datasets related to basic Wikipedia struc- can add the WikiBrain Java library and run a single pro- tures and the second is focused on software that is specific gram that downloads, parses, and saves Wikipedia data on to a single Wikipedia-based technology. As the WikiBrain commodity hardware. WikiBrain provides a robust, fast team has the strict policy of not reinventing the wheel when- API to access and query article text and structured meta- ever possible, WikiBrain leverages some of this existing work data such as links, categories, pageview analytics, and Wiki- in select portions of its functionality. Where this is the case, data.[49]. WikiBrain also makes three state-of-the-art re- we describe how WikiBrain was aided by a particular soft- search advances available to every developer and researcher ware package or dataset. as API\primitives:" a multilingual concept network (Section 4), semantic relatedness algorithms (Section 5), and geospa- 2.1 Access to basic data structures tial data integration (Section 6). In doing so, WikiBrain • Wikipedia API: The Wikimedia Foundation provides ac- empowers developers to create rich intelligent applications cess to an impressive array of data through a REST API. and enables researchers to focus their efforts on creating and Accessing this API is substantially slower than accessing evaluating robust, innovative, reproducible software. well-engineered locally stored versions of the data. How- This paper describes WikiBrain, compares it to related ever, the Wikipedia API has one important advantage over software libraries, and provides examples of its use. Sec- all other means of accessing Wikipedia data: it is always tion two begins by reviewing research algorithms that mine 100% up to date. In order to support real-time access to Wikipedia and software APIs that support those algorithms. Wikipedia data, WikiBrain has wrappers for the Wikipedia Section three discusses the overall design and usage of Wik- API that allow developers to access the live Wikipedia API iBrain. We then delve into details on the three 'primitives' just as easily as they can locally stored Wikipedia data (the (Sections four, five, and six). Finally, section seven consists switch is a single line of code per data structure). They can of a case study that uses WikiBrain to quantitatively evalu- even easily integrate data from the two sources. ate Tobler's First Law of Geography { \everything is related to everything else, but near things are more related than • Java Wikipedia Library (JWPL): JWPL provides ac- distant things" [48] { replicating the results of a study by cess to basic data structures like links and categories. Wik- Hecht and Moxley [20] in a few lines of code. iBrain uses some libraries from JWPL to faciliate the pars- ing of wiki markup. 2. RELATED WORK • DBpedia: DBpedia [1] is a highly influential project that extracts structured information from Wikipedia pages. Its Wikipedia has attracted the attention of researchers in a widely-used datasets include basic structures like links and wide range of fields. Broadly speaking, Wikipedia research categories, but also an inferred general-purpose ontology, can be split into two categories: work that studies Wikipedia mappings to other data sources (e.g. Freebase, YAGO), and work that uses Wikipedia as a source of world knowl- and other enriched corpora. WikiBrain connects to DBpe- edge. While the former body of literature is large (e.g. [23, dia's dataset of geographic coordinates as described below. 41, 25, 34]), the latter is likely at least an order of magnitude larger. • bliki: Bliki is a \parser library for converting Wikipedia Researchers have leveraged the unprecedented structured wikitext notation to HTML". Bliki allows for structured repository of world knowledge in Wikipedia for an enormous access to a variety of wiki markup data structures (e.g. variety of tasks. For instance,