Deliverable 3.2.2 Dbpedia-Live Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Collaborative Project LOD2 - Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 3.2.2 DBpedia-Live Extraction Dissemination Level Public Due Date of Deliverable Month 8, 30/4/2011 Actual Submission Date 7/6/2011 Work Package WP3, Knowledge Base Creation, Enrichment and Repair Task Task T3.2 Type Report Approval Status Approved Version 1.0 Number of Pages 20 Filename deliverable-3.2.2.pdf Abstract: DBpedia is the Semantic Web mirror of Wikipedia. Wikipedia users constantly revise Wikipedia articles almost each second. Hence, data stored in DBpedia triplestore can quickly become outdated, and Wikipedia articles need to be re-extracted. DBpedia-Live, the result of this deliverable, enables such a continuous synchronization between DBpedia and Wikipedia. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability. Project funded by the European Commission within the Seventh Framework Programme (2007 – 2013) LOD2 (222011) DBpedia Live History Version Date Reason Revised by 0.1 2011-05-16 Initial version Jens Lehmann 0.5 2011-05-18 First deliverable version Mohamed Morsey 1.0 2011-05-23 Deliverable revised and Jens Lehmann extended Author list Organisation Name Contact Information ULEI Mohamed Morsey [email protected] leipzig.de ULEI Jens Lehmann [email protected] leipzig.de Executive summary This deliverable contains a brief description of DBpedia Live { the synchronisation framework for aligning Wikipedia and DBpedia. D3.2.2 is a software prototype deliverable. We made all source code openly available in the DBpedia code repos- itory, provide a public SPARQL endpoint and access to all changes performed by DBpedia Live. Deliverable 3.2.2 Page 2 Contents 1 Introduction 4 2 Overview 6 3 Live Extraction Framework 10 3.1 General System Architecture . 10 3.2 Extraction Manager . 11 4 Wiki-Based Ontology Engineering 14 5 New Features 16 5.1 Abstract Extraction . 16 5.2 Mapping-Affected Pages . 17 5.3 Unmodified Pages . 18 5.4 Changesets . 18 5.5 Important Pointers . 19 Bibliography 20 3 1 Introduction DBpedia is the result of a community effort to extract structured information from Wikipedia, which in turn is the largest online encyclopedia and currently the 7th most visited website according to alexa.com. Over the past four years the DBpedia knowledge base has turned into a crystallization point for the emerging Web of Data. Several tools have been built on top of it, e.g. DBpedia Mobile1, Query Builder2, Relation Finder[6], and Navigator3. It is used in a variety of applications, for instance Muddy Boots, Open Calais, Faviki, Zemanta, LODr, and TopBraid Composer (cf. [5]). Despite this success, a disadvantage of DBpedia has been the heavy-weight release process. Producing a release requires manual effort and { since dumps of the Wikipedia database are created on a monthly basis { DBpedia has never reflected the current state of Wikipedia. In this deliverable, we present a live extraction framework, which allows DB- pedia to be up-to-date with a minimal delay of only a few minutes. The main motivation behind this enhancement is that our approach turns DBpedia into a real-time editable knowledge base, while retaining the tight coupling to Wikipedia. It also opens the door for an increased use of DBpedia in different scenarios. For instance, a user may like to add to her movie website a list of highest grossing movies produced in the current year. Due to the live extraction process, this be- comes much more appealing, since the contained information will be as up-to-date as Wikipedia instead of being several months delayed. Overall, we make the following contributions: • migration the previous incomplete DBpedia-Live framework, which was PHP- based, to the new Java-based framework, which also maintains up-to-date information, • addition of abstract extraction capability, • re-extraction of mapping-affected pages, • flexible low-priority re-extraction of pages, which have not been modified for a longer period of time { this allows changes in the underlying extraction framework, which potentially affect all Wikipedia pages, while still process- ing the most recent Wikipedia changes 1http://beckr.org/DBpediaMobile/ 2http://querybuilder.dbpedia.org 3http://navigator.dbpedia.org 4 LOD2 (222011) DBpedia Live • publishing added and deleted triples as compressed N-Triples file. • building a synchronization tool, which downloads those those compressed N- Triples files, and update another triplestore with them accordingly, in order to keep it in synchronization with ours. • deployment of the framework on our server. Deliverable 3.2.2 Page 5 2 Overview The core of DBpedia consists of an infobox extraction process, which was first described in [2]. Infoboxes are templates contained in many Wikipedia articles. They are usually displayed in the top right corner of articles and contain factual information (cf. Figure 2.1). The infobox extractor processes an infobox as follows: The DBpedia URI, which is created from the Wikipedia article URL, is used as subject. The predicate URI is created by concatenating the namespace fragment http://dbpedia.org/property/ and the name of the infobox attribute. Objects are created from the attribute value. Those values are pre-processed and converted to RDF to obtain suitable value representations. For instance, internal MediaWiki links are converted to DBpedia URI references, lists are detected and represented accordingly, units are detected and converted to standard datatypes etc. Nested templates can also be handled, i.e. the object of an extracted triple can point to another complex structure extracted from a template. Apart from the infobox extraction, the framework has currently 12 extractors which process the following types of Wikipedia content: • Labels. All Wikipedia articles have a title, which is used as an rdfs:label for the corresponding DBpedia resource. • Abstracts. We extract a short abstract (first paragraph, represented by using rdfs:comment) and a long abstract (text before a table of contents, using the property dbpedia:abstract) from each article. • Interlanguage links. We extract links that connect articles about the same topic in different language editions of Wikipedia and use them for assigning labels and abstracts in different languages to DBpedia resources. • Images. Links pointing at Wikimedia Commons images depicting a resource are extracted and represented by using the foaf:depiction property. • Redirects. In order to identify synonymous terms, Wikipedia articles can redirect to other articles. We extract these redirects and use them to resolve references between DBpedia resources. • Disambiguation. Wikipedia disambiguation pages explain the different mean- ings of homonyms. We extract and represent disambiguation links by using the predicate dbpedia:wikiPageDisambiguates. 6 LOD2 (222011) DBpedia Live {{Infobox settlement | official_name = Algarve | settlement_type = Region | image_map = LocalRegiaoAlgarve.svg | mapsize = 180px | map_caption = Map showing Algarve Region in Portugal | subdivision_type = [[Countries of the world|Country]] | subdivision_name = {{POR}} | subdivision_type3 = Capital city | subdivision_name3 = [[Faro, Portugal|Faro]] | area_total_km2 = 5412 | population_total = 410000 | timezone = [[Western European Time|WET]] | utc_offset = +0 | timezone_DST = [[Western European Summer Time|WEST]] | utc_offset_DST = +1 | blank_name_sec1 = [[NUTS]] code | blank_info_sec1 = PT15 | blank_name_sec2 = [[GDP]] per capita | blank_info_sec2 = €19,200 (2006) }} Figure 2.1: Mediawiki infobox syntax for Algarve (left) and rendered infobox (right). • External links. Articles contain references to external Web resources which we represent by using the DBpedia property dbpedia:wikiPageExternalLink. • Page links. We extract all links between Wikipedia articles and represent them by using the dbpedia:wikiPageWikiLink property. • Wiki page. Links a Wikipedia article to its corresponding DBpedia resource, e.g. (<http://en.wikipedia.org/wiki/Germany> <http://xmlns.com/foaf/0.1/primaryTopic> <http://dbpedia.org/resource/Germany>.). • Homepages. This extractor obtains links to the homepages of entities such as companies and organizations by looking for the terms homepage or website within article links (represented by using foaf:homepage). • Geo-coordinates. The geo-extractor expresses coordinates by using the Ba- sic Geo (WGS84 lat/long) Vocabulary1 and the GeoRSS Simple encoding of 1http://www.w3.org/2003/01/geo/ Deliverable 3.2.2 Page 7 LOD2 (222011) DBpedia Live the W3C Geospatial Vocabulary2. The former expresses latitude and longi- tude components as separate facts, which allows for simple areal filtering in SPARQL queries. • Person data. It extracts personal information such as surname, and birth date. This information is represented in predicates like foaf:surname, and dbpedia:birthDate. • PND. For each person, there is a record containing his name, birth and occupation connected with a unique identifier, which is the PND (Personen- namendatei) number. PNDs are published by the German national library. A PND is related to its