Flexible Generation of Linked Open Data Triples from NLP Frameworks for Automatic Knowledge Base Construction

The LODeXporter: Flexible Generation of Linked Open Data Triples from NLP Frameworks for Automatic Knowledge Base Construction Rene´ Witte and Bahar Sateli Semantic Software Lab Concordia University, Montreal,´ QC, Canada http://www.semanticsoftware.info Abstract We present LODeXporter, a novel approach for exporting Natural Language Processing (NLP) results to a graph-based knowledge base, following Linked Open Data (LOD) principles. The rules for transforming NLP entities into Resource Description Framework (RDF) triples are described in a custom mapping language, which is defined in RDF Schema (RDFS) itself, providing a separation of concerns between NLP pipeline engineering and knowledge base engineering. LODeXporter is available as an open source component for the GATE (General Architecture for Text Engineering) framework. Keywords: Linked Open Data, Automatic Knowledge Base Construction, Semantic Web 1. Introduction web technologies, like RDF, to publish structured, open Knowledge bases in standardized graph-based formats have datasets that can be interlinked and shared automatically become increasingly important in modern applications, such between machines. All data within the LOD cloud have as personal intelligent assistants. Due to the large amount unique URIs. When dereferenced, the URIs provide useful, of formalized knowledge required, automatic knowledge machine-readable information about that entity, as well as base (KB) construction is a crucial step that typically spans links to other related entities on the web of data. multiple information sources (such as structured databases, Automatic Knowledge Base Construction. As semantic sensors, and documents). A significant amount of knowl- knowledge bases, triplestores are especially well suited for edge resides in natural language text and can be automat- connecting previously isolated knowledge sources (so-called ically extracted through existing or custom-made Natural “information silos”). Towards this end, various solutions Language Processing (NLP) components and pipelines. Yet, have been developed for converting existing sources to a so far no generic solution existed for exporting the results knowledge base, e.g., RDB2RDF for creating RDF data from of an NLP pipeline into a knowledge base. a relational database.2 Natural language documents are an Our contribution is a novel technique for converting the re- abundant source of rich semantic knowledge. One of the sults of an NLP pipeline into an LOD-compliant KB, based best known LOD datasets, DBpedia,3 is a large knowledge on the Resource Description Format (RDF) (Cyganiak et base that is automatically generated from the Wikipedia al., 2014) and RDF Schema (RDFS) (Brickley and Guha, (Bizer et al., 2009), with several billions of in-bound links 2014) W3C standards. The concrete mapping of NLP re- from other open datasets. Similarly, custom solutions exist sults to entities in the knowledge base is defined through a that convert the results of a specific NLP pipeline into a mapping language, which is itself written in RDFS. This pro- pre-defined knowledge base. vides high flexibility, as the same NLP pipeline can drive the However, so far there existed no generic solution for popu- generation of triples in different KBs with different vocabu- lating a knowledge base with the results from a text mining laries. It also relieves the NLP engineer from dealing with pipeline. While it is always possible to develop custom ex- the technical details of generating correct Uniform Resource port strategies, this is not a practicable solution in a modern Identifiers (URIs), thereby providing an agile solution for agile data science workflow, where it often becomes nec- bringing NLP results onto the Linked Open Data (LOD) essary to experiment with different knowledge bases and cloud (Heath and Bizer, 2011). representation vocabularies; typical examples are shared LODeXporter has been implemented based on the Gen- tasks, where the expected format of the knowledge base is eral Architecture for Text Engineering (GATE) framework prescribed and existing NLP pipelines need to be adapted to (Cunningham et al., 2013) and is available as open source a challenge-specific format. (LGPLv3) on GitHub.1 2. Foundations 3. Design of LODeXporter In this section, we provide brief background information on the foundations of our approach. We now describe our design decisions behind LODeXporter in detail. Linked Data. Linked Open Data (LOD) (Heath and Bizer, 2011; Wood et al., 2014) is the practice of using semantic 2Relational Databases to RDF (RDB2RDF), https://www.w3. 1LODeXporter on GitHub, https://github.com/ org/2001/sw/wiki/RDB2RDF SemanticSoftwareLab/TextMining-LODeXporter 3DBpedia, http://wiki.dbpedia.org/ 2423 protocol configurable Base URI unique run id type annot id export rule z }| { z }| { z }| { z }| { z }| { z }| { < http:// semanticsoftware.info/lodexporter/ 74a79e9f-0bce-411f-a174-18cc84790bf8/ Person/ 33383# GATEPersonMapping > | {z } Subject URI Figure 1: Anatomy of a LODeXporter-generated Subject URI 3.1. Requirements URI Generation. One of the core features of LODeX- Our high-level vision is the design of a new NLP component porter is the generation of URIs for LOD triples from text that can be added to any existing NLP pipeline, thereby (R4). In designing the URI generation scheme, we had to providing functionality to export the text analysis results strike a balance between (a) conforming to LOD best prac- in linked data format. We derived a number of detailed tices; (b) taking into account the NLP source of the URIs; requirements from this vision: and their (c) usability, in particular for querying knowledge bases that mix NLP-generated knowledge with other sources. Scalability (R1). For the population of large knowledge Figure1 shows an example for a subject URI that was gener- bases, it must be possible to scale out the generation of ated for a single Person instance in a document. Conforming triples, i.e., running pipelines on multiple (cloud) instances to LOD best practices, URIs are HTTP-resolvable. A user- and storing the resulting triples in a networked triplestore. definable run-time parameter specifies the base URI (e.g., The time required to export the annotations as triples must for a public SPARQL endpoint), which should always be a scale linearly with the size of the documents and the number domain ‘owned’ by the user. This is followed by a unique of triples to be exported. run id, which ensures that each new run of LODeXporter Separation of Concerns (R2). Apart from adding a new on the same text generates new URIs. Generally, exposing component to an analysis pipeline, no further changes must implementation details in URIs is discouraged (Wood et al., be required on the NLP side. Thus, we separate the work of 2014). However, re-running NLP pipelines (and therefore a language engineer, developing the NLP pipeline (e.g., to re-generating triples) is an extremely common occurrence find domain-specific entities) from the work of a knowledge in language engineering – for example, after improving a base engineer, who defines the structure of a concrete KB. component, a machine learning model, or experimenting Configurability (R3). The export must be dynamically with different parameters. Hence, we decided that the triples configurable, so that the same NLP pipeline can drive the from different runs of an NLP pipeline must be able to peace- generation of different triples in the same, or multiple differ- fully co-exist in a KB and be easily distinguishable, which ent, knowledge bases. In particular, the vocabularies used in is achieved with the generation of this run id. Next, the the mapping process must be easily changeable, to support semantic (annotation) type, as detected by the NLP pipeline, experiments with different knowledge bases and different is encoded (here “Person”), followed by its unique annota- application scenarios. tion id within a document. Finally, the mapping export rule name is added, as it is possible to export the same annotation LOD Best Practices (R4). The solution must conform to (with the same type) multiple times, using different export the relevant W3C standards on Linked (Open) Data. This rules (e.g., with different vocabularies). includes the recommended format for the generated triples as well as the support of standard protocols for communicating 3.3. Mapping Language with triplestores in a product-independent fashion. The central idea of our approach to automatic KB construc- 3.2. Mapping NLP Annotations to Triples tion from NLP results is to externalize the knowledge about A core idea of our approach is that the concrete mapping, the export process, so that the NLP pipeline can remain un- from NLP results to knowledge base, is not part of the NLP changed (except for adding the LODeXporter component). pipeline itself: Rather, it is externalized in form of a set of This provides the required separation of concerns (R2) and mapping rules that are encoded in the knowledge base itself. makes it possible to easily reuse existing NLP pipelines for In other words, only by dynamically connecting an NLP different knowledge bases. The export rules – how NLP pipeline to a knowledge base is the concrete mapping ef- results are mapped to triples – are themselves described in fected: The same pipeline can so be used to export different form of RDF triples using our new mapping language. Thus, results, using different mapping rules, to multiple knowl- the configuration how a specific knowledge base has

Load more