<<

The LODeXporter: Flexible Generation of Linked Open Data Triples from NLP Frameworks for Automatic Knowledge Base Construction

Rene´ Witte and Bahar Sateli Semantic Software Lab Concordia University, Montreal,´ QC, Canada http://www.semanticsoftware.info

Abstract We present LODeXporter, a novel approach for exporting Natural Language Processing (NLP) results to a graph-based knowledge base, following Linked Open Data (LOD) principles. The rules for transforming NLP entities into Resource Description Framework (RDF) triples are described in a custom mapping language, which is defined in RDF Schema (RDFS) itself, providing a separation of concerns between NLP pipeline engineering and knowledge base engineering. LODeXporter is available as an open source component for the GATE (General Architecture for Text Engineering) framework.

Keywords: Linked Open Data, Automatic Knowledge Base Construction,

1. Introduction web technologies, like RDF, to publish structured, open Knowledge bases in standardized graph-based formats have datasets that can be interlinked and shared automatically become increasingly important in modern applications, such between machines. All data within the LOD cloud have as personal intelligent assistants. Due to the large amount unique URIs. When dereferenced, the URIs provide useful, of formalized knowledge required, automatic knowledge machine-readable information about that entity, as well as base (KB) construction is a crucial step that typically spans links to other related entities on the web of data. multiple information sources (such as structured , Automatic Knowledge Base Construction. As semantic sensors, and documents). A significant amount of knowl- knowledge bases, triplestores are especially well suited for edge resides in natural language text and can be automat- connecting previously isolated knowledge sources (so-called ically extracted through existing or custom-made Natural “information silos”). Towards this end, various solutions Language Processing (NLP) components and pipelines. Yet, have been developed for converting existing sources to a so far no generic solution existed for exporting the results knowledge base, e.g., RDB2RDF for creating RDF data from of an NLP pipeline into a knowledge base. a relational .2 Natural language documents are an Our contribution is a novel technique for converting the re- abundant source of rich semantic knowledge. One of the sults of an NLP pipeline into an LOD-compliant KB, based best known LOD datasets, DBpedia,3 is a large knowledge on the Resource Description Format (RDF) (Cyganiak et base that is automatically generated from the al., 2014) and RDF Schema (RDFS) (Brickley and Guha, (Bizer et al., 2009), with several billions of in-bound links 2014) W3C standards. The concrete mapping of NLP re- from other open datasets. Similarly, custom solutions exist sults to entities in the knowledge base is defined through a that convert the results of a specific NLP pipeline into a mapping language, which is itself written in RDFS. This pro- pre-defined knowledge base. vides high flexibility, as the same NLP pipeline can drive the However, so far there existed no generic solution for popu- generation of triples in different KBs with different vocabu- lating a knowledge base with the results from a laries. It also relieves the NLP engineer from dealing with pipeline. While it is always possible to develop custom ex- the technical details of generating correct Uniform Resource port strategies, this is not a practicable solution in a modern Identifiers (URIs), thereby providing an agile solution for agile data science workflow, where it often becomes nec- bringing NLP results onto the Linked Open Data (LOD) essary to experiment with different knowledge bases and cloud (Heath and Bizer, 2011). representation vocabularies; typical examples are shared LODeXporter has been implemented based on the Gen- tasks, where the expected format of the knowledge base is eral Architecture for Text Engineering (GATE) framework prescribed and existing NLP pipelines need to be adapted to (Cunningham et al., 2013) and is available as open source a challenge-specific format. (LGPLv3) on GitHub.1

2. Foundations 3. Design of LODeXporter In this section, we provide brief background information on the foundations of our approach. We now describe our design decisions behind LODeXporter in detail. . Linked Open Data (LOD) (Heath and Bizer, 2011; Wood et al., 2014) is the practice of using semantic 2Relational Databases to RDF (RDB2RDF), https://www.w3. 1LODeXporter on GitHub, https://github.com/ org/2001/sw/wiki/RDB2RDF SemanticSoftwareLab/TextMining-LODeXporter 3DBpedia, http://wiki.dbpedia.org/

2423 protocol configurable Base URI unique run id type annot id export rule z }| { z }| { z }| { z }| { z }| { z }| { < http:// semanticsoftware.info/lodexporter/ 74a79e9f-0bce-411f-a174-18cc84790bf8/ Person/ 33383# GATEPersonMapping > | {z } Subject URI

Figure 1: Anatomy of a LODeXporter-generated Subject URI

3.1. Requirements URI Generation. One of the core features of LODeX- Our high-level vision is the design of a new NLP component porter is the generation of URIs for LOD triples from text that can be added to any existing NLP pipeline, thereby (R4). In designing the URI generation scheme, we had to providing functionality to export the text analysis results strike a balance between (a) conforming to LOD best prac- in linked data format. We derived a number of detailed tices; (b) taking into account the NLP source of the URIs; requirements from this vision: and their (c) usability, in particular for querying knowledge bases that mix NLP-generated knowledge with other sources. Scalability (R1). For the population of large knowledge Figure1 shows an example for a subject URI that was gener- bases, it must be possible to scale out the generation of ated for a single Person instance in a document. Conforming triples, i.e., running pipelines on multiple (cloud) instances to LOD best practices, URIs are HTTP-resolvable. A user- and storing the resulting triples in a networked triplestore. definable run-time parameter specifies the base URI (e.g., The time required to export the annotations as triples must for a public SPARQL endpoint), which should always be a scale linearly with the size of the documents and the number domain ‘owned’ by the user. This is followed by a unique of triples to be exported. run id, which ensures that each new run of LODeXporter Separation of Concerns (R2). Apart from adding a new on the same text generates new URIs. Generally, exposing component to an analysis pipeline, no further changes must implementation details in URIs is discouraged (Wood et al., be required on the NLP side. Thus, we separate the work of 2014). However, re-running NLP pipelines (and therefore a language engineer, developing the NLP pipeline (e.g., to re-generating triples) is an extremely common occurrence find domain-specific entities) from the work of a knowledge in language engineering – for example, after improving a base engineer, who defines the structure of a concrete KB. component, a machine learning model, or experimenting Configurability (R3). The export must be dynamically with different parameters. Hence, we decided that the triples configurable, so that the same NLP pipeline can drive the from different runs of an NLP pipeline must be able to peace- generation of different triples in the same, or multiple differ- fully co-exist in a KB and be easily distinguishable, which ent, knowledge bases. In particular, the vocabularies used in is achieved with the generation of this run id. Next, the the mapping process must be easily changeable, to support semantic (annotation) type, as detected by the NLP pipeline, experiments with different knowledge bases and different is encoded (here “Person”), followed by its unique annota- application scenarios. tion id within a document. Finally, the mapping export rule name is added, as it is possible to export the same annotation LOD Best Practices (R4). The solution must conform to (with the same type) multiple times, using different export the relevant W3C standards on Linked (Open) Data. This rules (e.g., with different vocabularies). includes the recommended format for the generated triples as well as the support of standard protocols for communicating 3.3. Mapping Language with triplestores in a product-independent fashion. The central idea of our approach to automatic KB construc- 3.2. Mapping NLP Annotations to Triples tion from NLP results is to externalize the knowledge about A core idea of our approach is that the concrete mapping, the export process, so that the NLP pipeline can remain un- from NLP results to knowledge base, is not part of the NLP changed (except for adding the LODeXporter component). pipeline itself: Rather, it is externalized in form of a set of This provides the required separation of concerns (R2) and mapping rules that are encoded in the knowledge base itself. makes it possible to easily reuse existing NLP pipelines for In other words, only by dynamically connecting an NLP different knowledge bases. The export rules – how NLP pipeline to a knowledge base is the concrete mapping ef- results are mapped to triples – are themselves described in fected: The same pipeline can so be used to export different form of RDF triples using our new mapping language. Thus, results, using different mapping rules, to multiple knowl- the configuration how a specific knowledge base has to be edge bases. This is a key feature to enable agile data science populated with triples can be read from the same knowledge workflows, where experiments with different formats can base, which provides for an introspective NLP export (R3). now be easily and transparently set up. In detail, our mapping language provides constructs for map- The mapping rules themselves are defined in form of RDF ping entities (subjects), features (properties and objects), as triples as well, based on our mapping vocabulary. We pro- well as meta-information about the export (e.g., text offsets vide mappings for common NLP result types, in particu- of entities or the name of the pipeline that generated an lar annotations, annotation features, and annotation rela- entity). A concrete mapping configuration is a set of rules tions. Additionally, a number of meta-features about the (see Figure 2c for an example). These rules are read by NLP pipeline itself can be defined for storage in a KB. LODeXporter, typically at the end of a pipeline, and triples

2424 are then generated according to these rules. Hence, the exact Mapping Relations. Additionally, we can export some same pipeline can be used to generate different LOD triples, pre-defined relations between entities into triples. In par- e.g., when facts need to be expressed with different LOD ticular, we can map a contains relationship (based on text vocabularies for different applications. offsets) to any LOD vocabulary. An example is shown in RhetoricalEntity Mapping Entities. For each entity that needs to be ex- Figure 2c, where we define that any con- DBpediaNE ported, a corresponding mapping rule specifies the NLP taining a results in an additional triple with the two subject triples, linked by pubo:containsNE (see Fig- type and its output RDF type. For example, a Person entity detected in a text can be mapped to a KB using the FOAF ure 2b). vocabulary.4 To export all instances of a GATE annotation Finally, we provide for exporting a number of meta- information of an export process as relations, including: of type Person, detected in a document, to :Person in a KB, this rule (in RDF/XML):5 the name of the NLP pipeline that generated the annotations, creation timestamp, document name, corpus name, among others. These are important when the traceability of triples Person 4. Implementation LODeXporter has been implemented in Java as an NLP results in the generation of a triple with (1) a subject URI, component for the GATE (General Architecture for Text En- as shown in Figure1; (2) a property defining the rdf:type of gineering) framework (Cunningham et al., 2013). GATE is that triple, and (3) the object URI with the defined type: a mature, open source, component-based framework, which integrates with a number of well-known NLP libraries, such as Stanford CoreNLP, LingPipe, OpenNLP, and UIMA. This ensures LODeXporter can be used with a large variety of existing NLP applications. This mapping approach makes our solution extremely flex- The RDF(S) triples generated by LODeXporter follow the ible: For example, to change the generation of triples to W3C standards and are not tied to any specific implemen- use the Person Core Vocabulary,6 instead of FOAF, only tation. Internally, LODeXporter relies on the Apache Jena7 a change of the mapping rule is required, which is read libraries for generating the RDF graph. Results can be ex- at export-time. Thus, the designer of a KB is free to ex- ported in two different modes (R1). periment with multiple different knowledge representation approaches, without requiring any reconfiguration on the Direct KB Export Mode: The first option is to directly NLP side. connect the NLP framework with a triplestore. Running Mapping Features. Most entities are further described in the NLP pipeline with LODeXporter will then directly an NLP process with additional features, e.g., the detected populate the KB with the triples generated from a text. gender for a name. We can map any existing feature of an Here, we currently support the Apache TDB storage 8 entity (one subject URI) to different properties and objects, layer. with the same configuration approach. For example, to map File-Based Export Mode: The second option is to export a feature gender for a Person, we would define an additional the triples of a document into a file, based on the W3C mapping rule to transform it into foaf:gender: standard N-Quads format.9 These triples can then be imported into any triplestore. A typical application sce- nario in a distributed (e.g., cluster or cloud-based) en- gender vironment is to upload the generated triples to a triple- store via the W3C SPARQL 1.1 Graph Store HTTP This rule is then included in the first rule above with an Protocol.10 The files generated by LODeXporter can additional triple: be used directly with an HTTP POST operation; we tested this extensively with the Apache Jena Fuseki 11 ... SPARQL server and its included s-post script. The choice between the two export formats depends on the intended application scenario: In cases where a knowledge resulting in another output triple with the same subject URI: base is created in batch-mode, based on an existing corpus, ”male” the first solution avoids web protocol overhead, but currently Additionally, we provide for the export of several meta- 7Apache Jena, https://jena.apache.org/ information of an annotation, such as the start- and end- 8 offsets of the entity in a document, as well as the underlying Apache TDB, https://jena.apache.org/documentation/tdb/ index.html string representation (surface form). 9W3C N-Quads, http://www.w3.org/TR/n-quads/ 10SPARQL 1.1 Graph Store HTTP Protocol, http://www.w3.org/ 4Friend-of-a-Friend (FOAF), http://xmlns.com/foaf/spec/ TR/sparql11-http-rdf-update/ 5The prefix map: refers to our mapping language. 11Apache Jena Fuseki, https://jena.apache.org/documentation/ 6Person Core Vocabulary, https://www.w3.org/ns/person fuseki2/

2425 @prefix map: . @prefix pubo: . @prefix rdf: . @prefix rdfs: . @prefix cnt: . @prefix sro: . @prefix ex: .

### Annotation Mapping ### ex:GATERhetoricalEntity a map:Mapping ; map:type sro:RhetoricalElement ; map:GATEtype” RhetoricalEntity”; map:hasMapping ex:GATEContentMapping .

ex:GATEDBpediaNE a map:Mapping ; map:type pubo:LinkedNamedEntity ; map:GATEtype”DBpediaNE”; map:hasMapping ex:GATEContentMapping ; map:hasMapping ex:GATELODRefFeatureMapping . (a) NLP annotations in GATE Developer ### Feature Mapping ### ex:GATEContentMapping a map:Mapping ; map:type cnt:chars ;

pubo:Doc#789 map:GATEattribute”content”. ex:LODRefFeatureMapping a map:Mapping ; pubo:hasAnnotation map:type rdfs:isDefinedBy ; map:GATEfeature”URI”. pubo:hasAnnotation pubo:RE#123

pubo:containsNE rdf:type cnt:chars ### Relation Mapping ### ex:RE NE RelationMapping a map:Mapping ; pubo:NE#456 sro:Contribution "In this paper we present the Zeeva system as a first prototype..." map:type pubo:containsNE ; map:domain ex:GATERhetoricalEntity ; cnt:chars rdfs:isDefinedBy map:range ex:GATEDBpediaNE ; map:GATEattribute”contains” "prototype" dbpedia:Software_prototyping

(b) LODeXporter output (c) LODeXporter mapping rules example

Figure 2: Example RDF graph (b) generated by LODeXporter from NLP annotations (a) based on the mapping rules shown on the right (c) for a paper in a application only works with TDB-based triple stores and is limited to We processed the full-text of each article to extract vari- parallelization within a single JVM (using the “GATE Cloud ous rhetorical entities (e.g., claim or contribution sentences) Parallelizer”12 tool). The second approach is suitable for and further linked each domain concept in the document continuously updating a knowledge base, e.g., based on a to its corresponding resource on the LOD cloud. We used stream of incoming documents, and can be easily distributed the LODeXporter component in our text mining pipeline in a cloud infrastructure. It is also not limited to a specific to transform the detected entities (added to the document knowledge base implementation, but requires an accessible in form of GATE annotations) to RDF triples, based on the SPARQL over HTTP interface. mapping rules shown in Figure 2c. The complete knowledge base contains 1,086,051 triples.13 5. Application A few example triples from the populated knowledge base are presented in Figure 3a, describing a contribution sen- Figure2 shows an example from a real-world application tence found in a document, as well as a concept (named scenario in Semantic Publishing (Berners-Lee and Hendler, entity) mentioned in its text. Here, we can see the triple gen- 2001), which aims at enriching scientific publications with erated for the document and the PUBO vocabularies used to machine-readable information in order to explicitly mark attach the triples describing the rhetorical and named enti- up experiments, data, and rhetorical elements in their raw ties to the document. Based on this schema, the knowledge text. While a number of RDFS and OWL ontologies are now base can then be queried, e.g., to find all documents with available, like SALT (Groza et al., 2007) and the Semantic their contribution sentences and mentioned topics by using a Publishing And Referencing (SPAR) ontology family (Per- SPARQL query, like the one shown in Figure 3b. The query oni et al., 2012), a serious bottleneck to their adoption is the shown here retrieves all documents that have an annotation vast amount of existing publications, which would be too of type sro:Contribution (sentence), as well as the time-consuming to manually annotate. named entities that are contained within the sentence bound- Hence, NLP techniques play a crucial role in enabling Se- ary, returning (a) their surface form (as they appeared in the mantic Publishing workflows and applications. In (Sateli document), and (b) their corresponding resource from the and Witte, 2015), we show how scientific literature can be DBpedia ontology. An example entry from the result set is transformed into a queryable knowledge base, through an illustrated in Figure 3c. Here, the sentence contains the term NLP pipeline that detects rhetorical entities and domain con- “ASR”, which was grounded to the entry for ‘Automatic cepts in their full-text (Figure2). The input to our pipeline Speech Recognition’ in the DBpedia ontology, demonstrat- was a set of 136 open access computer science articles. 13Available in N-Quads format at http://www.semanticsoftware. 12The GATE Cloud Paralleliser (GCP), https://gate.ac.uk/gcp/ info/semantic-scientific-literature-peerj-2015-supplements

2426 , .

a , ; ; ”We evaluated the system performance using this ASR implementation.”ˆˆ .

a ; ; ”ASR”ˆˆ .

(a) Examples triples in N-Quads format from the knowledge base

PREFIX pubo: PREFIX sro: PREFIX rdf: PREFIX rdfs: PREFIX cnt:

SELECT DISTINCT ?document ?sentence ?topic ?uri WHERE { ?document pubo:hasAnnotation ?rhetoricalEntity . ? rhetoricalEntity cnt:chars ?sentence . ? rhetoricalEntity rdf:type sro:Contribution . ? rhetoricalEntity pubo:containsNE ?namedEntity. ?namedEntity rdfs:isDefinedBy ?uri . ?namedEntity cnt:chars ?topic . } ORDER BY ?paper (b) A SPARQL query to find contribution sentences and their domain concepts

Document Sentence Topic URI

cs-8. “We evaluated the system performance using this ASR implementation.” “ASR” dbpedia:Speech recognition

(c) One example result from the knowledge base

Figure 3: Example triples (a) generated by LODeXporter in a semantic publishing pipeline, the corresponding SPARQL query (b) on the knowledge base, and the resulting table (c) on the bottom. ing the power of connecting the NLP results in a knowledge porter we completely externalized the mapping rules, so that base with external open datasets. the same NLP pipeline can now generate different triples, based on different vocabularies, when connected to various 6. Related Work knowledge bases. We previously presented OwlExporter (Witte et al., 2010) NLP2RDF15 is another approach for converting NLP tool at LREC 2010, an NLP component for ontology population output to RDF (Hellmann et al., 2013). However, the goals from text. In contrast with LODeXporter, OwlExporter is fo- of NLP2RDF are largely orthogonal to the ones from LOD- cused on the population of (OWL) eXporter: NLP2RDF focuses on the representation of NLP ontologies,14 with Description (DL) reasoning as their data (such as, words, sentences, strings), with the primary primary application. In some respect, LODeXporter is our goal of providing an exchange format between different completely re-designed approach for knowledge export from NLP tools. All knowledge is represented through its own NLP pipelines, based on the experience we gained from NIF (NLP Interchange Format) vocabulary.16 In contrast, building numerous knowledge-intensive applications. In par- LODeXporter focuses on the representation of domain data ticular, with LODeXporter we focus on scenarios that were (e.g., biological entities, company intelligence, financial less common when we first designed OwlExporter, more data), which is represented in domain-specific vocabularies. than 15 years ago: Large-scale knowledge bases, generated The resulting knowledge bases are primarily meant for creat- in distributed cloud-based environments, linked data datasets ing LOD datasets for publication and/or building intelligent and applications (based on the LOD initiative started in applications on top of them. Hence, although LODeXporter 2007), shared open vocabularies, as well as the proliferation and NLP2RDF both generate RDF from NLP frameworks, of AI applications, such as intelligent assistants, where NLP they are two quite distinct solutions for different application techniques form just one of numerous knowledge sources use cases. and algorithms. While OwlExporter relied on special NLP annotations driving the ontology population, with LODeX- 15NLP2RDF, see https://site.nlp2rdf.org/ 16NIF 2.0 Core Ontology, see http://persistence.uni-leipzig.org/ 14Web Ontology Language (OWL), https://www.w3.org/OWL/ nlp2rdf/ontologies/nif-core

2427 7. Conclusion Cyganiak, R., Wood, D., and Lanthaler, M. (2014). RDF We presented LODeXporter, a novel approach for the gener- 1.1 Concepts and Abstract Syntax. W3C Recommenda- ation of linked open data (LOD) compliant triples from text tion, https://www.w3.org/TR/rdf11-concepts/. analysis pipelines. LODeXporter provides for a separation Groza, T., Handschuh, S., Moller,¨ K., and Decker, S. (2007). of concerns, as the NLP pipeline (developed by a language SALT – Semantically Annotated LATEX for Scientific Pub- engineer) is strictly separated from the export of the NLP lications. In The Semantic Web: Research and Applica- results into RDF(S) triples (designed by a knowledge engi- tions, LNCS, pages 518–532. Springer. neer). We achieve this separation through a new mapping Heath, T. and Bizer, C. (2011). Linked Data: Evolving language, which is defined itself in RDFS, and defines the the Web into a Global Data Space. Synthesis lectures rules for converting NLP annotations and features into (sub- on the semantic web: theory and technology. Morgan & ject, property, object) triples. This approach supports agile Claypool Publishers. data science workflows (Sateli and Witte, 2016), where ex- Hellmann, S., Lehmann, J., Auer, S., and Brummer,¨ M. isting NLP pipelines can be easily connected with different (2013). Integrating NLP Using Linked Data. In Proceed- web vocabularies, for example, to facilitate submissions to ings of the 12th International Semantic Web Conference - shared tasks within competitions. Part II, ISWC ’13, pages 98–113. Springer-Verlag New LODeXporter enables NLP framework users to easily gen- York, Inc. erate knowledge bases in an LOD-compliant format, which Peroni, S., Shotton, D., and Vitali, F. (2012). Scholarly can then be shared on the linked open data (LOD) cloud. It publishing and linked data: describing roles, statuses, also facilitates the development of complex AI applications, temporal and contextual extents. In Proceedings of the such as intelligent assistants, which typically rely on numer- 8th International Conference on Semantic Systems, pages ous knowledge sources, including text analysis results. 9–16. ACM. Sateli, B. and Witte, R. (2015). Semantic representation of 8. Acknowledgements scientific literature: bringing claims, contributions and This work was partially funded by a Natural Sciences and named entities onto the Linked Open Data cloud. PeerJ Engineering Research Council of Canada (NSERC) Discov- Computer Science, 1(e37), 12/2015. ery Grant (DG). Sateli, B. and Witte, R. (2016). From Papers to Triples: An 9. Bibliographical References Open Source Workflow for Semantic Publishing Experi- ments. In Semantics, Analytics, Visualisation: Enhancing Berners-Lee, T. and Hendler, J. (2001). Publishing on the Scholarly Data (SAVE-SD 2016), Montreal,´ QC, Canada, semantic web. Nature, 410(6832):1023–1024. 04/2016. Springer. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Witte, R., Khamis, N., and Rilling, J. (2010). Flexible Cyganiak, R., and Hellmann, S. (2009). DBpedia—A Ontology Population from Text: The OwlExporter. In crystallization point for the Web of Data. Web Semantics: International Conference on Language Resources and Science, Services and Agents on the , Evaluation (LREC), pages 3845–3850, Valletta, Malta, 7(3):154–165. May 19–21. ELRA. Brickley, D. and Guha, R. (2014). RDF Schema 1.1. W3C Wood, D., Zaidman, M., Ruth, L., and Hausenblas, M. Recommendation, https://www.w3.org/TR/rdf-schema/. (2014). Linked Data: Structured data on the Web. Man- Cunningham, H., Tablan, V., Roberts, A., and Bontcheva, K. ning Publications Co. (2013). Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS computational biology, 9(2):e1002854.

2428