<<

A Knowledge and Reasoning Toolkit for Cognitive Applications

Mustafa Canim Cristina Cornelio Robert Farrell IBM T.J. Research Center IBM T.J. Watson Research Center IBM T.J. Watson Research Center Yorktown Heights, NY Yorktown Heights, NY Yorktown Heights, NY [email protected] [email protected] [email protected] Achille Fokoue Kyle Gao John Gunnels IBM T.J. Watson Research Center IBM T.J. Watson Research Center IBM T.J. Watson Research Center Yorktown Heights, NY Yorktown Heights, NY Yorktown Heights, NY [email protected] [email protected] [email protected] Arun Iyengar Ryan Musa Mariano Rodriguez-Muro IBM T.J. Watson Research Center IBM T.J. Watson Research Center IBM T.J. Watson Research Center Yorktown Heights, NY Yorktown Heights, NY Yorktown Heights, NY [email protected] [email protected] [email protected] Rosario Uceda-Sosa IBM T.J. Watson Research Center Yorktown Heights, NY [email protected] ABSTRACT 1 INTRODUCTION This paper presents a knowledge and reasoning toolkit for devel- There is an increasing need to provide support for cognitive ap- oping cognitive applications which have significant requirements plications which make use of structured and semi-structured for managing structured and semi-structured data. Our system pro- and require advanced querying and reasoning to be performed on vides enhanced querying and reasoning capabilities along with the data. The specific requirements for these types of applications natural language processing support and the ability to automati- vary. One of the key challenges is to develop tools which support cally extract data from PDF documents. We also have the capability cognitive applications in a general way so that they can be reused to manage in a user-friendly way. Our system is imple- for a wide variety of customer scenarios. mented as a set of Web services, and we provide enhanced clients We are developing a knowledge and reasoning toolkit (KRT) to allow applications to easily access our knowledge and reasoning based on real world scenarios which is designed to be used for a toolkit. broad range of cognitive applications. It is exposed to the user as a set of Web services allowing applications to easily make use of KEYWORDS it via the http protocol. This paper describes our system and also Data extraction, knowledge representation, natural language pro- provides considerable information about past work in the area. cessing, engineering, reasoning, Resource Description The objective of the Knowledge and Reasoning Toolkit (KRT) is Framework (RDF), (OWL) to bring content into a structured knowledge representation to inte- grate data from heterogeneous sources, and enable conversational ACM Reference Format: access in support of humans engaging in conducting relevant tasks. Mustafa Canim, Cristina Cornelio, Robert Farrell, Achille Fokoue, Kyle Previously, extracting and representing such information from nat- Gao, John Gunnels, Arun Iyengar, Ryan Musa, Mariano Rodriguez-Muro, ural language, structured, and semi-structured (table) content is and Rosario Uceda-Sosa. 2017. A Knowledge and Reasoning Toolkit for highly manual, tedious, and error prone. The KRT creates and lever- Cognitive Applications. In Proceedings of HotWeb’17, San Jose / Silicon Valley, ages , management, and reasoning techniques CA, USA, October 14, 2017, 10 pages. to reduce the human effort, ensure a principled representation, and https://doi.org/10.1145/3132465.3132478 support effective and fluent human-machine conversation. The KRT leverages knowledge and reasoning at different stages Permission to make digital or hard copies of all or part of this work for personal or of a project’s lifetime. At early stages, the KRT uses reasoning to classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation support data integration tasks, i.e., discovering connections within on the first page. Copyrights for components of this work owned by others than ACM the datasets, proposing initial alignments, and providing pointers must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a to possible misalignments as well as candidate corrections. The fee. Request permissions from [email protected]. KRT services provide means to assess the quality of the data, offer HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA confidence levels about the integrated data with respect to domain © 2017 Association for Computing Machinery. knowledge and extend the data with new data that might have been ACM ISBN 978-1-4503-5527-8/17/10...$15.00 https://doi.org/10.1145/3132465.3132478 HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA M. Canim et al. missing, or that can be computed if background knowledge is taken Then, we present KRT approach to handle structured and semi- into account. structured data. Most state-of-the-art approaches decouple the On the query answering and conversational side, the KRT lowers specification of the mappings or transformations from their ac- the effort required to obtain useful data by enabling new knowl- tual execution to either produce a new from the edge management and query answering paradigms that are guided legacy data (materialization approach) or to directly query the data by knowledge and that can interact with the user. Conversational in its original form through query rewrite (virtualization approach). systems can use KRT services to allow a domain expert to interact These approaches mainly differ along the following key dimensions: with the domain knowledge to facilitate bootstrapping, develop- • Format specific vs. Format agnostic ment and maintenance of the system. During query answering, • Generic automated direct mapping vs. Customizable map- conversational systems can use KRT query services to explore do- ping specification main knowledge, help the user refine the query and obtain the • Materialization vs. Virtualization desired answers. The KRT provides new forms of query answering • High level declarative mapping language vs. General purpose that combine the power of structured queries, with the flexibility of transformation/query or programming language information retrieval techniques, enabling complex query answer- • Existence vs. Absence of visual mapping editors ing with lower requirements with respect to data consistency. Some of these dimensions represent a spectrum along which differ- ent proposals fall. Others correspond to discrete (but not necessarily 1.1 Key challenges mutually exclusive) design choices. Data extraction and data integration projects can require large in- vestments before some results can be exploited. The integration 2.1 Format specific vs. Format agnostic of the structured data, e.g., mapping and alignment, and the train- The research on mapping or aligning legacy data to knowledge ing of models for for the graphs started more than a decade ago with, as initial focus, the are lengthy and tedious tasks. Given that the tools and techniques exploration of approaches to map relational data into RDF/OWL. involved are unaware of the meaning of the data and the domain, Experiences from early systems such as D2RQ [9, 10], R2O [68], issues in this data and in the integration or in the models are often and SQL-RDF [70], which offer both generic automated mapping invisible until later stages in the project. Debugging and mainte- and customizable mapping approaches, led to the adoption of the nance is a difficult task for which there is little or no automated W3C standard mapping language for customizing the mapping from support. relational data to RDF, R2ML[23], along with an approach for auto- Similarly, constructing conversational systems that can use the mated generic direct mapping from relational data to RDF[4]. [57] resulting data is often a costly project-specific task. Mapping of user provides a good survey of approaches and systems to map relational queries to data sets is done in an application specific fashion, limit- data to RDF. Likewise, multiple approaches have been proposed to ing re-usability of the systems developed and limiting the queries map XML data to RDF through both direct generic automated trans- that the system is able to handle. In general, the conversational formations (e.g., xCurator[75] or XSL-based transformation[12] ) or systems in these projects lack an layer that allows them through customized mappings/transformations [7, 8]. Approaches to separate the knowledge that is specific to the domain of the have also been developed to handle mapping from other specific project from the conversational flow of the application; they have data formats: e.g. CSV and spreadsheet data ( [37, 48]). Recently, due no back-end support that allows them to explore, query and reason to the need to integrate and map information from multiple inter- with the knowledge to drive the flow. The conversational systems related sources with heterogeneous formats, new approaches have also have no way to alter, correct and update this knowledge in been proposed to map data in a format agnostic way. A prominent order to deal with issues in the data ingestion stage and no way to approach, RML [24], extends the R2RML standard [23] in order to use this knowledge to resolve conversation specific-problems, such support many input formats beside relational data (e.g., JSON, CSV, as disambiguating the relevant data. and XML). Another example, xR2RML [56], extends both R2RML The remainder of the paper is structured as follows. Section 2 and RML to also handle mapping from NoSQL . summarizes past work in handling semi-structured data and de- scribes how the KRT handles it. Section 3 describes reasoning ca- 2.2 Generic automated direct mapping vs. pabilities of our system. Section 4 discusses our natural language Customizable mapping specification. support features. Section 5 describes how we extract information Generic direct mapping approaches specify a predefined static trans- from PDF documents. Section 6 describes our ontology manage- formation of any input of a given format (e.g., CSV or relation data) ment capabilities. Section 7 describes our Web services architecture into a generic RDF graph, which does not "conform" to any specific and user interface. Finally, Section 8 concludes the paper. ontology model. For example, [4] defines a generic direct mapping of relational data to RDF, based on the simple principles that 1) a 2 HANDLING STRUCTURED AND record is an RDF node, 2) the column name of a relational SEMI-STRUCTURED DATA table is an RDF predicate, and 3) a relational table cell is a value. Customizable approaches (e.g., [23, 24]) provide a finer level of user In this section, we first review various approaches that have been control over the specification of the mappings, which can be driven proposed to map legacy structured and semi-structured data in by the actual semantics of the input data, as opposed to just its various formats (e.g., XML, CSV, relational) into knowledge graphs. syntax, as in generic automated approaches. A Knowledge and Reasoning Toolkit for Cognitive Applications HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA

2.3 Materialization vs. Virtualization. of the most advanced visual mapping editors, based on its ease of Current state-of-the-art systems to map legacy data to RDF enable use and features: it supports both data-driven and ontology driven virtual access (mainly through query rewrite) to data as RDF, or approaches, enables users to visualize both the input and the out- materialization of RDF graphs, or both. For example, Ontop [5, 14] put, and exposes only the syntax of the navigation language of the uses a combination of ontological constraints (in a global ontology) input data to the user (e.g., XPath for XML, JSONPath for JSON). and mapping rules to rewrite SPARQL queries into SQL queries that Unfortunately, RMLEditor, as in the case of other visual editors, is can then be evaluated over a traditional relational database - an still instance-based: the user starts by loading input data (e.g., XML illustration of the Ontology Based Data Access (OBDA) approach. files) instead of models or schemas describing the structure ofall At the other end of the spectrum, a system like Stardog 1 supports considered data (e.g. XML Schemas). This means that the complete- both virtual access and complete materialization of the legacy data ness of the resulting mapping remains difficult to access based on into a new knowledge graph. the few examples provided as input to the mapping process.

2.4 High-level declarative mapping language 2.6 The KRT Approach vs. General purpose transformation/query In the KRT, we consider ease of mapping and customization by or programming language. subject matter experts as paramount. In particular, we aim at an approach that relies on a robust, mature, and straightforward visual Although a majority of state-of-the-art systems specify mappings mapping editor combined with good default automated mappings and transformations using high-level declarative mapping languages that can be visually reviewed by users. Unfortunately, none of (such as R2RML or RML), some approaches rely on general purpose the visual RDF mapping editors presented in the previous section transformation/query or programming languages. For example, has a mapping process as simple as connecting elements from [7, 8] use standard XML query or transformation languages such the input (schema) to mapped elements in the output ontology as XQuery or XSLT to encode both mappings and transformations- by drawing a line between them. Visual mapping editors to map sometimes in combination with SPARQL, as in [7]. A clear advan- between relational data and XML data have been extensively re- tage of higher level declarative mapping languages is that they can searched [36, 66, 69], have reached significant maturity and robust- be much easier to use (even without visual editors) and to analyze, ness levels, and are supported in major commercial data integration for example, in order to produce sound and complete query rewrite tools (e.g., IBM Rational Application Developer Mapping Tool (IBM for data access through virtualization. However, their limited ex- RAD) 3, Altova Map-Force4, Stylus Studio5). They support model pressiveness often prevents the specification of arbitrarily complex (or schema) based mappings, and mappings can be specified as mappings and transformations. To address this limitation, language simply as connecting input elements to output elements (without extension points enable the invocation of functions written in gen- even specifying an expression in a navigation language such as eral purpose Turing-complete languages. XPath). Since an RDF can be viewed as an XML docu- 2.5 Existence vs. Absence of Visual mapping ment (when serialized as abbreviated RDF/XML syntax), in KRT, editors. we use a commercial XML mapping tool (IBM RAD Mapping tool) Separating mapping specification from mapping execution was an to leverage the extensive research, maturity and ease of use of XML important step in facilitating the customization of the mapping visual mapping tools. However, we had to make two important process by end users. However, without powerful visual mapping adaptations to enable easy, consistent and robust RDF mappings editors, considerable familiarity with the mapping language and the using an XML mapping editor. underlying input format is still required. Recently, to address this is- First, like most XML mapping editors, RAD XML mapper is sue, visual mapping editors have been developed. However, most of schema based. The user starts by specifying the structure of the them still require some user familiarity with the mapping language. input and output by providing their DTD or XML Schema. The For example, the FluidOps editor [71] guides the user in an input input structure is then displayed on a left panel while the output data driven, step-by-step process that is closely aligned with the structure is displayed on the right panel. Mappings can then be main conceptual constructs of R2RML. Thus, although it effectively specified as easily as connecting an element from the input panel hides R2RML vocabulary, it still relies on a good user understanding to an element of the right panel by drawing a line between them. of R2RML . TopBraid Composer2 also provides a similar However, our target is an OWL ontology - not a DTD or XML data-driven step-by-step approach, but, in contrast to [71], it sup- Schema required by RAD. OWL and XML Schema both define ports multiple data sources (relational data, CSV, XML, etc). [64] constraints, but these constraints have very different semantics: describes an ontology-driven improvement over [71], where the XML Schema constraints are for checking the validity of XML user can start from a target ontology (instead of the input data documents; whereas OWL constraints are for inferring new facts. only) and can perform the previous steps in any order, but the An XML Schema that constrains the structure of an arbitrary ABox steps are still close to R2RML rules. DataOps [65] is another editor file serialized in the abbreviated RDF/XML syntax would betoo supporting heterogeneous sources, but it still exposes to the user permissive; it would essentially specify a single wildcard element some of the syntax of the mapping language. RMLEditor [39] is one 3https://www.ibm.com/developerworks/downloads/r/rad/ 1http://www.stardog.com/blog/virtual-graphs-relational-data-in-stardog/ 4https://www.altova.com/mapforce.html 2http://www.topquadrant.com/tools/modeling-topbraid-composer-standard-edition/ 5http://www.stylusstudio.com/xml-mapper.html HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA M. Canim et al.

() as the content of the element, which would necessary to be able to handle noisy data, conflicts and uncertainty. have no value in the mapping editor since the right panel will have Therefore we are interested in extending the logic of our toolkit only an single uninformative wildcard element. In KRT, we created with probability theory. While the early lan- a tool to automatically generate a customized XML Schema given a guages, such as Horn logic programs and , did not focus on TBox. expressing and reasoning with uncertainty, in recent years logic Second, we need to make sure that IRIs assigned to mapped programming languages have been developed that can express both RDF resources are used consistently in their definition (as value logical and quantitative uncertainty such as: SLP (Stochastic Logic of the attribute "rdf:about") and in references to them (as value of Programs), PCC (Probabilistic Concurrent Constraint Programs), the attribute "rdf:resource"). RAD, like other XML mapping edi- ProbLog (Probabilistic Prolog), PSL (Probabilistic Soft Logic), MLN tors, does not support this directly. Our approach is to define, as a (Markov Logic Networks), and PRISM (PRogramming In Statistical reusable function (called Submap in RAD), the specification of the Modeling). We are investigating such programming languages and mapping that takes a given XML element as input and produces their restriction/application to the case of (DL), as output the IRI of the corresponding mapped RDF resource. We such as the recently developed Pronto reasoner, a version of Pellet then consistently invoke this function both to compute the value of that allows probabilistic reasoning. This reasoner assigns an annota- the attribute "rdf:about" of the RDF resource and for all references tion corresponding to the probability value of the considered entity to that resource (as value of the attribute "rdf:resource"). in the knowledge base that by default is set to 1 if not otherwise In addition to the ease of mapping editing provided by RAD, since specified. it is schema-based, it is more amenable to accessing the complete- Besides defining a probability distribution over the different con- ness of the specified mappings. Furthermore, by adding semantics flicting alternatives, another way to handle conflicts in a knowledge constraints like disjointness (e.g., Person and Vehicle are disjoint or representation, is using contextual information. There are several Man and Woman are disjoint) and performing consistency checks approaches to augmenting an ontology with context [6, 11, 17, 35, with explanations on the generated ABox files, we can automati- 46, 60, 61]. One of the first formulations of context in description cally detect errors in the generated ABox and debug the mappings. logic is using named graphs [17]. However, this formulation can Currently, our tool to generate a customized target XML Schema cause a huge amount of duplication in the data. In [46] the authors from a target ontology does not take into account all the semantics provided an alternative method: they introduced a new namespace constraints defined in the ontology. This means that some map- prefix (rdfc) to handle context information. The rdfc namespace pings that violate semantics constraints can still be specified in the is characterized by the fundamental relationship is true in (ist), tool - hence the need to do consistency checks on the generated where the formulation statement ist context means that the state- ABox. As we implement more semantics constraints in the tool, ment is true in that specific context. They then use lifting rules or more invalid mappings will not be feasible within the visual editor. lifting axioms to deduce the truth of statements in one context from the truth of statements in some other context. Another approach, 3 REASONING CAPABILITIES proposed in [60, 61], to handle context, avoiding the creation of new namespaces, corresponds to introducing, in a regular ontology, Logical reasoning is a well-studied and developed area of research a supporting ontology made by second-order triples defining in- that, in recent years, has become fundamental in various knowledge stances of properties in a specific context. Thus each property can representation paradigms. Logical programming languages such as be used in a general way or as an instance of the general property Horn logic programs, Prolog (and its variants), Datalog, and ASP in a specific context. (Answer Set Programming) all suffer from exponential growth of Context can also improve the computational time of the inference the search space. This has driven research in two directions: the first process; in order to compute the answer context can be used to one focuses on optimizing computational complexity by developing force the system to consider only the relevant information of a pruning techniques, heuristics, etc., and the second one consists given input context. Other techniques can be used to this purpose of using less expressive logics that are more amenable to the large such as summarization and refinement. [25, 32] datasets that we find in the majority of real-world use cases. In The approach to handle context proposed in [60, 61], gives us the order to be able to handle these kind of scenarios in our toolkit, we opportunity to have a meta-ontology that our system can also use focus on the OWL and RDFS formalisms. These paradigms both to manage itself. For example we want to be able to describe rules use description logics (DL, decidable subsets of First Order Logic that deduce which action to perform next, using information about (FOL)) as their underlying logic. In our toolkit we use particular previous operations (or interactions with the user) together with the reasoners developed for these models such as: Pellet, Racer, HermiT, content of such operations (e.g. when the user wants to add a new Quonto, ELK, etc. For simple queries we use SPARQL, but for more class to the ontology, the meta-reasoner should check if the class sophisticated reasoning we use a rule-based reasoner. This enables already exists and, if it does not, should understand automatically our toolkit to provide an answer to a user query jointly with a proof, which questions pose to the user in order to create the new class i.e. one or more detailed explanations of the inference required to in the right place in the ontology). A second order ontology ([21]) reach an answer. The explainability of AI systems is an increasingly allows us to handle different levels of information. For example, important factor in convincing human experts of the correctness in compliance, we want to represent both the hierarchy between of the results given by a machine. different types of regulations/laws and their characteristics and the Since our model is based on knowledge extraction from natural content of such regulations. language (i.e., textual documents and/or dialog with a user), it is A Knowledge and Reasoning Toolkit for Cognitive Applications HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA

One of the principal differences between the Prolog-style logic Supervised methods include maximum entropy models [31][43] reasoning and description logic reasoning is the assumption of and conditional random fields51 [ ]. Unsupervised methods include closed/open universe. While in databases is used a closed-world convolutional neural networks (CNNs) with attention models [73], assumption (i.e. “negation as failure”), in description logic it is recurrent networks [53], and other methods that use embeddings common to assume an open universe; if an entry is not entailed by of characters, words, parts of speech tags, dependency parses, and the knowledge base, it is unknown and so could be either true or other information to generate dense feature vectors. For example, false (but not deduced as false pre-emptively). In our scenario it is Lee, Dernoncourt, and Szolovits [49] extract relations from scien- necessary to understand this difference since our purpose is to elicit tific articles using a CNN with 4 layers: an embedding layerthat knowledge so as to be able to answer to a user question. Thus we converts lexico-syntactic features and the argument order into a want to use this distinction to guide the conversation with the user; vector, a convolutional layer that slides filters over the tokens to for example, consider a scenario in which we receive a query from create feature maps, a max pooling layer that takes the most effec- a user and we try to answer, but the proof is “incomplete” since tive feature in the feature map, and a final layer that uses softmax we miss some information. We then want to ask a set of questions activation to output the probability of each relation. Rule-based of the user so as to elicit the information that we need in order to post-processing is used to correct and detect additional relations. provide a positive or negative answer. This procedure can be seen as [47] provides a recent survey of neural network models for relation a form of abduction, but in our case is guided through interaction extraction. with the user. Furthermore, in our tool, we try to minimize the [67] focuses on mapping natural language to number of interactions with the user, providing a proper minimal a formal representation of meaning, such as semantic case role list of questions to resolve an incomplete proof. labels or higher-level logical forms. New work indicates that neural network methods can be used to map sequences of words to logical 4 NATURAL LANGUAGE PROCESSING expressions by encoding hierarchical representations [26]. Recent SUPPORT results indicate that neural language models that depend only on character-level inputs can also perform well on word and sentence Analysis and interpretation of text in various human natural lan- level tasks [45]. Future work on relation extraction will likely focus guages has been a fundamental aspect of the fields of artificial on longer distance relationships between clauses and sentences in intelligence ([1]) and computational linguistics ([54]) for decades. a document, including logical (entailment, presupposition) relation- Linguistic approaches focus on grammatical formalisms to capture ships [72]. the structural relationships between words, phrases, and clauses Our work focuses on semantic relations that are suitable for within sentences, including phrase structure grammars [18] and applications involving reasoning and inference, such as question dependency grammars [62]. Large lexical resources have been de- answering [30] and dialogue [29]. Most of the knowledge base man- veloped [58] to aid disambiguation. State-of-the-art neural network aged by the KRT originates in data without the formal structure parsers (e.g., [76], [3]) can achieve high accuracy on parsed corpora imposed by a schema or and includes text expressed benchmarks such as the Penn Treebank [55]. in various natural languages. Structured data sources, such as rela- Relationships are a fundamental building block of domain-specific tional databases and XML documents, also include elements con- knowledge bases used to support , reasoning, taining text or text mixed with structured data. A key challenge explanation, and other tasks. Why is relation extraction so chal- is the interpretation of this semi-structured natural language text lenging? Relations can be expressed many different ways in natural into knowledge representations suitable for graph databases. language text. For example, a “memberOf” relation between John In the KRT, we prepare semi-structured data for human anno- and a Cabinet can be expressed as ”John is in the Cabinet”, “Cab- tation automatically by extracting text elements and composing inet Member John”, “John was a member of the Cabinet”, and so them into larger text units while maintaining provenance back on. There are many lexical and grammatical variations. Relations to the structured source data. We employ a supervised learning can be wide ranging, including social, temporal, spatial, and causal. methodology whereby human annotators label sequences of tokens Relations can be multidimensional (e.g., spatiotemporal). When within the text units as mentions of various entities. Mentions of designing a knowledge representation, one must often consider the the same entity are annotated as co-references. Mentions of rela- cross product of the types of entities being extracted. tions are annotated by selecting entities in order to indicate the Many different methods have been developed for relation extrac- direction of the relation. Annotation guidelines are developed to tion. The goal of these methods is to extract relations that reflect improve agreement across the human annotators, and the result- the underlying relational structure of the domain of interest in the ing data is adjudicated in order to arrive at a ground truth. This face of variability, noise, and uncertainty. Most methods evaluate domain-specific training data is used to train mod- pairs of entities within a sentence or a sliding window and classify els. Human supervision is necessary given the specialized lexicon, the entities into relation types according to the lexical, grammatical, grammar, and semantics of the target domains. and semantic features of the entities and the surrounding tokens. The KRT natural language processing pipeline starts with seg- Rule-based methods are typically accurate, but time consuming to menting the text units into sentences, tokenizing, tagging with develop and not easily generalized. For example, [38] developed part-of-speech, and dependency parsing. Next, machine learning rules to extract hyponym (is-a) relationships from news articles models are used to extract entities and relationships. The entity but these cannot be applied to biomedical texts. Statistical rela- types used by the machine learning models are a hierarchical projec- tional learning methods [33] extract relations using learned models. tion and filtering of a KRT OWL ontology. Additional entity types HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA M. Canim et al. are added to enable human annotators to label tokens which modify context for each extracted citation is stored, which allows for fur- the semantic interpretation of the extracted relationships, such as ther citation analysis. Other than and citations they also negations. Prior to entity detection, we make use of domain-specific mention the capability to extract tables using a specialized module dictionaries, generated by KRT microservices, to pre-annotate text and make them searchable. However, they do not provide further with relevant lexical variations of domain entities. One microservice details about how the table extractor works or the quality of the expands tokens using embeddings and matches the resulting terms extracted tables. According to [15], figures should be considered to noun phrases in the dependency parse. Sequential classification as rich information resources in digital documents, yet they have with a sliding window is used to recognize relationships. Following been neglected for a long time in terms of information extraction relation extraction, another microservice connects entities and rela- from digital libraries. Based on this, the authors in [74] target not tions extracted from text units with the both nodes and properties only the extraction of figures, but also the metadata such as cap- generated from analysis of the structured data that contained those tions associated with the figures. Their extraction method is based text units. Rules are then applied to align and normalize the graphs on [19] where the positional and font related information extracted originating from the unstructured and structured sources. from digital documents can be used for accurate extraction of figure In future work, we plan to extract relations from the text con- metadata. As extraction of such information is heavily dependent tained in structured data, such as tables. We also plan to develop on the underlying PDF processing library (such as PDFBox), a ma- logical and process flow representations of documents that include chine learning-based system was also developed which uses only conditions, actions, and states, including temporal and spatial re- textual features. A Lucene based search engine is used to index fig- lationships. Ultimately, a question answering system or dialogue ure related textual metadata extracted from documents [20]. They system should be able to query the resulting graph to answer ques- also develop a pipeline for extraction of data from 2D line graphs tions involving temporal and spatial reasoning over propositions and scatter plots. Extraction of vector images and understanding composed of entities and relations specific to the target domain. the semantics of figures in scholarly documents are left as future work. In addition to figure extraction they also developed a mod- ule for extracting algorithms and acknowledgements provided in 5 INFORMATION EXTRACTION FROM PDF digital documents which are called “AlgSeer” [16] and AckSeer DOCUMENTS respectively. For extraction of acknowledgements, an ensemble of named entity recognizers is used to extract the entities from the The very first step of building a high quality knowledge graph and acknowledgments section of a paper and three types of entities are developing cognitive applications on top of it is to extract informa- extracted: person names, organizations, and companies. In order to tion from unstructured data sources. In the last couple of decades tackle the entity disambiguation problem, AckSeer utilizes a novel Portable Document Format (PDF) technology has been widely used search engine based algorithm for disambiguating the plethora of for information storage and exchange in many domain specific entities found in the acknowledgment sections of papers [44]. areas. Although this technology provides a very convenient way Similar to the methods described above, we have specialized of exchanging information, it is intended to be used by humans, modules in the KRT for information extraction from PDF docu- not computers. One of the main design principles of the KRT is ments. During the extraction process we use both machine learning to have advanced capabilities to process digital documents and and rule-based methods. As for the initial step, the KRT ingestion extract information conveyed in these documents. Noisy informa- pipeline first converts the PDF documents into HTML format for tion extraction processes degrades the quality of the knowledge further processing. Objects within the documents, such as tables graphs generated. Therefore, the quality of information extraction and images, are identified using machine learning methods. If there tools directly affects the quality of cognitive applications. Below we are scanned pages or images within the documents these are pre- discuss various techniques proposed in the literature in order to ex- processed with an OCR engine to convert into text or tables. Once tract information from various document sources. We also provide the HTML output is produced, an HTML parser engine is used information about similar approaches, as well as other methods we to parse the document and extract tables, figures, captions and use in the KRT infrastructure for data ingestion pipeline. other metadata information. For caption extraction we use regu- In [74], Williams et al. describe a comprehensive system for lar expression-based methods. Since the tables are widely used in information extraction from scientific documents in PDF format. digital documents we perform further processing steps for the ex- Their system crawls thousands of documents from CiteSeer digital traction of the content of tables. These steps consist of identifying library. After the crawling step they first extract metadata regarding table headers, normalization of spanning row/column headers, and the papers such as title, authors, abstract, venue, page numbers converting them into a more structured format in either XML or and publisher information. Extraction is performed using SVM- JSON, based on the user’s preference. Fang et al., provides a com- HeaderParse [34], which is an SVM-based header extractor. They prehensive list of heuristics that can be used for the extraction of refer the reader to a recent comparison of various header extraction table headers [28] which overlaps with some of the techniques we tools [52] which shows that more accurate extraction tools than used for table header extraction. Once the data rows of tables are SVMHeaderParse currently exist; though, the scalability aspects extracted the content is converted into a graph structure for post- of these tools are unclear. They also extract citations using the processing, such as and query answering. During citation string parsing tool ParsCit [22]. Using regular expressions, the query processing step we also perform fuzzy match techniques the section of the text containing citations are first identified and to link the user queries with the table structure and bring useful then each citation is extracted, parsed and tagged. The citation information to the users. A Knowledge and Reasoning Toolkit for Cognitive Applications HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA

6 ONTOLOGY MANAGEMENT ONTIO Service The Ontology Input Output Service (ONTIOS) manages concur- rent access to multiple ontology models (T-Boxes) and instance A-Box/T-Box Mapping Data Access graphs (A-Boxes) on behalf of one or more users. It sits on top of Model Instance QUERY REST API 6 a SAIL Sesame [13], Blazegraph or a Jena [2] layer and supports Induction Ingestion (keyword, Frame-based) frame-based, high level views of complex instances with nested components in domains with structured or semi-structured data Metadata A-Box/T-Box Multiple Repository (XML, HTML or JSON). These views can be deduced from a user Mgmt Mapping Mgmt REST API provided ontology or induced from prototypical instances. The ONTIO Service is domain-agnostic but designed to provide Storage API Write Txn API Query/Retrieval API extra services on top of general ontology management and storage tools (like Blazegraph, NeOn [27], Open Anzo 7 or AllegroGraph 8). SAIL Layer ARQ Endpoint RIO ONTIO leverages the characteristics of a large number of structured Jena or Sesame data sources in technical and process-oriented fields in order to provide automated ingestion of data according to semi-automatic, Figure 1: The ONTIOS architecture. user-friendly ontologies that act as type systems to automatically curate the instance data. There are many scenarios where data instances (A-Box) are • A single top class, object property, datatype property and already structured or semi-structured, like in the case of Web sites annotation properties must be defined to help with ontology or XML documents describing goods and services, food recipes integration and refactoring. or instructions for maintenance or assembly of complex products. • Object properties, with the exception of the top property These structures often carry valuable semantic information which above, should inherit from one of the abstract relations hasAt- can be used to automatically deduce a simple OWL-Q ontology tribute, attributeOf, hasMember, memberOf and associatedTo. with property domain and range definitions. This automatic seed This allows users unfamiliar with the ontology to query en- ontology can be enhanced by users to achieve a user-friendly type tire subgraphs associated to an entity, for example, a recipe system for the ingesting, browsing and querying of these data and all its steps and ancillary sub-recipes, like the pie crust sources. recipe needed to make a quiche. This is similar to other Given a few sample instances, the ONTIOS parser can deduce attempts to simplify the myriad of user-created relations classes, data properties and structural object properties (relations), in ontologies, by subsuming them into well-known UML like containment, e.g., Recipe1 – contains –> RecipeStep2 and connect relations [63]. these structural building blocks to domain entities by using simple • The seed ontology and its classes represent the data source entity resolution services. E.g., paragraph RecipeStep2 –mentions –> and its organization and are separate from the domain sub- MozzarellaCheese. The result is a seed ontology which is consistent ontology. For example, in the Recipe ontology, classes rep- with the underlying structure of the data source. resenting the tasks to be performed and their relations are This seed model can be built upon by domain experts using stan- separate from the dishes and ingredient ontologies which dard ontology editors. Operations like renaming, adding aliases, ex- constitute the domain knowledge. tending the inheritance hierarchies and domain or range definitions • Metadata, including user-readable labels and provenance, of properties are allowed. These enhancements are automatically are maintained in each instance. recorded in the user model, or T-Box. No statements of the seed model can be retracted by the user model. ONTIOS relies on (a) the underlying OWL or RDF APIs provided New instances can now be ingested following this user model. by Sesame or Jena, (b) an ARQ endpoint and (c) the RIO library for If instance data doesn’t follow the existing T-Box, e.g., the new data formatting as indicated in Figure 1. The ONTIOS architecture data would entail adding a new class to the domain definition of is similar to [40] in its use of OWL-like vocabulary for classes and an object property, the mismatched information is either flagged instances, even though it only requires an RDF store. The service to the domain expert or can be automatically added to the existing manages multiple access to the ontologies in a server, transactional T-Box. writes and provides its own indexing of classes and properties for ef- Models managed by ONTIOS are required to be within OWL- ficient access of the ontology, according to user-defined vocabulary. QL, but the use of disjunction in the definition of property domain Model induction and instance ingestion services aim to automate and ranges is allowed, since they do not add many statements the transfer of data into a well-formed ontology. Retrieval and query to the deductive closure and prevent the proliferation of ad-hoc services allow users to query the ontology with little knowledge properties. Also, models should also follow the following set of of the underlying structure of relations. Using keywords, the ser- design principles, vice can assemble together simple queries using relation paths of the abstract UML relations. Results are returned in the shape of frame-based nested objects that mimic the underlying ontology structure. 6https://www.blazegraph.com/ 7http://www.openanzo.org/ In summary, the ONTIO service aims to automate the disci- 8https://franz.com/agraph/allegrograph/ plined ingestion of structured data sources using domain experts’ HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA M. Canim et al. enhanced vocabulary and provide user-friendly query facilities on commercial offerings by several companies including IBM, Amazon, top of an ARQ endpoint, which return a frame-based view of these and Microsoft offer cognitive services which are accessed via http. entities. These services are tailored to domains with many complex, It is becoming increasingly common to provide clients or software structured data sources, as found in many technical and financial development kits (SDKs) which hide details of the underlying http applications. interface and make it easier to use the services. Clients wrap HTTP calls to services in a function, method, or procedure which makes it 7 WEB SERVICES ARCHITECTURE AND considerably easier for an application program to access a service. USER INTERFACE Clients are typically written for specific programming languages such as Java, Python, or Node.js (which is actually a JavaScript As described in the sections above, the KRT is comprised of compo- runtime built on Chrome’s V8 JavaScript engine). Clients are pro- nents with very different characteristics in terms of system load, vided in different languages to accommodate applications written resource consumption, demands of scalability, etc. For example, in different languages. though most components are compute-intensive, the information Our client for the KRT provides a number of key features in extraction component can also be I/O-intensive; while the query- addition to making it easier to use the http interface. We also pro- ing and reasoning components require high-performance graph vide caching at the client [41]. This is useful for situations when a storage, the NLP component might need GPU support to run deep response to a service call can be reused by subsequent service calls. learning models more efficiently. Therefore we adopt a microservice When a request can be satisfied from a cache instead of by making architecture[59] to develop the toolkit as independently deployable a remote service call, this can considerably reduce latency. services and hence match the characteristics of each component We also provide the ability to store data at the client and analyze with appropriate resources. results from service calls at the client [42]. For example, our client Adopting a microservice architecture also brings the following can access natural language processing services and aggregate major opportunities and benefits[50]: and analyze data across several documents. Our client integrates • Organized around Business Capabilities: Organizing teams natural language processing with searching so that results from around individual business capabilities reduces the context of Web searches can easily be analyzed. development and makes clear the boundary of components. Our client provides both synchronous and asynchronous in- • Decentralized Governance: Since different components terfaces for accessing services. The synchronous interface makes are deployed individually and communicate using common requests for services and waits for each response before contin- protocols such as HTTP, teams have autonomy to choose uing. The asynchronous interface allows requests to be made to technology stacks and deployment specifications; services which do not block. Instead, a request can execute on a • Independent Scalability: Given different load characteris- separate thread while the main application can continue executing tics, components can be independently scaled and elastically in parallel. If the application can function correctly while making adjusted. asynchronous service calls, the added latency for accessing services • Infrastructure Automation: Automated tests, continuous can be considerably reduced by using an asynchronous interface. deployment and continuous integration are as important for microservice architectures as for traditional monolithic appli- 8 CONCLUSION cations. Specifically for microservices, tools like Kubernetes are used to automate microservices management. This paper has presented our knowledge and reasoning toolkit (KRT) for developing cognitive applications. Our KRT provides ad- We use Node-RED, a programming tool for wiring together hard- vanced management and querying capabilities for both structured ware devices, APIs and online services, to put together microser- and semi-structured data. We discussed reasoning capabilities and vices. natural language processing support provided by our KRT. The KRT can also automatically extract information from PDF documents 7.1 Web Services Interface and provides ontology management. Our KRT is exposed to users Offering the KRT as services, it is natural to use Representational via a Web services interface, and we provide clients for enhancing State Transfer (REST) Web services as the main form of user inter- the functionality of application programs accessing the KRT. face. We design data and computation resource interfaces according to the semantics of HTTP request methods. For tasks that are short 9 ACKNOWLEDGMENTS but frequently requested, bulk processing interfaces are provided The authors would like to thank Francois Lancelot for his contribu- to reduce the overhead of network transmission. For tasks that tions to several of the ideas presented in this paper. have long running times, we design a job management system in the backend to keep the interfaces stateless. Finally, well defined REST Web services can be wrapped into clients and programming REFERENCES language SDKs as described below. [1] James F Allen. 2003. Natural language processing. (2003). [2] A. Ameen, K.U. Rahman Kahn, and B.P. Rani. 2014. Reasoning in Using Jena. Computer Engineering and Intelligent Systems 5, 4 (2014). 7.2 Enhanced Client Support [3] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally Normalized It is common to provide interfaces to cognitive services using http Transition-Based Neural Networks. CoRR abs/1603.06042 (2016). http://arxiv.org/ as described above. Our KRT exposes services through http, and abs/1603.06042 A Knowledge and Reasoning Toolkit for Cognitive Applications HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA

[4] Marcelo Arenas, Alexandre Bertails, Eric Prud, Juan Sequeda, et al. 2012. A direct et al. 2010. Building Watson: An overview of the DeepQA project. AI magazine mapping of relational data to RDF. (2012). 31, 3 (2010), 59–79. [5] Timea Bagosi, Diego Calvanese, Josef Hardi, Sarah Komla-Ebri, Davide Lanti, [31] R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, N. Nicolov, Martin Rezk, Mariano Rodríguez-Muro, Mindaugas Slusnys, and Guohui Xiao. and S. Roukos. 2004. A Statistical Model for Multilingual Entity Detection and 2014. The ontop framework for ontology based data access. In Chinese Semantic Tracking. Defense Technical Information Center. https://books.google.com.tr/ Web and Conference. Springer, 67–77. books?id=LuMwngAACAAJ [6] Djamal Benslimane, Ahmed Arara, Gilles Falquet, Zakaria Maamar, Philippe [32] Achille Fokoue, Aaron Kershenbaum, Li Ma, Edith Schonberg, and Kavitha Thiran, and Faïez Gargouri. 2006. Contextual Ontologies. In ADVIS (Lecture Notes Srinivas. 2006. The Summary Abox: Cutting Ontologies Down to Size. In in Computer Science), Vol. 4243. Springer, 168–176. The Semantic Web - ISWC 2006, 5th International Semantic Web Conference, [7] Nikos Bikakis, Chrisa Tsinaraki, Ioannis Stavrakantonakis, Nektarios Giolda- ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings. 343–356. https: sis, and Stavros Christodoulakis. 2015. The SPARQL2XQuery interoperability //doi.org/10.1007/11926078_25 framework. 18, 2 (2015), 403–490. [33] Lise Getoor and Ben Taskar. 2007. Introduction to statistical relational learning. [8] Stefan Bischof, Stefan Decker, Thomas Krennwallner, Nuno Lopes, and Axel MIT press. Polleres. 2012. Mapping between RDF and XML with XSPARQL. Journal on Data [34] Hui Han C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang, and Ed- Semantics (2012), 1–39. ward A. Fox. 2003. Automatic document metadata extraction using support vector [9] Christian Bizer and Richard Cyganiak. 2009. The D2RQ Plattform. (2009). machines. In In JCDL âĂŹ03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference [10] Christian Bizer and Andy Seaborne. 2004. D2RQ-treating non-RDF databases as on Digital Libraries. 37–48. http://citeseerx.ist.psu.edu/viewdoc/summary?doi= virtual RDF graphs. In Proceedings of the 3rd international semantic web conference 10.1.1.147.3718 (ISWC2004), Vol. 2004. Springer. [35] Ramanathan Guha. 1992. Contexts: A Formalization and Some Applications. Ph.D. [11] Paolo Bouquet, Luciano Serafini, and Heiko Stoermer. 2005. Introducing Context Dissertation. Stanford, CA, USA. UMI Order No. GAX92-17827. into RDF Knowledge Bases. In In Proceedings of SWAP 2005, the 2nd Italian [36] Laura M Haas, Mauricio A Hernández, Howard Ho, Lucian Popa, and Mary Roth. Semantic Web Workshop. 14–16. 2005. Clio grows up: from research prototype to industrial tool. In Proceedings of [12] Frank Breitling. 2009. A standard transformation from XML to RDF via XSLT. the 2005 ACM SIGMOD international conference on Management of data. ACM, Astronomische Nachrichten 330, 7 (2009), 755–760. 805–810. [13] J. Broekstra, A. Kampman, and F. van Harmelen. 2002. Sesame: A Generic Architec- [37] Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi. 2008. ture for Storing and Querying RDF and RDF Schema. Springer Berlin Heidelberg, RDF123: from Spreadsheets to RDF. The Semantic Web-ISWC 2008 (2008), 451–466. Berlin, Heidelberg, 54–68. [38] Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text [14] Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Da- Corpora. In Proceedings of the 14th Conference on Computational Linguistics - vide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. 2017. Ontop: Volume 2 (COLING ’92). Association for Computational Linguistics, Stroudsburg, Answering SPARQL queries over relational databases. Semantic Web 8, 3 (2017), PA, USA, 539–545. https://doi.org/10.3115/992133.992154 471–487. [39] Pieter Heyvaert, Anastasia Dimou, Aron-Levi Herregodts, Ruben Verborgh, Dim- [15] Sandra Carberry, Stephanie Elzer, and Seniz Demir. 2006. Information Graphics: itri Schuurman, Erik Mannens, and Rik Van de Walle. 2016. RMLEditor: a graph- An Untapped Resource for Digital Libraries. In Proceedings of the 29th Annual based mapping editor for mappings. In International Semantic Web International ACM SIGIR Conference on Research and Development in Information Conference. Springer, 709–723. Retrieval (SIGIR ’06). ACM, New York, NY, USA, 581–588. https://doi.org/10.1145/ [40] M. Horridge and S. Bechhofer. 2011. The OWL API: A Java API for OWL On- 1148170.1148270 tologies. Semantic Web 2, 1 (Jan. 2011), 11–21. http://dl.acm.org/citation.cfm?id= [16] Stephen H Carman. 2013. Algseer: An Architecture For Extraction, Indexing And 2019470.2019471 search Of Algorithms In Scientific Literature. (2013). [41] Arun Iyengar. 2017. Providing Enhanced Functionality for Data Store Clients. [17] Jeremy J. Carroll, Christian Bizer, Pat Hayes, and Patrick Stickler. 2005. Named In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, Graphs. Web Semant. 3, 4 (Dec. 2005), 247–267. https://doi.org/10.1016/j.websem. CA, USA, April 19-22, 2017. 1237–1248. https://doi.org/10.1109/ICDE.2017.168 2005.09.001 [42] Arun Iyengar. 2017. Supporting Data Analytics Applications Which Utilize [18] Noam Chomsky. 2002. Syntactic structures. Walter de Gruyter. Cognitive Services. In 37th IEEE International Conference on Distributed Computing [19] Sagnik Ray Choudhury, Prasenjit Mitra, Andi Kirk, Silvia Szep, Donald Pellegrino, Systems, ICDCS 2017, Atlanta, GA, USA, June 5-8, 2017. 1856–1864. https://doi. Sue Jones, and C. Lee Giles. 2013. Figure Metadata Extraction from Digital org/10.1109/ICDCS.2017.172 Documents. In Proceedings of the 2013 12th International Conference on Document [43] Nanda Kambhatla. 2004. Combining Lexical, Syntactic, and Semantic Features Analysis and Recognition (ICDAR ’13). IEEE Computer Society, Washington, DC, with Maximum Entropy Models for Extracting Relations. In Proceedings of the ACL USA, 135–139. https://doi.org/10.1109/ICDAR.2013.34 2004 on Interactive Poster and Demonstration Sessions (ACLdemo ’04). Association [20] Sagnik Ray Choudhury, Suppawong Tuarob, Prasenjit Mitra, Lior Rokach, Andi for Computational Linguistics, Stroudsburg, PA, USA, Article 22. https://doi.org/ Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, and Clyde Lee Giles. 2013. A 10.3115/1219044.1219066 Figure Search Engine Architecture for a Chemistry . In Proceedings [44] Madian Khabsa, Pucktada Treeratpituk, and C Lee Giles. 2012. Entity resolution of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’13). ACM, using search engine results. In Proceedings of the 21st ACM international conference New York, NY, USA, 369–370. https://doi.org/10.1145/2467696.2467757 on Information and . ACM, 2363–2366. [21] Simona Colucci, Tommaso Di Noia, Eugenio Di Sciascio, Francesco M Donini, and [45] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character- Azzurra Ragone. 2010. Second-order description logics: Semantics, motivation, Aware Neural Language Models.. In AAAI. 2741–2749. and a calculus. In 23rd International Workshop on Description Logics DL2010. 67. [46] Graham Klyne. October 2000. Contexts for RDF Information Modelling. In Content [22] Isaac G. Councill, C. Lee Giles, and Min-Yen Kan. 2008. ParsCit: an Open-source Technologies Ltd. http://www.ninebynine.org/RDFNotes/RDFContexts.html CRF Reference String Parsing Package. In LREC. [47] Shantanu Kumar. 2017. A Survey of Deep Learning Methods for Relation Extrac- [23] Souripriya Das, Seema Sundara, and Richard Cyganiak. 2012. R2RML: RDB to tion. arXiv preprint arXiv:1705.03645 (2017). RDF mapping language. (2012). [48] Andreas Langegger and Wolfram Wöß. 2009. XLWrap–querying and integrating [24] Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik arbitrary spreadsheets with SPARQL. The Semantic Web-ISWC 2009 (2009), 359– Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated 374. RDF Mappings of Heterogeneous Data.. In LDOW. [49] Ji Young Lee, Franck Dernoncourt, and Peter Szolovits. 2017. MIT at SemEval- [25] Julian Dolby, Achille Fokoue, Aditya Kalyanpur, Aaron Kershenbaum, Edith 2017 Task 10: Relation Extraction with Convolutional Neural Networks. CoRR Schonberg, Kavitha Srinivas, and Li Ma. 2007. Scalable Semantic Retrieval abs/1704.01523 (2017). http://arxiv.org/abs/1704.01523 Through Summarization and Refinement. In Proceedings of the 22Nd National [50] James Lewis and Martin Fowler. [n. d.]. Microservices. ([n. d.]). https: Conference on - Volume 1 (AAAI’07). AAAI Press, 299–304. //martinfowler.com/articles/microservices.html http://dl.acm.org/citation.cfm?id=1619645.1619693 [51] Yaliang Li, Jing Jiang, Hai Leong Chieu, and Kian Ming Adam Chai. 2011. Extract- [26] Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Atten- ing Relation Descriptors with Conditional Random Fields.. In IJCNLP. 392–400. tion. CoRR abs/1601.01280 (2016). http://arxiv.org/abs/1601.01280 [52] Mario Lipinski, Kevin Yao, Corinna Breitinger, Joeran Beel, and Bela Gipp. 2013. [27] Michael Erdmann and Walter Waterfeld. 2012. Overview of the NeOn Toolkit. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific Springer, 281–301. PDF Documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on [28] Jing Fang, Prasenjit Mitra, Zhi Tang, and C Lee Giles. 2012. Table Header Detec- Digital Libraries (JCDL ’13). ACM, New York, NY, USA, 385–386. https://doi.org/ tion and Classification.. In AAAI. 599–605. 10.1145/2467696.2467753 [29] Robert Farrell, Jonathan Lenchner, Jeffrey Kephart, Alan Webb, Michael Muller, [53] Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. Thomas Erickson, David Melville, Rachel Bellamy, Daniel Gruen, Jonathan Con- 2015. A Dependency-Based Neural Network for Relation Classification. CoRR nell, et al. 2016. Symbiotic Cognitive Computing. AI Magazine 37, 3 (2016). abs/1507.04646 (2015). http://arxiv.org/abs/1507.04646 [30] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, HotWeb’17, October 14, 2017, San Jose / Silicon Valley, CA, USA M. Canim et al.

[54] Christopher D Manning, Hinrich Schütze, et al. 1999. Foundations of statistical natural language processing. Vol. 999. MIT Press. [55] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Build- ing a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist. 19, 2 (June 1993), 313–330. http://dl.acm.org/citation.cfm?id=972470.972475 [56] Franck Michel, Loïc Djimenou, Catherine Faron-Zucker, and Johan Montagnat. 2015. Translation of relational and non-relational databases into RDF with xR2RML. In 11th International Confenrence on Web Information Systems and Technologies (WEBIST’15). 443–454. [57] Franck Michel, Johan Montagnat, and Catherine Faron-Zucker. 2014. A survey of RDB to RDF translation approaches and tools. Ph.D. Dissertation. I3S. [58] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41. [59] Sam Newman. 2015. Building microservices: designing fine-grained systems." O’Reilly Media, Inc.". [60] Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. 2014. Don’t like RDF reifica- tion?: making statements about statements using singleton property. In Proceed- ings of the 23rd international conference on World wide web. ACM, 759–770. [61] Vinh Nguyen and Amit P. Sheth. 2017. Logical Inferences with Contexts of RDF Triples. CoRR abs/1701.05724 (2017). http://arxiv.org/abs/1701.05724 [62] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, et al. [n. d.]. Universal Dependencies v1: A Multilingual Treebank Collection. [63] Carlos Pedrinaci, A. Bernaras, T. Smithers, J. Aguado, and M. Cendoya. 2004. A Framework for Ontology Reuse and Persistence Integrating UML and Sesame. Springer Berlin Heidelberg, Berlin, Heidelberg, 37–46. [64] Christoph Pinkel, Carsten Binnig, Peter Haase, Clemens Martin, Kunal Sengupta, and Johannes Trame. 2014. How to best find a partner? An evaluation of editing approaches to construct R2RML mappings. In European Semantic Web Conference. Springer, 675–690. [65] Christoph Pinkel, Andreas Schwarte, Johannes Trame, Andriy Nikolov, Ana Sasa Bastinos, and Tobias Zeuch. 2015. DataOps: seamless end-to-end anything-to-RDF data integration. In International Semantic Web Conference. Springer, 123–127. [66] Alessandro Raffio, Daniele Braga, Stefano Ceri, Paolo Papotti, and Mauricio A Hernandez. 2008. Clip: a visual language for explicit schema mappings. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 30–39. [67] Siva Reddy, Oscar Täckström, Michael Collins, Tom Kwiatkowski, Dipanjan Das, Mark Steedman, and Mirella Lapata. 2016. Transforming dependency struc- tures to logical forms for semantic parsing. Transactions of the Association for Computational Linguistics 4 (2016), 127–140. [68] Jesús Barrasa Rodriguez and Asunción Gómez-Pérez. 2006. Upgrading relational legacy data to the semantic web. In Proceedings of the 15th international conference on World Wide Web. ACM, 1069–1070. [69] Mary Roth, Mauricio A Hernández, Phil Coulthard, L Yan, Lucian Popa, HC-T Ho, and CC Salter. 2006. XML mapping technology: Making connections in an XML-centric world. IBM Systems Journal 45, 2 (2006), 389–409. [70] Andy Seaborne, Damian Steer, and Stuart Williams. 2007. SQL-RDF. In W3C Workshop on RDF Access to Relational Databases. [71] Kunal Sengupta, Peter Haase, Michael Schmidt, and Pascal Hitzler. 2013. Editing R2RML mappings made easy. In Proceedings of the 2013th International Conference on Posters & Demonstrations Track-Volume 1035. CEUR-WS. org, 101–104. [72] Galina Tremper and Anette Frank. 2013. A discriminative analysis of fine-grained semantic relations including presupposition: Annotation and classification. Dia- logue & Discourse 4, 2 (2013), 282–322. [73] Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation Classifi- cation via Multi-Level Attention CNNs.. In ACL (1). [74] Kyle Williams, Jian Wu, Sagnik Ray Choudhury, Madian Khabsa, and C Lee Giles. 2014. Scholarly big data information extraction and integration in the citeseer χ digital library. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on. IEEE, 68–73. [75] S Hassas Yeganeh, Oktie Hassanzadeh, and Renée J Miller. 2011. Linking semistructured data on the web. Interface (2011). [76] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. CoRR abs/1409.2329 (2014). http://arxiv.org/abs/1409. 2329