The Excitement Open Platform for Textual Inferences
Total Page:16
File Type:pdf, Size:1020Kb
The Excitement Open Platform for Textual Inferences Bernardo Magnini∗, Roberto Zanoli∗, Ido Dagan◦, Kathrin Eichler†,Gunter¨ Neumann†, Tae-Gil Noh‡, Sebastian Pado‡, Asher Stern◦, Omer Levy◦ ∗FBK (magnini|[email protected]) ‡Heidelberg, Stuttgart Univ. (pado|[email protected]) †DFKI (neumann|[email protected]) ◦Bar Ilan University (dagan|sterna3|[email protected]) Abstract and finally designed and implemented a rich envi- ronment to support open source distribution. This paper presents the Excitement Open The goal of the platform is to provide function- Platform (EOP), a generic architecture and ality for the automatic identification of entailment a comprehensive implementation for tex- relations among texts. The EOP is based on a modu- tual inference in multiple languages. The lar architecture with a particular focus on language- platform includes state-of-art algorithms, independent algorithms. It allows developers and a large number of knowledge resources, users to combine linguistic pipelines, entailment al- and facilities for experimenting and test- gorithms and linguistic resources within and across ing innovative approaches. The EOP is languages with as little effort as possible. For ex- distributed as an open source software. ample, different entailment decision approaches can share the same resources and the same sub- 1 Introduction components in the platform. A classification-based algorithm can use the distance component of an In the last decade textual entailment (Dagan et al., edit-distance based entailment decision approach, 2009) has been a very active topic in Computa- and two different approaches can use the same set tional Linguistics, providing a unifying framework of knowledge resources. Moreover, the platform for textual inference. Several evaluation exercises has various multilingual components for languages have been organized around Recognizing Textual like English, German and Italian. The result is an Entailment (RTE) challenges and many method- ideal software environment for experimenting and ologies, algorithms and knowledge resources have testing innovative approaches for textual inferences. been proposed to address the task. However, re- The EOP is distributed as an open source software2 search in textual entailment is still fragmented and and its use is open both to users interested in using there is no unifying algorithmic framework nor inference in applications and to developers willing software architecture. to extend the current functionalities. In this paper, we present the Excitement Open The paper is structured as follows. Section 2 Platform (EOP), a generic architecture and a com- presents the platform architecture, highlighting prehensive implementation for multilingual textual how the EOP component-based approach favors inference which we make available to the scien- interoperability. Section 3 provides a picture of tific and technological communities. To a large the current population of the EOP in terms of both extent, the idea is to follow the successful experi- entailment algorithms and knowledge resources. ence of the Moses open source platform (Koehn et Section 4 introduces expected use cases of the plat- al., 2007) in Machine Translation, which has made form. Finally, Section 5 presents the main features a substantial impact on research in that field. The of the open source package. EOP is the result of a two-year coordinated work under the international project EXCITEMENT.1 A 2 Architecture consortium of four academic partners has defined The EOP platform takes as input two text portions, the EOP architectural specifications, implemented the first called the Text (abbreviated with T), the the functional interfaces of the EOP components, second called the Hypothesis (abbreviated with H). imported existing entailment engines into the EOP 2http://hltfbk.github.io/ 1http://www.excitement-project.eu Excitement-Open-Platform/ 43 Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 43–48, Baltimore, Maryland USA, June 23-24, 2014. c 2014 Association for Computational Linguistics Common Analysis Structure (CAS) (Gurevych et Linguis'c)Analysis) al., 2007; Noh and Pado,´ 2013). Pipeline)(LAP)) 2.2 Entailment Core (EC) Raw)Data) Linguis'c) Analysis)Components) The Entailment Core performs the actual entail- ment recognition based on the preprocessed text made by the Linguistic Analysis Pipeline. It con- Entailment)Decision)) Decision) Algorithm)(EDA)) sists of one or more Entailment Decision Algo- rithms (EDAs) and zero or more subordinate com- Entailment)Core)(EC)) ponents. An EDA takes an entailment decision Dynamic)and)Sta'c)Components) (i.e., ”entailment” or ”no entailment”) while com- (Algorithms)and)Knowledge)) ponents provide static and dynamic information for the EDA. Entailment Decision Algorithms are at the top Figure 1: EOP architecture level in1) the EC. They compute an entailment deci- sion for a given Text/Hypothesis (T/H) pair, and The output is an entailment judgement, either ”En- can use components that provide standardized al- tailment” if T entails H, or ”NonEntailment” if the gorithms or knowledge resources. The EOP ships relation does not hold. A confidence score for the with several EDAs (cf. Section 3). decision is also returned in both cases. Scoring Components accept a Text/Hypothesis The EOP architecture (Pado´ et al., 2014) is based pair as an input, and return a vector of scores. on the concept of modularization with pluggable Their output can be used directly to build minimal and replaceable components to enable extension classifier-based EDAs forming complete RTE sys- and customization. The overall structure is shown tems. An extended version of these components are in Figure 1 and consists of two main parts. The the Distance Components that can produce normal- Linguistic Analysis Pipeline (LAP) is a series of ized and unnormalized distance/similarity values linguistic annotation components. The Entailment in addition to the score vector. Core (EC) performs the actual entailment recog- nition. This separation ensures that (a) the com- Annotation Components can be used to add dif- ponents in the EC only rely on linguistic analysis ferent annotations to the Text/Hypothesis pairs. An in well-defined ways and (b) the LAP and EC can example of such a type of component is one that be run independently of each other. Configuration produces word or phrase alignments between the files are the principal means of configuring the EOP. Text and the Hypothesis. In the rest of this section we first provide an intro- Lexical Knowledge Components describe se- duction to the LAP, then we move to the EC and mantic relationships between words. In the finally describe the configuration files. EOP, this knowledge is represented as directed rules made up of two word–POS pairs, where 2.1 Linguistic Analysis Pipeline (LAP) the LHS (left-hand side) entails the RHS (right- The Linguistic Analysis Pipeline is a collection of hand side), e.g., (shooting star, Noun) = ⇒ annotation components for Natural Language Pro- (meteorite, Noun). Lexical Knowledge Compo- cessing (NLP) based on the Apache UIMA frame- nents provide an interface that allows for (a) listing work.3 Annotations range from tokenization to all RHS for a given LHS; (b) listing all LHS for part of speech tagging, chunking, Named Entity a given RHS; and (c) checking for an entailment Recognition and parsing. The adoption of UIMA relation for a given LHS–RHS pair. The interface enables interoperability among components (e.g., also wraps all major lexical knowledge sources cur- substitution of one parser by another one) while rently used in RTE research, including manually ensuring language independence. Input and output constructed ontologies like WordNet, and encyclo- of the components are represented in an extended pedic resources like Wikipedia. version of the DKPro type system based on UIMA Syntactic Knowledge Components capture en- 3http://uima.apache.org/ tailment relationships between syntactic and 44 lexical-syntactic expressions. We represent such In the EOP we include a transformation based relationships by entailment rules that link (option- inference system that adopts the knowledge based ally lexicalized) dependency tree fragments that transformations of Bar-Haim et al. (2007), while in- can contain variables as nodes. For example, the corporating a probabilistic model to estimate trans- rule fall of X = X falls, or X sells Y to Z = formation confidences. In addition, it includes a ⇒ ⇒ Z buys Y from X express general paraphrasing pat- search algorithm which finds an optimal sequence terns at the predicate-argument level that cannot be of transformations for any given T/H pair (Stern et captured by purely lexical rules. Formally, each al., 2012). syntactic rule consists of two dependency tree frag- ments plus a mapping from the variables of the Edit distance EDA involves using algorithms LHS tree to the variables of the RHS tree.4 casting textual entailment as the problem of map- ping the whole content of T into the content of H. 2.3 Configuration Files Mappings are performed as sequences of editing The EC components can be combined into actual operations (i.e., insertion, deletion and substitu- inference engines through configuration files which tion) on text portions needed to transform T into H, contain information to build a complete inference where each edit operation has a cost associated with engine. A configuration file completely describes it. The underlying