Porting an Open Information Extraction System from English to German
Total Page:16
File Type:pdf, Size:1020Kb
Porting an Open Information Extraction System from English to German Tobias Falke† Gabriel Stanovsky‡ Iryna Gurevych† Ido Dagan‡ †Research Training Group AIPHES and UKP Lab Computer Science Department, Technische Universitat¨ Darmstadt ‡Natural Language Processing Lab Department of Computer Science, Bar-Ilan University Abstract on English, with only a few recent exceptions (Zhila and Gelbukh, 2013; Gamallo and Garcia, 2015). For Many downstream NLP tasks can benefit from most languages, Open IE systems are still missing. Open Information Extraction (Open IE) as a While one could create them from scratch, as it was semantic representation. While Open IE sys- tems are available for English, many other done for Spanish, this can be a very laborious pro- languages lack such tools. In this paper, we cess, as state-of-the-art systems make use of hand- present a straightforward approach for adapt- crafted, linguistically motivated rules. Instead, an ing PropS, a rule-based predicate-argument alternative approach is to transfer the rule sets of analysis for English, to a new language, Ger- available systems for English to the new language. man. With this approach, we quickly obtain an Open IE system for German covering 89% of In this paper, we study whether an existing set the English rule set. It yields 1.6 n-ary extrac- of rules to extract Open IE tuples from English de- tions per sentence at 60% precision, making it pendency parses can be ported to another language. comparable to systems for English and readily We use German, a relatively close language, and the usable in downstream applications.1 PropS system (Stanovsky et al., 2016) as examples in our analysis. Instead of creating rule sets from scratch, such a transfer approach would simplify the 1 Introduction rule creation, making it possible to build Open IE The goal of Open Information Extraction (Open IE) systems for other languages with relatively low ef- is to extract coherent propositions from a sentence, fort in a short amount of time. However, challenges each represented as a tuple of a relation phrase and we need to address are differences in syntax, dis- one or more argument phrases (e.g., born in (Barack similarities in the corresponding dependency rep- Obama; Hawaii)). Open IE has been shown to be resentations as well as language-specific phenom- useful for a wide range of semantic tasks, including ena. Therefore, the existing rules cannot be directly question answering (Fader et al., 2014), summariza- mapped to the German part-of-speech and depen- tion (Christensen et al., 2013) and text comprehen- dency tags in a fully automatic way, but require a sion (Stanovsky et al., 2015), and has consequently careful analysis as carried out in this work. Similar drawn consistent attention over the last years (Banko manual approaches to transfer rule-based systems to et al., 2007; Wu and Weld, 2010; Fader et al., 2011; new languages were shown to be successful, e.g. Akbik and Loser,¨ 2012; Mausam et al., 2012; Del for temporal tagging (Moriceau and Tannier, 2014), Corro and Gemulla, 2013; Angeli et al., 2015). whereas fully automatic approaches led to less com- Although similar applications of Open IE in other petitive systems (Strotgen¨ and Gertz, 2015). languages are obvious, most previous work focused Our analysis reveals that a large fraction of the 1Source code and online demo available at PropS rule set can be easily ported to German, re- https://github.com/UKPLab/props-de quiring only small adaptations. With roughly 10% 892 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 892–898, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics Sehenswert sind die Orte San Jose und San Andres, die an der nordlichen¨ Kuste¨ des Peten-Itz´ a-Sees´ liegen. subj mod prep an prop of conj und conj und mod poss Sehenswert Orte San Jose und San Andres nordlichen¨ Kuste¨ Peten-Itz´ a-Sees´ liegen Worth seeing towns San Jose and San Andres northern shore Lake Peten-Itz´ a´ located Extraction 1: liegen ( die Orte San Jose und San Andres ; an der nordlichen¨ Kuste¨ des Peten-Itz´ a-Sees´ ) Extraction 2: sehenswert ( die Orte San Jose und San Andres ) Figure 1: PropS representation for Worth seeing are the towns San Jose and San Andres, which are located on the northern shore of Lake Peten-Itz´ a´. Grey boxes indicate predicates. Two Open IE tuples, one unary and one binary, are extracted from this sentence. of the effort that went into the English system, we dependency parser that uses a common tagset for could build a system for German covering 89% of five European languages. However, an evaluation the rule set. As a result, we present PropsDE, the for English and Spanish revealed that this approach first Open IE system for German. In an intrinsic cannot compete with the systems specifically built evaluation, we show that its performance is compa- for those languages. To the best of our knowledge, rable with systems for English, yielding 1.6 extrac- no work on Open IE for German exists. tions per sentence with an overall precision of 60%. Open IE with PropS Stanovsky et al. (2016) 2 Background recently introduced PropS, a rule-based converter turning dependency graphs for English into typed Open Information Extraction Open IE was in- graphs of predicates and arguments. An example is troduced as an open variant of traditional Informa- shown in Figure 1 (in German). Compared to a de- tion Extraction (Banko et al., 2007). Since its in- pendency graph, the representation masks non-core ception, several extractors were developed. The syntactic details, such as tense or determiners, uni- majority of them, namely ReVerb (Fader et al., fies semantically equivalent constructions, such as 2011), KrakeN (Akbik and Loser,¨ 2012), Exem- active/passive, and explicates implicit propositions, plar (Mesquita et al., 2013) and ClausIE (Del Corro such as indicated by possessives or appositions. and Gemulla, 2013), successfully used rule-based The resulting graph can be used to extract Open strategies to extract tuples. Alternative approaches IE tuples in a straightforward way. Every non- are variants of self-supervision, as in TextRunner nested predicate node pred in the graph, together (Banko et al., 2007), WOE (Wu and Weld, 2010) and with its n argument-subgraphs argi, yields a tuple OLLIE (Mausam et al., 2012), and semantically- pred(arg1; ...; argn). With this approach, PropS is oriented approaches utilizing semantic role labeling most similar to KrakeN and ClausIE, applying rules (Open IE-42) or natural logic (Angeli et al., 2015). to a dependency parse. However, due to additional While TextRunner and ReVerb require only POS nodes for implicit predicates, it can also make ex- tagging as preprocessing to allow a high extraction tractions that go beyond the scope of other systems, speed, the other systems rely on dependency parsing such as has ( Michael; bicycle ) from Michael’s bicy- to improve the extraction precision. cle is red. In line with more recent Open IE systems, For non-English Open IE, ExtrHech has been pre- this strategy extracts tuples that are not necessarily sented for Spanish (Zhila and Gelbukh, 2013). Sim- binary, but can be unary or of higher arity. ilar as the English systems, it uses a set of extraction rules, specifically designed for Spanish in this case. 3 Analysis of Portability More recently, ArgOE (Gamallo and Garcia, 2015) was introduced. It manages to extract tuples in sev- Approach For each rule of the converter that eral languages with the same rule set, relying on a transforms a dependency graph to the PropS graph, we assess its applicability for German. A rule is ap- 2https://github.com/knowitall/openie plied to a part of the graph if certain conditions are 893 fulfilled, expressed using dependency types, POS Stanford dependencies, main verbs govern all auxil- tags and lemmas. As we already pointed out in iaries, whereas in TIGER dependencies, an auxiliary the introduction, several differences between the de- heads the main verb. The above example shows this pendency and part-of-speech representations for En- for gone and am. Therefore, all rules identifying and glish and German make a fully automatic translation removing auxiliaries and modals have to be adapted of these rules impossible. We therefore manually to account for this difference. analyzed the portability of each rule and report the With similar changes as discussed for determin- findings in the next section. ers, we can also handle possessive and copular con- While using Universal Dependencies (Nivre et al., structions. The graph for Michael’s bicycle is red, 2016) could potentially simplify porting the rules, for example, features an additional predicate have to we chose not to investigate this option due to the on- explicate the implicit possessive relation, while red going nature of the project and focused on the estab- becomes an adjectival predicate, omitting is: lished representations for now. In line with the En- obj subj prop of glish system, that works on collapsed Stanford de- poss pendencies (de Marneffe and Manning, 2008), we haben Michael Fahrrad rot assume a similar input representation for German have Michael bicycle red that can be obtained with a set of collapsing and Moreover, conditional constructions can be pro- propagation rules provided by Ruppert et al. (2015) cessed with slight changes as well. Missing a coun- for TIGER dependencies (Seeker and Kuhn, 2012). terpart for the type mark, we instead look for sub- ordinating conjunctions by part-of-speech. In fact, Findings Overall, we find that most rules can be we found conditionals to be represented more con- used for German, mainly because syntactic differ- sistently across different conjunctions, making their ences, such as freer word order (Kubler,¨ 2008), are handling in German easier than in English.