Gendr: a Generic Deep Realizer with Complex Lexicalization
Total Page:16
File Type:pdf, Size:1020Kb
GenDR: A Generic Deep Realizer with Complex Lexicalization Franc¸ois Lareau, Florie Lambrey, Ieva Dubinskaite, Daniel Galarreta-Piquette, Maryam Nejat OLST, Departement´ de linguistique et de traduction, Universite´ de Montreal´ C.P. 6128, succ. Centre-Ville, Montreal´ QC, H3C 3J7, Canada ffrancois.lareau, florie.lambrey, ieva.dubinskaite, daniel.galarreta-piquette, [email protected] Abstract We present a generic deep realizer called GenDR, which takes as input an abstract semantic representation of predicate-argument relations, and produces corresponding syntactic dependency structures in English, French, Lithuanian and Persian, with the possibility to fairly easily add more languages. It is generic in that it is designed to operate across a wide range of languages and applications, given the appropriate lexical resources. The focus is on the lexicalization of multiword expressions, with built-in rules to handle thousands of different cross-linguistic patterns of collocations (intensifiers, support verbs, causatives, etc.), and on rich paraphrasing, with the ability to produce many syntactically and lexically varied outputs from the same input. The system runs on a graph transducer, MATE (Bohnet et al., 2000; Bohnet and Wanner, 2010), and its grammar design is directly borrowed from MARQUIS (Lareau and Wanner, 2007; Wanner et al., 2009; Wanner et al., 2010), which we have trimmed down to its core and built upon. The grammar and demo dictionaries are distributed under a CC-BY-SA licence (http://bit.ly/2x8xGVO). This paper explains the design of the grammar, how multiword expressions (especially collocations) are dealt with, and how the syntactic structure is derived from the relative communicative salience of the meanings involved. Keywords: multilingual natural language generation, deep realization, lexicalization, multiword expressions, collocations 1. Introduction lexical flexibility. More recently, statistical approaches have Natural language generation (NLG) is one of the rare tasks been applied to text generation from logical forms (Basile, in natural language processing (NLP) that is not yet com- 2015) or semantic structures (Mille, 2014), but again, lexical pletely dominated by statistical or neuronal methods. We choice is rather rigid. All of these realizers have a hard time believe the main reason for this is that for real-life applica- producing collocations. tions, the output of an NLG system must often be flawless, MARQUIS (Lareau and Wanner, 2007; Wanner and Lareau, so that extremely high precision is required, and rule-based 2009; Wanner et al., 2010) was a multilingual data-to-text methods still outperform other approaches in that respect. system for the air quality domain that addressed the issues A number of domain-independent text realizers have been that we are concerned with, and its linguistic realization developed to provide an easier way of producing text au- component has been reused in a couple of projects from tomatically. While some have impressive coverage, many different domains, namely patents (Wanner et al., 2009; expect as input a syntactic structure, thus offering very little Wanner et al., 2011) and football (Bouayad-Agha et al., flexibility in terms of lexical choice and structure. Others 2011; Bouayad-Agha et al., 2012). Its lexicalization model take a more abstract input, offering more syntactic flexibility, was designed to produce natural-sounding collocations and but their lexicalization model is rather rigid. We propose a multiword expressions, and to be as generic as possible. multilingual generic deep realizer, GenDR, that is a platform However, the range of collocations it was able to produce for the modeling of the semantics-syntax interface in lan- was limited to the most common patterns. The system we guages. In this paper, we will show how GenDR deals with present here extends this coverage by a very large margin. two crucial tasks: arborization (building a syntactic struc- 3. GenDR’s architecture ture that reflects the semantic structure) and lexicalization (picking the right words to express the desired meanings). GenDR runs on MATE, a graph transducer (Bohnet et al., 2000; Bohnet and Wanner, 2010). It consists of a graph 2. Previous work grammar and some dictionaries, which we discuss below. Most realizers that we know of expect an input where both 3.1. Input and output structures lexical choice and syntactic structure have already been com- The input to GenDR is a semantic structure a` la Meaning- puted, leaving the user with two particularly complex tasks. Text Theory (MTT) (Mel’cuk,ˇ 2012), which is a graph rep- This is the case of FUF/SURGE (Elhadad, 1993; Elhadad resentation of first-order logical form where predicates are and Robin, 1996), RealPro (Lavoie and Rambow, 1997; Co- linked to their arguments by relations labelled with numbers GenTex, 1998), SimpleNLG (Gatt and Reiter, 2009), its indicating the arguments’s position in the predicate. One of bilingual version, SimpleNLG-EnFr (Vaudry and Lapalme, the meanings in the structure has to be flagged as the most 2013) and its Spanish version, SimpleNLG-ES (Ramos-Soto salient meaning in the sentence. It is the dominant node of et al., 2017), JSReal (Daoust and Lapalme, 2015) and its the sentence’s rheme/focus (Mel’cuk,ˇ 2001) and it will be bilingual version, JSRealB (Molins and Lapalme, 2015), as mapped to the syntactic root. well as ATML3 (Weißgraeber and Madsack, 2017). KPML (Bateman, 1996) and OpenCCG (White, 2008) both start 3 2 from a more abstract representation of the text’s meaning, main 1 but they tend to focus on the grammar more than the lexi- con, resulting in well-formed sentences that somehow lack (1) ‘owe’ ( ‘Paul’ , ‘$500k’ , ‘bank’ ) 3018 The semantic structure in (1) is a simplified visualization of 3.2. Multilingual grammar the actual input structure, which is as follows: At its core, GenDR is a multilingual grammar. We have di- structure Sem debt { rectly borrowed the core rules from MARQUIS (Lareau S{ and Wanner, 2007; Wanner et al., 2009; Wanner et al., owe { 2010), which dealt with Catalan, English, French, Polish, tense=PRES Portuguese and Spanish. We kept only the most basic rules, 1-> Paul {class=proper_noun} that described very general phenomena like simple lexical- 2-> "$500K" {class=amount} ization, complementation, modification, etc. Almost all 3-> bank {number=SG definiteness=DEF}} of these rules are shared across languages, while a few main-> owe}} language-specific rules model grammatical phenomena like The output of the system is a set of surface syntactic depen- auxiliaries, determiners and so on. dency structures (Mel’cuk,ˇ 1988). For example, with the The mapping between semantic graphs and surface syn- input in (1), GenDR produces the six structures in (2). tactic structures takes place in two steps, each handled by root iobj a different module of the grammar: the semantic module dobj prep subj det maps the input semantic structures onto deep syntactic struc- (2) a. Paul owes $500k to the bank tures (Mel’cuk,ˇ 2013), while the syntactic module maps iobj deep syntactic structures to surface syntactic ones (Mel’cuk,ˇ root dobj prep 1988) – a layered architecture directly borrowed from MTT subj ncomp prep det det (Mel’cuk,ˇ 1973; Kahane, 2003; Milicevi´ c,´ 2006). b. Paul owes a debt of $500k to the bank The semantic module contains 21 core rules, most of which ncomp root prep were adapted from MARQUIS, and 132 bound lexicalization dobj prep subj det ncomp det rules (see §5.4.) implementing lexical functions; they are c. Paul has a debt of $500k to the bank documented in (Lambrey and Lareau, 2015; Lambrey, 2016). subj The syntactic module is much lighter, with only 20 rules, root prep det ncomp det obl prep 12 of which are language-independent. Each rule models a d. Paul’s debt to the bank amounts to $500k linguistic phenomenon, and the grammar relies heavily on rich dictionaries. e. Paul’s debt to the bank stands at $500k subj root 4. Arborization prep det det dobj ncomp As shown above, one node in the input structure has to be f. Paul’s debt to the bank totals $500k marked as the main node, i.e., the most salient. This is be- cause a semantic graph has no inherent hierarchy. Consider For better readability, the sentences shown here are fully for instance the structure in (4). inflected and linearized, but the actual outputs are unordered trees with lemmas and grammatical features. They are to be 1 2 fed to a surface realizer that computes the inflected forms (4) ‘small’ ‘cat’ ‘lie’ ‘sofa’ ‘red’ and word order. GenDR focuses on the deeper tasks of 1 1 lexicalization and arborization, and as one can see from this Any of the predicates ‘small’, ‘lie’ and ‘red’ is a root. In a example, it offers a high level of lexico-syntactic flexibility. tree, however, there can only be one root, so we have to pick Between the semantic and surface syntactic levels of rep- one. Picking ‘small’ yields The cat lying on the red sofa is resentation, there is a third, intermediate level: the deep small, while picking ‘lie’ produces The small cat is lying on syntactic structure (Mel’cuk,ˇ 1988). At this level, there are the red sofa, and ‘red’, The sofa on which the small cat is only meaningful lexemes and support verbs linked by two lying is red. This choice can be left to GenDR, but typically major types of relations: complementation (labelled with one wants to control the communicative organization of the Roman numerals) and modification (labelled ATTR). Again, sentence by specifying the main node. The deep syntactic these structures are ordered here only for readability. tree is then built in a top-down fashion, starting from this III root I II main node and using the semantic graph as a blueprint. This algorithm is inspired by (Wanner and Bateman, 1990; Wan- (3) a.