Tulipa: Towards a Multi-Formalism Parsing Environment for Grammar
Total Page:16
File Type:pdf, Size:1020Kb
TuLiPA: Towards a Multi-Formalism Parsing Environment for Grammar Engineering Laura Kallmeyer Timm Lichte Wolfgang Maier SFB 441 SFB 441 SFB 441 Universit¨at T¨ubingen Universit¨at T¨ubingen Universit¨at T¨ubingen D-72074, T¨ubingen, Germany D-72074, T¨ubingen, Germany D-72074, T¨ubingen, Germany [email protected] [email protected] [email protected] Yannick Parmentier Johannes Dellert Kilian Evang CNRS - LORIA SFB 441 - SfS SFB 441 - SfS Nancy Universit´e Universit¨at T¨ubingen Universit¨at T¨ubingen F-54506, Vandœuvre, France D-72074, T¨ubingen, Germany D-72074, T¨ubingen, Germany [email protected] {jdellert,kevang}@sfs.uni-tuebingen.de Abstract sharing of resources (e.g., having a common lex- icon, from which different features would be ex- In this paper, we present an open-source tracted depending on the target formalism). parsing environment (T¨ubingen Linguistic In this context, we present a parsing environ- Parsing Architecture, TuLiPA) which uses ment relying on a general architecture that can Range Concatenation Grammar (RCG) be used for parsing with mildly context-sensitive as a pivot formalism, thus opening the (MCS) formalisms1 (Joshi, 1987). Its underly- way to the parsing of several mildly ing idea is to use Range Concatenation Grammar context-sensitive formalisms. This en- (RCG) as a pivot formalism, for RCG has been vironment currently supports tree-based shown to strictly include MCS languages while be- grammars (namely Tree-Adjoining Gram- ing parsable in polynomial time (Boullier, 2000). mars (TAG) and Multi-Component Tree- Currently, this architecture supports tree-based Adjoining Grammars with Tree Tuples grammars (Tree-Adjoining Grammars and Multi- (TT-MCTAG)) and allows computation Component Tree-Adjoining Grammars with Tree not only of syntactic structures, but also Tuples (Lichte, 2007)). More precisely, tree- of the corresponding semantic representa- based grammars are first converted into equivalent tions. It is used for the development of a RCGs, which are then used for parsing. The result tree-based grammar for German. of RCG parsing is finally interpreted to extract a derivation structure for the input grammar, as well 1 Introduction as to perform additional processings (e.g., seman- tic calculus, extraction of dependency views). Grammars and lexicons represent important lin- guistic resources for many NLP applications, The paper is structured as follows. In section 2, among which one may cite dialog systems, auto- we present the architecture of the TuLiPA pars- matic summarization or machine translation. De- ing environment and show how the use of RCG veloping such resources is known to be a complex as a pivot formalism makes it easier to design a task that needs useful tools such as parsers and modular system that can be extended to support generators (Erbach, 1992). several dimensions (syntax, semantics) and/or for- malisms. In section 3, we give some desiderata for Furthermore, there is a lack of a common frame- grammar engineering and present TuLiPA’s cur- work allowing for multi-formalism grammar engi- neering. Thus, many formalisms have been pro- 1A formalism is said to be mildly context sensitive (MCS) posed to model natural language, each coming iff (i) it generates limited cross-serial dependencies, (ii) it is with specific implementations. Having a com- polynomially parsable, and (iii) the string languages gener- ated by the formalism have the constant growth property (e.g., mon framework would facilitate the comparison n {a2 |n ≥ 0} does not have this property). Examples of MCS between formalisms (e.g., in terms of parsing com- formalisms include Tree-Adjoining Grammars, Combinatory plexity in practice), and would allow for a better Categorial Grammars and Linear Indexed Grammars. rent state with respect to these. In section 4, we tures labelling nodes. As a result of these unifica- compare this system with existing approaches for tions, the arguments of the semantic formulas are parsing and more generally for grammar engineer- unified (see Fig. 1). ing. Finally, in section 5, we conclude by present- ing future work. S NP↓x VP 2 Range Concatenation Grammar as a ↓y pivot formalism NPj VNP NPm John loves Mary The main idea underlying TuLiPA is to use RCG as a pivot formalism for RCG has appealing formal name(j,john) love(x,y) name(m,mary) properties (e.g., a generative capacity lying be- ; love(j,m),name(j,john),name(m,mary) yond Linear Context Free Rewriting Systems and a polynomial parsing complexity) and there ex- Figure 1: Semantic calculus in Feature-Based ist efficient algorithms, for RCG parsing (Boullier, TAG. 2000) and for grammar transformation into RCG (Boullier, 1998; Boullier, 1999). In our system, the semantic support has been in- Parsing with TuLiPA is thus a 3-step process: tegrated by (i) extending the internal tree objects to include semantic formulas (the RCG-conversion is 1. The input tree-based grammar is converted kept unchanged), and (ii) extending the construc- into an RCG (using the algorithm of tion of the derived tree (step 3) so that during the Kallmeyer and Parmentier (2008) when deal- interpretation of the RCG derivation in terms of ing with TT-MCTAG). tree combinations, the semantic formulas are car- ried and updated with respect to the feature unifi- 2. The resulting RCG is used for parsing the in- cations performed. put string using an extension of the parsing Secondly, let us consider lexical disambigua- algorithm of Boullier (2000). tion. Because of the high redundancy lying within 3. The RCG derivation structure is interpreted to lexicalized formalisms such as lexicalized TAG, extract the derivation and derived trees with it is common to consider tree schemata having a respect to the input grammar. frontier node marked for anchoring (i.e., lexical- ization). At parsing time, the tree schemata are The use of RCG as a pivot formalism, and thus anchored according to the input string. This an- of an RCG parser as a core component of the sys- choring selects a subgrammar supposed to cover tem, leads to a modular architecture. In turns, this the input string. Unfortunately, this subgrammar makes TuLiPA more easily extensible, either in may contain many trees that either do not lead to terms of functionalities, or in terms of formalisms. a parse or for which we know apriorithat they cannot be combined within the same derivation 2.1 Adding functionalities to the parsing (so we should not predict a derivation from one environment of these trees to another during parsing). As a re- As an illustration of TuLiPA’s extensibility, one sult, the parser could have poor performance be- may consider two extensions applied to the system cause of the many derivation paths that have to be recently. explored. Bonfante et al. (2004) proposed to polar- First, a semantic calculus using the syn- ize the structures of the grammar, and to apply an tax/semantics interface for TAG proposed by Gar- automaton-based filtering of the compatible struc- dent and Kallmeyer (2003) has been added. This tures. The idea is the following. One compute po- interface associates each tree with flat semantic larities representing the needs/resources brought formulas. The arguments of these formulas are by a given tree (or tree tuple for TT-MCTAG). unification variables, which are co-indexed with A substitution or foot node with category NP re- features labelling the nodes of the syntactic tree. flects a need for an NP (written NP-). In the same During classical TAG derivation, trees are com- way, an NP root node reflects a resource of type bined, triggering unifications of the feature struc- NP (written NP+). Then you build an automaton whose edges correspond to trees, and states to po- To sum up, the idea would be to keep the core larities brought by trees along the path. The au- RCG parser, and to extend TuLiPA with a spe- tomaton is then traversed to extract all paths lead- cific conversion module for each targeted formal- ing to a final state with a neutral polarity for each ism. On top of these conversion modules, one category and +1 for the axiom (see Fig. 2, the state should also provide interpretation modules allow- 7 is the only valid state and {proper., trans., det., ing to decode the RCG derivation forest in terms noun.} the only compatible set of trees). of the input formalism (see Fig. 3). 0 John 11eats 22a 33cake 4 det. noun. 2 4 6 intrans. S+ S+ S+ NP+ proper. 0 1 trans. NP+ det. noun. 3 5 7 S+ NP- S+ NP- S+ Figure 3: Towards a multi-formalism parsing en- vironment. Figure 2: Polarity-based lexical disambiguation. An important point remains to be discussed. It In our context, this polarity filtering has been concerns the role of lexicalization with respect to added before step 1, leaving untouched the core the formalism used. Indeed, the tree-based gram- RCG conversion and parsing steps. The idea is mar formalisms currently supported (TAG and TT- to compute the sets of compatible trees (or tree MCTAG) both share the same lexicalization pro- tuples for TT-MCTAG) and to convert these sets cess (i.e., tree anchoring). Thus the lexicon format separately. Indeed the RCG has to encode only is common to these formalisms. As we will see valid adjunctions/substitutions. Thanks to this below, it corresponds to a 2-layer lexicon made of automaton-based “clustering” of the compatible inflected forms and lemma respectively, the latter tree (or tree tuples), we avoid predicting incompat- selecting specific grammatical structures. When ible derivations. Note that the time saved by using parsing other formalisms, it is still unclear whether a polarity-based filter is not negligible, especially one can use the same lexicon format, and if not 2 when parsing long sentences.