An Open-Source Alignment Tool for LFG F-Structures

f-align: An Open-Source Alignment Tool for LFG f-Structures Anton Bryl Josef van Genabith CNGL, School of Computing, CNGL, School of Computing, Dublin City University Dublin City University [email protected] [email protected] Abstract a node B, the descendant nodes of A are aligned to the descendant nodes of B. The main reason we did Lexical-Functional Grammar (LFG) f- not adapt this algorithm for our task are problems structures (Kaplan and Bresnan, 1982) have related to generalizing this approach. In the general attracted some attention in recent years as an case an f-structure is not a tree, but a directed acyclic intermediate data representation for statistical machine translation. So far, however, there graph (DAG). While the algorithm of Meyers et al. are no alignment tools capable of aligning aligns two trees with n nodes and maximum degree f-structures directly, and plain word alignment d in O(n2d2) time, we see no straightforward way of tools are used for this purpose. In this way adapting it to DAGs without increasing complexity no use is made of the structural information substantially. Another issue not to be ignored is that contained in f-structures. We present the first the output of f-structure parsers (Kaplan et al., 2002; version of a specialized f-structure alignment Cahill et al., 2004) is often fragmented. Unlike in- open-source software tool. correct parses, fragmented parses do carry useful information and their exclusion is undesirable. 1 Introduction The extensive existing work on phrase-structure The use of LFG f-structures in transfer-based sta- tree alignment, starting with the work by Kaji et tistical machine translation naturally requires align- al. (1992) and proceeding to a number of more re- ment techniques as a prerequisite for transfer rule cent approaches (Ambati and Lavie, 2008; Zhechev, induction. The existing research (Riezler and 2009), is also not straightforward to reuse, as LFG Maxwell, 2006; Avramidis and Kuhn, 2009; Gra- f-structures represent sentences in a way quite differ- ham and van Genabith, 2009) uses general-purpose ent from phrase-structure trees, in particular having word-alignment tools such as GIZA++ by Och et al. all internal nodes and not only leaves (potentially) lexicalized; not to mention again that, in general, f- (1999) for aligning the f-structures. The general- 1 purpose alignment tools, however, take no advan- structures are graphs, and not necessarily trees. tage of the dependencies, which are made explicit in Therefore we decided to design a new algorithm. the f-structures, thus actually ignoring a lot of useful For the sake of computational speed, we keep the information readily available in the f-structure anno- structure-related part of the algorithm as simple as tated data. This paper focuses on a way of making possible. As measures of “structural closeness” be- use of precisely this information. tween two nodes we propose to use the best lexi- A relevant algorithm was proposed by Meyers et cal match between their children and the best lexical al. (1998), who represent sentences with trees sim- match between their parents. These measures are, ilar to f-structures. Their algorithm aligns a pair of on one hand, simple and efficient to calculate, al- trees in a recursive bottom-up procedure, ensuring 1Phrase-structure trees are used in another layer of LFG, that, somewhat simplifying, if a node A is aligned to namely in c-structure. lowing a greedy alignment of two DAGs, irrespec- item associated with each node. If an f-structure tive of whether they are fragmented or not, to be node is not lexicalized (e.g. it is a SPEC node with built in O(n2d2) + O(n2 ln n) time, which is com- a DET as its child), it is removed and its children are parable to the complexity of Meyers et al. (1998) linked directly to its parents. algorithm for trees. On the other hand, these measures are sufficient for resolving simple ambiguities, 2.1 Composite Alignment Score such as several occurrences of the same word in the The composite alignment score is defined as a sim- same sentence (a very frequent situation, if we take ple scoring formula which makes some use of struc- function words into account). A similar idea under- tural information and incurs reasonable computa- lies the work by Watanabe et al. (2000) on resolving tional cost. the alignment ambiguities which arise when using a Let Lex(A; A0) be a measure of lexical closeness translation dictionary; their method checks how well of the words associated with the nodes A and A0. the neighbors of a word match the neighbors of each Such measure may either come from a dictionary, as of its candidate counterparts. in the work by Meyers et al. (1998), or, as in our ex- We use a very simple bilexical dictionary unit (a periments, be extracted in some way from the data. plain cooccurrence counter on training bitext) in this In addition to the lexical score, and on the basis version of the software, but nothing prevents using of it, we calculate two supplementary scores: more elaborate dictionary units within the same ar- 1. The score of the best lexical match between the chitecture. The software supports the data formats children of A and A0. If both nodes have children, of two different LFG pasers, namely XLE (Kaplan the score is calculated as follows: et al., 2002) and DCU (Cahill et al., 2004). The evaluation of the tool presented here is in- 0 0 Sc(A; A ) = max Lex(B; B ); (1) evitably limited. We use sentence-aligned English B2C(A); B02C(A0) and German Europarl (Koehn, 2005) and SMUL- TRON 2.0 (Volk et al., 2009) data. To the best of where C(:) is the set of children of the node. If only 0 our knowledge, there is no relevant gold standard, one of the two nodes has children, Sc(A; A ) = 0; if 0 so we produced a small gold standard set ourselves, none of the two nodes has children, Sc(A; A ) = 1. manually node-aligning 20 f-structure pairs created from SMULTRON data, using the word alignments 2. The score of the best lexical match between the 0 contained in SMULTRON for reference. It would be parents of A and A . If both nodes have parents, the also possible to evaluate the method within an SMT score is calculated as follows: system, but the available LFG-based SMT software 0 0 Sp(A; A ) = max Lex(B; B ); (2) is currently not accurate enough for the alignment B2P (A); 0 0 to be reliably evaluated by the translation scores. B 2P (A ) We also manually examined and analyzed some sen- where P (:) is the set of parents of the node (all tences from the output. The evaluation, even though nodes with which the node is connected by incom- limited, supports the validity of the core idea of the ing edges). If only one of the two nodes has parents, 0 method and suggests ways of further improvement. Sp(A; A ) = 0; if none of the two nodes has parents, 0 The paper is organized as follows. In Section 2 we Sp(A; A ) = 1. explain the algorithm; Section 3 is dedicated to the resulting software tool; Section 4 details the evalu- To allow for some differences in structure, an ex- ation; in Section 5 we present our conclusion and tended children matching score can be calculated: outline directions for further improvement. in this case the best match for each child of A is searched for also among the “grandchildren” (chil- 0 2 The Alignment Algorithm dren of children) of A and vice versa. Matches between the grandchildren of A and the grandchildren The algorithm requires each sentence to be repre- of A0 are not considered. In the same way, an ex- sented as a graph (probably disjoint) with a lexical tended parent matching score can be calculated. 0 The composite score is a weighted sum of the lex- where Npair(A; A ) is the number of sentence-pairs ical score and the supplementary scores: in which the lexical entry from the node A occurs A0 0 0 0 on the source side, and the lexical entry from oc- S(A; A ) = wcSc(A; A ) + wpSp(A; A )+ 0 curs on the target side, and N1(A) and N2(A ) are 0 monolingual occurrence counts. Additionally, if A (1 − wc − wp)Lex(A; A ) (3) and A0 are identical strings consisting only of digits, The weights wc and wp may vary, with larger val- then Lex(A; A0) is set to 1:0. ues corresponding to greater reliance on structural 2. For each pair of DAGs in the corpus the align- information. ment procedure is performed as described in Section Let us consider aligning two DAGs with at most 2.2. n lexicalized nodes in each, and with at most d children and at most d parents for each node. It is easy 3 The Tool: f-align to see that (3) is calculated in O(d2) time, and therefore the scores for all possible pairs are calculated The algorithm is the basis for version 0.1 of a in O(n2d2) time. Building a greedy alignment us- new open-source tool, f-align, which we present 2 ing the scores takes O(n2 ln n) time, the most costly here. It is written in C++ and can be compiled operation being the sorting procedure; so the overall both under Linux and under Windows. The tool complexity of a greedy alignment of two f-structures currently understands two representations of LFG f- is O(n2d2) + O(n2 ln n).

Load more