Context-Free Grammar Rewriting and the Transfer of Packed Linguistic Representations

Context-Free Grammar Rewriting and the Transfer of Packed Linguistic Representations Marc Dymetman Fred´ eric´ Tendeau Xerox Research Centre Europe Lernout & Hauspie 6, chemin de Maupertuis Koning Albert-I laan 64 38240 Meylan, France B-1780 Wemmel, Belgium [email protected] [email protected] Abstract The representations of (Emele and Dorna, 1998) and We propose an algorithm for the transfer of packed linguistic (Kay, 1999) are based on a notion of propositional con- structures, that is, finite collections of labelled graphs which texts (see (Maxwell and Kaplan, 1991)), where each share certain subparts. A labelled graph is seen as a word over possible non-ambiguous reading included in the packed a vocabulary of description elements (nodes, arcs, labels), and source representation is extracted by selecting the value a collection of graphs as a set of such words, that is, as a lan- (true or false) of a certain number of propositional vari- guage over description elements. A packed representation for ables that index elements of the labelled source graph. the collection of graphs is then viewed as a context-free gram- Transfer is then seen as a process of rewriting source mar which generates such a language. We present an algorithm graph elements (e.g, nodes labelled with French lexemes) that uses a conventional set of transfer rules but is capable of into target graph elements (e.g. nodes labelled with En- rewriting the CFG representing the source packed structure into glish lexemes), while preserving the propositional con- a CFG representing the target packed structure that preserves texts in which these graph elements were selected. the compaction properties of the source CFG. In contrast, our approach, following (Dymetman, 1997), views a packed representation as being a gram- 1 Introduction mar (more specifically, a context-free grammar) over the vocabulary of graph elements (labelled nodes and edges), There is currently much interest in translation models where each word (in the sense of formal language theory) that support some amount of ambiguity preservation between source and target texts, so as to minimize disam- generated by the grammar represents one of the possible biguation decisions that the system, or an interactive user, non-ambiguous readings of the packed representation. In has to make during the translation process (Kay et al., other terms, the collection of non-ambiguous graphs be- longing to the packed representation is seen as a lan- 1994). guage over a vocabulary of graph elements, and a packed An important aspect of such models is the ability to handle, during all the stages of the translation process, representation is seen as a grammar which generates such packed linguistic structures, that is, structures which fac- a language. Packing comes from the fact that a context- torize in a compact fashion all the different readings of free grammar is an efficient representation for the language it generates. Another essential feature of such a a sentence and obviate the need to list and treat all these representation is that it is interaction-free, that is, each readings in isolation of each other (as is standard in more nondeterministic top-down traversal of the grammar suc- traditional models for machine translation). In the case of parsing, and more specifically, parsing ceeds without ever backtracking and it results in a certain with unification-based formalisms such as LFG, tech- reading, without the need for checking the consistency of a set of associated propositional constraints: the repre- niques for producing packed structures have been in sentation for the collection of readings is as direct as can existence for some time (Maxwell and Kaplan, 1991; be while permitting a factorization of common parts. Maxwell and Kaplan, 1993; Maxwell and Kaplan, 1996; Dörre, 1997; Dymetman, 1997). More recently, tech- Based on this notion, we present an algorithm for niques have been appearing for the generation from transfer which, starting from a finite set of rewriting packed structures (Shemtov, 1997), the transfer between patterns (the transfer lexicon), associates with a given context-free grammar representing the source packed packed structures (Emele and Dorna, 1998; Rayner and structure a context-free grammar representing the tar- Bouillon, 1995), and the integration of such mechanisms into the whole translation process (Kay, 1999; Frank, get packed structure. Therefore, the target representa- 1999). tion remains interaction-free and transparently encodes This paper focuses on the problem of transfer. The the target structures; furthermore, under certain natural “locality” conditions on the rewriting rules (the graph el- method proposed is related to those of (Emele and Dorna, ements in their left-hand sides tend be be “close” from 1998) and (Kay, 1999). As in these approaches, we view each other in the source grammar derivations), the target packed representations as being descriptions of a finite collection of directed labelled graphs (similar to the func- grammar preserves much of the factorization and com- tional structures of LFG), each representing a different paction properties of the source grammar. non-ambiguous reading, which share certain subparts. The paper is structured in the following way. Sec- tion 2 explains how ambiguous graphs can be seen as way to describe such a graph is by listing a collection of commutative languages over graph description elements, “description elements” for it, where each such element and how context-free grammars provide concise specifi- is either a labelled node such as see 0 or a labelled edge cations for these languages. Section 3 extends the stan- such as mod27 . Using this format, the pragmatically pre- f 01 dard notion of non-ambiguous transfer to that of am- ferred analysis for our sentence is the set see 0 , arg1 , 02 2 27 7 23 3 34 biguous transfer. Section 4 presents the basic language- i1 , arg2 , light , mod , green1 , mod , on , arg2 , g 05 5 56 6 theoretic formalism needed and introduces some opera- hill4 , mod , with , arg2 , telescope . tors on languages. Section 5 presents the detailed rewrit- If we consider the collection of all possible analyses, ing algorithm, which applies these operators not directly we then obtain a collection of sets of description ele- to languages, but to the context-free grammars specify- ments. It is convenient to view such a collection as a ing them. Section 6 gives an example of the algorithm in commutative language over the vocabulary of all possi- operation. ble description elements; each word in such a language corresponds to one analysis and is a list of description 2 Ambiguous structures as languages elements the order of which is considered irrelevant. The main advantage of taking this view of ambiguous 0: see / saw structures is that formal language theory provides stan- arg1 arg2 mod mod dard tools for representing languages compactly. Thus it is well-known in computational lexicography that a 1: i 2: light mod mod large list of word strings can be represented efficiently by mod 3: on means of a finite-state automaton which factorizes com- 7: green1 / green2 arg2 mon substrings. Such a representation is both compact 4: hill mod and “explicit”: accessing and using it is as direct as the 5: with flat list of words would be. arg2 Although one might think of using finite-state mod- 6: telescope els for representing compactly the language associated with a collection of graphs, they do not seem as relevant Figure 1: An informal graphical representation of the 20 as context-free models for our purposes. The reason is possible analyses for “I saw the green light on the hill that the source packed representations are typically ob- with a telescope”. tained as the results of chart-parsing processes. A chart Let’s consider the sentence “I saw the green light on used in the parsing of a context-free grammar can itself the hill with a telescope”. In Fig. 1, we have repre- be viewed as a context-free grammar, which is a spe- sented informally the set of possible analyses for this cialization of the original grammar for the string being sentence. Labels on the nodes correspond to predicate parsed, and which directly generates the derivation trees names (‘on’, ‘hill’, etc). A slash is used to indicate dif- for this string relative to the original grammar (Billot and ferent possible readings for a node; for instance, we as- Lang, 1989).1 The generalization of this approach to uni- sume that the surface form “saw” can correspond to the fication grammars (of the LFG or DCG type) proposed verbs “to see” or “to saw”, and that “green” is ambigu- in (Dymetman, 1997) shows that, in turn, chart-parsing ous between the color adjective “green1” and the noun with these unification grammars conducts naturally to “green2” (grassy lawn). Relations between nodes are in- packed representations for the parse results very close to dicated by labels on the edges joining two nodes: ‘arg1’ the ones we are about to introduce. G and ‘arg2’ for first and second argument, ‘mod’ for mod- Let’s consider the CFG 0 : ifier. The solid edges correspond to relations which are S ! SAW ON WITH D3 satisfied in all the readings for the sentence, dotted edges ! 1 02 SAW D0 arg101 i arg2 LIGHT to relations that are satisfied only for certain readings. ! 2 LIGHT GREEN mod27 light ! j 7 Thus, the preprositional phrase “on the hill” can modify GREEN green17 green2 ! 34 4 either “light” or “see/saw”, the phrase “with a telescope” ON on3 arg2 hill ! 56 6 either “hill”, “light”, or “see/saw”. The informal picture WITH with5 arg2 telescope j ! 0 of Fig. 1 does not make explicit exactly which structures D0 see0 saw ! j 23 are actually possible analyses of the sentence.

Load more