Machine Translation As Lexicalized Parsing with Hooks
Total Page:16
File Type:pdf, Size:1020Kb
Machine Translation as Lexicalized Parsing with Hooks Liang Huang Hao Zhang and Daniel Gildea Dept. of Computer & Information Science Computer Science Department University of Pennsylvania University of Rochester Philadelphia, PA 19104 Rochester, NY 14627 Abstract that techniques for parsing with lexicalized gram- mars can be adapted to the translation problem, re- We adapt the “hook” trick for speeding up ducing the complexity of decoding with an inversion bilexical parsing to the decoding problem transduction grammar and a bigram language model for machine translation models that are from O(n7) to O(n6). We present background on based on combining a synchronous con- this translation model as well as the use of the tech- text free grammar as the translation model nique in bilexicalized parsing before describing the with an n-gram language model. This new algorithm in detail. We then extend the al- dynamic programming technique yields gorithm to general m-gram language models, and lower complexity algorithms than have to general synchronous context-free grammars for previously been described for an impor- translation. tant class of translation models. 2 Machine Translation using Inversion Transduction Grammar 1 Introduction The Inversion Transduction Grammar (ITG) of Wu In a number of recently proposed synchronous (1997) is a type of context-free grammar (CFG) for grammar formalisms, machine translation of new generating two languages synchronously. To model sentences can be thought of as a form of parsing on the translational equivalence within a sentence pair, the input sentence. The parsing process, however, ITG employs a synchronous rewriting mechanism to is complicated by the interaction of the context-free relate two sentences recursively. To deal with the translation model with an m-gram1 language model syntactic divergence between two languages, ITG in the output language. While such formalisms ad- allows the inversion of rewriting order going from mit dynamic programming solutions having poly- one language to another at any recursive level. ITG nomial complexity, the degree of the polynomial is in Chomsky normal form consists of unary produc- prohibitively high. tion rules that are responsible for generating word In this paper we explore parallels between transla- pairs: tion and monolingual parsing with lexicalized gram- X ! e=f mars. Chart items in translation must be augmented X ! e/ with words from the output language in order to cap- ture language model state. This can be thought of as X ! /f a form of lexicalization with some similarity to that where e is a source language word, f is a foreign lan- of head-driven lexicalized grammars, despite being guage word, and means the null token, and binary unrelated to any notion of syntactic head. We show production rules in two forms that are responsible 1We speak of m-gram language models to avoid confusion for generating syntactic subtree pairs: with n, which here is the length of the input sentence for trans- lation. X ! [Y Z] and guage model. Such language models are commonly X ! hY Zi used in noisy channel models of translation, which The rules with square brackets enclosing the find the best English translation e of a foreign sen- right-hand side expand the left-hand side symbol tence f by finding the sentence e that maximizes the into the two symbols on the right-hand side in the product of the translation model P (fje) and the lan- same order in the two languages, whereas the rules guage model P (e). with angled brackets expand the left hand side sym- It is worth noting that since we have specified ITG bol into the two right-hand side symbols in reverse as a joint model generating both e and f, a language model is not theoretically necessary. Given a foreign order in the two languages. The first class of rules ∗ is called straight rule. The second class of rules is sentence f, one can find the best translation e : ∗ called inverted rule. e = argmax P (e; f) One special case of 2-normal ITG is the so-called e Bracketing Transduction Grammar (BTG) which = argmax P (e; f; q) has only one nonterminal A and two binary rules e Xq by approximating the sum over parses q with the A ! [AA] probability of the Viterbi parse: and ∗ e = argmax max P (e; f; q) A ! hAAi e q By mixing instances of the inverted rule with This optimal translation can be computed in using those of the straight rule hierarchically, BTG can standard CKY parsing over f by initializing the meet the alignment requirements of different lan- chart with an item for each possible translation of guage pairs. There exists a more elaborate version each foreign word in f, and then applying ITG rules of BTG that has 4 nonterminals working together from the bottom up. to guarantee the property of one-to-one correspon- However, ITG's independence assumptions are dence between alignments and synchronous parse too strong to use the ITG probability alone for ma- trees. Table 1 lists the rules of this BTG. In the chine translation. In particular, the context-free as- discussion of this paper, we will consider ITG in 2- sumption that each foreign word's translation is cho- normal form. sen independently will lead to simply choosing each By associating probabilities or weights with the foreign word's single most probable English trans- bitext production rules, ITG becomes suitable for lation with no reordering. In practice it is beneficial weighted deduction over bitext. Given a sentence to combine the probability given by ITG with a local pair, searching for the Viterbi synchronous parse m-gram language model for English: tree, of which the alignment is a byproduct, turns out ∗ α to be a two-dimensional extension of PCFG parsing, e = argmax max P (e; f; q)Plm(e) e q having time complexity of O(n6), where n is the length of the English string and the foreign language with some constant language model weight α. The string. A more interesting variant of parsing over bi- language model will lead to more fluent output by text space is the asymmetrical case in which only the influencing both the choice of English words and the foreign language string is given so that Viterbi pars- reordering, through the choice of straight or inverted ing involves finding the English string “on the fly”. rules. While the use of a language model compli- The process of finding the source string given its tar- cates the CKY-based algorithm for finding the best get counterpart is decoding. Using ITG, decoding is translation, a dynamic programming solution is still a form of parsing. possible. We extend the algorithm by storing in each chart item the English boundary words that will af- 2.1 ITG Decoding fect the m-gram probabilities as the item's English Wu (1996) presented a polynomial-time algorithm string is concatenated with the string from an adja- for decoding ITG combined with an m-gram lan- cent item. Due to the locality of m-gram language Structural Rules Lexical Rules A ! [AB] B ! hAAi A ! [BB] B ! hBAi S ! A C ! e =f A ! [CB] B ! hCAi i j S ! B C ! /f A ! [AC] B ! hACi j S ! C C ! e / A ! [BC] B ! hBCi i A ! [CC] B ! hCCi Table 1: Unambiguous BTG model, only m−1 boundary words need to be stored instantiated in n5 possible ways, implying that the to compute the new m-grams produced by combin- complexity of the parsing algorithm is O(n5). ing two substrings. Figure 1 illustrates the combi- Eisner and Satta (1999) pointed out we don't have 0 nation of two substrings into a larger one in straight to enumerate k and h simultaneously. The trick, order and inverted order. shown in mathematical form in Figure 2 (bottom) is 0 very simple. When maximizing over h , j is irrele- 3 Hook Trick for Bilexical Parsing vant. After getting the intermediate result of maxi- 0 A traditional CFG generates words at the bottom of mizing over h , we have one less free variable than a parse tree and uses nonterminals as abstract rep- before. Throughout the two steps, the maximum resentations of substrings to build higher level tree number of interacting variables is 4, implying that 4 nodes. Nonterminals can be made more specific to the algorithmic complexity is O(n ) after binarizing the actual substrings they are covering by associ- the factors cleverly. The intermediate result ating a representative word from the nonterminal's 0 0 max[β(B[i; k; h ]) · P (A[h] ! B[h ]C[h])] yield. When the maximum number of lexicalized h0;B nonterminals in any rule is two, a CFG is bilexical. A A typical bilexical CFG in Chomsky normal form has two types of rule templates: C[h] 0 can be represented pictorially as i k . The A[h] ! B[h]C[h ] same trick works for the second max term in or Equation 1. The intermediate result coming from 0 A[h] ! B[h ]C[h] binarizing the second term can be visualized as A depending on which child is the head child that B[h] agrees with the parent on head word selection. k j Bilexical CFG is at the heart of most modern statisti- . The shape of the intermediate re- cal parsers (Collins, 1997; Charniak, 1997), because sults gave rise to the nickname of “hook”. Melamed the statistics associated with word-specific rules are (2003) discussed the applicability of the hook trick more informative for disambiguation purposes. If for parsing bilexical multitext grammars. The anal- we use A[i; j; h] to represent a lexicalized con- ysis of the hook trick in this section shows that it is stituent, β(·) to represent the Viterbi score function essentially an algebraic manipulation.