Statistical Machine Translation with a Factorized Grammar
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Machine Translation with a Factorized Grammar Libin Shen and Bing Zhang and Spyros Matsoukas and Jinxi Xu and Ralph Weischedel Raytheon BBN Technologies Cambridge, MA 02138, USA {lshen,bzhang,smatsouk,jxu,weisched}@bbn.com Abstract for which the segments for translation are always fixed. In modern machine translation practice, a sta- However, do we really need such a large rule set tistical phrasal or hierarchical translation sys- to represent information from the training data of tem usually relies on a huge set of trans- much smaller size? Linguists in the grammar con- lation rules extracted from bi-lingual train- struction field already showed us a perfect solution ing data. This approach not only results in space and efficiency issues, but also suffers to a similar problem. The answer is to use a fac- from the sparse data problem. In this paper, torized grammar. Linguists decompose lexicalized we propose to use factorized grammars, an linguistic structures into two parts, (unlexicalized) idea widely accepted in the field of linguis- templates and lexical items. Templates are further tic grammar construction, to generalize trans- organized into families. Each family is associated lation rules, so as to solve these two prob- with a set of lexical items which can be used to lex- lems. We designed a method to take advantage icalize all the templates in this family. For example, of the XTAG English Grammar to facilitate the extraction of factorized rules. We experi- the XTAG English Grammar (XTAG-Group, 2001), mented on various setups of low-resource lan- a hand-crafted grammar based on the Tree Adjoin- guage translation, and showed consistent sig- ing Grammar (TAG) (Joshi and Schabes, 1997) for- nificant improvement in BLEU over state-of- malism, is a grammar of this kind, which employs the-art string-to-dependency baseline systems factorization with LTAG e-tree templates and lexical with 200K words of bi-lingual training data. items. Factorized grammars not only relieve the burden 1 Introduction on space and search, but also alleviate the sparse data problem, especially for low-resource language A statistical phrasal (Koehn et al., 2003; Och and translation with few training data. With a factored Ney, 2004) or hierarchical (Chiang, 2005; Marcu model, we do not need to observe exact “template et al., 2006) machine translation system usually re- – lexical item” occurrences in training. New rules lies on a very large set of translation rules extracted can be generated from template families and lexical from bi-lingual training data with heuristic methods items either offline or on the fly, explicitly or im- on word alignment results. According to our own plicitly. In fact, the factorization approach has been experience, we obtain about 200GB of rules from successfully applied on the morphological level in training data of about 50M words on each side. This previous study on MT (Koehn and Hoang, 2007). In immediately becomes an engineering challenge on this work, we will go further to investigate factoriza- space and search efficiency. tion of rule structures by exploiting the rich XTAG A common practice to circumvent this problem English Grammar. is to filter the rules based on development sets in the We evaluate the effect of using factorized trans- step of rule extraction or before the decoding phrase, lation grammars on various setups of low-resource instead of building a real distributed system. How- language translation, since low-resource MT suffers ever, this strategy only works for research systems, greatly on poor generalization capability of trans- 616 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 616–625, MIT, Massachusetts, USA, 9-11 October 2010. c 2010 Association for Computational Linguistics lation rules. With the help of high-level linguis- algorithm. The source side of a translation rule is tic knowledge for generalization, factorized gram- used to detect when this rule can be applied. The tar- mars provide consistent significant improvement get side of the rule provides a hypothesis tree struc- in BLEU (Papineni et al., 2001) over string-to- ture for the matched span. Mono-lingual parsing can dependency baseline systems with 200K words of be viewed as a special case of this generic algorithm, bi-lingual training data. for which the source string is a projection of the tar- This work also closes the gap between compact get tree structure. hand-crafted translation rules and large-scale unor- Figure 1 shows three examples of string-to- ganized automatic rules. This may lead to a more ef- dependency translation rules. For the sake of con- fective and efficient statistical translation model that venience, we use English for both source and target. could better leverage generic linguistic knowledge Upper-cased words represent source, while lower- in MT. cased words represent target. X is used for non- In the rest of this paper, we will first provide a terminals for both sides, and non-terminal alignment short description of our baseline system in Section 2. is represented with subscripts. Then, we will introduce factorized translation gram- In Figure 1, the top boxes mean the source side, mars in Section 3. We will illustrate the use of the and the bottom boxes mean the target side. As for XTAG English Grammar to facilitate the extraction the third rule, FUN Q stands for a function word in of factorized rules in Section 4. Implementation de- the source language that represents a question. tails are provided in Section 5. Experimental results are reported in Section 6. 3 Translation with a Factorized Grammar 2 A Baseline String-to-Tree Model We continue with the example rules in Figure 1. Suppose, we have “... HATE ... FUN Q” in a given As the baseline of our new algorithm, we use a test segment. There is no rule having both HATE string-to-dependency system as described in (Shen and FUN Q on its source side. Therefore, we have et al., 2008). There are several reasons why we take to translate these two source words separately. For this model as our baseline. First, it uses syntactic example, we may use the second rule in Figure 1. tree structures on the target side, which makes it easy Thus, HATE will be translated into hates, which is to exploit linguistic information. Second, depen- wrong. dency structures are relatively easier to implement, Intuitively, we would like to have translation rule as compared to phrase structure grammars. Third, that tell us how to translate X1 HATE X2 FUN Q a string-to-dependency system provides state-of-the- as in Figure 2. It is not available directly from the art performance on translation accuracy, so that im- training data. However, if we obtain the three rules provement over such a system will be more convinc- in Figure 1, we are able to predict this missing rule. ing. Furthermore, if we know like and hate are in the Here, we provide a brief description of the base- same syntactic/semantic class in the source or target line string-to-dependency system, for the sake of language, we will be very confident on the validity completeness. Readers can refer to (Shen et al., of this hypothesis rule. 2008; Shen et al., 2009) for related information. Now, we propose a factorized grammar to solve In the baseline string-to-dependency model, each this generalization problem. In addition, translation translation rule is composed of two parts, source and rules represented with the new formalism will be target. The source sides is a string rewriting rule, more compact. and the target side is a tree rewriting rule. Both sides can contain non-terminals, and source and tar- 3.1 Factorized Rules get non-terminals are one-to-one aligned. Thus, in We decompose a translation rule into two parts, the decoding phase, non-terminal replacement for a pair of lexical items and an unlexicalized tem- both sides are synchronized. plate. It is similar to the solution in the XTAG En- Decoding is solved with a generic chart parsing glish Grammar (XTAG-Group, 2001), while here we 617 X1 LIKE X2 X1 HATE X2 X1 LIKE X2 FUN_Q likes hates like X1 X2 X1 X2 does X1 X2 Figure 1: Three examples of string-to-dependency translation rules. X1 V X2 X1 V X2 X1 V X2 FUN_Q VBZ VBZ VB X1 X2 X1 X2 does X1 X2 Figure 3: Templates for rules in Figure 1. extract lexical items (LIKE, like), (HATE, hate) and X1 HATE X2 FUN_Q (LIKE, like) respectively. We obtain the same lexical items from the first and the third rules. The resultant templates are shown in Figure 3. Here, V represents a verb on the source side, VB hate stands for a verb in the base form, and VBZ means a verb in the third person singular present form as in the Penn Treebank representation (Marcus et al., does X1 X2 1994). In the XTAG English Grammar, tree templates for transitive verbs are grouped into a family. All transi- Figure 2: An example of a missing rule. tive verbs are associated with this family. Here, we assume that the rule templates representing struc- tural variations of the same word class can also be work on two languages at the same time. organized into a template family. For example, as For each rule, we first detect a pair of aligned head shown in Figure 4, templates and lexical items are words. Then, we extract the stems of this word pair associated with families. It should be noted that as lexical items, and replace them with their POS a template or a lexical item can be associated with tags in the rule.