Max-Margin Synchronous Grammar Induction for Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Max-Margin Synchronous Grammar Induction for Machine Translation Xinyan Xiao and Deyi Xiong∗ School of Computer Science and Technology Soochow University Suzhou 215006, China [email protected], [email protected] Abstract is not elegant theoretically. It brings an undesirable gap that separates modeling and learning in an SMT Traditional synchronous grammar induction system. estimates parameters by maximizing likeli- Therefore, researchers have proposed alternative hood, which only has a loose relation to trans- lation quality. Alternatively, we propose a approaches to learning synchronous grammars di- max-margin estimation approach to discrim- rectly from sentence pairs without word alignments, inatively inducing synchronous grammars for via generative models (Marcu and Wong, 2002; machine translation, which directly optimizes Cherry and Lin, 2007; Zhang et al., 2008; DeNero translation quality measured by BLEU. In et al., 2008; Blunsom et al., 2009; Cohn and Blun- the max-margin estimation of parameters, we som, 2009; Neubig et al., 2011; Levenberg et al., only need to calculate Viterbi translations. 2012) or discriminative models (Xiao et al., 2012). This further facilitates the incorporation of Theoretically, these approaches describe how sen- various non-local features that are defined on the target side. We test the effectiveness of our tence pairs are generated by applying sequences of max-margin estimation framework on a com- synchronous rules in an elegant way. However, they petitive hierarchical phrase-based system. Ex- learn synchronous grammars by maximizing likeli- periments show that our max-margin method hood,1 which only has a loose relation to transla- significantly outperforms the traditional two- tion quality (He and Deng, 2012). Moreover, gen- step pipeline for synchronous rule extraction erative models are normally hard to be extended to by 1.3 BLEU points and is also better than pre- incorporate useful features, and the discriminative vious max-likelihood estimation method. synchronous grammar induction model proposed by Xiao et al. (2012) only incorporates local features 1 Introduction defined on parse trees of the source language. Non- local features, which encode information from parse Synchronous grammar induction, which refers to trees of the target language, have never been ex- the process of learning translation rules from bilin- ploited before due to the computational complexity gual corpus, still remains an open problem in sta- of normalization in max-likelihood estimation. tistical machine translation (SMT). Although state- Consequently, we would like to learn syn- of-the-art SMT systems model the translation pro- chronous grammars in a discriminative way that can cess based on synchronous grammars (including directly maximize the end-to-end translation quality bilingual phrases), most of them still learn trans- measured by BLEU (Papineni et al., 2002), and is lation rules via a pipeline with word-based heuris- also able to incorporate non-local features from tar- tics (Koehn et al., 2003). This pipeline first builds get parse trees. word alignments using heuristic combination strate- gies, then heuristically extracts rules that are consis- We thus propose a max-margin estimation method tent with word alignments. Such heuristic pipeline 1More precisely, the discriminative model by Xiao et al. ∗Corresponding author (2012) maximizes conditional likelihood. 255 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 255–264, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics to discriminatively induce synchronous grammar di- [0,5] rectly from sentence pairs without word alignments. We try to maximize the margin between a reference [1,5] translation and a candidate translation with transla- [1,3] tion errors that are measured by BLEU. The more ᐳӰ о ⋉嗉 Ѯ㹼 Պ䈸 serious the translation errors, the larger the margin. 01 2 3 4 5 bushi yu shalong juxing huitan In this way, our max-margin method is able to learn synchronous grammars according to their translation performance. We further incorporate various non- 01 2 3 4 5 6 local features defined on target parse trees. We ef- ficiently calculate the non-local feature values of a [4,6] translation over its exponential derivation space us- ing the inside-outside algorithm. Because our max- [1,6] margin estimation optimizes feature weights only by [0,6] the feature values of Viterbi and reference transla- r1: yu shalong with Sharon tions, we are able to efficiently perform optimization ⟨ ⇒ ⟩ r2: X juxing huitan held a talk X even with non-local features. ⟨ ⇒ ⟩ r3: bushi X Bush X We apply the proposed max-margin estimation ⟨ ⇒ ⟩ method to learn synchronous grammars for a hi- Figure 1: A derivation of a sentence pair represented by erarchical phrase-based translation system (Chiang, a synchronous tree. The above and below part are the 2007) which typically produces state-of-the-art per- parses in the source language side and the target language formance. With non-local features defined on tar- side respectively. Left subscript of a node X denotes the get parse trees, our max-margin method significantly source span, while right subscript denotes the target span. outperforms the baseline that uses synchronous A dashed line denotes an alignment from a source span to a target span. The annotation for a dashed line cor- rules learned from the traditional pipeline by 1.3 responds to the rewriting rule used in the corresponding BLEU points on large-scale Chinese-English bilin- step of the derivation. gual training data. The remainder of this paper is organized as fol- lows. Section 2 presents the discriminative syn- chronous rule r G in one step. We refer to such chronous grammar induction model with the non- a sequence of translation∈ steps as a derivation (See local features. In Section 3, we elaborate our max- Figure 1) and denote it as d (s), where (s) margin estimation method which is able to directly represents the derivation space∈ of D a source sentence.D optimize BLEU, and discuss how we induce gram- Given an input source sentence s, we output a pair mar rules. Local and non-local features are de- t, d in SMT. Thus, we study the triple s, t, d in scribed in Section 4. Finally, in Section 5, we verify SMT.⟨ ⟩ ⟨ ⟩ the effectiveness of our method through experiments In our discriminative model, we calculate the by comparing it against both the traditional pipeline value of a triple s, t, d according to the following and max-likelihood estimation method. scoring function⟨: ⟩ 2 Discriminative Model with Non-local f(s, t, d) = θT Φ(s, t, d) (1) Features where θ Θ is a feature weight vector, and Φ is the ∈ Let denotes the set of all strings in a source lan- feature function. guage.S Given a source sentence s , (s) denotes There are exponential outputs in SMT. Therefore all candidate translations in the target∈ S languageT that it is necessary to factorize the feature function in or- can be generated by a synchronous grammar G.A der to perform efficient calculation over the SMT translation t (s) is generated by a sequence of output space using dynamic programming. We de- ∈ T translation steps (r1, ..., rn), where we apply a syn- compose the feature function of a triple s, t, d into ⟨ ⟩ 256 [1,5] s, S, s is a sentence in a source language; S S means source training sentences; denotes all the possible sentences; S о Պ䈸 t, T, symbols for the target language that Ѯ㹼 Պ䈸 T similar to s, S, ; d, derivation and derivationS space; D [1,6] (s) space of derivations for D a source sentence; (s, t) space of derivations for D a source sentence with its translation; (s) hypergraph that represents (s); (Hs, t) hypergraph that represents D(s, t); Figure 2: Example features for the derivation in Figure 1. H D Shaded nodes denote information encoded in the feature. Table 1: Notations in this paper. We give an abstract of related notations for clarity. a sum of values of each synchronous rule in the derivation d. words and result in an extremely large number of states during dynamic programming. Fortunately, Φ(s, t, d) = ϕ(r, s) + ϕ(r, s, t) (2) when integrating out derivations over the derivation r d r d space (s, t) of a source sentence and its transla- ∑∈ local ∑∈ non-local tion, weD can efficiently calculate the non-local fea- Our feature functions include| {z } both| local{z and} non- tures. Because all derivations in (s, t) share the local features. A feature is a local feature if and same translation, there is no need toD maintain states only if it can be factored among the translation steps for target boundary words. We will discuss this com- in a derivation. In other words, the value of a lo- putational problem in details in Section 3.3. In the cal feature for s, t, d can be calculated as a sum of ⟨ ⟩ proposed max-margin estimation described in next local scores in each translation step, and the calcula- section, we only need to integrate out derivation tion of each local score only requires to look at the for a Viterbi translation and a reference translation rule used in corresponding step and the input sen- when updating feature weights. Therefore, the de- tence. Otherwise, the feature is a non-local feature. fined non-local features allow us to not only explore Our discriminative model allows to incorporate non- useful knowledge on the target parse trees, but also local features that are defined on target translations. compute them efficiently over (s, t) during max- For example, a rule feature in Figure 2(a), which margin estimation. D indicates the application of a specific rule in a derivation, is a local feature.