Merged Bilingual Trees Based on Universal Dependencies in Machine Translation

Merged bilingual trees based on Universal Dependencies in Machine Translation David Marecekˇ Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics, Charles University in Prague Malostranske´ nam´ estˇ ´ı 25, 118 00 Prague, Czech Republic [email protected] Abstract the leaf nodes and therefore the grammatical differences between the languages does not much af- In this paper, we present our new ex- fect the common dependency structure. We were perimental system of merging dependency partially inspired by the Stochastic inversion trans- representations of two parallel sentences duction grammars (Wu, 1997). into one dependency tree. All the in- Our merged dependency trees are defined in ner nodes in dependency tree represent Section 2. The data we use and necessary prepro- source-target pairs of words, the extra cessing is in Section 3. The merging algorithm it- words are in form of leaf nodes. We use self, which merges two parallel sentences into one, Universal Dependencies annotation style, is described in Section 4. Section 5 presents the in which the function words, whose us- minimally supervised parsing of the merged sen- age often differs between languages, are tences. The experimental translation system using annotated as leaves. The parallel tree- the merged trees is described in Section 6. Finally, bank is parsed in minimally supervised we present our results (Section 7) and conclusions way. Unaligned words are there auto- (Section 8). matically pushed to leaves. We present a simple translation system trained on 2 Merged trees such merged trees and evaluate it in WMT 2016 English-to-Czech and Czech- We introduce “merged trees”, where parallel sen- to-English translation task. Even though tences from two languages are represented by a the model is so far very simple and no lan- single dependency tree. Each node of the tree con- guage model and word-reordering model sists of two word-forms and two POS tags. An ex- were used, the Czech-to-English variant ample of such merged dependency tree is in Fig- reached similar BLEU score as another es- ure 3. If two words are translations of each other tablished tree-based system. (1-1 alignment), they share one node labeled by both of them. Words that do not have their coun- 1 Introduction terparts in the other sentence (1-0 or 0-1 alignment) are also represented by nodes and the miss- Tree-based machine translation systems (Chiang ing counterpart is marked by label <empty>. All et al., 2005; Dusekˇ et al., 2012; Sennrich and Had- such nodes representing a single word without any dow, 2015) are alternatives to the leading phrase- counterpart are leaf nodes. This ensures that the based MT systems (Koehn et al., 2007) and newly merged tree structure can be simply divided into very progressive neural MT systems (Bahdanau et two monolingual trees, not including empty nodes. al., 2015). Our approach aims to produce bilingual The two separated trees are also “internally” iso- dependency trees, in which both source and tar- morfic, the only differences are in leaves. get sentences are encoded together. We adapt the The annotation style of Universal Dependencies Universal Dependencies annotation style (Nivre et 1 is suitable for the merged tree structures, since ma- al., 2016), in which the functional words are in jority of function words are annotated as leaves 1Functional words are determiners, prepositions, conjunc- there. Function words are the ones which often tions, auxiliary verbs, particles, etc. cannot be translated as one-to-one. For example, 333 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 333–338, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics ADJ adjective PART particle Input: enF, enT, csF, csT : arrays of forms ADP adposition PRON pronoun and tags of the English and Czech ADV adverb PROPN proper noun sentence AUX auxiliary verb PUNCT punctuation Input: en2csAlign, cs2enAlign: CONJ coord. conj. SCONJ subord. conj. unidirectional alignment links DET determiner SYM symbol between English and Czech INTJ interjection VERB verb Output: mrgF, mrgT : arrays of form and NOUN noun X other tags of the merged sentence NUM numeral k = 0; foreach i 1, , enF do Table 1: List of part-of-speech tags used in Uni- used =∈ 0 {; ··· | |} versal Dependencies annotation style. foreach j 1, , csF do ∈ { ··· | |} if cs2enAlign[i] = j then continue; 6 prepositions in one languages can be translated as k++; different noun suffixes in another one. Some lan- if en2csAlign[j] = i then mrgF [k] = enF [i] + ’ ’ + csF [j]; guages use determiners, some not. Auxiliary verbs mrgT [k] = enT [i] + ’ ’ + csT [j]; are also used differently across languages. used = 1; else 3 Data mrgF [k] = ’<empty> ’ + csF [j]; The parallel data we use in the experiments is the mrgT [k] = ’<empty> ’ + csT [j]; training part of the Czech-English parallel corpus end CzEng 1.0 (Bojar et al., 2012). It consists of end more than 15 million sentences, 206 million to- if used = 0 then kens on the Czech side and 232 million tokens on k++; the English side. We extract the parallel sentences mrgF [k] = enF [i] + ’ <empty>’; with original tokenization from the CzEng export- mrgT [k] = enT [i] + ’ <empty>’; format together with the part-of-speech (POS) tags end and the word alignment. end The original CzEng POS tags, Prague Depen- return mrgF, mrgT ; dency Treebank tags (Hajicˇ et al., 2006) for Czech Figure 1: Merging algorithm pseudocode. and Penn Treebank tags (Marcus et al., 1993) for English, are mapped to the universal POS tagset developed for Universal Dependencies (Nivre et word.3 These alignment links are direct outputs al., 2016). The simple 1-to-1 mapping was taken from GIZA++ word-alignment tool (Och and Ney, 2 from the GitHub repository. The POS tags used 2003) before symmetrization. in Universal Dependencies POS are listed in Ta- The algorithm traverses through the source sen- ble 1. tence and for each word, it collects all its target counterparts using the cs2enAlign4 The Czech 4 Merging algorithm word, where the cs2enAlign and en2csAlign inter- Parallel sentences tagged by the universal POS sect, creates the word pair with the English one. tags are then merged together using the algorithm The other Czech words stay alone and are com- in Figure 1. We describe the algorithm for the pleted with the <empty> label. If there is no in- English-to-Czech translation, even though the pro- tersection counterpart for the English word, it is cedure is generally language universal. also completed with the <empty> label. The algorithm uses two unidirectional align- Figure 2 shows one example of merging. The ments, which we call en2csAlign and cs2enAlign. pairs of words connected by both cs2enAlign For each English word, the en2csAlign defines its 3In CzEng corpus export format, these alignments are counterpart in the Czech sentence. The cs2enAlign called ali there and ali back, sometimes they are also called defined the English counterpart for each Czech left and right alignments. 4Since we search for all the Czech words that are aligned 2https://github.com/UniversalDependencies to the English one, we need the cs2enAlign. 334 Shaw painstakingly removed the case from under the rigid fingers . PROPN ADV VERB DET NOUN ADP ADP DET ADJ NOUN PUNCT PROPN ADV ADV VERB NOUN ADP ADJ NOUN PUNCT Shaw velice opatrně vysunul kufřík ze ztuhlých prstů . Shaw_Shaw painstakingly_velice <empty>_opatrně removed_vysunul the_<empty> PROPN_PROPN ADV_ADV <empty>_ADV VERB_VERB DET_<empty> case_kufřík from_ze under_<empty> the_<empty> rigid_ztuhlých fingers_prstů ._. NOUN_NOUN ADP_ADP ADP_<empty> DET_<empty> ADJ_ADJ NOUN_NOUN PUNCT_PUNCT Figure 2: Example of merging English and Czech sentence together with the Universal Dependencies POS tags. Alignment links are depicted by arrows. Bidirectional arrows represent the intersection con- nections. a-tree zone=en removed_vysunul VERB_VERB Shaw_Shaw painstakingly_velice case_kufřík ._. PROPN_NOUN ADV_ADV NOUN_NOUN PUNCT_PUNCT <empty>_opatrně the_<empty> fingers_prstů <empty>_ADV DET_<empty> NOUN_NOUN from_ze under_<empty> the_<empty> rigid_ztuhlých ADP_ADP ADP_<empty> DET_<empty> ADJ_ADJ Figure 3: Example of English-Czech merged tree. The same sentence as in Figure 2 is shown. and en2csAlign links are paired together into anism how to import external probabilities. The one word, their POS tags are paired in the UDP is based on Dependency Model with Va- same way. The words without intersection lence, a generative model which consists of two counterparts are paired with <empty> words or sub-models: <empty> POS tags respectively. The tokens in the merged sentence are ordered primarily ac- Stop model pstop( tg, dir) represents proba- • ·| cording to the English sentence. The Czech bility of not generating another dependent in words with <empty> counterparts are together direction dir to a node with POS tag tg. The with Czech words aligned with the same English direction dir can be left or right. If pstop = 1, word. Globally, the Czech word order cannot be the node with the tag tg cannot have any de- preserved due to crossing intersection alignment pendent in direction dir. If it is 1 in both di- links, which is a quite common phenomenon. rections, the node is a leaf. 5 Minimally Supervised Parallel Parsing Attach model pattach(td tg, dir) represents • | probability that the dependent of the node For parsing the merged sentences, we use the with POS tag t in direction dir is labeled Unsupervised Dependency Parser (UDP) imple- g with POS tag t . mented by Marecekˇ and Straka (2013). The source d 5 code is freely available, and it includes a mech- In other words, the stop model generates edges, 5http://ufal.mff.cuni.cz/udp while the attach model generates POS tags for the 335 ADP, ADV, AUX, CONJ, parsing: We parse the input data using a • DET, PART, PRON, PUNCT, 1.0 parser trained on the source parts of the SCONJ, <empty> merged-trees.

Load more