Machine Translation 2: Statistical MT: Phrase-Based and Neural

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ondˇrej Bojar [email protected]ff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2017 MT2: PBMT, NMT Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based, Hierarchical and Syntactic MT. • Neural MT: Sequence-to-sequence. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December2017 MT2:PBMT,NMT 1 Outline of MT Lecture 2 1. What makes MT statistical. • Brute-force statistical MT. • Noisy channed model. • Log-linear model. 2. Phrase-based translation model. • Phrase extraction. • Decoding (gradual construction of hypotheses). • Minimum error-rate training (weight optimization). 3. Neural machine translation (NMT). • Sequence-to-sequence, with attention. December2017 MT2:PBMT,NMT 2 Quotes Warren Weaver (1949): I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text. Noam Chomsky (1969): . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Frederick Jelinek (80’s; IBM; later JHU and sometimes UFAL)´ Every time I fire a linguist, the accuracy goes up. Hermann Ney (RWTH Aachen University): MT = Linguistic Modelling + Statistical Decision Theory December2017 MT2:PBMT,NMT 3 The Statistical Approach (Statistical = Information-theoretic.) • Specify a probabilistic model. = How is the probability mass distributed among possible outputs given observed inputs. • Specify the training criterion and procedure. = How to learn free parameters from training data. Notice: • Linguistics helpful when designing the models: – How to divide input into smaller units. – Which bits of observations are more informative. December2017 MT2:PBMT,NMT 4 Statistical MT J Given a source (foreign) language sentence f1 = f1 ...fj ...fJ, I Produce a target language (English) sentence e1 = e1 ...ej ...eI. Among all possible target language sentences, choose the sentence with the highest probability: Iˆ I J eˆ1 = argmax p(e1|f1 ) (1) I I,e1 I J We stick to the e1, f1 notation despite translating from English to Czech. December2017 MT2:PBMT,NMT 5 Brute-Force MT (1/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobréráno. How are you? = Jak se máˇs? How are you? = Jak se máte? 1 if eI = f J seen in the TM p(eI|f J)= 1 1 (2) 1 1 n 0 otherwise Any problems with the definition? J I J • Not a probability. There may be f , s.t. I p(e |f ) > 1. 1 e1 1 1 I J P count(e1,f1 ) ⇒ Have to normalize, use J instead of 1. count(f1 ) • Not “smooth”, no generalization: Good morning. ⇒ Dobréráno. December2017 MT2:PBMT,NMT 6 Brute-Force MT (2/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobréráno. How are you? = Jak se máˇs? How are you? = Jak se máte? 1 if eI = f J seen in the TM p(eI|f J)= 1 1 (3) 1 1 n 0 otherwise J I J • Not a probability. There may be f , s.t. I p(e |f ) > 1. 1 e1 1 1 I J P count(e1,f1 ) ⇒ Have to normalize, use J instead of 1. count(f1 ) • Not “smooth”, no generalization: Good morning. ⇒ Dobréráno. Good evening. ⇒ ∅ December2017 MT2:PBMT,NMT 7 Bayes’ Law p(b|a)p(a) Bayes’ law for conditional probabilities: p(a|b)= p(b) So in our case: Iˆ I J eˆ1 = argmax p(e1|f1 ) Apply Bayes’ law I I,e1 J I I J p(f1 |e1)p(e1) p(f1 ) constant = argmax J I p(f ) ⇒ irrelevant in maximization I,e1 1 J I I = argmax p(f1 |e1)p(e1) I I,e1 Also called “Noisy Channel” model. December2017 MT2:PBMT,NMT 8 Motivation for Noisy Channel Iˆ J I I eˆ1 = argmax p(f1 |e1)p(e1) (4) I I,e1 Bayes’ law divided the model into components: J I I J p(f1 |e1) Translation model (“reversed”, e1 → f1 ) . is it a likely translation? I p(e1) Language model (LM) . is the output a likely sentence of the target language? • The components can be trained on different sources. There are far more monolingual data ⇒ language model more reliable. December2017 MT2:PBMT,NMT 9 Without Equations Parallel Texts Monolingual Texts Translation Model Language Model Global Search Input Output for sentence with highest probability December 2017 MT2: PBMT, NMT 10 Summary of Language Models I I • p(e1) should report how “good” sentence e1 is. • We surely want p(The the the.) <p(Hello.) • How about p(The cat was black.) <p(Hello.)? . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model: p( The cat was black . )= p(The| ) p(cat| The) p(was|The cat) p(black|cat was) p(.|was black) p( |black .) p( |. ) Formally, with n = 3: I p (eI)= p(e |ei−1 ) (5) LM 1 Y i i−n+1 i=1 December 2017 MT2: PBMT, NMT 11 Estimating and Smoothing LM count(w1) p(w1)= total words observed Unigram probabilities. count(w1w2) p(w2|w1)= Bigram probabilities. count(w1) count(w1w2w3) p(w3|w2, w1)= Trigram probabilities. count(w1w2) Unseen ngrams (p(ngram)=0) are a big problem, invalidate I whole sentence: pLM(e1)= ···· 0 ···· =0 ⇒ Back-off with shorter ngrams: I I pLM(e )= 0.8 · p(ei|ei−1, ei−2)+ 1 Qi=1 0.15 · p(e |e − )+ i i 1 (6) 0.049 · p(ei)+ 0.001 6= 0 December 2017 MT2: PBMT, NMT 12 From Bayes to Log-Linear Model Och (2002) discusses some problems of Equation 19: • Models estimated unreliably ⇒ maybe LM more important: Iˆ J I I 2 eˆ1 = argmax p(f1 |e1)(p(e1)) (7) I I,e1 • In practice, “direct” translation model equally good: Iˆ I J I eˆ1 = argmax p(e1|f1 )p(e1) (8) I I,e1 • Complicated to correctly introduce other dependencies. ⇒ Use log-linear model instead. December 2017 MT2: PBMT, NMT 13 Log-Linear Model (1) I J • p(e1|f1 ) is modelled as a weighted combination of models, called “feature functions”: h1(·, ·) ...hM (·, ·) M I J I J exp( m=1 λmhm(e1, f1 )) p(e1|f1 )= P M ′ (9) ′ ′I J ′I exp( m=1 λmhm(e 1 , f1 )) Pe 1 P • Each feature function hm(e, f) relates source f to target e. E.g. the feature for n-gram language model: I h (f J , eI) = log p(e |ei−1 ) (10) LM 1 1 Y i i−n+1 i=1 M • Model weights λ1 specify the relative importance of features. December 2017 MT2: PBMT, NMT 14 Log-Linear Model (2) As before, the constant denominator not needed in maximization: M I J Iˆ exp( m=1 λmhm(e1, f1 )) eˆ1 = argmaxI,eI P M ′ 1 ′ ′I J ′I exp( m=1 λmhm(e 1 , f1 )) (11) Pe 1 P M I J = argmaxI,eI exp( m λmhm(e1, f1 )) 1 P =1 December 2017 MT2: PBMT, NMT 15 Relation to Noisy Channel With equal weights and only two features: I J J I • hTM(e1, f1 ) = log p(f1 |e1) for the translation model, I J I • hLM(e1, f1 ) = log p(e1) for the language model, log-linear model reduces to Noisy Channel: Iˆ M I J eˆ = argmax I exp( λmhm(e , f )) 1 I,e1 m=1 1 1 P I J I J = argmax I exp(hTM(e , f )+ hLM(e , f )) I,e1 1 1 1 1 J I I (12) = argmax I exp(log p(f |e )+log p(e )) I,e1 1 1 1 J I I = argmax I p(f |e )p(e ) I,e1 1 1 1 December 2017 MT2: PBMT, NMT 16 Phrase-Based MT Overview This time around = Nyn´ı . they ’re moving = zareagovaly faster even even = dokonce jeˇstˇe moving ... = ... ’re This time around, they ’re moving = Nyn´ızareagovaly they , even faster = dokonce jeˇstˇerychleji ... = ... around time This Phrase-based MT: choose such segmentation . of input string and such phrase “replacements” Nyn´ı jeˇstˇe to make the output sequence “coherent” rychleji dokonce (3-grams most probable). zareagovaly December 2017 MT2: PBMT, NMT 17 Phrase-Based Translation Model • Captures the basic assumption of phrase-based MT: J ˜ ˜ 1. Segment source sentence f1 into K phrases f1 ... fK. 2. Translate each phrase independently: f˜k → e˜k. 3. Concatenate translated phrases (with possible reordering R): e˜R(1) ... e˜R(K) K • In theory, the segmentation s1 is a hidden variable in the maximization, we should be summing over all segmentations: (Note the three args in hm(·, ·, ·) now.) Iˆ M I J K eˆ1 = argmaxI,eI sK exp( m=1 λmhm(e1, f1 ,s1 )) (13) 1 P 1 P • In practice, the sum is approximated with a max (the biggest element only): Iˆ M I J K eˆ1 = argmaxI,eI maxsK exp( m=1 λmhm(e1, f1 ,s1 )) (14) 1 1 P December 2017 MT2: PBMT, NMT 18 Core Feature: Phrase Trans. Prob. The most important feature: phrase-to-phrase translation: K h (f J, eI, sK) = log p(f˜ |e˜ ) (15) Phr 1 1 1 Y k k k=1 The conditional probability of phrase f˜k given phrase e˜k is estimated from relative frequencies: count(f,˜ e˜) p(f˜ |e˜ )= (16) k k count(˜e) • count(f,˜ e˜) is the number of co-occurrences of a phrase pair (f,˜ e˜) that are consistent with the word alignment • count(˜e) is the number of occurrences of the target phrase e˜ in the training corpus.

Machine Translation 2: Statistical MT: Phrase-Based and Neural

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support