Machine Translation 2: Statistical MT: Phrase-Based and Neural

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ondˇrej Bojar [email protected]ff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2017 MT2: PBMT, NMT Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based, Hierarchical and Syntactic MT. • Neural MT: Sequence-to-sequence. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December2017 MT2:PBMT,NMT 1 Outline of MT Lecture 2 1. What makes MT statistical. • Brute-force statistical MT. • Noisy channed model. • Log-linear model. 2. Phrase-based translation model. • Phrase extraction. • Decoding (gradual construction of hypotheses). • Minimum error-rate training (weight optimization). 3. Neural machine translation (NMT). • Sequence-to-sequence, with attention. December2017 MT2:PBMT,NMT 2 Quotes Warren Weaver (1949): I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that is has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text. Noam Chomsky (1969): . the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Frederick Jelinek (80’s; IBM; later JHU and sometimes UFAL)´ Every time I fire a linguist, the accuracy goes up. Hermann Ney (RWTH Aachen University): MT = Linguistic Modelling + Statistical Decision Theory December2017 MT2:PBMT,NMT 3 The Statistical Approach (Statistical = Information-theoretic.) • Specify a probabilistic model. = How is the probability mass distributed among possible outputs given observed inputs. • Specify the training criterion and procedure. = How to learn free parameters from training data. Notice: • Linguistics helpful when designing the models: – How to divide input into smaller units. – Which bits of observations are more informative. December2017 MT2:PBMT,NMT 4 Statistical MT J Given a source (foreign) language sentence f1 = f1 ...fj ...fJ, I Produce a target language (English) sentence e1 = e1 ...ej ...eI. Among all possible target language sentences, choose the sentence with the highest probability: Iˆ I J eˆ1 = argmax p(e1|f1 ) (1) I I,e1 I J We stick to the e1, f1 notation despite translating from English to Czech. December2017 MT2:PBMT,NMT 5 Brute-Force MT (1/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobréráno. How are you? = Jak se máˇs? How are you? = Jak se máte? 1 if eI = f J seen in the TM p(eI|f J)= 1 1 (2) 1 1 n 0 otherwise Any problems with the definition? J I J • Not a probability. There may be f , s.t. I p(e |f ) > 1. 1 e1 1 1 I J P count(e1,f1 ) ⇒ Have to normalize, use J instead of 1. count(f1 ) • Not “smooth”, no generalization: Good morning. ⇒ Dobréráno. December2017 MT2:PBMT,NMT 6 Brute-Force MT (2/2) Translate only sentences listed in a “translation memory” (TM): Good morning. = Dobréráno. How are you? = Jak se máˇs? How are you? = Jak se máte? 1 if eI = f J seen in the TM p(eI|f J)= 1 1 (3) 1 1 n 0 otherwise J I J • Not a probability. There may be f , s.t. I p(e |f ) > 1. 1 e1 1 1 I J P count(e1,f1 ) ⇒ Have to normalize, use J instead of 1. count(f1 ) • Not “smooth”, no generalization: Good morning. ⇒ Dobréráno. Good evening. ⇒ ∅ December2017 MT2:PBMT,NMT 7 Bayes’ Law p(b|a)p(a) Bayes’ law for conditional probabilities: p(a|b)= p(b) So in our case: Iˆ I J eˆ1 = argmax p(e1|f1 ) Apply Bayes’ law I I,e1 J I I J p(f1 |e1)p(e1) p(f1 ) constant = argmax J I p(f ) ⇒ irrelevant in maximization I,e1 1 J I I = argmax p(f1 |e1)p(e1) I I,e1 Also called “Noisy Channel” model. December2017 MT2:PBMT,NMT 8 Motivation for Noisy Channel Iˆ J I I eˆ1 = argmax p(f1 |e1)p(e1) (4) I I,e1 Bayes’ law divided the model into components: J I I J p(f1 |e1) Translation model (“reversed”, e1 → f1 ) . is it a likely translation? I p(e1) Language model (LM) . is the output a likely sentence of the target language? • The components can be trained on different sources. There are far more monolingual data ⇒ language model more reliable. December2017 MT2:PBMT,NMT 9 Without Equations Parallel Texts Monolingual Texts Translation Model Language Model Global Search Input Output for sentence with highest probability December 2017 MT2: PBMT, NMT 10 Summary of Language Models I I • p(e1) should report how “good” sentence e1 is. • We surely want p(The the the.) <p(Hello.) • How about p(The cat was black.) <p(Hello.)? . We don’t really care in MT. We hope to compare synonymic sentences. LM is usually a 3-gram language model: p( The cat was black . )= p(The| ) p(cat| The) p(was|The cat) p(black|cat was) p(.|was black) p( |black .) p( |. ) Formally, with n = 3: I p (eI)= p(e |ei−1 ) (5) LM 1 Y i i−n+1 i=1 December 2017 MT2: PBMT, NMT 11 Estimating and Smoothing LM count(w1) p(w1)= total words observed Unigram probabilities. count(w1w2) p(w2|w1)= Bigram probabilities. count(w1) count(w1w2w3) p(w3|w2, w1)= Trigram probabilities. count(w1w2) Unseen ngrams (p(ngram)=0) are a big problem, invalidate I whole sentence: pLM(e1)= ···· 0 ···· =0 ⇒ Back-off with shorter ngrams: I I pLM(e )= 0.8 · p(ei|ei−1, ei−2)+ 1 Qi=1 0.15 · p(e |e − )+ i i 1 (6) 0.049 · p(ei)+ 0.001 6= 0 December 2017 MT2: PBMT, NMT 12 From Bayes to Log-Linear Model Och (2002) discusses some problems of Equation 19: • Models estimated unreliably ⇒ maybe LM more important: Iˆ J I I 2 eˆ1 = argmax p(f1 |e1)(p(e1)) (7) I I,e1 • In practice, “direct” translation model equally good: Iˆ I J I eˆ1 = argmax p(e1|f1 )p(e1) (8) I I,e1 • Complicated to correctly introduce other dependencies. ⇒ Use log-linear model instead. December 2017 MT2: PBMT, NMT 13 Log-Linear Model (1) I J • p(e1|f1 ) is modelled as a weighted combination of models, called “feature functions”: h1(·, ·) ...hM (·, ·) M I J I J exp( m=1 λmhm(e1, f1 )) p(e1|f1 )= P M ′ (9) ′ ′I J ′I exp( m=1 λmhm(e 1 , f1 )) Pe 1 P • Each feature function hm(e, f) relates source f to target e. E.g. the feature for n-gram language model: I h (f J , eI) = log p(e |ei−1 ) (10) LM 1 1 Y i i−n+1 i=1 M • Model weights λ1 specify the relative importance of features. December 2017 MT2: PBMT, NMT 14 Log-Linear Model (2) As before, the constant denominator not needed in maximization: M I J Iˆ exp( m=1 λmhm(e1, f1 )) eˆ1 = argmaxI,eI P M ′ 1 ′ ′I J ′I exp( m=1 λmhm(e 1 , f1 )) (11) Pe 1 P M I J = argmaxI,eI exp( m λmhm(e1, f1 )) 1 P =1 December 2017 MT2: PBMT, NMT 15 Relation to Noisy Channel With equal weights and only two features: I J J I • hTM(e1, f1 ) = log p(f1 |e1) for the translation model, I J I • hLM(e1, f1 ) = log p(e1) for the language model, log-linear model reduces to Noisy Channel: Iˆ M I J eˆ = argmax I exp( λmhm(e , f )) 1 I,e1 m=1 1 1 P I J I J = argmax I exp(hTM(e , f )+ hLM(e , f )) I,e1 1 1 1 1 J I I (12) = argmax I exp(log p(f |e )+log p(e )) I,e1 1 1 1 J I I = argmax I p(f |e )p(e ) I,e1 1 1 1 December 2017 MT2: PBMT, NMT 16 Phrase-Based MT Overview This time around = Nyn´ı . they ’re moving = zareagovaly faster even even = dokonce jeˇstˇe moving ... = ... ’re This time around, they ’re moving = Nyn´ızareagovaly they , even faster = dokonce jeˇstˇerychleji ... = ... around time This Phrase-based MT: choose such segmentation . of input string and such phrase “replacements” Nyn´ı jeˇstˇe to make the output sequence “coherent” rychleji dokonce (3-grams most probable). zareagovaly December 2017 MT2: PBMT, NMT 17 Phrase-Based Translation Model • Captures the basic assumption of phrase-based MT: J ˜ ˜ 1. Segment source sentence f1 into K phrases f1 ... fK. 2. Translate each phrase independently: f˜k → e˜k. 3. Concatenate translated phrases (with possible reordering R): e˜R(1) ... e˜R(K) K • In theory, the segmentation s1 is a hidden variable in the maximization, we should be summing over all segmentations: (Note the three args in hm(·, ·, ·) now.) Iˆ M I J K eˆ1 = argmaxI,eI sK exp( m=1 λmhm(e1, f1 ,s1 )) (13) 1 P 1 P • In practice, the sum is approximated with a max (the biggest element only): Iˆ M I J K eˆ1 = argmaxI,eI maxsK exp( m=1 λmhm(e1, f1 ,s1 )) (14) 1 1 P December 2017 MT2: PBMT, NMT 18 Core Feature: Phrase Trans. Prob. The most important feature: phrase-to-phrase translation: K h (f J, eI, sK) = log p(f˜ |e˜ ) (15) Phr 1 1 1 Y k k k=1 The conditional probability of phrase f˜k given phrase e˜k is estimated from relative frequencies: count(f,˜ e˜) p(f˜ |e˜ )= (16) k k count(˜e) • count(f,˜ e˜) is the number of co-occurrences of a phrase pair (f,˜ e˜) that are consistent with the word alignment • count(˜e) is the number of occurrences of the target phrase e˜ in the training corpus.

Load more