Distributional Semantics and Machine Learning for Statistical Machine Translation HAP/LAP

Distributional Semantics and Machine Learning for Statistical Machine Translation Author: Mikel Artetxe Advisors: Eneko Agirre and Gorka Labaka HAP/LAP Hizkuntzaren Azterketa eta Prozesamendua Language Analysis and Processing Final Thesis May 2016 Departments: Computer Systems and Languages, Computational Architectures and Technologies, Computational Science and Artificial Intelligence, Basque Language and Communication, Communications Engineering. Distributional Semantics and Machine Learning for SMT ii/112 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT iii/112 Laburpena Lan honetan semantika distribuzionalaren eta ikasketa automatikoaren erabilera aztertzen dugu itzulpen automatiko estatistikoa hobetzeko. Bide horretan, erregresio logistikoan oinarritutako ikasketa automatikoko eredu bat proposatzen dugu hitz-segiden itzulpen- probabilitatea modu dinamikoan modelatzeko. Proposatutako eredua itzulpen automatiko estatistikoko ohiko itzulpen-probabilitateen orokortze bat dela frogatzen dugu, eta tes- tuinguruko nahiz semantika distribuzionaleko informazioa barneratzeko baliatu ezaugarri lexiko, hitz-cluster eta hitzen errepresentazio bektorialen bidez. Horretaz gain, semantika distribuzionaleko ezagutza itzulpen automatiko estatistikoan txertatzeko beste hurbilpen bat lantzen dugu: hitzen errepresentazio bektorial elebidunak erabiltzea hitz-segiden itzulpenen antzekotasuna modelatzeko. Gure esperimentuek proposatutako ereduen balia- garritasuna erakusten dute, emaitza itxaropentsuak eskuratuz oinarrizko sistema sendo baten gainean. Era berean, gure lanak ekarpen garrantzitsuak egiten ditu errepresentazio bektorialen mapaketa elebidunei eta hitzen errepresentazio bektorialetan oinarritutako hitz-segiden antzekotasun neurriei dagokienean, itzulpen automatikoaz haratago balio propio bat dutenak semantika distribuzionalaren arloan. Abstract In this work, we explore the use of distributional semantics and machine learning to improve statistical machine translation. For that purpose, we propose the use of a logistic regression based machine learning model for dynamic phrase translation probability modeling. We prove that the proposed model can be seen as a generalization of the standard translation probabilities used in statistical machine translation, and use it to incorporate context and distributional semantic information through lexical, word cluster and word embedding features. Apart from that, we explore the use of word embeddings for phrase translation probability scoring as an alternative approach to incorporate distributional semantic knowledge into statistical machine translation. Our experiments show the effectiveness of the proposed models, achieving promising results over a strong baseline. At the same time, our work makes important contributions in relation to bilingual word embedding mappings and word embedding based phrase similarity measures, which go be- yond machine translation and have an intrinsic value in the field of distributional semantics. Language Analysis and Processing Distributional Semantics and Machine Learning for SMT iv/112 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT v/112 Contents 1 Introduction 1 2 Statistical machine translation 3 2.1 Machine translation paradigms . .3 2.1.1 Rule-based machine translation . .3 2.1.2 Corpus-based machine translation . .5 2.1.3 Hybrid machine translation . .7 2.2 Word alignment . .7 2.3 Phrase-based statistical machine translation . 10 2.3.1 Log-linear models . 13 2.4 Addressing the limitations of statistical machine translation . 14 2.4.1 Factored models . 15 2.4.2 Tree-based models . 15 2.4.3 Deep learning and neural machine translation . 16 2.5 Evaluation in machine translation . 17 2.6 Moses . 18 2.7 Conclusions . 18 3 Large scale machine learning 21 3.1 Basic concepts in supervised learning . 21 3.2 Linear models and the perceptron . 23 3.3 Logistic regression and gradient descent . 24 3.4 Addressing the limitations of linear models . 27 3.4.1 Feature engineering . 28 3.4.2 Support vector machines . 29 3.4.3 Artificial neural networks and deep learning . 30 3.5 Large scale online learning . 32 3.6 Vowpal Wabbit . 34 3.7 Conclusions . 34 4 Distributional semantics 35 4.1 Monolingual word embeddings . 35 4.2 Word clustering . 38 4.2.1 Brown clustering . 38 4.2.2 Clark clustering . 40 4.2.3 Word embedding clustering . 41 4.3 Bilingual word embeddings . 41 4.3.1 Bilingual mapping . 42 4.3.2 Monolingual adaptation . 43 4.3.3 Bilingual training . 43 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT vi/112 4.4 Distributional semantics for machine translation . 45 4.4.1 Phrase translation similarity scoring . 45 4.4.2 New translation induction . 47 4.5 Conclusions . 47 5 Logistic regression for dynamic phrase translation probability modeling 49 5.1 Proposed model . 49 5.2 Proof of relative frequency counting equivalence . 52 5.3 Feature design . 54 5.3.1 Source language features . 54 5.3.2 Target language features . 55 5.3.3 Feature combination . 56 5.4 Experiment and results . 56 5.5 Conclusions . 60 6 Distributional semantic features for phrase translation logistic regression 63 6.1 Word cluster features . 63 6.2 Word embedding features . 64 6.3 Experiment and results . 65 6.4 Conclusions . 68 7 A general framework for bilingual word embedding mappings 69 7.1 Proposed method . 69 7.1.1 Basic optimization objective . 69 7.1.2 Orthogonality constraint for monolingual invariance . 70 7.1.3 Length normalization for maximum cosine similarity . 72 7.1.4 Mean centering for maximum covariance . 72 7.1.5 Weighted and partial dictionaries . 74 7.2 Relation to existing bilingual mapping methods . 76 7.2.1 Mikolov et al. (2013b) . 76 7.2.2 Xing et al. (2015) . 77 7.2.3 Faruqui and Dyer (2014) . 80 7.3 Experiments on word translation induction . 83 7.4 Conclusions . 87 8 Bilingual word embeddings for phrase translation similarity scoring 89 8.1 Analysis of centroid cosine similarity . 89 8.2 Proposed phrase similarity measures . 91 8.3 Experiments on phrase translation selection . 93 8.4 Experiments on end-to-end machine translation . 98 8.5 Conclusions . 100 9 Conclusions and future work 101 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT vii/112 List of Figures 1 The Vauquois triangle . .4 2 The Vauquois triangle adapted for EBMT systems . .6 3 Intersection and union for alignment symmetrization . 10 4 Phrase pair extraction from word alignment . 12 5 Decoding of Maria no daba una bofetada a la bruja verde .......... 13 6 An example factored model . 16 7 Example decision boundary for a linearly separable problem . 24 8 The sigmoid or logistic function . 25 9 Visualization of gradient descent in a 2-dimensional weight space . 27 10 Example decision boundary for a problem that is not linearly separable . 28 11 A feedforward neural network with a single hidden layer . 31 12 The CBOW and skip-gram models . 37 13 An example dendrogram produced by Brown clustering . 40 14 PCA visualization of two linearly related word spaces trained independently 42 15 Average weights assigned by MERT for the different components . 60 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT viii/112 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT ix/112 List of Tables 1 Example source language lexical features . 55 2 Example target language lexical features . 56 3 Bilingual English-Spanish corpora . 57 4 Monolingual Spanish corpora used for language modeling . 57 5 Results on English-Spanish translation for the baseline system . 58 6 Results on English-Spanish translation for different target language features 59 7 Results on English-Spanish translation for different validation sets . 59 8 Results on English-Spanish translation for different integration methods . 59 9 Example source language word cluster features . 64 10 Example source language word embedding features . 65 11 Results on English-Spanish translation for different word clusters . 66 12 Results on English-Spanish translation for different word cluster features . 67 13 Results on English-Spanish translation for different word cluster integration methods . 67 14 Results on English-Spanish translation with embedding features . 68 15 Different optimization objectives within the proposed framework . 76 16 Results on English-Italian word translation induction . 86 17 Corpora for training bilingual word embeddings . 95 18 Results on English-Spanish phrase translation selection . 97 19 Results on English-Spanish machine translation with phrase similarity scoring 99 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT x/112 Language Analysis and Processing Distributional Semantics and Machine Learning for SMT 1/112 1 Introduction Machine translation has been one of the most prominent applications of natural language processing and artificial intelligence since their early days, meeting the dream to break the language barrier in an increasingly global yet diverse world. The classical approach, based on rules, has been progressively replaced by statistical machine translation, which has become the dominant paradigm bringing about a breakthrough to the field. Nevertheless, in spite of the great progress in recent years, current machine translation engines still suffer from important limitations, which are mostly related to the following factors: Sparsity of natural language, causing many terms or

Load more