Using Comparable Corpora to Augment Low Resource SMT Models

Using Comparable Corpora to Augment Low Resource SMT Models Ann Irvine December 10, 2014 2 Contents 1 Introduction 7 1.1 Contributions of this thesis ........................................ 11 1.2 Structure of this document ........................................ 11 1.3 Related publications ............................................ 13 2 Literature Review 15 2.1 Statistical Machine Translation ...................................... 15 2.1.1 Phrase-Based Machine Translation ................................ 15 2.1.2 Other Models of Translation ................................... 16 2.1.3 Low Resource Machine Translation ............................... 18 2.1.3.1 AVENUE ....................................... 18 2.1.3.2 METIS-II ....................................... 19 2.1.3.3 Other work ....................................... 19 2.1.4 The OOV and rare word problem in SMT ............................ 19 2.2 Expanding bilingual resources ...................................... 20 2.2.1 Bilingual lexical induction .................................... 20 2.2.1.1 Contextual Similarity ................................. 20 2.2.1.2 Other Monolingual Similarity Metrics ........................ 21 2.2.1.3 Integration with SMT ................................. 22 2.2.2 Extracting Parallel Data from Comparable Corpora ....................... 22 2.2.3 Crowdsourcing Translations ................................... 22 2.2.4 Automatically Expanding SMT Model Coverage ........................ 23 2.3 Domain Adaptation for Machine Translation ............................... 23 3 Languages, Data, and Analysis 25 3.1 Languages ................................................. 25 3.2 Parallel Corpora and Dictionaries ..................................... 26 3.2.1 Parallel Corpora ......................................... 26 3.2.2 Bilingual Dictionaries ...................................... 27 3.3 Monolingual and Comparable Corpora .................................. 28 3.3.1 Web crawls ............................................ 28 3.3.2 Wikipedia ............................................. 30 3.4 Analysis .................................................. 30 3.4.1 Approach ............................................. 30 3.4.1.1 Error Taxonomy .................................... 30 3.4.1.2 Word Alignment Driven Evaluation .......................... 30 3.4.1.3 Table Enhancement for Translation Analysis ..................... 31 3.4.2 Experiments ........................................... 32 3.4.2.1 WADE Analyses .................................... 32 3.4.2.2 TETRA Analyses ................................... 34 3.4.3 Word Alignment Errors ...................................... 35 3.4.4 Analysis Conclusion ....................................... 35 4 4 Bilingual Lexicon Induction 37 4.1 Motivating Prior Work .......................................... 37 4.2 Using Monolingual Data to Predict Translations ............................. 38 4.2.1 Monolingual Signals of Translation Equivalence ........................ 38 4.2.1.1 Contextual Similarity ................................. 38 4.2.1.2 Temporal Similarity .................................. 38 4.2.1.3 Orthographic Similarity ................................ 39 4.2.1.4 Topic Similarity .................................... 41 4.2.1.5 Frequency Similarity ................................. 43 4.2.1.6 Burstiness Similarity ................................. 43 4.2.1.7 Additional Signals ................................... 44 4.2.2 Orthogonality of Signals ..................................... 44 4.2.3 Individual Monolingual Signals ................................. 47 4.2.4 Learning to combine orthogonal monolingual signals ...................... 48 4.3 Experiments ................................................ 51 4.3.1 Comparison with Unsupervised Baseline ............................ 51 4.3.2 Analysis by Word Frequency ................................... 52 4.3.3 Analysis by Word Burstiness ................................... 52 4.3.4 Analysis by Amount of Monolingual Data ............................ 54 4.4 Learning Curve Analyses ......................................... 56 4.4.1 Translated Word Pairs ...................................... 56 4.4.2 Monolingual Data ........................................ 56 4.5 Learning Models Across Languages ................................... 59 4.6 Comparison with Prior Work ....................................... 59 4.7 Conclusions ................................................ 61 5 Monolingual Phrase Table Scoring 63 5.1 Phrase Table Scoring ........................................... 63 5.1.1 Phrasal Features ......................................... 63 5.1.2 Lexical Features ......................................... 64 5.2 Experiments with An Existing Phrase Table ............................... 64 5.2.1 Results .............................................. 66 5.2.1.1 Ablation Experiments ................................. 66 5.2.1.2 Combining Bilingually and Monolingually Estimated Features ........... 66 5.3 Phrase Table Scoring Conclusions .................................... 67 6 End-to-End SMT with Zero or Small Parallel Texts 69 6.1 Improving Coverage ........................................... 69 6.2 Improving Accuracy ........................................... 71 6.3 Zero Parallel Data Setting ......................................... 71 6.4 Small Parallel Corpora Setting ...................................... 72 6.4.1 Data ................................................ 72 6.4.2 Experimental setup ........................................ 72 6.4.3 Results .............................................. 75 6.4.3.1 Bilingual Lexicon Induction .............................. 75 6.4.3.2 Improving Coverage and Accuracy in End-to-End SMT ............... 76 6.4.3.3 Translations of Low Frequency Words ........................ 76 6.4.3.4 Appending Top-K Translations ............................ 77 6.4.3.5 Learning Curves over Parallel Data .......................... 77 6.4.3.6 Learning Curves over Comparable Corpora ...................... 78 6.4.4 Post-Augmentation WADE Analysis ............................... 78 6.4.4.1 WADE with Multiple References ........................... 78 6.4.4.2 Analysis ........................................ 82 6.4.4.3 WADE Analysis Conclusions ............................. 84 LOW RESOURCE LANGUAGE AND DOMAIN MACHINE TRANSLATION 5 6.5 End-to-End SMT with Zero or Small Parallel Texts Conclusion ..................... 84 7 Phrase Translation Mining 87 7.1 Option 1: Fast Phrase Pair Filtering ................................... 89 7.2 Option 2: Composing Phrase Translations ................................ 90 7.2.1 Motivating Experiments ..................................... 91 7.2.2 Phrase Composition Algorithm ................................. 95 7.2.3 Pruning Phrase Pairs Using Scores Derived from Comparable Corpora ............. 95 7.3 End-to-end SMT with Induced Phrase Translations ........................... 98 7.3.1 Experimental Setup ........................................ 98 7.3.2 Unigram Translations ....................................... 98 7.3.3 Composing and Pruning Phrase Translations .......................... 98 7.3.4 MT Experimental Setup ..................................... 99 7.3.5 Results .............................................. 99 7.3.5.1 Unigram Translations ................................. 99 7.3.5.2 Composed Phrase Pairs ................................ 99 7.3.5.3 End-to-End Translation ................................ 101 7.3.6 Discussion ............................................ 101 8 From Low Resource MT to Domain Adapted MT 105 8.1 Domain Adaptation Data ......................................... 105 8.2 WADE and TETRA Analyses ...................................... 106 8.3 New-Domain Comparable Corpora .................................... 108 8.4 Using Comparable Corpora to Score Phrase Tables for Domain Adaptation ............... 109 8.5 Using Comparable Corpora to Translate Unseen Words for Domain Adaptation ............ 111 8.5.1 Bilingual Lexicon Induction Model ............................... 111 8.5.2 Evaluation of Induced Translations ............................... 112 8.5.3 Integrating Translations into End-to-End SMT ......................... 113 8.5.4 Conclusion ............................................ 114 9 Conclusion 115 10 Language Set 119 11 Data Resources 123 12 WADE Analysis: Comparison of the use of Automatic and Manual Word Alignments 131 13 Bilingual Lexicon Induction 133 13.1 Contextual Vector Projection Dictionaries ................................ 133 13.2 Comparison of Temporal Signatures ................................... 133 14 Zero-Parallel Data Translations 139 15 Fast Phrase Pair Filtering 143 15.1 Exploratory Experiments in Learning Effective, Efﬁcient Filters .................... 143 15.2 Scaling Up ................................................ 147 6 Chapter 1 Introduction The objective of this thesis is to directly incorporate comparable corpora into the estimation of end-to-end statistical machine translation (SMT) models. Typically, SMT models are estimated from parallel corpora, or pairs of translated sentences. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross- lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which

Load more