Reliable Training Scenarios for Dealing with Minimal Parallel-Resource Language Pairs in Statistical Machine Translation

ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184 ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/ WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en AUTONOMOUS UNIVERSITY OF BARCELONA DOCTORAL THESIS Reliable Training Scenarios for Dealing with Minimal Parallel-Resource Language Pairs in Statistical Machine Translation Author: Supervisor: Benyamin AHMADNIAYEBOSARI Dr. Javier SERRANO GARCIA PhD Program in Electrical and Telecommunication Engineering A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in the Department of Telecommunication and Systems Engineering School of Engineering November 2017 iii Abstract Over the years, various changes have been made to Machine Translation (MT) which is mainly applied for Natural Language Processing (NLP). Statistical Machine Trans- lation (SMT) is one of the preferred approaches to MT, and various improvements could be detected in this approach, specifically in the output quality in a number of systems for language pairs since the advances in computational power, together with the exploration of new methods and algorithms have been made. When we ponder over the development of SMT systems for many language pairs, the major bottleneck that we will find is the lack of training parallel data. Due to the fact that lots of time and effort is required to create these corpora, they are available in limited quantity, genre, and language. SMT models learn that how they could do translation through the process of examining a bilingual parallel corpus that contains the sentences aligned with their human-produced translations. However, the output quality of SMT systems is heav- ily dependent on the availability of massive amounts of parallel text within the source and target languages. Hence, an important role is played by the parallel resources so that the quality of SMT systems could be improved. We define minimal parallel-resource SMT settings possess only small amounts of parallel data, which can also be seen in various pairs of languages. The performance achieved by current state-of-the-art minimal parallel-resource SMT is highly appreciable, but they usually use the monolingual text and do not fundamentally address the shortage of parallel training text. Creating enlargement in the parallel training data without providing any sort of guarantee on the quality of the bilingual sentence pairs that have been newly generated, is also raising concerns. The limitations that emerge during the training of the minimal parallel- resource SMT prove that the current systems are incapable of producing the high- quality translation output. In this thesis, we have proposed a "direct-bridge combination" scenario as well as a "round-trip training scenario", that the former is based on bridge language technique, while the latter one is based on retraining approach, for dealing with minimal parallel-resource SMT systems. Our main aim for putting forward the direct-bridge combination scenario is that we might bring it closer to state-of-the-art performance. This scenario has been proposed to maximize the information gain by choosing the appropriate portions of the bridge-based translation system that do not interfere with the direct-based translation system which is trusted more. Furthermore, the round-trip training scenario has been proposed to take advantage of the readily available generated bilingual sentence pairs to build high-quality SMT system in an iterative behaviour; by select- ing high-quality subset of generated sentence pairs in target side, preparing their suitable correspond source sentences, and using them together with the original sentence pairs to retrain the SMT system. iv The proposed methods are intrinsically evaluated, and their comparison is made against the baseline translation systems. We have also conducted the experiments in the aforementioned proposed scenarios with minimal initial bilingual data. We have demonstrated improvement made in the performance through the use of proposed methods while building high-quality SMT systems over the baseline involving each scenario. v Acknowledgements I would like to express my deep gratitude to my advisor, Dr. Javier Serrano, who has had a profound influence on my research and studies. He introduced me to the fas- cinating field of Natural Language Processing and Statistical Machine Translation, and taught me a great deal of valuable research skills. Besides being a knowledge- able teacher, he has also been a helpful friend with a strong personality which was always inspiring. I want to thank Dr. Gholamreza Haffari, for his guidance and support along this thesis work during my fruitful research-visiting stay in the Faculty of Information Technology at Monash University, Australia. Probably, this doctoral thesis would have not been possible without his help. At this point, I would like to express my everlasting gratitude to Dr. Mojtaba Sabbagh Jafari (Vali-e-Asr University of Rafsanjan, Iran), and Dr. Nik-Mohammad Balouchzahi (University of Sistan and Baluchestan, Iran). I marvel at their invaluable and insightful technical support, their exceptional generosity, and encouragement shown towards me. I am also thankful to the past and current members of Transmedia-Catalonia research group, and Telecommunication and Systems Engineering Department at the School of Engineering at Autonomous University of Barcelona, Spain. Very special thanks to my friend, Shekoofeh Dadgostar (University of Granada, Spain), for all her mental and unconditional support in the most stressful and hard- est moments. Of course, I would like to thank my beloved family members. My dear parents (Farideh and Khosro), and my lovely sister (Tamara), have endowed me with the curiosity about the world, and supported me through my life even from thousands miles away. Their continuing encouragement lightens my path into higher educa- tion. Thank you all. Finally, thanks to all my friends, families, colleagues, reviewers, and everyone else not mentioned here, in Spain, Australia, and Iran, for all their supports in many aspects during the course of the research. vii Contents Abstract iii Acknowledgements v List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Context and Motivation . .1 1.2 Thesis Objectives . .6 1.3 Thesis Outline . .7 2 Background 9 2.1 Computational Linguistics . .9 2.2 Natural Language Processing . .9 2.3 Minimal-Resource Languages . 11 2.4 Machine Translation . 12 2.4.1 Rule-Based Machine Translation . 14 2.4.2 Corpus-Based Machine Translation . 15 2.4.3 Hybrid Machine Translation . 15 2.5 Statistical Machine Translation . 16 2.6 Parallel Corpus Alignment . 18 2.7 Translation Model Training . 19 2.8 Language Model Training . 20 2.9 Decoding . 22 2.10 Evaluation . 22 2.11 Decoding Software Packages . 24 2.12 State of the art . 24 2.12.1 Learning Frameworks for Minimal Parallel-Resource SMT . 25 2.12.1.1 Semi-Supervised Learning . 25 2.12.1.2 Active Learning . 29 2.12.1.3 Deep Learning . 32 2.12.2 Pivoting Framework for Minimal Parallel-Resource SMT . 33 2.12.3 Other Research Lines . 34 2.12.3.1 Bilingual Lexicon Induction for Minimal Parallel-Resource SMT............................. 35 2.12.3.2 Monolingual Collocation for Minimal Parallel-Resource SMT............................. 36 2.12.3.3 Domain Adaptation for Minimal Parallel-Resource SMT 37 2.13 Summary . 39 viii 3 Phrase-Based Translation Models for SMT Systems 41 3.1 Introduction . 42 3.2 Classical Phrase-Based Translation Model . 42 3.2.1 Noisy-Channel Model . 43 3.2.2 Log-Linear Model . 44 3.2.3 Feature Functions . 45 3.2.4 Phrase Extraction . 47 3.2.5 Phrase-Table Induction . 47 3.2.6 Learning Weights . 48 3.2.7 Solving Search-Problem . 48 3.2.8 Classical Model Decoding . 49 3.3 Hierarchical Phrase-Based Translation Model . 51 3.3.1 Hierarchical Rules . 52 3.3.2 Grammars Definition . 52 3.3.3 Rules Extraction . 53 3.3.4 Rule Parameters Learning . 54 3.3.5 Standard Features . 55 3.3.6 Hierarchical Model Decoding . 55 3.4 Experimental Framework . 56 3.4.1 Experimental Set-Up . 57 3.4.2 Implementation . 58 3.4.3 Results Analysis and Evaluation . 59 3.5 Comparative Performance of Phrase-Based Models . 65 3.5.1 Experiments Setting . 65 3.5.2 Results . 66 3.6 Discussion . 72 3.7 Summary . 73 4 Direct-Bridge Combination for Minimal Parallel-Resource SMT 75 4.1 Introduction . 75 4.2 Bridge Language Theory . 77 4.3 Bridging Approaches . 77 4.3.1 Transfer Approach . 78 4.3.2 Synthetic Corpus Approach . 79 4.3.3 Triangulation Approach . 79 4.3.3.1 Phrase Translation Probabilities . 80 4.3.3.2 Lexical Reordering Weights . 81 4.3.4 Interpolated Model . 81 4.4 Proposed Improvements in Bridge Language Technique . 83 4.4.1 Interpolating Bilingual Texts . 83 4.4.2 Combining Phrase-tables . 84 4.5 Experiments . 84 4.6 Results and Evaluation . 85 4.7 Proposed Method . 90 4.7.1 Optimized Direct-Bridge Combination Method . 90 4.7.2 ODBC Method Experiments . 92 4.7.2.1 Baseline Systems Evaluation . 93 4.7.2.2 Baseline Combination . 93 4.7.2.3 Direct-Bridge Combination . 94 4.8 Discussion . 96 4.9 Summary . 97 ix 5 Round-Trip Training Scenario for Minimal Parallel-Resource SMT 99 5.1 Introduction . 100 5.2 Bootstrapping Analysis . 102 5.2.1 Self-Training Mechanism . 105 5.2.2 Co-Training Mechanism . 108 5.3 Round-Trip Training Theory .

Reliable Training Scenarios for Dealing with Minimal Parallel-Resource Language Pairs in Statistical Machine Translation

A Survey of Orthographic Information in Machine Translation 3

Word Representation Or Word Embedding in Persian Texts Siamak Sarmady Erfan Rahmani

Language Resource Management System for Asian Wordnet Collaboration and Its Web Service Application

Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Creating Language Resources for Under-Resourced Languages: Methodologies, and Experiments with Arabic

Adaptation of Deep Bidirectional Transformers for Afrikaans Language

Unsupervised Learning of Cross-Modal Mappings Between

The Interplay Between Lexical Resources and Natural Language Processing

A Representation Framework for Cross-Lingual/Interlingual Lexical Semantic Correspondences

The Crْbadلn Project: Corpus Building for Under-Resourced Languages

Faqs & Recommendations on LR and MT