Leveraging Word and Phrase Alignments for Multilingual Learning
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging Word and Phrase Alignments for Multilingual Learning Junjie Hu CMU-LTI-21-012 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15123 www.lti.cs.cmu.edu Thesis Committee: Dr. Graham Neubig (Chair) Carnegie Mellon University Dr. Yulia Tsvetkov Carnegie Mellon University Dr. Zachary Lipton Carnegie Mellon University Dr. Kyunghyun Cho New York University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies. © 2020 Junjie Hu Keywords: natural language processing, multilingual learning, cross-lingual transfer learn- ing, machine translation, domain adaptation, deep learning Dedicated to my family for their wholehearted love and support in all my endeavours. iv Abstract Recent years have witnessed impressive success in natural language processing (NLP) thanks to the advances of neural networks and the availability of large amounts of labeled data. However, many NLP systems predominately have focused on high- resource languages (e.g., English, Chinese) that have large, computationally accessi- ble collections of labeled data for training. While the achievements on high-resource languages are exciting, there are more than 6,900 languages in the world and the ma- jority of them have far fewer resources for training deep neural networks. In fact, it is often expensive, or sometimes infeasible, to collect labeled data written in all possible languages. As a result, this data scarcity issue limits the generalization of NLP systems in many multilingual scenarios. Moreover, as models may be used to process text from a wide range of domains (e.g., social media or medical articles), the data scarcity issue is further exacerbated by the domain shift between the training and test data. In this thesis, with the goal of improving the generalization ability of NLP models to alleviate the aforementioned challenges, we exploit word and phrase alignment to train neural NLP models (e.g., neural machine translation or contextualized language models), and provide evaluation methods for examining the generalization capabili- ties of such models over diverse application scenarios. This thesis contains two parts. The first part explores cross-lingual generalization for language understanding. In particular, we examine the ability of pre-trained multilingual representations to trans- fer learned knowledge from a high-resource language to other languages. To this end, we first introduce a multi-task benchmark for evaluating the cross-lingual general- ization capabilities of multilingual representations across 40 languages and 9 tasks. Second, we leverage word and sentence alignments from parallel data to improve the multilingual representations for language understanding tasks such as those included in our benchmark. The second part of the thesis is devoted to leveraging alignment information for machine translation, a popular and useful language generation task. In particular, we focus on learning to translate aligned words and phrases between two languages with fewer parallel sentences. To accomplish this goal, we exploit techniques to obtain aligned words and phrases from monolingual data, knowledge bases or crowdsourcing and use them to improve translation systems. vi Acknowledgments First and foremost, I am grateful to my advisors Graham Neubig and Jaime Car- bonell who have provided me with constant supports as well as insightful discussions in my academic development at Carnegie Mellon University. In addition to being an excellent researcher, Graham is also the kindest and most supportive mentor that I have had during my academic career. He has a great sense of research taste, which also motivates me to conduct important research and think deeply about my research agenda. Jaime is one of the most influential researchers that I have met and a role model for my academic career. Both of them will have a long-standing influence on me beyond my PhD. I would like to thank Kyunghyun Cho, Yulia Tsvetkov and Zachary Lipton for all the help they have provided to me during my Ph.D. study, including serving as my committee members, giving advice on my job search, and having insightful discus- sions about my thesis. A huge thanks go to all of my other collaborators and friends throughout my Ph.D. study at CMU: Pengcheng Yin, Wei-cheng Chang, Yuexin Wu, Diyi Yang, Po-Yao Huang, Han Zhao, Yichong Xu, Zhengbao Jiang, Zi-Yi Dou, Mengzhou Xia, Anto- nios Anastasopoulos, Chunting Zhou, Xuezhe Ma, Xinyi Wang, Austin Matthews, Craig Stewart, Nikolai Vogler, Shruti Rijhwani, Paul Michel, Danish Pruthi, Zaid Sheikh, Wei Wei, Zi Yang, Zhilin Yang, Desai Fan, Shuxin Yao, Hector Liu, Zhiting Hu, Jean Oh, Andrej Risteski, Geoff Gordon, Anatole Gershman, Ruslan Salakhut- dinov, and William Cohen. In addition to my time at CMU, I also would like to thank my MPhil advisors Michael R. Lyu and Irwin King at the Chinese University of Hong Kong, and my col- laborators during my internship at Google and Microsoft: Sebastian Ruder, Melvin Johnson, Orhan Firat, Aditya Siddhant, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, Liu Yang, Xiaodong Liu, Yelong Shen. Last but most importantly, I would like to thank my dad Saihao Hu, my mom Suhua Li, my elder sister Jiana Hu, and my lovely girlfriend Ming Jiang for all of your unconditional love and support during my Ph.D. journey. Needless to say, this thesis would not have been possible without your encouragement along the way. This thesis is dedicated to all of you. viii Contents 1 Introduction1 1.1 Main Contributions.................................3 1.2 Thesis Outline....................................5 2 Background7 2.1 Prior Work.....................................7 2.2 Multilingual Neural Network Models........................ 12 2.3 Pre-training of Neural Network Models...................... 13 3 Cross-Lingual Generalization Benchmark 17 3.1 Overview...................................... 17 3.2 XTREME...................................... 19 3.3 Experiments..................................... 23 3.4 Analyses....................................... 28 3.5 Related Work.................................... 34 3.6 Discussion and Future Work............................ 36 4 Leveraging Word and Sentence Alignment for Language Understanding 37 4.1 Overview...................................... 38 4.2 Cross-lingual Alignment.............................. 38 4.3 Experiments..................................... 41 4.4 Related Work.................................... 44 4.5 Discussion and Future Work............................ 45 5 Leveraging Word Alignment for Domain Adaptation of Machine Translation 47 5.1 Overview...................................... 47 5.2 Domain Adaptation by Lexicon Induction..................... 48 ix 5.3 Experimental Results................................ 51 5.4 Related Work.................................... 61 5.5 Discussion and Future Work............................ 62 6 Leveraging Aligned Entities for Machine Translation 63 6.1 Overview...................................... 64 6.2 Denoising Entity Pre-training............................ 65 6.3 Experimental Setting................................ 68 6.4 Discussion...................................... 70 6.5 Related Work.................................... 77 6.6 Discussion and Future Work............................ 78 7 Leveraging Phrase Alignment for Machine Translation 81 7.1 Overview...................................... 81 7.2 Problem Definition................................. 82 7.3 Active Selection Strategies............................. 84 7.4 Training with Sentences and Phrases........................ 87 7.5 Experiments..................................... 89 7.6 Related Work.................................... 97 7.7 Discussion and Future Work............................ 98 8 Conclusion 99 8.1 Summary of Contributions............................. 99 8.2 Key Ideas and Suggestions............................. 101 8.3 Future Work..................................... 103 A Appendix of Chapter 3 105 A.1 Languages...................................... 105 A.2 Translations for QA datasets............................ 105 A.3 Performance on translated test sets......................... 107 A.4 Generalization to unseen tag combinations..................... 107 A.5 Results for each task and language......................... 109 B Appendix of Chapter 4 113 B.1 Training Details for Reproducibility........................ 113 x B.2 Detailed Results................................... 114 B.3 Detailed Results on Performance Difference by Languages............ 115 Bibliography 117 xi xii Chapter 1 Introduction Over the past decade, the success of NLP systems has been mostly driven by deep neural network models and supervised learning approaches on a large amount of labeled data. For example, neural machine translation (NMT) [18] trained on billions of parallel sentences has become the de facto paradigm of many commercial translation systems such as Google’s multilingual NMT system [3, 89] and Microsoft’s NMT system [77]. Among these exciting NLP research develop- ments, this thesis particularly focuses on multilingual learning approaches and investigates two main categories of multilingual NLP tasks – machine translation (MT) and cross-lingual lan- guage understanding. Machine translation is the task of translating text from a source language to another target language. The broad genre of cross-lingual language