Topics in Sequence-To-Sequence Learning for Natural Language Processing

Topics in Sequence-to-Sequence Learning for Natural Language Processing Roee Aharoni Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat Gan, Israel May 2020 This work was carried out under the supervision of Prof. Yoav Goldberg, Department of Computer Science, Bar-Ilan University. To my parents, Shoshana (Shoshi) and Yehoshua (Shuki). Thank you for all the endless love, support, and for raising me to always be curious. To Sapir, my best friend and partner in life. Thank you for coping with all the busy week- ends, long travels and white nights before deadlines, and for all your love and support during this journey. This work is dedicated to you. Acknowledgements To the one and only Yoav Goldberg, my advisor in the journey that is concluded in this work. Thank you for all the patience, inspiration, endless knowledge, creativity, and general wizardry and super powers. You made this journey a true pleasure, and those several years one of the most enjoyable and rewarding periods of my life. Other than on NLP, I learned so much from you on how to do proper science, how to ask the right questions, and how to communicate ideas clearly and concisely. I couldn’t ask for a better advisor and truly cherish all the invaluable things I learned from you. Thank you! To all BIU-NLP members, and particularly Ido Dagan, Eliyahu Kiperwasser, Vered Schwartz, Gabriel Stanovsky, Yanai Elazar and Amit Moryessef – thank you for being amazing colleagues and for creating a supporting and welcoming environment, which makes research a truly amazing experience. I learned so much from you and I’m sure we will collaborate again in the future. To Orhan Firat, Melvin Johnson and the rest of the Google Translate team – thank you for sharing the opportunity to work with you in one of the most impactful natural language processing teams in the world. Summer 2018 was an unforgettable one and really a dream coming true, which certainly shaped my how my future will look like in a significant way. Last but not least, I would like to thank all the teachers that introduced and helped me deepen my practice of Yoga. This practice had a significant contri- bution to keeping me strong and flexible, also physically but mostly mentally, during this journey. Namaste. Preface Publications Portions of this thesis are joint work and have been published elsewhere. Chapter 3, “Improving Sequence-to-Sequence Learning for Morphological Inflection Generation” appeared in the proceedings of the 14th Workshop on Computational Research in Phonetics, Phonology, and Morphology (SIGMOR- PHON 2016), Berlin, Germany (Aharoni et al., 2016). Chapter 3, “Morphological Inflection Generation with Hard Monotonic At- tention” appeared in the proceedings of the 55th annual meeting of the Associ- ation for Computational Linguistics (ACL 2017), Vancouver, Canada (Aharoni and Goldberg, 2017a). Chapter 4, “Towards String-to-Tree Neural Machine Translation”, appeared in the proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (ACL 2017), Vancouver, Canada (Aharoni and Goldberg, 2017b). Chapter 5, “Split and Rephrase: Better Evaluation and a Stronger Baseline”, appeared in the proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia (Aharoni and Goldberg, 2018). Chapter 6, “Massively Multilingual Neural Machine Translation”, appeared in the proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, USA (Aharoni et al., 2019). Chapter 7, “Emerging Domain Clusters in Pretrained Language Models”, appeared in the proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Seattle, USA (Aharoni and Goldberg, 2020). Funding This work was supported by Intel (via the ICRI-CI grant), the Israeli Science Foundation (grant number 1555/15), the German Research Foundation via the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1) and by Google. Contents Abstracti 1 Introduction1 1.1 Hard Attention Architectures for Morphological Inflection Gen- eration..................................1 1.2 Linearizing Syntax for String-to-Tree Neural Machine Translation4 1.3 Semantic Sentence Simplification by Splitting and Rephrasing..7 1.4 Massively Multilingual Neural Machine Translation: Towards Uni- versal Translation............................9 1.5 Emerging Domain Clusters in Pretrained Language Models... 12 1.6 Outline.................................. 14 2 Background 15 2.1 Neural Networks Basics........................ 15 2.1.1 Feed-Forward Neural Networks............... 15 2.1.2 Recurrent Neural Networks.................. 16 2.2 Neural Machine Translation...................... 17 2.2.1 Task Definition......................... 17 2.2.2 Dense Word Representations................. 18 2.2.3 Encoder............................. 18 2.2.4 Decoder............................. 19 2.2.5 The Attention Mechanism................... 21 2.2.6 Training Objective....................... 22 2.2.7 Inference and Beam-search.................. 22 3 Hard Attention Architectures for Morphological Inflection Generation 25 4 Linearizing Syntax for String-to-Tree Neural Machine Translation 46 5 Semantic Sentence Simplification by Splitting and Rephrasing Complex Sentences 56 6 Massively Multilingual Neural Machine Translation: Towards Universal Translation 63 7 Unsupervised Domain Clusters in Pretrained Language Models 75 8 Conclusions 92 8.1 Linguistically Inspired Neural Architectures............ 92 8.2 Injecting Linguistic Knowledge in Neural Models using Syntactic Linearization............................... 93 8.3 Understanding the Weaknesses of Neural Text Generation Models 94 8.4 The Benefits of Massively Multilingual Modeling................................. 94 8.5 Domain Data Selection with Massive Language Models...... 95 8.6 Going Forward............................. 96 8.6.1 Modeling Uncertainty..................... 96 8.6.2 Finding the Right Data for the Task............. 97 8.6.3 Distillation, Quantization and Retrieval for Practical Large-Scale Neural NLP............ 98 9 Bibliography 99 HEBREW ABSTRACT @ Abstract Making computers successfully process natural language is a long standing goal in artificial intelligence that spreads across numerous tasks and applications. One prominent example is machine translation, that has preoccupied the minds of scientists for many decades (Bar-Hillel(1951); Weaver(1955)). While serving as a scientific benchmark for the progress of artificial intelligence, machine translation and many other natural language processing (NLP) applications are also highly useful for millions of people around the globe, enabling better communication and easier access to the world’s information. Many NLP tasks can be cast as sequence-to-sequence learning problems – e.g. problems that involve sequential input and output. From a machine learning perspective, such problems handle the prediction of a structured output given a structured input, as each element of the output sequence is usually pre- dicted while conditioning on the input sequence and on the previously pre- dicted elements. This setting requires rich feature representations and special- ized inference algorithms to take into account the different interactions between and within the input and output elements. The recent proliferation of neural-network based machine learning methods (also known as “deep-learning”) has enabled profound progress on sequence- to-sequence learning tasks; specifically, it allowed to implicitly learn representations in an end-to-end manner, without requiring manual feature engineering as was common in previous methods. In addition, the limitation of using a fixed context window when modeling each element due to computational bur- den was removed with the neural methods, allowing for much better modeling of long-range dependencies in such tasks. i In order to unlock the full potential of neural-network based methods for NLP applications, many new research questions arise – can we design neural models while integrating existing knowledge about language? Can we take advantage of such implicitly learned representations in shared multilingual set- tings? What are the limitations of such models? What can we learn about tex- tual domains from the representations such models learn? In this thesis, I seek answers to those questions revolving around neural sequence-to-sequence learning for NLP. My works involve different levels of language studies: Morphology, the study of words, how they are formed, and their relationship to other words in the same language, where I propose novel neural architectures for inflection generation; Syntax – the set of rules, princi- ples, and processes that govern the structure of sentences in a given language, where I study the integration of syntactic information to neural machine translation; Semantics – the study of meaning in language, usually at the sentence level, where I worked on complex sentence simplification that preserves the input semantics, and on massively multilingual translation that encodes dozens of languages to a shared semantic space; and finally Pragmatics – that looks at linguistic context beyond the sentence level, where I proposed a novel method to select training data using contextualized sentence representations from pretrained neural language models. In Chapter3, “Morphological Inflection Generation with

Load more