Overcoming Limitations of Categorical Language Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Overcoming Limitations of Categorical Language Modeling Shiran Dudy Advisor: Steven Bedrick A thesis presented for the degree of Doctor of Philosophy Center for Spoken Language Understanding Oregon Health & Science University November 2020 “Education means teaching a child to be curious, to wonder, to reflect, to enquire. The child who asks becomes a partner in the learning process, an active recipient. To ask is to grow”. Jonathan Sacks ii Acknowldgements There are many who I would like to thank, and who accompanied me throughout my journey. First, I am very grateful to have had Steven Bedrick as my advisor. I learned from him a lot: starting from the basics on how to ask a research question, to considering how and in what ways, in the grand scheme of things, our work adds to the general knowledge of our community. He also taught me how to attend to details more carefully, and to rigorously examine my steps and outcome in a methodological fashion. He always removed any roadblocks, and provided me with whatever assistance or advice about my work. Steven was always there when I asked (and I asked a lot). I am most appreciative of how he let me discover myself, his trust and support in me to follow my passion. He was everything I could ask for in a mentor. I also would like to thank Melanie Fried-Oken who accepted me to her group and who exposed me to the world of assistive technology. Her dedication to relentlessly developing means to find the voices of the people who have lost their basic ability to communicate was inspiring to me. She taught me what it takes to run an interdisci- plinary group. Most importantly, she has supported me throughout, and I am very fortunate for that. I also wanted to thank Peter Heeman who made it possible for me to graduate on time. Throughout the last year he has done everything he could to ensure I am provided with the resources and the knowledge to graduate and make a smooth transition onward. I do not take it for granted and appreciate him a lot for helping me make my first steps following my graduation. I wanted to thank Pat as well, as throughout the program she smoothly took care of every administrative issue. She was the first person who I saw when I arrived here and was always very friendly and welcoming. I am also grateful for Brian Roark who I learned from and consulted with whenever I got stuck. He always had the time and patience to listen, and offer good advice (which he always had). Finally, I would like to thank David Smith, who from time to time helped us in brainstorming over directions and ideas that I had in mind, and encouraged me to continue asking. I wanted to thank my mother who was (and is) there for me throughout it all, who initially thought that pursuing a PhD so far from home was a crazy idea. Nonetheless, she still supported me from afar throughout. Ronen, my love, who I could not imagine going through the last stretch without. I am very fortunate to have him in my life. My sister who taught me that I can shape my reality in my own hands. To my dad who supported me and is very proud now. To my adopting mothers here, Dvora M., Karen, and Dvora T.. And especially to Dvora M. who was there for me whenever I needed a fresh perspective on life. To Naomi, Dudi, Yael, Ori and Cleiton who also my became my close family. I wanted to thank my committee members Brian Roark, Peter Heeman, Meysam Asgari and Xubo Song for providing me feedback on this work and helping me strengthen my argument. iii Overcoming Limitations of Categorical Language Modeling Shiran Dudy Abstract Neural language models typically employ a categorical approach to prediction and training, leading to several well-known computational and numerical limitations. These limitations are particularly evident in applied settings where language models are employed as means for communication. From speller systems employed as assis- tive technology to texting applications on smartphones, all language models revolve around category-based prediction. Research shows that neural-category approaches to language modeling are questionable for predicting low-frequency words that are essential for user personalization. It is also challenging to adapt these architectures to a changing vocabulary due to the initially learned vocabulary constraints, which limit predictions of relevant categories (i.e., words) a user can type. Recently, such categorical models were shown to be relatively complex with long inference times, which may be detrimental for user engagement. In this thesis, I reevaluate neural- category approaches and propose an alternative: continuous output prediction. Continuous output prediction is an underexplored alternative approach to lan- guage modeling that performs prediction directly against a continuous word em- bedding space. This approach splits the inference phase into two steps: a vector prediction followed by a vector decoding (mapping the vector to a category). Pre- dicting a vector in an embedding space opens the door to a theoretically unlimited number of categories that can be represented and decoded using this technique. Technically, I show how given a trained model, continuous models’ adaptation to a new vocabulary requires minimal architectural modifications compared to that of categorical alternatives. I also explore another important trait of continuous out- put prediction models: such models reach low-frequency vocabulary words that are often ignored by categorical models. I discuss the computational aspects of con- tinuous output prediction, showing its promising results, especially in multiple-user settings and settings in which short inferences are required. Finally, to evaluate the diversity of categories predicted, including low-frequency words, I propose a simple metric based on the unique types predicted. iv Contents 1 Introduction1 1.1 Problem Statement............................3 1.2 Thesis Contributions...........................3 1.2.1 Retrieval-based language model.................4 1.2.2 Prediction diversity evaluation metric..............4 1.2.3 Adaptation in retrieval-based approaches............4 1.3 Organization of the Thesis........................4 2 Preliminaries and Background6 2.1 On the Roles of Language........................6 2.2 Augmentative and Alternative Communication (AAC)........7 2.2.1 BCI systems............................8 2.2.2 Icons................................ 10 2.3 Language Models............................. 14 2.3.1 Language models’ application.................. 15 2.3.2 Statistical language models................... 16 2.3.3 Evaluation metrics........................ 19 2.3.4 Neural network language models................. 19 2.3.5 Neural models compared to count-based approaches...... 25 2.4 Word-Embedding Spaces......................... 25 2.4.1 Static embeddings........................ 26 2.4.2 Contextualized embeddings................... 29 2.4.3 Hot representation........................ 30 2.5 Limitations of Neural-Categorical-Based Prediction.......... 31 2.5.1 Complexity limitations...................... 31 2.5.2 Decoding limitations....................... 33 2.5.3 Architectural limitations..................... 36 2.5.4 Evaluation limitations...................... 37 3 Towards Continuous-Output prediction of Language Models 39 3.1 Introduction................................ 39 3.1.1 Motivation for using a continuous approach.......... 39 3.2 Related Work............................... 43 3.2.1 Predictive language models................... 43 3.2.2 Adversarial language model training............... 44 3.2.3 Rare words............................ 44 3.3 Methods.................................. 45 3.3.1 Datasets.............................. 45 v 3.3.2 Models............................... 46 3.3.3 Embeddings............................ 47 3.3.4 Decoding............................. 48 3.3.5 Process.............................. 48 3.3.6 Metrics.............................. 48 3.3.7 Baselines.............................. 50 3.4 Results................................... 50 3.4.1 High-level analysis........................ 50 3.4.2 Proposing an adversarial continuous output model (GAN).. 52 3.4.3 GAN’s model performance.................... 53 3.4.4 Long-tail analysis......................... 53 3.4.5 An improved categorical model: The unlikelihood loss function.................. 58 3.4.6 An improved categorical model: Employing subword unit tokenization.............. 59 3.4.7 Overall performance across experiments............. 60 3.4.8 Computational costs....................... 62 3.5 Future Directions............................. 65 3.6 Conclusion................................. 65 4 Incremental Domain Adaptation in Language Models 67 4.1 Introduction................................ 67 4.2 Related Work............................... 67 4.2.1 Domain adaptation in NLP................... 67 4.2.2 Continual learning........................ 71 4.3 Continual Learning of Language Models................ 72 4.3.1 Problem definition........................ 72 4.3.2 Problem formalization...................... 72 4.4 Methods.................................. 72 4.4.1 Datasets.............................. 73 4.4.2 Models............................... 73 4.4.3 Embeddings............................ 73 4.4.4 Decoding............................. 73 4.4.5 Experiments............................ 74 4.4.6 Metrics.............................. 77 4.5 Results................................... 77 4.5.1 Performance following the second training..........