Blank Language Model: Flexible Sequence Modeling by Any-Order Generation by Victor Quach B.S

Blank Language Model: Flexible Sequence Modeling by Any-Order Generation by Victor Quach B.S. in Engineering., École polytechnique (2016) M.S. in Computer Science and Mathematics, École polytechnique (2017) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science May 15, 2020 Certified by. Regina Barzilay Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Blank Language Model: Flexible Sequence Modeling by Any-Order Generation by Victor Quach Submitted to the Department of Electrical Engineering and Computer Science on May 15, 2020, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract We propose Blank Language Model (BLM), a model that generates sequences by dy- namically creating and filling in blanks. Unlike previous masked language models [7] or the Insertion Transformer [26], BLM uses blanks to control which part of the sequence to expand. This fine-grained control of generation is ideal for a variety of text editing and rewriting tasks. The model can start from a single blank or partially completed text with blanks at specified locations. It iteratively determines which word to place in a blank and whether to insert new blanks, and stops generating when no blanks are left to fill. BLM can be efficiently trained using a lower bound of the marginal data likelihood, and achieves perplexity comparable to traditional left-to-right language models on the Penn Treebank and WikiText datasets. On the task of filling missing text snippets, BLM significantly outperforms all other baselines in terms of both accuracy and fluency. Experiments on style transfer and damaged ancient text restoration demonstrate the potential of this framework for a wide range of applications. Thesis Supervisor: Regina Barzilay Title: Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments First, I want to express my deepest gratitude to my advisor, Regina Barzilay. Over the years, she has never failed to provide me with the academic guidance or life advice I needed. Regina has pushed me to think critically and creatively, and to push the boundaries of what I can achieve. I really appreciate her undying patience and support in this journey. I would like to thank Tianxiao Shen and Tommi Jaakola, co-authors of the work on which this thesis is based. I am grateful to have had the opportunity to collaborate with such brilliant minds. I would also like to thank Yujia Bao, Adam Fisch, Benson Chen, Adam Yala, Tal Schuster, Yujie Qian, Jiang Guo, Darsh Shah, Jiaming Luo, Wengong Jin and all other members of MIT NLP group. In these times of self-quarantine, I more than ever measure the value of our research discussions (planned or impromptu) who have always left me more curious and intellectually stimulated. I am thankful to be part of a group composed of such smart, talented, and generous people. I want to thank Yuening Zhang, my girlfriend, dance partner, life partner, and friend. This work would not have been possible without her undiminished support. Finally, I would like to thank my family to which I am deeply indebted. To my sister Hélène who has been a lifelong confidant and friend. To my parents for their sacrifical love, constant encouragement, and support of my studies. I would not be where I am today without all you. Thank you. 5 6 Bibliographic Note This thesis is based on our previous work available as a preprint [25]. 7 8 Contents 1 Introduction 15 2 Related Work 17 3 Blank Language Models 19 4 Experiments 25 4.1 Language Modeling . 26 4.2 Text Infilling . 27 4.3 Ancient Text Restoration . 30 4.4 Sentiment Transfer . 32 5 Conclusion 37 9 10 List of Figures 1-1 BLM fills in blanks of arbitrary length. 16 1-2 An example trajectory that generates the sentence “customer service is awesome”. Each action is a tuple (b; w; l; r), indicating the blank location b selected for expansion, the word w to fill in, whether to create a left blank l, and whether to create a right blank r....... 16 3-1 Architecture of the Blank Language Model. In the first stage, an in- dex is chosen among all current blank positions. For that location, a word is selected in the second stage. In the final stage, the blank representation is concatenated with the chosen word’s embedding and fed into a multilayer perceptron (MLP) to determine the creation of the following blanks. 20 4-1 Examples of inputs and outputs for the three rewriting tasks (text infilling, ancient test restoration and style transfer). We contrast text infilling, where blanks can cover an arbitrary number of words, with ancient text restoration, where the number of characters to recover is indicated by the number of ‘?’ symbols in the input. 25 4-2 Failure rate, BLEU score and perplexity of generated documents for the text infilling task. The “No infill” line reports the BLEU scoreof the blanked document. The “Data PPL” dotted line serves as reference for the perplexity of the original documents . 27 11 4-3 Example generations for the text infilling task, with mask ratios 0.1 and 0.5. Completions are in italic. Invalid completions are in red. For the seq2seq-fill baseline, we represent the outputs of the model along with the merged document. In this example, the insertion transformer produces invalid completions by failing to generate tokens in the “? the” blank. At mask ratio 0.5, the seq2seq-fill baseline also generates an invalid document by producing too many ‘|’ tokens, i.e. filling to many blanks. 29 4-4 Example generations for the style transfer task using attention-based masking mechanism. Masked words are in bold. 34 12 List of Tables 4.1 Perplexity on the Penn Treebank and WikiText datasets. 26 4.2 Character error rate for the ancient text restoration task in both single- slot and multi-slot settings. 31 4.3 Accuracy and BLEU scores for the Yelp sentiment transfer task. Ac- curacy measures the percentage of sentences labeled as the target sentiment by the classifier. BLEU is evaluated against human reference generations. For reference, we also report accuracy and BLEU scores of the canvas (i.e. the original masked sentence). 33 13 14 Chapter 1 Introduction Neural language models have been successfully applied to many sequence generation tasks, including machine translation [3], summarization [23], and image caption- ing [32]. Typically, sequences are modeled autoregressively from left to right, making the log-likelihood tractable and allowing efficient training and inference. While left- to-right models are effective, they are not well-suited for text completion or editing. In these tasks, we are given a partial draft of the text and the goal is to add new text to complete it. Models such as Masked Language Model [7, MLM] and Insertion Transformer [26] are able to fill in words to complete partially written text. However, neither ofthemis tailored to rewriting/editing. MLM assumes that the length of the text to be inserted is known in advance. Insertion Transformer, on the other hand, does not explicitly control where insertions can take place. In this paper, we introduce Blank Language Model (BLM). The model exploits a special “ ” symbol to control where tokens can be placed. In each stage of generation, a blank can be replaced by any word, and potentially accompanied by a new blank on the left, right or both sides of the word to continue writing. As shown in Fig. 1-1, such models can be used to fill in missing words in incomplete sentences, generate a new sentence in between two given sentences, and so on. BLM can start with a single blank or partial text with blanks in specified locations. The model iterates through generation steps, replacing blanks with words and possibly adjoining blanks, until no 15 They also have which . They also have ice cream which is really good . Figure 1-1: BLM fills in blanks of arbitrary length. Canvas Action Step Location b Word w (Left l, Right r) 0. #1 #1 is Y Y 1. #1 is #2 #1 customer N Y 2. customer #1 is #2 #2 awesome N N 3. customer #1 is awesome #1 service N N 4. customer service is awesome -End- Figure 1-2: An example trajectory that generates the sentence “customer service is awesome”. Each action is a tuple (b; w; l; r), indicating the blank location b selected for expansion, the word w to fill in, whether to create a left blank l, and whether to create a right blank r. blanks remain. Our BLM is based on a Transformer encoder that maps the input text contain- ing blanks into a sequence of vector representations. The representations at blank locations are further processed to select a blank, word to fill in it, and whether to generate adjoining blanks. Since there are multiple trajectories through the actions in the BLM that all result in the same final text, we train the model by maximizing the marginal likelihood. To make training more efficient, and to introduce an inductive bias towards order independence, we maximize instead a lower bound on the marginal likelihood. At test time, BLM can in principle fill in any amount of text in any ofthe given blank positions.

Blank Language Model: Flexible Sequence Modeling by Any-Order Generation by Victor Quach B.S

How Do BERT Embeddings Organize Linguistic Knowledge?

Treebanks, Linguistic Theories and Applications Introduction to Treebanks

Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness

Deep Linguistic Analysis for the Accurate Identification of Predicate

The Procedure of Lexico-Semantic Annotation of Składnica Treebank

Unified Language Model Pre-Training for Natural

Building a Treebank for French

Converting an HPSG-Based Treebank Into Its Parallel Dependency-Based Treebank

Corpus Based Evaluation of Stemmers

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Lecture 5: Part-Of-Speech Tagging

Merging Propbank, Nombank, Timebank, Penn Discourse Treebank and Coreference James Pustejovsky, Adam Meyers, Martha Palmer, Massimo Poesio