KIM-DISSERTATION-2020.Pdf (2.774Mb)

Deep Latent Variable Models of Natural Language The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Kim, Yoon. 2020. Deep Latent Variable Models of Natural Language. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365532 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA ©2020 – Yoon H. Kim all rights reserved. Dissertation advisor: Alexander M. Rush Yoon H. Kim Deep Latent Variable Models of Natural Language Abstract Understanding natural language involves complex underlying processes by which meaning is extracted from surface form. One approach to operationalizing such phenomena in computational models of natural language is through probabilistic latent variable models, which can encode structural dependencies among observed and un- observed variables of interest within a probabilistic framework. Deep learning, on the other hand, offers an alternative computational approach to modeling natural language through end-to-end learning of expressive, global models, where any phenomena necessary for the task are captured implicitly within the hidden layers of a neural network. This thesis explores a synthesis of deep learning and latent variable modeling for natural language processing applications. We study a class of models called deep latent variable models, which parameterize components of probabilistic latent variable models with neural networks, thereby retaining the modularity of latent variable models while at the same time exploiting rich parameterizations enabled by recent ad- vances in deep learning. We experiment with different families of deep latent variable models to target a wide range of language phenomena—from word alignment to parse trees—and apply them to core natural language processing tasks including language modeling, machine translation, and unsupervised parsing. We also investigate key challenges in learning and inference that arise when work- ing with deep latent variable models for language applications. A standard approach iii Dissertation advisor: Alexander M. Rush Yoon H. Kim for learning such models is through amortized variational inference, in which a global inference network is trained to perform approximate posterior inference over the latent variables. However, a straightforward application of amortized variational inference is often insufficient for many applications of interest, and we consider several extensions to the standard approach that lead to improved learning and inference. In summary, each chapter presents a deep latent variable model tailored for modeling a particular aspect of language, and develops an extension of amortized variational inference for addressing the particular challenges brought on by the latent variable model being considered. We anticipate that these techniques will be broadly applicable to other domains of interest. iv Contents 1 Introduction 1 1.1 Thesis Outline ................................ 3 1.2 Related Publications ............................ 4 2 Background 6 2.1 Motivation .................................. 6 2.2 Notation ................................... 9 2.3 Neural Networks ............................... 11 2.4 Latent Variable Models ........................... 13 2.4.1 Deep Latent Variable Models ................... 15 2.5 Learning and Inference ........................... 18 2.5.1 Maximum Likelihood Estimation . 19 2.5.2 Tractable Exact Inference ..................... 20 2.5.3 Variational Inference ........................ 23 2.5.4 Amortized Variational Inference . 28 2.5.5 Variational Autoencoders ..................... 31 2.6 Thesis Roadmap ............................... 37 3 Latent Variable Model of Sentences & Semi-Amortized Variational Inference 38 3.1 Introduction ................................. 38 3.2 Background ................................. 39 3.2.1 Generative Model .......................... 40 3.2.2 Posterior Collapse ......................... 42 3.2.3 Amortization Gap ......................... 44 3.3 Semi-Amortized Variational Autoencoders . 45 3.3.1 Implementation Details ...................... 47 3.4 Empirical Study ............................... 49 3.4.1 Experimental Setup ........................ 49 v 3.4.2 Results ................................ 51 3.4.3 Analysis of Latent Variables .................... 54 3.5 Discussion .................................. 57 3.5.1 Limitations ............................. 57 3.5.2 Posterior Collapse: Optimization vs. Underlying Model . 58 3.6 Related Work ................................ 59 3.7 Conclusion .................................. 60 4 Latent Variable Model of Attention & Relaxations of Discrete Spaces 61 4.1 Introduction ................................. 61 4.2 Background ................................. 63 4.2.1 Soft Attention ............................ 64 4.2.2 Hard Attention ........................... 66 4.2.3 Pathwise Gradient Estimators for Discrete Distributions . 68 4.3 Variational Attention for Latent Alignment . 71 4.3.1 Test Time Inference ........................ 76 4.4 Empirical Study ............................... 77 4.4.1 Experimental Setup ........................ 77 4.4.2 Results ................................ 79 4.4.3 Analysis ............................... 81 4.5 Discussion .................................. 84 4.5.1 Limitations ............................. 84 4.5.2 Attention as a Latent Variable . 84 4.6 Related Work ................................ 86 4.7 Conclusion .................................. 87 5 Latent Variable Model of Trees & Posterior Regularization 88 5.1 Introduction ................................. 88 5.2 Background ................................. 90 5.2.1 Recurrent Neural Network Grammars . 90 5.2.2 Posterior Regularization ...................... 91 5.2.3 Conditional Random Fields .................... 93 5.3 Unsupervised Recurrent Neural Network Grammars . 94 5.3.1 Generative Model .......................... 94 5.3.2 Posterior Regularization with Conditional Random Fields . 97 5.3.3 Learning and Inference . 101 5.4 Empirical Study ............................... 104 5.4.1 Experimental Setup . 104 vi 5.4.2 Results ................................ 108 5.4.3 Analysis ............................... 113 5.5 Discussion .................................. 115 5.5.1 Limitations ............................. 115 5.5.2 Rich Generative Models for Learning Latent Structures . 116 5.6 Related Work ................................ 117 5.7 Conclusion .................................. 118 6 Latent Variable Model of Grammars & Collapsed Variational Inference 120 6.1 Introduction ................................. 120 6.2 Background ................................. 123 6.2.1 Probabilistic Context-Free Grammars . 123 6.2.2 Grammar Induction vs. Unsupervised Parsing . 125 6.3 Compound Probabilistic Context-Free Grammars . 125 6.3.1 A Neural Parameterization . 126 6.3.2 A Compound Extension . 128 6.3.3 Training and Inference . 131 6.4 Empirical Study ............................... 134 6.4.1 Experimental Setup . 134 6.4.2 Results ................................ 136 6.4.3 Analysis ............................... 138 6.4.4 Recurrent Neural Network Grammars on Induced Trees . 142 6.5 Discussion .................................. 145 6.5.1 Limitations ............................. 145 6.5.2 Richer Grammars for Modeling Natural Language . 146 6.6 Related Work ................................ 147 6.7 Conclusion .................................. 148 7 Conclusion 150 References 174 vii Listing of figures 2.1 Graphical model for the Naive Bayes model. For simplicity, all sequences are depicted as having T tokens. All distributions are categorical, and the 2 K−1 f 2 V −1gK parameters are µ ∆ and π = πk ∆ k=1. 15 2.2 Graphical model representation of a categorical latent variable model with tokens generated by an RNN. For simplicity, all sequences are depicted as having T tokens. The z(n)’s are drawn from a Categorical distribution with parameter µ, while x(n) is drawn from an RNN. The parameters of the RNN are given by θ = fW; U; V; T; E; bg. See the text for more details. 18 2.3 (Top) Traditional variational inference uses variational parameters λ(n) for each data point x(n). (Bottom) Amortized variational inference employs a global inference network ϕ that is run over the input x(n) to produce the local variational distributions. ....................... 29 3.1 (Left) Perplexity upper bound of various models when trained with 20 steps (except for VAE) and tested with varying number of SVI steps from random initialization. (Right) Same as the left except that SVI is initialized with variational parameters obtained from the inference network. 53 3.2 (Top) Output saliency visualization of some examples from the test set. Here the saliency values are rescaled to be between 0-100 within each example for easier visualization. Red indicates higher saliency values. (Mid- dle) Input saliency of the first test example from the top (in blue), inad- dition to two sample outputs generated from the variational posterior (with their saliency values in red). (Bottom) Same as the middle except we use a made-up example.. ............................ 55 3.3 Saliency by part-of-speech

Load more