Spectral Methods for Natural Language Processing Jang Sun
Total Page:16
File Type:pdf, Size:1020Kb
Spectral Methods for Natural Language Processing Jang Sun Lee (Karl Stratos) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2016 c 2016 Jang Sun Lee (Karl Stratos) All Rights Reserved ABSTRACT Spectral Methods for Natural Language Processing Jang Sun Lee (Karl Stratos) Many state-of-the-art results in natural language processing (NLP) are achieved with sta- tistical models involving latent variables. Unfortunately, computational problems associ- ated with such models (for instance, finding the optimal parameter values) are typically intractable, forcing practitioners to rely on heuristic methods without strong guarantees. While heuristics are often sufficient for empirical purposes, their de-emphasis on theoretical aspects has certain negative ramifications. First, it can impede the development of rigorous theoretical understanding which can generate new ideas and algorithms. Second, it can lead to black art solutions that are unreliable and difficult to reproduce. In this thesis, we argue that spectral methods|that is, methods that use singular value decomposition or other similar matrix or tensor factorization|can effectively remedy these negative ramifications. To this end, we develop spectral methods for two unsupervised language processing tasks. The first task is learning lexical representations from unanno- tated text (e.g., hierarchical clustering of a vocabulary). The second task is estimating parameters of latent-variable models used in NLP applications (e.g., for unsupervised part- of-speech tagging). We show that our spectral algorithms have the following advantages over previous methods: 1. The algorithms provide a new theoretical framework that is amenable to rigorous analysis. In particular, they are shown to be statistically consistent. 2. The algorithms are simple to implement, efficient, and scalable to large amounts of data. They also yield results that are competitive with the state-of-the-art. Table of Contents List of Figures vii List of Tables x 1 Introduction 1 1.1 Motivation . .1 1.2 Learning Lexical Representations . .3 1.2.1 Hierarchical Word Clusters . .4 1.2.2 Word Embeddings . .4 1.3 Estimating Parameters of Latent-Variable Models . .5 1.3.1 Unsupervised POS Tagging . .5 1.3.2 Phoneme Recognition . .6 1.4 Thesis Overview . .7 1.5 Notation . .7 2 Related Work 8 2.1 Latent-Variable Models in NLP . .8 2.2 Representation Learning in NLP . .9 2.3 Spectral Techniques . 10 I The Spectral Framework 12 3 A Review of Linear Algebra 13 3.1 Basic Concepts . 13 i 3.1.1 Vector Spaces and Euclidean Space . 13 3.1.2 Subspaces and Dimensions . 14 3.1.3 Matrices . 15 3.1.4 Orthogonal Matrices . 17 3.1.5 Orthogonal Projection onto a Subspace . 17 3.1.6 Gram-Schmidt Process and QR Decomposition . 18 3.2 Eigendecomposition . 19 3.2.1 Square Matrices . 20 3.2.2 Symmetric Matrices . 22 3.2.3 Variational Characterization . 23 3.2.4 Semidefinite Matrices . 25 3.2.5 Numerical Computation . 27 3.3 Singular Value Decomposition (SVD) . 32 3.3.1 Derivation from Eigendecomposition . 32 3.3.2 Variational Characterization . 34 3.3.3 Numerical Computation . 35 3.4 Perturbation Theory . 36 3.4.1 Perturbation Bounds on Singular Values . 36 3.4.2 Canonical Angles Between Subspaces . 36 3.4.3 Perturbation Bounds on Singular Vectors . 38 4 Examples of Spectral Techniques 43 4.1 The Moore{Penrose Pseudoinverse . 43 4.2 Low-Rank Matrix Approximation . 44 4.3 Finding the Best-Fit Subspace . 45 4.4 Principal Component Analysis (PCA) . 46 4.4.1 Best-Fit Subspace Interpretation . 46 4.5 Canonical Correlation Analysis (CCA) . 47 4.5.1 Least Squares Interpretation . 49 4.5.2 New Coordinate Interpretation . 49 4.5.3 Dimensionality Reduction with CCA . 50 ii 4.6 Spectral Clustering . 55 4.7 Subspace Identification . 57 4.8 Alternating Minimization Using SVD . 58 4.9 Non-Negative Matrix Factorization . 61 4.10 Tensor Decomposition . 62 II Inducing Lexical Representations 65 5 Word Clusters Under Class-Based Language Models 66 5.1 Introduction . 66 5.2 Background . 68 5.2.1 The Brown Clustering Algorithm . 68 5.2.2 CCA and Agglomerative Clustering . 69 5.3 Brown Model Definition . 70 5.4 Clustering Under the Brown Model . 71 5.4.1 An Overview of the Approach . 71 5.4.2 Spectral Estimation of Observation Parameters . 72 5.4.3 Estimation from Samples . 73 5.4.4 Agglomerative Clustering . 75 5.5 Experiments . 76 5.5.1 Experimental Settings . 77 5.5.2 Comparison to the Brown Algorithm: Quality . 77 5.5.3 Comparison to the Brown Algorithm: Speed . 78 5.5.4 Effect of the Choice of κ and Context . 81 5.6 Conclusion . 81 6 Word Embeddings from Decompositions of Count Matrices 84 6.1 Introduction . 84 6.2 Background in CCA . 85 6.2.1 CCA Objective . 85 6.2.2 Exact Solution via SVD . 86 iii 6.2.3 Using CCA for Word Representations . 86 6.3 Using CCA for Parameter Estimation . 88 6.3.1 Clustering under a Brown Model . 88 6.3.2 Spectral Estimation . 89 6.3.3 Choice of Data Transformation . 90 6.4 A Template for Spectral Methods . 91 6.5 Related Work . 93 6.6 Experiments . 93 6.6.1 Word Similarity and Analogy . 93 6.6.2 As Features in a Supervised Task . 96 6.7 Conclusion . 97 III Estimating Latent-Variable Models 98 7 Spectral Learning of Anchor Hidden Markov Models 99 7.1 Introduction . 99 7.2 The Anchor Hidden Markov Model . 100 7.3 Parameter Estimation for A-HMMs . 101 7.3.1 NMF . 101 7.3.2 Random Variables . 102 7.3.3 Derivation of a Learning Algorithm . 103 7.3.4 Construction of the Convex Hull Ω . 106 7.4 Experiments . 109 7.4.1 Background on Unsupervised POS Tagging . 109 7.4.2 Experimental Setting . 110 7.4.3 Practical Issues with the Anchor Algorithm . 112 7.4.4 Tagging Accuracy . 113 7.4.5 Qualitative Analysis . 113 7.5 Related Work . 114 7.5.1 Latent-Variable Models . 114 iv 7.5.2 Unsupervised POS Tagging . 115 7.6 Conclusion . 116 8 Spectral Learning of Refinement Hidden Markov Models 119 8.1 Introduction . 119 8.2 Related Work . 121 8.3 The R-HMM Model . 121 8.3.1 Definition of an R-HMM . 121 8.4 The Forward-Backward Algorithm . 123 8.5 Spectral Estimation of R-HMMs . 125 8.5.1 Random Variables . 125 8.5.2 Estimation of the Operators . 127 8.6 The Spectral Estimation Algorithm . 129 8.7 Experiments . 130 8.8 Conclusion . ..