An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding

An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding Kilian Q. Weinberger and Lawrence K. Saul Department of Computer and Information Science, University of Pennsylvania Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104-6389 kilianw,lsaul @cis.upenn.edu { } Abstract Many problems in AI are simplified by clever representations of sensory or symbolic input. How to discover such representations automatically, from large amounts of unlabeled data, remains a fundamental challenge. The goal of statis- query A B tical methods for dimensionality reduction is to detect and discover low dimensional structure in high dimensional data. In this paper, we review a recently proposed algorithm— Figure 1: Images of teapots: pixel distances versus percep- maximum variance unfolding—for learning faithful low di- tual distances. As measured by the mean-squared-difference mensional representations of high dimensional data. The of pixel intensities, image A is closer to the query image than algorithm relies on modern tools in convex optimization image B, despite the fact that the view in image A involves that are proving increasingly useful in many areas of ma- a full 180 degrees of rotation. chine learning. Introduction judgments of similarity and difference. (Consider the em- barrassment when your robotic butler grabs the teapot by its A fundamental challenge of AI is to develop useful internal spout rather than its handle, not to mention the liability when representations of the external world. The human brain ex- it subsequently attempts to refill your guest’s cup.) A more cels at extracting small numbers of relevant features from useful representation of these images would index them by large amounts of sensory data. Consider, for example, how the teapot’s angle of rotation, thus locating image B closer we perceive a familiar face. A friendly smile or a menac- to the query image than image A. ing glare can be discerned in an instant and described by Objects may be similar or different in many ways. In the a few well chosen words. On the other hand, the digital teapot example of Fig. 1, there is only one degree of free- representations of these images may consist of hundreds or dom: the angle of rotation. More generally, there may be thousands of pixels. Clearly, there are much more compact many criteria that are relevant to judgments of similarity and representations of images, sounds, and text than their native difference, each associated with its own degree of freedom. digital formats. With such representations in mind, we have These degrees of freedom are manifested over time by vari- spent the last few years studying the problem of dimension- abilities in appearance or presentation. ality reduction—how to detect and discover low dimensional The most important modes of variability can often be dis- structure in high dimensional data. tilled by automatic procedures that have access to large num- For higher-level decision-making in AI, the right repre- bers of observations. In essence, this is the goal of statis- sentation makes all the difference. We mean this quite lit- tical methods for dimensionality reduction (Burges 2005; erally, in the sense that proper judgments of similiarity and Saul et al. 2006). The observations, initially represented difference depend crucially on our internal representations as high dimensional vectors, are mapped into a lower dimen- of the external world. Consider, for example, the images of sional space. If this mapping is done faithfully, then the axes teapots in Fig. 1. Each image shows the same teapot from of the lower dimensional space relate to the data’s intrinsic a different angle. Compared on a pixel-by-pixel basis, the degrees of freedom. query image and image A are the most similar pair of images; that is, their pixel intensities have the smallest mean- The linear method of principal components analysis squared-difference. The viewing angle in image B, however, (PCA) performs this mapping by projecting high dimen- is much closer to the viewing angle in the query image— sional data into low dimensional subspaces. The principal evidence that distances in pixel space do not support crucial subspaces of PCA have the property that they maximize the variance of the projected data. PCA works well if the most Copyright c 2006, American Association for Artificial Intelli- important modes of variability are approximately linear. In gence (www.aaai.org). All rights reserved. this case, the high dimensional observations can be very well 1683 reconstructions original nearby outputs. Such locally distance-preserving representations are exactly the kind constructed by maximum variance unfolding. The algorithm for maximum variance unfolding is based on a simple intuition. Imagine the inputs xi as connected 4 8 163264560 to their k nearest neighbors by rigid rods. (The value of k is the algorithm’s one free parameter.) The algorithm attempts to pull the inputs apart, maximizing the sum total of Figure 2: Results of PCA applied to a data set of face im- their pairwise distances without breaking (or stretching) the ages. The figure shows a grayscale face image (right) and its rigid rods that connect nearest neighbors. The outputs are linear reconstructions from different numbers of principal obtained from the final state of this transformation. components. The number of principal components required The effect of this transformation is easy to visualize for for accurate reconstruction greatly exceeds the small num- inputs that lie on low dimensional manifolds, such as curves ber of characteristic poses and expressions in the data set. or surfaces. For example, imagine the inputs as beads on a necklace that is coiled up in three dimensions. By pulling the necklace taut, the beads are arranged in a line, a nonlin- reconstructed from their low dimensional linear projections. ear dimensionality reduction from 3 to 1. Alternatively, PCA works poorly if the most important modes of vari- imagine the inputs as the lattice of sites in a crumpled fish- ability are nonlinear. To illustrate the effects of nonlinearity, ing net. By pulling on the ends of the net, the inputs are we applied PCA to a data set of 28 20 grayscale images. × arranged in a plane, a nonlinear dimensionality reduction Each image in the data set depicted a different pose or ex- from 3 to 2. As we shall see, this intuition for maximum pression of the same person’s face. The variability of faces is variance unfolding also generalizes to higher dimensions. not expressed linearly in the pixel space of grayscale images. The “unfolding” transformation described above can be Fig. 2 shows the linear reconstructions of a particular im- formulated as a quadratic program. Let ηij 0, 1 de- age from different numbers of principal components (that is, ∈{ } note whether inputs xi and xj are k-nearest neighbors. The from principal subspaces of different dimensionality). The outputs yi from maximum variance unfolding, as described reconstructions are not accurate even when the number of above, are those that solve the following optimization: principal components greatly exceeds the small number of 2 characteristic poses and expressions in this data set. Maximize yi yj subject to: ij − In this paper, we review a recently proposed algorithm (1) y y 2 = x x 2 for all (i, j) with η =1. i − j i − j ij for nonlinear dimensionality reduction. The algorithm, (2) i yi =0 known as “maximum variance unfolding” (Sun et al. 2006; Saul et al. 2006), discovers faithful low dimensional repre- Here, the first constraint enforces that distances between sentations of high dimensional data, such as images, sounds, nearby inputs match distances between nearby outputs, and text. It also illustrates many ideas in convex optimiza- while the second constraint yields a unique solution (up to tion that are proving increasingly useful in the broader field rotation) by centering the outputs on the origin. of machine learning. The apparent intractability of this quadratic program can Our work builds on earlier frameworks for analyzing be finessed by a simple change of variables. Note that high dimensional data that lies on or near a low dimen- as written above, the optimization over the outputs yi is sional manifold (Tenenbaum, de Silva, & Langford 2000; not convex, meaning that it potentially suffers from spu- Roweis & Saul 2000). Manifolds are spaces that are locally rious local minima. Defining the inner product matrix K = y y , we can reformulate the optimization as a linear, but unlike Euclidean subspaces, they can be globally ij i · j nonlinear. Curves and surfaces are familiar examples of one semidefinite program (SDP) (Vandenberghe & Boyd 1996) and two dimensional manifolds. Compared to earlier frame- over the matrix K. The resulting optimization is simply works for manifold learning, maximum variance unfolding a linear program over the matrix elements Kij, with the has many interesting properties, which we describe in the additional constraint that the matrix K has only nonnega- following sections. tive eigenvalues, a property that holds for all inner product matrices. In earlier work (Weinberger & Saul 2004; Maximum Variance Unfolding Weinberger, Sha, & Saul 2004), we showed that the SDP over K can be written as: Algorithms for nonlinear dimensionality reduction map high dimensional inputs x n to low dimensional out- Maximize trace(K) subject to: i i=1 (1) 2 for all puts y n , where x { d,}y r, and r d. The re- Kii 2Kij +Kjj = xi xj (i, j) i i=1 i i with− . − duced{ dimensionality} r ∈is chosen∈ to be as small as possible, ηij =1 r (2) ΣijKij =0. yet sufficiently large to guarantee that the outputs yi d ∈ (3) K 0. provide a faithful representation of the inputs xi . What constitutes a “faithful” representation?∈ Suppose The last (additional) constraint K 0 requires the ma- that the high dimensional inputs lie on a low dimensional trix K to be positive semidefinite.

An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Comparison of Dimensionality Reduction Techniques on Audio Signals

Litrec Vs. Movielens a Comparative Study

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Dimensionality Reduction and Principal Component Analysis

Collaborative Filtering: a Machine Learning Perspective by Benjamin Marlin a Thesis Submitted in Conformity with the Requirement

13 Dimensionality Reduction and Microarray Data

Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

Collaborative Filtering: a Machine Learning Perspective Chapter 6: Dimensionality Reduction

Accent Transfer with Discrete Representation Learning and Latent Space Disentanglement

Text Comparison Using Word Vector Representations and Dimensionality Reduction