Recent Developments in Dimensionality Reduction
Total Page:16
File Type:pdf, Size:1020Kb
Liza Levina, Statistics Dimensionality reduction 1 Recent Developments in Dimensionality Reduction Liza Levina Department of Statistics University of Michigan Liza Levina, Statistics Dimensionality reduction 2 Outline Introduction to dimension reduction ² Classical methods: PCA and MDS ² Manifold embedding methods: Isomap, LLE, Laplacian and Hessian ² Eigenmaps and related work Estimating intrinsic dimension ² Liza Levina, Statistics Dimensionality reduction 3 Introduction Dimension Reduction: The problem: ² p – Have n points X1; : : : ; Xn in R ; m – Find “best” representation Y1; : : : ; Yn in R , m < p. Why do it? ² – Visualization (m = 2 or 3) – Computational speed-up (if p is very large) – De-noising and extracting important “features” – Ultimately, improving further analysis Every successful high-dimensional statistical method explicitly ² or implicitly reduces the dimension of data and/or model Liza Levina, Statistics Dimensionality reduction 4 Questions to ask when doing dimension reduction: What is important to preserve here (e.g., variability in the data, ² interpoint distances, clusters, local vs. global structure, etc.)? Is interpretation of new coordinates (i.e., features) important? ² Is an explicit mapping from Rp to Rm necessary (i.e., given a new ² point, can you easily express it in the reduced space)? What effect will this method have on the subsequent ² analysis/inference (i.e., regression, classification, etc)? Liza Levina, Statistics Dimensionality reduction 5 Classical Methods of Dimension Reduction I. Principal Component Analysis (PCA) Problem: find the linear combination of coordinates of Xi that has maximum variance: T 1 = arg max Var( X) ; =1 jj jj T 2 = arg max Var( X); : : : ; =1, 1 jj jj ? Solution: 1, 2, : : : are the eigenvectors of the sample covariance matrix §^ , corresponding to eigenvalues ¸ ¸ ¸ . 1 ¸ 2 ¢ ¢ ¢ ¸ k Liza Levina, Statistics Dimensionality reduction 6 Example: Population and Sample Principal Components 2−d normal sample, n = 200 5 Sample PC Population PC 4 3 2 1 0 −1 −2 −3 −4 −5 −6 −4 −2 0 2 4 6 Liza Levina, Statistics Dimensionality reduction 7 Advantages: very simple and popular ² mapping new points easy ² Disadvantages: interpretation often not possible ² For p larger than n, not consistent (needs a good estimator §^ of ² covariance) (Johnstone & Lu 2004, Paul & Johnstone 2004) Linear projection: may “collapse” non-linear features ² How many components to take? (usually from eigenvalue plots, a.k.a. ² scree plots; or explaining a given fraction of the total variance, e.g. 80% ) Liza Levina, Statistics Dimensionality reduction 8 Modern versions of PCA for large data Sparse (basis) PCA (Johnstone & Lu 2004) deals with very high-dimensional vectors ² suitable for signal processing applications ² good for “denoising” ² Algorithm: ² – Transform to a “sparse” basis (e.g. wavelets, or Fourier transform) – Discard coefficients close to 0 – Do PCA on the rest and transform back to signal Liza Levina, Statistics Dimensionality reduction 9 Liza Levina, Statistics Dimensionality reduction 10 Sparse (loadings) PCA (Zou, Hastie, Tibshirani 2004) use a penalty so that PC have only a few non-zero entries ² makes interpretation easier ² Other PCA-related methods Factor Analysis (used a lot in social sciences) ² x = As + " s Gaussian, " Gaussian(0; I), estimate A Independent Component Analysis (used a lot for “blind source ² separation” and other signal processing tasks) x = As components of s independent, non-Gaussian, estimate A Liza Levina, Statistics Dimensionality reduction 11 II. Multi-dimensional Scaling (MDS) Based on the data distance (or dissimilarity) matrix ¢ = d(X ; X ) ² ij i j m General problem: find Y = (Y1; : : : ; Yn) R such that ² 2 D (Y ) = Y Y minimizes ij jj i ¡ jjj H(Y ) = w (D (Y ) ¢ )2 ij ij ¡ ij Xi Xj<i Weights w can be used to omit missing data, attach confidence to ² ij individual distance measurements, normalize H(Y ), etc. Advantages Very useful for visualization ² Points themselves are not needed, only distances ² Non-metric version can be used with very general dissimilarities ² Liza Levina, Statistics Dimensionality reduction 12 There are versions designed to preserve clusters ² Disadvantages Global method: requires distance measurements to be accurate no ² matter how far the points are Only gives relative locations (though this could be an advantage as ² well) – need extra information to go to geographic coordinates (e.g. localization of sensor networks) No interpretation of new coordinates ² Liza Levina, Statistics Dimensionality reduction 13 Nonlinear Dimensionality Reduction (“Manifold Embedding” methods) Recently developed in machine learning (last 5 years) ² Main motivation: highly non-linear structures in the data (data ² manifolds) Main idea: local geometry is preserved – therefore do everything ² locally (in neighborhoods), then put together Currently in the process of moving from a collection of diverse ² algorithms to unified framework(s) Liza Levina, Statistics Dimensionality reduction 14 The Isomap [Tenenbaum, de Silva, Langford 2000] The algorithm: 1. Find neighborhoods for each point Xi (such as k nearest neighbors) 2. For neighbors, take Euclidean distances; for non-neighbors, length of the shortest path through the neighborhood graph. 3. Apply classical MDS to the distance matrix. Liza Levina, Statistics Dimensionality reduction 15 Isomap results: Faces n = 698, p = 64 64, distances are Euclidean distances between £ intensity vectors Liza Levina, Statistics Dimensionality reduction 16 Isomap results: Digits n = 1000, special metric for handwritten digits Liza Levina, Statistics Dimensionality reduction 17 Features of the Isomap Possibly the most intuitive manifold embedding method ² Ideal for applications where there is an underlying physical distance ² space (e.g., localization in wireless sensor networks) Global method: requires global isometry of the manifold embedding ² x = f(y), i.e. x x = y y j i ¡ jjM j i ¡ jj Can only embed a fully connected graph ² Shortest paths are expensive to compute ² Liza Levina, Statistics Dimensionality reduction 18 Known problems with the Isomap Global isometry – hence cannot deal with local distortions ² – Fix: C-Isomap (de Silva & Tenenbaum 2003) normalizes distances locally; but it’s not clear when this is appropriate (local distortion vs. uneven sampling). Does not deal well with “holes” in the sample (shortest paths have to ² go around) A single erroneous link can “short-circuit” the graph and warp the ² embedding – Fix: “Convex flows embedding” (“Seeing through water”, Efros et al 2005) uses flow capacity as a measure of distance (too many paths through the same link are penalized) Liza Levina, Statistics Dimensionality reduction 19 Locally Linear Embedding (LLE) [Saul & Roweis 2000] Main idea: locally everything is linear The algorithm: ² 1. Find neighborhoods for each point Xi (such as k nearest neighbors) 2. Find weights Wij that give the best linear reconstruction of Xi from its neighbors Xj 3. Fix weights Wij and find lower-dimensional points Yi that can be best reconstructed from their neighbors with these weights Solved by an eigenvalue problem ² Liza Levina, Statistics Dimensionality reduction 20 Liza Levina, Statistics Dimensionality reduction 21 Laplacian Eigenmaps [Belkin & Niyogi 2002] Main Idea: want to keep neighbors close in the embedding; given f(x ) f(x ) f x x j i ¡ j j ¼ kr kj i ¡ jj want to find f with minimal f 2 kr k Provided theoretical analysisR , connections to graph Laplacian ² operators and heat kernels (spectral graph theory), spectral clustering (graph partitioning algorithms) Computationally, solves another eigenvalue problem ² From the practical point of view, embedding is very similar to the LLE ² (they show LLE is approximating their criterion) Liza Levina, Statistics Dimensionality reduction 22 Hessian Eigenmaps [Donoho & Grimes 2003] Same framework as Laplacian Eigenmaps, but using a Hessian ² instead of Laplacian Accounts for local curvature ² The only method with proven optimality properties under ideal ² conditions The catch: Hessian estimates are noisy! (i.e. need larger sample size) ² LLE, Laplacian and Hessian Eigenmaps are all local methods; assume local isometry Liza Levina, Statistics Dimensionality reduction 23 Dealing with holes: comparison on the Swiss roll Original Data Regular LLE 1.5 1 20 0.5 0 0 −0.5 −1 −20 40 −1.5 20 20 10 −2 0 0 −10 −2 −1 0 1 2 Hessian LLE ISOMAP 0.06 30 20 0.04 10 0.02 0 0 −10 −0.02 −20 −30 −0.04 −0.05 0 0.05 −40 −20 0 20 Liza Levina, Statistics Dimensionality reduction 24 Issues to be resolved How do you know what dimension to project to? ² – Fair amount of work done (the rest of this talk) How do you project new points without recomputing the whole ² embedding? – Usually just interpolate – Charting (Brand 2002): embedding somewhat similar to LLE but provides an explicit mapping – Out-of-sample extensions (Bengio et al 2003): based on the kernel view How do you interpret the new coordinates? ² – Not much beyond 2-d pictures Liza Levina, Statistics Dimensionality reduction 25 How does this help you in further analysis? ² – Classification of partially labeled data (Belkin & Niyogi 2003) 1. Project all data onto a manifold 2. Use labeled data to train a classifier on projections – Embeddings that simultaneously enhance classification (Vlachos et al 2002, de Ridder & Duin 2002, Costa & Hero 2005(?)): force points from the same class to be projected closer together – Dimensionality reduction in regression is a whole area in itself