Recent Developments in Dimensionality Reduction

Liza Levina, Statistics Dimensionality reduction 1 Recent Developments in Dimensionality Reduction Liza Levina Department of Statistics University of Michigan Liza Levina, Statistics Dimensionality reduction 2 Outline Introduction to dimension reduction ² Classical methods: PCA and MDS ² Manifold embedding methods: Isomap, LLE, Laplacian and Hessian ² Eigenmaps and related work Estimating intrinsic dimension ² Liza Levina, Statistics Dimensionality reduction 3 Introduction Dimension Reduction: The problem: ² p – Have n points X1; : : : ; Xn in R ; m – Find “best” representation Y1; : : : ; Yn in R , m < p. Why do it? ² – Visualization (m = 2 or 3) – Computational speed-up (if p is very large) – De-noising and extracting important “features” – Ultimately, improving further analysis Every successful high-dimensional statistical method explicitly ² or implicitly reduces the dimension of data and/or model Liza Levina, Statistics Dimensionality reduction 4 Questions to ask when doing dimension reduction: What is important to preserve here (e.g., variability in the data, ² interpoint distances, clusters, local vs. global structure, etc.)? Is interpretation of new coordinates (i.e., features) important? ² Is an explicit mapping from Rp to Rm necessary (i.e., given a new ² point, can you easily express it in the reduced space)? What effect will this method have on the subsequent ² analysis/inference (i.e., regression, classification, etc)? Liza Levina, Statistics Dimensionality reduction 5 Classical Methods of Dimension Reduction I. Principal Component Analysis (PCA) Problem: find the linear combination of coordinates of Xi that has maximum variance: T 1 = arg max Var( X) ; =1 jj jj T 2 = arg max Var( X); : : : ; =1, 1 jj jj ? Solution: 1, 2, : : : are the eigenvectors of the sample covariance matrix §^ , corresponding to eigenvalues ¸ ¸ ¸ . 1 ¸ 2 ¢ ¢ ¢ ¸ k Liza Levina, Statistics Dimensionality reduction 6 Example: Population and Sample Principal Components 2−d normal sample, n = 200 5 Sample PC Population PC 4 3 2 1 0 −1 −2 −3 −4 −5 −6 −4 −2 0 2 4 6 Liza Levina, Statistics Dimensionality reduction 7 Advantages: very simple and popular ² mapping new points easy ² Disadvantages: interpretation often not possible ² For p larger than n, not consistent (needs a good estimator §^ of ² covariance) (Johnstone & Lu 2004, Paul & Johnstone 2004) Linear projection: may “collapse” non-linear features ² How many components to take? (usually from eigenvalue plots, a.k.a. ² scree plots; or explaining a given fraction of the total variance, e.g. 80% ) Liza Levina, Statistics Dimensionality reduction 8 Modern versions of PCA for large data Sparse (basis) PCA (Johnstone & Lu 2004) deals with very high-dimensional vectors ² suitable for signal processing applications ² good for “denoising” ² Algorithm: ² – Transform to a “sparse” basis (e.g. wavelets, or Fourier transform) – Discard coefficients close to 0 – Do PCA on the rest and transform back to signal Liza Levina, Statistics Dimensionality reduction 9 Liza Levina, Statistics Dimensionality reduction 10 Sparse (loadings) PCA (Zou, Hastie, Tibshirani 2004) use a penalty so that PC have only a few non-zero entries ² makes interpretation easier ² Other PCA-related methods Factor Analysis (used a lot in social sciences) ² x = As + " s Gaussian, " Gaussian(0; I), estimate A Independent Component Analysis (used a lot for “blind source ² separation” and other signal processing tasks) x = As components of s independent, non-Gaussian, estimate A Liza Levina, Statistics Dimensionality reduction 11 II. Multi-dimensional Scaling (MDS) Based on the data distance (or dissimilarity) matrix ¢ = d(X ; X ) ² ij i j m General problem: find Y = (Y1; : : : ; Yn) R such that ² 2 D (Y ) = Y Y minimizes ij jj i ¡ jjj H(Y ) = w (D (Y ) ¢ )2 ij ij ¡ ij Xi Xj<i Weights w can be used to omit missing data, attach confidence to ² ij individual distance measurements, normalize H(Y ), etc. Advantages Very useful for visualization ² Points themselves are not needed, only distances ² Non-metric version can be used with very general dissimilarities ² Liza Levina, Statistics Dimensionality reduction 12 There are versions designed to preserve clusters ² Disadvantages Global method: requires distance measurements to be accurate no ² matter how far the points are Only gives relative locations (though this could be an advantage as ² well) – need extra information to go to geographic coordinates (e.g. localization of sensor networks) No interpretation of new coordinates ² Liza Levina, Statistics Dimensionality reduction 13 Nonlinear Dimensionality Reduction (“Manifold Embedding” methods) Recently developed in machine learning (last 5 years) ² Main motivation: highly non-linear structures in the data (data ² manifolds) Main idea: local geometry is preserved – therefore do everything ² locally (in neighborhoods), then put together Currently in the process of moving from a collection of diverse ² algorithms to unified framework(s) Liza Levina, Statistics Dimensionality reduction 14 The Isomap [Tenenbaum, de Silva, Langford 2000] The algorithm: 1. Find neighborhoods for each point Xi (such as k nearest neighbors) 2. For neighbors, take Euclidean distances; for non-neighbors, length of the shortest path through the neighborhood graph. 3. Apply classical MDS to the distance matrix. Liza Levina, Statistics Dimensionality reduction 15 Isomap results: Faces n = 698, p = 64 64, distances are Euclidean distances between £ intensity vectors Liza Levina, Statistics Dimensionality reduction 16 Isomap results: Digits n = 1000, special metric for handwritten digits Liza Levina, Statistics Dimensionality reduction 17 Features of the Isomap Possibly the most intuitive manifold embedding method ² Ideal for applications where there is an underlying physical distance ² space (e.g., localization in wireless sensor networks) Global method: requires global isometry of the manifold embedding ² x = f(y), i.e. x x = y y j i ¡ jjM j i ¡ jj Can only embed a fully connected graph ² Shortest paths are expensive to compute ² Liza Levina, Statistics Dimensionality reduction 18 Known problems with the Isomap Global isometry – hence cannot deal with local distortions ² – Fix: C-Isomap (de Silva & Tenenbaum 2003) normalizes distances locally; but it’s not clear when this is appropriate (local distortion vs. uneven sampling). Does not deal well with “holes” in the sample (shortest paths have to ² go around) A single erroneous link can “short-circuit” the graph and warp the ² embedding – Fix: “Convex flows embedding” (“Seeing through water”, Efros et al 2005) uses flow capacity as a measure of distance (too many paths through the same link are penalized) Liza Levina, Statistics Dimensionality reduction 19 Locally Linear Embedding (LLE) [Saul & Roweis 2000] Main idea: locally everything is linear The algorithm: ² 1. Find neighborhoods for each point Xi (such as k nearest neighbors) 2. Find weights Wij that give the best linear reconstruction of Xi from its neighbors Xj 3. Fix weights Wij and find lower-dimensional points Yi that can be best reconstructed from their neighbors with these weights Solved by an eigenvalue problem ² Liza Levina, Statistics Dimensionality reduction 20 Liza Levina, Statistics Dimensionality reduction 21 Laplacian Eigenmaps [Belkin & Niyogi 2002] Main Idea: want to keep neighbors close in the embedding; given f(x ) f(x ) f x x j i ¡ j j ¼ kr kj i ¡ jj want to find f with minimal f 2 kr k Provided theoretical analysisR , connections to graph Laplacian ² operators and heat kernels (spectral graph theory), spectral clustering (graph partitioning algorithms) Computationally, solves another eigenvalue problem ² From the practical point of view, embedding is very similar to the LLE ² (they show LLE is approximating their criterion) Liza Levina, Statistics Dimensionality reduction 22 Hessian Eigenmaps [Donoho & Grimes 2003] Same framework as Laplacian Eigenmaps, but using a Hessian ² instead of Laplacian Accounts for local curvature ² The only method with proven optimality properties under ideal ² conditions The catch: Hessian estimates are noisy! (i.e. need larger sample size) ² LLE, Laplacian and Hessian Eigenmaps are all local methods; assume local isometry Liza Levina, Statistics Dimensionality reduction 23 Dealing with holes: comparison on the Swiss roll Original Data Regular LLE 1.5 1 20 0.5 0 0 −0.5 −1 −20 40 −1.5 20 20 10 −2 0 0 −10 −2 −1 0 1 2 Hessian LLE ISOMAP 0.06 30 20 0.04 10 0.02 0 0 −10 −0.02 −20 −30 −0.04 −0.05 0 0.05 −40 −20 0 20 Liza Levina, Statistics Dimensionality reduction 24 Issues to be resolved How do you know what dimension to project to? ² – Fair amount of work done (the rest of this talk) How do you project new points without recomputing the whole ² embedding? – Usually just interpolate – Charting (Brand 2002): embedding somewhat similar to LLE but provides an explicit mapping – Out-of-sample extensions (Bengio et al 2003): based on the kernel view How do you interpret the new coordinates? ² – Not much beyond 2-d pictures Liza Levina, Statistics Dimensionality reduction 25 How does this help you in further analysis? ² – Classification of partially labeled data (Belkin & Niyogi 2003) 1. Project all data onto a manifold 2. Use labeled data to train a classifier on projections – Embeddings that simultaneously enhance classification (Vlachos et al 2002, de Ridder & Duin 2002, Costa & Hero 2005(?)): force points from the same class to be projected closer together – Dimensionality reduction in regression is a whole area in itself

Recent Developments in Dimensionality Reduction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support