Liza Levina, Statistics Dimensionality reduction 1
Recent Developments in Dimensionality Reduction
Liza Levina Department of Statistics University of Michigan Liza Levina, Statistics Dimensionality reduction 2
Outline
Introduction to dimension reduction Classical methods: PCA and MDS Manifold embedding methods: Isomap, LLE, Laplacian and Hessian Eigenmaps and related work
Estimating intrinsic dimension Liza Levina, Statistics Dimensionality reduction 3
Introduction
Dimension Reduction:
The problem: p – Have n points X1, . . . , Xn in R ; m – Find “best” representation Y1, . . . , Yn in R , m < p.
Why do it? – Visualization (m = 2 or 3) – Computational speed-up (if p is very large) – De-noising and extracting important “features”
– Ultimately, improving further analysis
Every successful high-dimensional statistical method explicitly or implicitly reduces the dimension of data and/or model Liza Levina, Statistics Dimensionality reduction 4
Questions to ask when doing dimension reduction:
What is important to preserve here (e.g., variability in the data, interpoint distances, clusters, local vs. global structure, etc.)?
Is interpretation of new coordinates (i.e., features) important? Is an explicit mapping from Rp to Rm necessary (i.e., given a new point, can you easily express it in the reduced space)?
What effect will this method have on the subsequent analysis/inference (i.e., regression, classification, etc)? Liza Levina, Statistics Dimensionality reduction 5
Classical Methods of Dimension Reduction
I. Principal Component Analysis (PCA)
Problem: find the linear combination of coordinates of Xi that has maximum variance:
T 1 = arg max Var( X) , =1 || || T 2 = arg max Var( X), . . . , =1, 1 || || ⊥
Solution: 1, 2, . . . are the eigenvectors of the sample covariance matrix ˆ , corresponding to eigenvalues . 1 2 k Liza Levina, Statistics Dimensionality reduction 6
Example: Population and Sample Principal Components
2−d normal sample, n = 200 5 Sample PC Population PC 4
3
2
1
0
−1
−2
−3
−4
−5 −6 −4 −2 0 2 4 6 Liza Levina, Statistics Dimensionality reduction 7
Advantages:
very simple and popular mapping new points easy Disadvantages:
interpretation often not possible For p larger than n, not consistent (needs a good estimator ˆ of covariance) (Johnstone & Lu 2004, Paul & Johnstone 2004)
Linear projection: may “collapse” non-linear features How many components to take? (usually from eigenvalue plots, a.k.a. scree plots; or explaining a given fraction of the total variance, e.g. 80% ) Liza Levina, Statistics Dimensionality reduction 8
Modern versions of PCA for large data
Sparse (basis) PCA (Johnstone & Lu 2004)
deals with very high-dimensional vectors suitable for signal processing applications good for “denoising” Algorithm: – Transform to a “sparse” basis (e.g. wavelets, or Fourier transform)
– Discard coefficients close to 0
– Do PCA on the rest and transform back to signal Liza Levina, Statistics Dimensionality reduction 9 Liza Levina, Statistics Dimensionality reduction 10
Sparse (loadings) PCA (Zou, Hastie, Tibshirani 2004)
use a penalty so that PC have only a few non-zero entries makes interpretation easier Other PCA-related methods
Factor Analysis (used a lot in social sciences) x = As + ε
s Gaussian, ε Gaussian(0, I), estimate A
Independent Component Analysis (used a lot for “blind source separation” and other signal processing tasks)
x = As
components of s independent, non-Gaussian, estimate A Liza Levina, Statistics Dimensionality reduction 11
II. Multi-dimensional Scaling (MDS)
Based on the data distance (or dissimilarity) matrix = d(X , X ) ij i j m General problem: find Y = (Y1, . . . , Yn) R such that ∈ D (Y ) = Y Y minimizes ij || i j|| H(Y ) = w (D (Y ) )2 ij ij ij Xi Xj