Liza Levina, Statistics Dimensionality reduction 1

Recent Developments in Dimensionality Reduction

Liza Levina Department of Statistics University of Michigan Liza Levina, Statistics Dimensionality reduction 2

Outline

Introduction to reduction Classical methods: PCA and MDS Manifold embedding methods: Isomap, LLE, Laplacian and Hessian Eigenmaps and related work

Estimating intrinsic dimension Liza Levina, Statistics Dimensionality reduction 3

Introduction

Dimension Reduction:

The problem: p – Have n points X1, . . . , Xn in R ; m – Find “best” representation Y1, . . . , Yn in R , m < p.

Why do it? – Visualization (m = 2 or 3) – Computational speed-up (if p is very large) – De-noising and extracting important “features”

– Ultimately, improving further analysis

Every successful high-dimensional statistical method explicitly or implicitly reduces the dimension of data and/or model Liza Levina, Statistics Dimensionality reduction 4

Questions to ask when doing dimension reduction:

What is important to preserve here (e.g., variability in the data, interpoint distances, clusters, local vs. global structure, etc.)?

Is interpretation of new coordinates (i.e., features) important? Is an explicit mapping from Rp to Rm necessary (i.e., given a new point, can you easily express it in the reduced space)?

What effect will this method have on the subsequent analysis/inference (i.e., regression, classification, etc)? Liza Levina, Statistics Dimensionality reduction 5

Classical Methods of Dimension Reduction

I. Principal Component Analysis (PCA)

Problem: find the linear combination of coordinates of Xi that has maximum variance:

T 1 = arg max Var( X) , =1 || || T 2 = arg max Var( X), . . . , =1, 1 || || ⊥

Solution: 1, 2, . . . are the eigenvectors of the sample covariance matrix ˆ , corresponding to eigenvalues . 1 2 k Liza Levina, Statistics Dimensionality reduction 6

Example: Population and Sample Principal Components

2−d normal sample, n = 200 5 Sample PC Population PC 4

3

2

1

0

−1

−2

−3

−4

−5 −6 −4 −2 0 2 4 6 Liza Levina, Statistics Dimensionality reduction 7

Advantages:

very simple and popular mapping new points easy Disadvantages:

interpretation often not possible For p larger than n, not consistent (needs a good estimator ˆ of covariance) (Johnstone & Lu 2004, Paul & Johnstone 2004)

Linear projection: may “collapse” non-linear features How many components to take? (usually from eigenvalue plots, a.k.a. scree plots; or explaining a given fraction of the total variance, e.g. 80% ) Liza Levina, Statistics Dimensionality reduction 8

Modern versions of PCA for large data

Sparse (basis) PCA (Johnstone & Lu 2004)

deals with very high-dimensional vectors suitable for processing applications good for “denoising” Algorithm: – Transform to a “sparse” basis (e.g. wavelets, or )

– Discard coefficients close to 0

– Do PCA on the rest and transform back to signal Liza Levina, Statistics Dimensionality reduction 9 Liza Levina, Statistics Dimensionality reduction 10

Sparse (loadings) PCA (Zou, Hastie, Tibshirani 2004)

use a penalty so that PC have only a few non-zero entries makes interpretation easier Other PCA-related methods

Factor Analysis (used a lot in social sciences) x = As + ε

s Gaussian, ε Gaussian(0, I), estimate A

Independent Component Analysis (used a lot for “blind source separation” and other tasks)

x = As

components of s independent, non-Gaussian, estimate A Liza Levina, Statistics Dimensionality reduction 11

II. Multi-dimensional Scaling (MDS)

Based on the data distance (or dissimilarity) matrix = d(X , X ) ij i j m General problem: find Y = (Y1, . . . , Yn) R such that ∈ D (Y ) = Y Y minimizes ij || i j|| H(Y ) = w (D (Y ) )2 ij ij ij Xi Xj

Advantages

Very useful for visualization Points themselves are not needed, only distances Non-metric version can be used with very general dissimilarities Liza Levina, Statistics Dimensionality reduction 12

There are versions designed to preserve clusters Disadvantages

Global method: requires distance measurements to be accurate no matter how far the points are

Only gives relative locations (though this could be an advantage as well) – need extra information to go to geographic coordinates (e.g. localization of sensor networks)

No interpretation of new coordinates Liza Levina, Statistics Dimensionality reduction 13

Nonlinear Dimensionality Reduction (“Manifold Embedding” methods)

Recently developed in machine learning (last 5 years) Main motivation: highly non-linear structures in the data (data manifolds)

Main idea: local geometry is preserved – therefore do everything locally (in neighborhoods), then put together

Currently in the process of moving from a collection of diverse algorithms to unified framework(s) Liza Levina, Statistics Dimensionality reduction 14

The Isomap

[Tenenbaum, de Silva, Langford 2000]

The algorithm:

1. Find neighborhoods for each point Xi (such as k nearest neighbors)

2. For neighbors, take Euclidean distances; for non-neighbors, length of the shortest path through the neighborhood graph.

3. Apply classical MDS to the distance matrix. Liza Levina, Statistics Dimensionality reduction 15

Isomap results: Faces

n = 698, p = 64 64, distances are Euclidean distances between intensity vectors Liza Levina, Statistics Dimensionality reduction 16

Isomap results: Digits

n = 1000, special metric for handwritten digits Liza Levina, Statistics Dimensionality reduction 17

Features of the Isomap

Possibly the most intuitive manifold embedding method Ideal for applications where there is an underlying physical distance space (e.g., localization in wireless sensor networks)

Global method: requires global isometry of the manifold embedding x = f(y), i.e. x x = y y | i j|M | i j| Can only embed a fully connected graph Shortest paths are expensive to compute Liza Levina, Statistics Dimensionality reduction 18

Known problems with the Isomap

Global isometry – hence cannot deal with local distortions – Fix: C-Isomap (de Silva & Tenenbaum 2003) normalizes distances locally; but it’s not clear when this is appropriate (local distortion vs. uneven sampling).

Does not deal well with “holes” in the sample (shortest paths have to go around)

A single erroneous link can “short-circuit” the graph and warp the embedding

– Fix: “Convex flows embedding” (“Seeing through water”, Efros et al 2005) uses flow capacity as a measure of distance (too many paths through the same link are penalized) Liza Levina, Statistics Dimensionality reduction 19

Locally Linear Embedding (LLE)

[Saul & Roweis 2000]

Main idea: locally everything is linear

The algorithm: 1. Find neighborhoods for each point Xi (such as k nearest neighbors)

2. Find weights Wij that give the best linear reconstruction of Xi

from its neighbors Xj

3. Fix weights Wij and find lower-dimensional points Yi that can be best reconstructed from their neighbors with these weights

Solved by an eigenvalue problem Liza Levina, Statistics Dimensionality reduction 20 Liza Levina, Statistics Dimensionality reduction 21

Laplacian Eigenmaps

[Belkin & Niyogi 2002]

Main Idea: want to keep neighbors close in the embedding; given

f(x ) f(x ) f x x | i j | k∇ k| i j| want to find f with minimal f 2 k∇ k Provided theoretical analysisR , connections to graph Laplacian operators and heat kernels (spectral graph theory), spectral clustering (graph partitioning algorithms)

Computationally, solves another eigenvalue problem From the practical point of view, embedding is very similar to the LLE (they show LLE is approximating their criterion) Liza Levina, Statistics Dimensionality reduction 22

Hessian Eigenmaps

[Donoho & Grimes 2003]

Same framework as Laplacian Eigenmaps, but using a Hessian instead of Laplacian

Accounts for local curvature The only method with proven optimality properties under ideal conditions

The catch: Hessian estimates are noisy! (i.e. need larger sample size) LLE, Laplacian and Hessian Eigenmaps are all local methods; assume local isometry Liza Levina, Statistics Dimensionality reduction 23

Dealing with holes: comparison on the Swiss roll

Original Data Regular LLE 1.5

1 20 0.5

0 0 −0.5

−1 −20 40 −1.5 20 20 10 −2 0 0 −10 −2 −1 0 1 2

Hessian LLE ISOMAP

0.06 30

20 0.04 10 0.02 0

0 −10

−0.02 −20 −30 −0.04 −0.05 0 0.05 −40 −20 0 20 Liza Levina, Statistics Dimensionality reduction 24

Issues to be resolved

How do you know what dimension to project to? – Fair amount of work done (the rest of this talk)

How do you project new points without recomputing the whole embedding?

– Usually just interpolate

– Charting (Brand 2002): embedding somewhat similar to LLE but provides an explicit mapping

– Out-of-sample extensions (Bengio et al 2003): based on the kernel view

How do you interpret the new coordinates? – Not much beyond 2-d pictures Liza Levina, Statistics Dimensionality reduction 25

How does this help you in further analysis? – Classification of partially labeled data (Belkin & Niyogi 2003) 1. Project all data onto a manifold 2. Use labeled data to train a classifier on projections

– Embeddings that simultaneously enhance classification (Vlachos et al 2002, de Ridder & Duin 2002, Costa & Hero 2005(?)): force points from the same class to be projected closer together

– Dimensionality reduction in regression is a whole area in itself - not covered in this talk

The area is constantly growing; one good source is http://www.cse.msu.edu/˜lawhiu/manifold Liza Levina, Statistics Dimensionality reduction 26

Estimating Intrinsic Dimension

Picking the right dimension is important:

m too small important data features are “collapsed” ⇒ m too large the projections become noisy and/or unstable ⇒ What do major algorithms do?

LLE, Laplacian and Hessian Eigenmaps: dimension is provided by the user

The Isomap: MDS error can be “eyeballed” to estimate dimension Charting: heuristic estimate equivalent to the “regression” estimator below. Liza Levina, Statistics Dimensionality reduction 27

Dimension Estimation Methods

1. Eigenvalue methods (local or global PCA, dimension = the number of eigenvalues greater than a given threshold).

Used a lot in practice with regular (global) PCA Global PCA cannot handle nonlinear manifolds Local PCA is unstable 2. Nearest neighbor (NN) methods m If X1, . . . , Xn are an i.i.d. sample from a density f(x) in R , then k f(x)V (m)T (x)m n k

– V (m) is the volume of the unit sphere in Rm, – Tk(x) is the Euclidean distance from x to its k-th nearest neighbor. Liza Levina, Statistics Dimensionality reduction 28

Regression estimator (Pettis et al. 1979): – Take logs: log k c(m) + m log T (x) k n – Average over points: Tk = 1/n i=1 Tk(Xi) – Estimate m by regressing log TkPon log k

– Need to choose “linear part”; this ignores dependence in Tk.

Other functions of k-NN – The length of the minimal spanning tree of k-NN graph (Costa & Hero 2003); total length of k-NN graph (Costa & Hero 2004) – Both have linear relationships to log n with slope equal to m

– Both are global methods

– Need to resample points to create a curve over n Liza Levina, Statistics Dimensionality reduction 29

3. Fractal methods

Correlation dimension: – defined by

n n 2 C (r) = 1 X X < r . n n(n 1) {k i jk } Xi=1 j=Xi+1 limn log Cn(r) mcorr = lim →∞ r 0 log r → – estimated by regressing log Cn(r) on log r (Grassberger & Procaccia 1983)

– Need to choose the “linear part”; dependence ignored

Capacity dimension and packing numbers (Kegl´ 2002) Liza Levina, Statistics Dimensionality reduction 30

4. Maximum Likelihood Estimator (MLE) (Levina & Bickel 2005)

Idea: fix a point x, assume f(x) const in a small sphere around x, and treat the observations as a homogeneous Poisson process.

Closed-form solution at each point x: 1 k 1 1 Tk(x) mˆ k(x) = log k 2 Tj(x)  Xj=1   where k is the fixed number of NN in each neighborhood

Unless local or cluster estimates are desired, average over points n 1 mˆ = mˆ (X ) k n k i Xi=1 No regression here; the only parameter to choose is k. Liza Levina, Statistics Dimensionality reduction 31

MLE does not depend on the density of points on the manifold MLE is consistent and asymptotically normal: as n → ∞ mˆ m k,n N(0, 2) √n → Liza Levina, Statistics Dimensionality reduction 32

How should dimension estimators be compared?

Behavior as a function of sample size n and dimension m (n has to grow very fast with m, and all methods underestimate dimension for higher m)

Bias and variance Global vs. local Computational cost Number of parameters to set and amount of human intervention Liza Levina, Statistics Dimensionality reduction 33

Comparing Methods

Methods:

1. The MLE

2. The regression estimator

3. The correlation dimension

All parameters are fixed throughout the comparisons 1. Simulations

Shown on m-spheres with n = 1000 uniform points (similar pattern on other sets) Liza Levina, Statistics Dimensionality reduction 34

MLE has smallest variance; best balance of bias and variance Mean 2 SD 30

MLE 25 Regression Corr.dim.

20

15

Estimated dimension Estimated 10

5

0 0 5 10 15 20 25 30 True dimension Liza Levina, Statistics Dimensionality reduction 35

2. Popular Datasets

Dataset Data dim. Sample size MLE Regression Corr. dim.

Swiss roll 3 1000 2.0(0.03) 1.8(0.03) 2.0(0.24)

Hands 480 512 481 2.9 2.5 3.9 / 19.7 Faces 64 64 698 4.8 4.0 3.5 Hands: video of rotation (front, back, side views) Faces: illumination, vertical + horizontal orientation Liza Levina, Statistics Dimensionality reduction 36

Many questions remain...

What if the manifold is sampled non-uniformly? What is the effect of noise (data on a manifold vs. near a manifold)? What if there are multiple /manifolds within one dataset? How can this help in identifying important variables (i.e. interpretation)?

How can this help in further analysis, e.g., classification? Practitioners: new methods to try on your data

Statisticians: lots of interesting questions to work on!