<<

Mathematical Advances in Learning

Nakul Verma University of California, San Diego [email protected]

June 03, 2008

Abstract number – the of the turn, i.e. the orientation of the body. Manifold learning has recently gained a lot of inter- In a typical learning scenario the task is slightly est by machine learning practitioners. Here we pro- more complicated as the agent only gets to see a vide a mathematically rigorous treatment of some of few samples from which it somehow needs to inter- the techniques in unsupervised learning in context of polate and generalize various possible scenarios. In . We will study the problems of our example this translates to the agent only hav- reduction and density estimation and present some ing access to few of the body poses, from which it recent results in terms of fast convergence rates when needs to predict where the person is looking. Thus the data lie on a manifold. the agent is faced with the difficulty to find an ap- propriate (possibly non-linear) to represent this data compactly. Manifold learning can be broadly 1 Introduction described as the study of algorithms that use and in- ferring the properties of data that is sampled from an With increase in the of data, both in terms of underlying manifold. number of observations as well as number of measure- The goal of this survey is to study different math- ments, traditional learning algorithms are now faced ematical techniques by which we can estimate some with new challenges. One may expect that more data global properties of a manifold from a few samples. should lead to more accurate models, however a large We will start by studying random projections as a collection of irrelevant and correlated features just nonadaptive linear dimensionality reduction proce- add on to the computational of the al- dure, which provides a probabilistic guarantee on pre- gorithm, without helping much to solve the task at serving the interpoint between all points on hand. This makes the learning task especially dif- a manifold. We will then focus on analyzing the spec- ficult. In an attempt to alleviate such problems, a trum of Laplace-Beltrami operator on functions on a new model in terms of manifolds for finding relevant manifold for finding non-linear and sim- features and representing the data by a few parame- plifying its structure. Lastly we will look at kernel ters is gaining interest by machine learning and signal density estimation to estimate high density regions processing communities. on a manifold. Most common examples of superficially high di- It is worth mentioning that our survey is by no mensional data are found in the fields of data mining means comprehensive and we simply highlight some and . Consider the problem of esti- of the recent theoretical advances in manifold learn- mating the face and body pose in humans. Know- ing. Most notably we do not cover the topics of ing where a person is looking gives a wealth of infor- regularization, regression and clustering of data be- mation to an automated agent regarding where the longing to manifolds. In the topic of dimensional- object of interest is – whether the person wants to ity reduction, we are skipping the analysis of classic interact with the agent or whether she is convers- techniques such as LLE (Locally Linear ), ing with another person. The task of deciding where Isomap and their variants. someone is looking seems quite challenging given the fact that the agent is only receiving a large array of 1.1 Preliminaries pixels. However, knowing that a person’s orientation only has one degree of freedom, the relevant infor- We begin by introducing our notation which we will mation in this data can be expressed by just a single use throughout the paper.

1 (x2, y2)

1.5

1

0.5

0 a

−0.5 b

10 −1 5 −1.5 0 12 10 −5 (x1, y1) 8 6 −10 4 2 0 −15

Figure 1: A 1-manifold in R3 Figure 2: Movement of a robot’s arm traces out a 2-manifold in R4 Definition 1. We say a f : U 7→ V is a diffeomorphism, if it is smooth1 and invertible with a smooth inverse. 1.2 Some examples of manifolds Definition 2. A subset M ⊂ RD is said to be a Movement of a robotic arm: Consider the prob- smooth n-manifold if M is locally diffeomorphic to lem of modelling the movement of a robotic arm with Rn, that is, at each p ∈ M we can find an open neigh- two joints (see figure 2). For simplicity let’s restrict borhood U ⊂ RD such that there exist a diffeomorphic the movement to the 2D-. Since there are two between U ∩ M and Rn. , intuitively one should suspect that the movement should trace out a 2-manifold. It is always helpful to have a picture in mind. See We now confirm this in detail. R3 figure 1 for an example of 1-manifold in . Notice Let’s denote the fixed shoulder joint as the that locally any small segment of the manifold “looks origin, the position of the elbow joint as (x ,y ) R1 1 1 like” an in . and the position of wrist as (x2,y2). To see that Definition 3. A at a p ∈ M, the movement of the robotic arm traces out a 2-manifold, consider the map f : R4 → R2 defined as denoted by TpM, is the affine subspace formed by col- 2 2 2 2 lection of all tangent vectors to M at p. (x1,y1,x2,y2) 7→ (x1 + y1, (x2 − x1) + (y2 − y1) ). Clearly M ⊂ R4, s.t. M = f −1(b2, a2) is the For the purposes of this survey we will restrict our- desired manifold. We can verify that locally M selves to the discussion of manifolds whose tangent is diffeomorphic to R2 by looking at its x y 0 0 space at each point is equipped with an inner prod- map Df = 2 1 1 uct. Such manifolds are called Riemannian manifolds x1 − x2 y1 − y2 x2 − x1 y2 − y1 and allow us to define various notions of , an- and observing that it has maximal rank for non- gles, , etc. on the manifold. degenerate values of a and b. Since we will largely be dealing with samples from a manifold, we need to define Set of orthogonal n × n matrices: We present this example to demonstrate that manifolds are not RD Definition 4. A sequence x1,...,xn ⊂ M ⊂ is only good for representing physical processes with called independent and identically distributed (i.i.d.) small degrees of freedom but also to better under- when each xi is picked independently from a fixed dis- stand some of the abstract objects which we regularly tribution D over M. encounter. Consider the problem of understanding With this mathematical machinery in hand, we can the geometry the set of orthonormal matrices in the now demonstrate that manifolds incorporate a wide space of real n × n matrices. Note that the set of array of important examples – we present two such n × n orthonormal matrices is also called the orthog- onal , and is denoted by O(n). We claim that examples that serve as a motivation to study these 2 n objects. this set forms a k(k − 1)/2-manifold in R . 2 n n(n+1)/2 1 To see this, consider the map f : R → R recall that a function is smooth if all its partial 2 n T Rn ∂ f/∂xi1 ...∂xin exist and are continuous. defined by (A)ij 7→ A A. Now M ⊂ such that

2 −1 M = f (In×n) is exactly O(n). To see that M is As one might expect, finding a mapping that pre- in fact a manifold, observe that the derivative map serves all distances of an arbitrary dataset can be a T T DfA.B = B A + A B is regular. difficult task. Luckily in our case, the saving grace comes from observing that the data has a manifold Observe that the examples above required us to structure. We are only required to preserve distances know the mapping f a priori. However in the context between points that lie on the manifold and not the of machine learning, the task is typically to estimate whole . properties about M without having access to f.

2.1.1 Dimension reduction of manifold data 1.3 Outline In the past decade, numerous methods for manifold The paper is organized as follows. We will discuss dimension reduction have been proposed. The clas- some linear and non-linear dimensionality reduction sic techniques such as Locally Linear Embeddings methods on manifolds with a special focus on ran- (LLE) and Isomaps, and newer ones such as Lapla- dom projections in section 2. We will then study cian Eigenmaps and Hessian Eigenmaps, all share a Laplacian-eigenmaps as a process to simplify mani- common intuition – all these methods try to capture fold structure in section 3, followed by nonparametric the local manifold geometry by constructing the ad- density estimation techniques on manifolds in section jacency graph on the sampled data. They all bene- 4. We will finally conclude by discussing the signif- fit from the observation that inference done on this icance of the results and some directions for future neighborhood graph corresponds approximately to work in section 5. the inference on the underlying manifold. For a com- prehensive survey we refer the readers to [8]. Note that these methods are examples of non-linear 2 Random projections for lin- dimensionality reduction techniques on manifolds. ear dimension reduction However, we will present a linear dimension reduc- tion technique that works surprisingly well on mani- RD Rd Dimension reduction is an important preprocessing folds. The goal is to find a Φ : 7→ , step in data analysis that has been studied exten- preferably d ≪ D, which when applied to the data, sively. Here we provide the motivation for why di- preserves all interpoint distances. More formally we mension reduction of data is desirable. We briefly want to give guarantees of the form: for all x,y ∈ M, discuss different techniques that have been employed kx − yk ≈ kΦx − Φyk. for dimension reduction on data coming from an un- derlying manifold and examine a recently analyzed 2.1.2 Issues with principal component anal- technique of random projections. ysis

2.1 Dimensionality reduction Arguably the most popular linear dimension reduc- tion technique is the Principal Component Analy- We know that learning algorithms scale poorly with sis (PCA). The main idea is to find an affine sub- the dimension of the data. This makes dimension space of a specified dimension that captures maxi- reduction a popular preprocessing step – first map the mum amount of variance in the data. It turns out data into a lower dimensional space while preserving that this optimization problem can be solved effi- the relevant information, and then run the regular ciently in closed form, and the desired optimal sub- learning algorithms in the smaller projected space. space is given by the span of the top d eigenvectors One reasonable criterion to measure the quality of (corresponding to the top eigenvalues) of the covari- our low dimensional mapping is to test how well does ance of the data [17]. the mapping preserves pairwise distances. The basic Unfortunately PCA, like all deterministic lin- intuition is that the distances between points in space ear methods, is not suited for asserting relate to the dissimilarity between the corresponding global preservation guarantees on all pair- observations. Thus, it is undesirable that two points wise points. One can easily construct examples where that are far apart in the original space get mapped distances among far away points in the original space close to each other by performing a dimension reduc- get collapsed in the projected space (see figure 3). tion. Similarly, we would not want points that were Instead we will look at projecting the data onto a close originally to get mapped far apart. random subspace.

3 5

0

−5 5

0 8 6 4 2 0 −2 −5 −4

Figure 3: PCA projection can sometimes collapse dis- Figure 4: of a manifold. Note tances between faraway points, making it an undesir- that the normals (dotted lines) of a particular length able choice for distance preserving dimension reduc- incident at each point of the manifold (solid ) will tion. intersect if the manifold is too curvy.

2.1.3 Random projections of manifolds We can now provide the results in detail2. We will start by defining one extra piece of notation which As the name suggests, random projections is con- would help our discussion. cerned with projecting the data onto a random sub- space of a fixed dimension d. We would be able to Definition 5 ([26]). The condition number of a man- 1 conclude that if the data lie on a manifold M, with ifold M is τ , if the normals of length r<τ at any high probability, projecting the data downto a suffi- two distinct points p, q ∈ M don’t intersect. ciently large random subspace would approximately preserve all interpoint distances. At a first glance this Look at figure 4 to see the normals of a manifold. result appears very counter-intuitive – after all how Notice that long non-intersecting normals are possi- can projecting the data onto a random subspace, that ble only if manifold is relatively flat. Hence the con- doesn’t even take into account the samples, has the dition number of M gives us a handle on how curvy capability to preserve distances? can M be. Starting point of such a counter-intuitive result is Lemma 6 (Johnson-Lindenstrauss [19], [12]). For the much celebrated theorem of Johnson and Linden- any 0 < ǫ < 1 and any integer m, let d be a positive ln m strauss which states that any point-set of size m in integer such that d = Ω 2 . Then for any set V of RD RO(log m) ǫ can be embedded in with small distor- m points in RD, there is a linear map Φ : RD 7→ Rd tion by using a linear map. Moreover, this linear map such that for all x,y ∈ V , is essentially a random subspace of the desired em- bedding dimension. kΦx − Φyk2 (1 − ǫ) ≤ ≤ (1 + ǫ) We can leverage this result and get the basic proof kx − yk2 outline for preserving distances on a manifold [2]: A projection onto a random subspace (of d dimen- 1. We will show that not just a pointset, but an sions) will satisfy this with high probability. entire subspace can be preserved by a random D T projection. Proof. Let Φ(x) = d R x, where R is a D × d q Gaussian random matrix with entries γij ∼ N(0, 1) 2. We will show that distances between points i.i.d. Note that RT x (for a fixed x) is distributed as within a small region of the manifold, can be a Gaussian random vector, and from concentration approximated by a subspace, and thus are well properties of Gaussians, it follows that

preserved. 2 T 2 d 2 −Ω{dǫ } 1. Pr kR xk ≥ (1 + ǫ) D kxk ≤ e 3. By taking a ǫ-net of suitable resolution over the 2   for clarity of the exposition we only provide a proof sketch manifold, distances between points that are far here and refer the readers to the original papers for detailed away are also well preserved. proof arguments.

4 2 T 2 d 2 −Ω{dǫ } 2. Pr kR xk ≤ (1 − ǫ) D kxk ≤ e Proof. Since we have chosen S small enough, pick   any p ∈ S and consider its Tp. For This immediately implies that (with high probabil- d any x ∈ S, letx ¯ ∈ R be its projection onto Tp and ity) x⊥ = x − x¯. Note that for any x,y ∈ S, we have that kx⊥−y⊥k kx−yk ≤ r/τ. kΦx − Φyk2 = kΦ(x − y)k2 Now by applying subspace preservation lemma to D T , we have that (with high probability) = kRT (x − y)k2 p d D d ≤ (1 + ǫ) kx − yk2 kΦx − Φyk ≤ kΦ¯x − Φ¯yk + kΦx⊥ − Φy⊥k d D 2 d = (1 + ǫ)kx − yk ≤ kx¯ − y¯k (1 + ǫ/2) + kx⊥ − y⊥k rD 2 Similarly we can also assert that kΦx − Φyk ≥ d (1 − ǫ)kx − yk2. ≤ kx − yk (1 + ǫ/2) + kx − ykr/τ rD Now, requiring this property to hold for all pairwise d distances between n points, a simple application of ≤ kx − yk (1 + ǫ) union bound gives the desired result. rD

Lemma 7 (subspace preservation [1]). Let L be a n- dimensional affine subspace of RD. Pick ǫ, δ > 0 and Similarly we can bound kΦx − Φyk ≥ kx − n 1 1 1 d ≥ Ω 2 log + 2 log . If Φ is a random subspace d ǫ ǫ ǫ δ yk D (1 − ǫ), giving us the desired result. of d , then with probability > 1−δ, we have q that for all x ∈ L,

(1 − ǫ) d/Dkxk ≤ kΦxk ≤ (1 + ǫ) d/Dkxk p p Proof. By positive homogeneity of norms, it suffices Theorem 9 (manifold preservation [2]). Suppose M to prove the result for vectors of length 1. Let V be is a compact n-dimensional in RD a ǫ/4-cover of a B of radius 1. Note that B can with condition number 1/τ. Suppose that for all 1 n be covered by a ǫ/4-net of size ≤ (12/ǫ)n. Applying ǫ > 0, M has an ǫ-cover of size ≤ N0 ǫ . Pick n D 1 N0 lemma 6 from above with distortion ǫ/2, immediately any ǫ, δ > 0 and d = Ω ǫ2 log ǫτ + ǫ2 log δ . Let yields for all v ∈ V (with high probability) Φ be a random subspace of d dimensions. Then with probability > 1 − δ, for all x,y ∈ M, (1 − ǫ/2)kvk2 ≤ kΦvk2 ≤ (1 + ǫ/2)kvk2

Let A be the smallest number such that kΦxk ≤ (1 + A)kxk for all x ∈ L, kxk ≤ 1. Note that d kΦx − Φyk d (1 − ǫ) ≤ ≤ (1 + ǫ) rD kx − yk rD kΦxk ≤ kΦvk + kΦ(x − v)k ≤ 1 + ǫ/2 + (1 + A)ǫ/4

Now since A is the smallest such number, we have that A ≤ ǫ/2+(1+A)ǫ/4 or equivalently A ≤ 3ǫ/4 ≤ 1−ǫ/4 2 ǫ τ ǫ. Similarly we can obtain a lower bound, yielding the Proof. For ǫ0 = 128 d/D, let µ1, . . . , µN , be an ǫ0- n desired result. p 1 cover of M. Note that N < N0 ǫ0 .   Lemma 8 (effects on close-by points [2]). Suppose ǫτ d S = M ∩ B, where ball B has radius r. Pick δ, ǫ > 0 Let Bi be a ball of radius 4 D centered at µi, we q n 1 1 1 ǫτ d can apply lemma 8 to B1,...,BN , to have distances and d = Ω 2 log + 2 log . If r ≤ and ǫ ǫ ǫ δ 4 D within B be preserved upto (1 ± ǫ). Φ is a random projection to ddimensions thenq with i probability > 1 − δ, for all x,y ∈ S Pick any x,y ∈ M, if kx − yk ≤ ǫτ/8 d/D, then x,y ∈ Bi and thus the projected distancesp are pre- d kΦx − Φyk d served. (1 − ǫ) ≤ ≤ (1 + ǫ) rD kx − yk rD If kx − yk > ǫτ/8 d/D, let µi and µj be their closest representatives.p Then

5 3 Laplacian Eigenmaps for sim- plifying manifold structure

kΦx − Φyk ≤ kΦµi − Φµj k + Laplacian Eigenmaps was recently proposed as a sim- kΦx − Φµ k + kΦy − Φµ k i j ple and intuitive algorithm for providing a low di- d mensional representation of data lying on a manifold. ≤ kµ − µ k (1 + ǫ/2) + i j rD Like many manifold learning algorithms, it finds a low d d dimensional representation by performing computa- ǫ0 (1 + ǫ) + ǫ0 (1 + ǫ) tions on the adjacency graph of the sampled data. rD rD The basic intuition is that the graph constructed d from the samples serves as a discrete approximation ≤ (kx − yk + 2ǫ0) (1 + ǫ/2) + rD for the manifold; and inference based on the graph d should correspond to desired inference on the un- 2ǫ (1 + ǫ) 0rD derlying manifold. What sets Laplacian Eigenmaps apart is that the choice of weight used in construct- d ≤ kx − yk (1 + ǫ) ing the graph and the subsequent spectral analysis is rD formally justified as a process which “simplifies” the manifold structure. Similarly we can find a lower bound, yielding the In contrast to random projections that explicitly final result. attempts to preserve all pairwise distances, the op- timization criterion of Laplacian Eigenmaps only in- corporates the condition to preserve local distances. It turns out that the solution to this optimization criterion has a remarkable property of smoothing the 2.2 Discussion manifold structure. More precisely, as we will ob- serve in the following sections, this mapping has the Random projection of manifolds was first considered property to reduce the curvature of high-curvature in [2] and the result was later improved in [10]. Note regions, transforming the manifold into a smoother, that the methodology of random projections provides more manageable object. a nonadaptive dimensionality reduction approach for manifold learning, where the projection map is inde- 3.1 Desirability of simple structure pendent of the actual data. The result presented is significant because data-independent projections are As mentioned before, Laplacian Eigenmaps provide rarely seen in manifold learning literature. It is also a non-linear mapping that, in essence, smooths out worth mentioning that the main result on the min- high curvature regions of the manifold. The power imum number of dimensions needed bears a strong and success of such a mapping comes from noting resemblance to results seen in the of Compressed that such regions can be thought as eccentricities in Sensing for encoding sparse vectors (see [15] for more the collected data. Thus smoothing out these regions details) and some of the ideas presented in [2] are should provide a good generalization ability on man- borrowed from Compressed Sensing literature. ifolds. Consider, for instance, a typical machine learn- Note that manifold learning practitioners are more ing task of discriminating two classes on a manifold. interested in distances (distances along the Due to the inherent curvy manifold structure, it is manifold) rather than the standard Euclidian dis- difficult to find a simple classifier that can separate tances considered in the analysis above. The result the classes. However, by first mapping the data via of theorem 9 is easily extendible to geodesic distances Laplacian Eigenmaps, one can find a simple classifier by considering limits of sums of Euclidian distances that can separate the classes well. See figure 5. [2]. Observe that the result presented here is a worst 3.2 Geometric derivation of Laplacian case analysis; it gives us an estimate of the minimum Eigenmaps number of dimensions needed to preserve all inter- point distances within factor of (1 ± ǫ). It would be Suppose we want to map M to a line such that nearby interesting to see bound on the number of dimensions points get mapped close together. Let f : M → R needed to preserve distances in an average sense. be a such a map. Then for any x ∈ M and y in the

6 RD Rd. For a compact M, the optimal d-dimensional em- bedding is given by the map x 7→ (f1(x), . . . , fd(x)), th where fi is the eigenfunction corresponding to the i lowest (non-zero) eigenvalue of ∆ [3].

3.2.1 Laplace as a Note that ∆ also has the desirable property of being a smoothness functional [25]. Smoothness of a function f over, say, a unit S1 can be defined as S(f) := ′ 2 S1 |f(x) | dx. Then functions for which S(f) is close toR zero are considered smooth. Note that constant functions over S1 are clearly smooth. In general, for x 7→ (f1(x), . . . , fd(x)) any f : M → R, Rd

2 S(f) := k∇f(x)k dx = f∆fdx = h∆f, fiL2(M) ZM ZM Observe that the smoothness of a unit norm eigenfunction ei of ∆ is controlled by the corre- sponding eigenvalue λi, since S(ei) = h∆ei, eii = λi. Therefore, approximating a function f in terms of its first d eigenfunctions of ∆ is a way of controlling its smoothness.

So far we have established that spectrum of ∆ of a manifold M, provides us with a desirable mapping of Figure 5: Laplacian eigenmaps a manifold of M. However, since we just have samples from M, we complex structure to a relatively simpler structure. need a way to approximate ∆. This is beneficial for many learning tasks; the task of discriminating two classes on a manifold becomes easier, for instance. 3.2.2 The graph Laplacian Graph Laplacian is considered as a discrete approxi- mation to the Laplace-Beltrami operator introduced neighborhood of x, we would like to have |f(x)−f(y)| in the previous section. be bounded in terms of the original geodesic distance Let x ,...,x be an independent sample from the d (x,y). Let l = d (x,y), then using the Taylor 1 m M M uniform distribution over M and let t be a free pa- expansion around x rameter (optimized later). We can then construct |f(x) − f(y)| ≤ l k∇f(x)k + o(l) a completely connected weighted undirected graph with samples as the vertices and edge weights wij 2 Thus k∇f(x)k provides us with an estimate of how as e−kxi−xj k /4t. The corresponding graph Laplacian far apart does f map nearby points. Hence in order to operator is given by the matrix [6]: preserve distances, one should look for a map f that t −wij if i 6= j minimizes this quantity over all x ∈ M. One sensi- (Lm)ij =  wik otherwise ble minimizing criterion (in “sum-squared” sense) is k 2 We may think of it asP an operator on functions on arg minkfk=1 M k∇f(x)k . Note that R k∇f(x)k2 = h∇f, ∇fi = hf, ∆fi, points from the manifold. Let p ∈ M, and f : M 7→ 2 R, then where ∆ = ∇ R is defined to be the Laplace-Beltrami operator on f(x). Thus minimizing this objective 2 2 −kp−x k −kp−x k function is same as minimizing f∆f. Notice that t 1 j 1 j M L f(p) = f(p) e 4t − f(x )e 4t this quantity has the same functional form as the m m m j R Xj Xj Rayleigh quotient (with kfk2 = 1). Hence the prob- t 3 lem reduces to finding the eigenfunctions correspond- We can now relate Lm to ∆M for any function f ing to the lowest eigenvalues of ∆ [3]. on M. 3 This argument can be generalized for mappings to For conciseness we will denote ∆ operator on M as ∆M .

7 3.2.3 Connecting together Since we have restricted the integral to a small enough ball, we can now use the local coordinate sys- Let Ltf(p) be the continuous approximation of the tem around an open neighborhood of p. We can apply graph Laplacian operator defined by the canonical change of coordinates by using the ex- 2 2 ∼ Rn −kp−yk −kp−yk ponential map expp : TpM(= ) 7→ M, that carries t 4t 4t L f(p) := f(p) e dνy− f(y)e dνy radial lines from 0 in T M into starting at ZM ZM p p in M. Note that expp(0) = p. t n We can show that Lm is a functional approximation To reduce the computations to R , any y ∈ M (in to ∆M . The proof outline goes as follows ([5], [6]). neighborhood of p) can be written as expp(x) for some For a fixed p ∈ M, and a smooth map f, (note that ¯ x ∈ TpM. Let f(x) := f(expp(x)). Then a key fact all statements are pointwise in p and f) about Laplace-Beltrami operator is that ∆M f(p) = 2 ∂ f¯ n ¯ t t t ∆R f(0) = − i 2 (0). Hence we can analyze L in 1. We will first deduce that Lm converges to L . ∂xi This follows almost immediately from law of Euclidian spaceP via the (inverse) exponential map [6]: large numbers. 2 t t 1 − kxk 2 2. We will relate L to ∆M by L = e 4t (f¯(0)−f¯(x))(1+O(kxk ))dx Vol(M) ZB¯ (a) Restricting the Lt integral to a small ball in M. This would help us express the Lt in Using Taylor approximation about 0, we have that: a single local . 1 f¯(x) − f¯(0) = x∇f¯+ xT Hx + O(kxk3) (b) By applying change of coordinates, we can 2 t Rn express L as an integral in . Hence for functions with bounded third order (c) Finally we will relate the new integral in Rn derivatives and letting t → 0, we have that (see [6] to ∆M . for details)

2 Noting that since M is compact and any f can t −1 1 T −kxk L f(p) = x∇f¯+ x Hx e 4t dx be approximated arbitrarily well by a sequence of Vol(M) ZB¯  2  functions {fi}, we can get a uniform convergence for −tr(H) 1 ∂2f¯(0) the entire M for any f (see [6] for details). = = − 2 Vol(M) Vol(M) X ∂xi

t Combining above lemmas immediately yields the Lemma 10 (continuous approximation of Lm [6]). t t main result Let Lm and L be defined as above, then for any ǫ > 0 t t 2 Theorem 12 (relating L to ∆M [6]). Let L and t t −Ω{ǫ m} Pr |Lmf(p) − L f(p)| > ǫ < 2e ∆M be as defined above. Then for any p ∈ M and any smooth function f with bounded third order deriva-  t  Proof. Note that since Lm is the empirical average tive, if t → 0 sufficiently fast, then of m independent samples sampled uniformly from t M and L is its expectation. Since M is compact, we t 1 L f(p) = ∆M f(p) can use Hoeffding’s inequality to bound the deviation, Vol(M) giving the result. 3.3 A practical algorithm Lemma 11 (restricting Lt to local coordinates [6]). Let B ⊂ M be a sufficiently small open ball contain- As seen in previous sections, Laplacian Eigenmaps ing p such that B can be expressed in a single chart. have a sound mathematical basis for simplifying data For any a > 0, as t → 0, representation. [3] gives a practical algorithm for em- bedding the data in lower dimensions using this tech- 2 t − kp−yk a RD L f(p) − e 4t (f(p) − f(y))dy = o(t ) nique. Let X = x1,...,xm ∈ an independent Z B sample drawn uniformly at random from M, d be the

Proof. For any point x ∈ M − B, let d = embedding dimension and t be the bandwidth param- 2 eter, infx∈M−B kp − xk . Note that d > 0 (since B is open). Hence the total contribution of such points to 2 Algorithm 3.1: Laplacian Eignmaps (X,d,t) the integral is bounded by Ce−d /4t for some constant 2 C. Note that as t tends to zero, this term decreases e−kxi−xj k /t if x and x are close 1. Let W = i j exponentially, giving the desired result. ij  0 otherwise

8 2. Let L = A − W , where A is a diagonal matrix do, however, have a handle on estimating the underly-

Aii = j Wji. ing density from independent samples via the method P of kernels. 3. Compute eigenvectors and eigenvalues for gener- Since we would like to make fewest possible as- alized eigenvector problem Lf = λBf. Let the sumptions on the underlying density, we will focus on solutions be column vectors of F. nonparametric density estimation techniques in this section. We refer the readers to [14], [13], and [31] for 4. return [F] eigenvectors corresponding to D×d an excellent treatment of the subject. lowest non-zero eigenvalues.

4.1 Density estimation

This algorithm has been applied successfully to Density estimation is an important problem in statis- real-world datasets in [25], giving promising results. tics and machine learning. Here the goal is to esti- mate the underlying density from an i.i.d. sample. ˆ 3.4 Discussion Let f be the true density and fm be our estimate (from m samples). Note that we will make little as- Laplacian Eigenmaps provide a sound low- sumptions about the structural form of f. dimensional representation of a manifold, which Given f and fˆm, we can evaluate the quality of has the benefit of simplifying its structure. The our estimate by looking at the associated deviation corresponding algorithm presented here is simple (called the risk) of fˆm from f. One popular way to and intuitive - it requires a few computations and analyze risk is by looking at the expected squared one eigenvalue problem making it quite appealing. difference between the true density and our estimate. One major limitation of the result presented is that Thus, risk can be defined as points are sampled i.i.d. from the uniform measure over the manifold. In general, one would like to relax 2 R = E (fˆm(x) − f(x)) dx this condition and this problem is still open. Z [25] exploits the fact that the embedding simpli- Of course for any reasonable estimator, as the sam- fies the structure of the manifold for semi-supervised ple size gets larger, one would expect the risk to go learning on data generated from manifolds. They also down. Here we are interested in studying how fast show an improvement in classification accuracy for does the risk go to zero for different estimators. certain real-world datasets. One intuitive estimator which works well in low di- As discussed, Laplace-Beltrami operator ∆ pro- mensions is the histogram estimate. The idea is that vides a good measure of smoothness, [7] and [25] have we can grid the space and count the relative frequency used this fact to develop a of regularization of of points falling into each bin. Though quite intuitive, functions on a manifold. histograms have their share of disadvantages which Just like the spectrum of Laplace-Beltrami opera- make them quite unappealing [31]. Primarily due to tor yields a smoother representation of a manifold, it sharp between adjacent bins, histogram es- would be interesting to study what conditions are op- timates are not smooth. Moreover the estimator fˆm timized if we explore a different basis for functions on is heavily dependent on the placement of the grid; by a manifold. For instance, the benefits of approximat- slightly moving the grid, one can get a wildly different ing functions on a manifold using the Lagrange basis estimator. (for polynomials) or the Fourier basis (for in- This motivates the study of kernel density estima- tegrable functions) is largely unexplored. tors which largely alleviates these problems by giving smooth approximation to the underlying density, and 4 Kernel methods for manifold doesn’t suffer from choice of grid placement. density estimation 4.1.1 Kernel density estimation Many manifold learning methods rely heavily on hav- As mentioned in previous section, kernel density es- ing independent samples from uniform distribution timation provides an attractive alternative to naive on M. However, in general, we can’t expect such re- histogram estimates, that works well in practice. strictive conditions on the underlying density. Even- The basic idea behind kernel estimate is as follows. though the analysis of many procedures in the non- To remove the dependence on the grid edges, kernel uniform setting largely remains an open problem, we estimators center a “kernel function” at each sampled

9 data point. By placing a smooth kernel function, the resulting estimator will have a smooth density esti- mate. Undersmoothed More formally, let K be a kernel function. That is, 0.25 a smooth function with the properties:

0.2 1. Non-negative: K(x) ≥ 0

2. Integrates to one: K(x)dx = 1 0.15 R 0.1

3. Zero mean: xK(x)dx = 0, and Probability density R 4. Finite variance: x2K(x)dx < ∞ 0.05 R 5. Maximum at zero: supx K(x) = K(0) 0

−10 −8 −6 −4 −2 0 2 4 6 8 10 Then a kernel density estimate on a sample Data x1,...,xm sampled independently from a fixed un- D Oversmoothed derlying distribution on R is given by 0.08

0.07 1 m kx − x k fˆ (x) = K i 0.06 m,K mhD  h  Xi=1 0.05

where h is the bandwidth parameter dependent on 0.04

the number of samples. It is easy to check the fm,K Probability density 0.03

is a well defined density function. 0.02 It is known that the quality of the kernel estimate 0.01 is particularly sensitive to the value of the bandwidth 0 parameter h and less on the form of K [31]. Hence, −10 −8 −6 −4 −2 0 2 4 6 8 10 Data the choice of bandwidth is important for a good ap-

Optimally smoothed proximation. See figure 6 to see how the changes 0.14 in the bandwidth result in varying approximations to the underlying density. Small values of h lead to spiky 0.12 estimates (without much smoothing) while larger h 0.1 values lead to oversmoothing. 0.08 For the optimal choice of bandwidth, the risk de- −4/(4+D) 0.06 creases as O(m ) (see [31] for details). Note Probability density that due to the exponential dependence on D, the 0.04 quality of the estimate decreases sharply with in- 0.02 crease in the dimension; we require exponential num- 0 ber of points to get the same level of accuracy in high −10 −8 −6 −4 −2 0 2 4 6 8 10 Data dimensions. This is generally referred as the curse of dimensionality. In the context of manifolds, one would hope that since the manifold occupies a small fraction of the Figure 6: Kernel density estimate of one dimensional entire ambient space, better convergence rates should data generated from a mixture of two Gaussians. For be possible. We will study this next. a fixed independent sample of size 20, and using the Gaussian kernel function (dotted ), we see that different choices of bandwidth yield significantly dif- 4.2 Manifold density estimation using ferent kernel estimators (solid line). Top figure shows Kernels the effect of small bandwidths, middle figure shows For a curvy object such as a manifold, we need to the effect of large bandwidth, and the bottom figure define a modified version of the kernel density esti- shows the choice of optimal bandwidth. Note that mator [27]: the optimal bandwidth recovers that the underlying density is in fact a mixture of two Gaussians.

1 m 1 d (p,x ) fˆ (p) = K M i m,K m hnθ (p)  h  Xi=1 xi

10 2 where dM (p, q) is the geodesic distance between Proof. Since Var(X) ≤ EX , we have that for any p, q ∈ M and θp(q) is the volume density function on p ∈ M, −1 M defined as K(expp (q)) (K is the ratio of canonical 1 1 d (p,x ) measure of Riemannian on TpM to Lebesgue ˆ E 2 M 1 Var fm,K (p) ≤ 2n 2 K measure of Euclidian metric on TpM). mh θx1 (p)  h  Note that this estimator is a well defined proba- 1 f(q) 2 dM (p, q) bility density. We will be able to relate it to the = 2n 2 K dq mh ZM θq (p)  h  underlying true density by [27]: Integrating both sides over the entire M, we have 1. Separately bounding the squared bias and the ˆ that M Var fm,K (p)dp is variance of the estimator. To do this, we will R apply change of coordinates to express the inte- n gral in R . 1 f(q) 2 dM (p, q) ≤ 2n 2 K dqdp Zp∈M mh Zq∈M θq (p)  h  2. Decomposing the expected risk to its bias and 1 1 variance components. We can then apply the 2 ≤ 2n K (0) f(q) 2 dpdq calculated bounds, yielding the optimal conver- mh Zq∈M Zp∈M θq (p) n n gence rates. Ch Vol(S ) 2 ≤ 2n K (0) f(q)dq mh Zq∈M Lemma 13 (bounding the squared bias [27]). Let f ˆ −1 be a probability density on M and fm,K be its esti- where last inequality is by letting C = supp,q θq (p) n n mator. If f is square integrable with bounded second and noting 1/θq(p) = h Vol(S ). The desired re- , then there exists a constant C1, sult follows.R such that Theorem 15 (kernel density estimation on mani- 2 Efˆ (p) − f(p) dp ≤ C h4 folds [27]). Let M be a compact n dimensional Rie- m,K 1 D ZM   mannian manifold in R , and let f, fˆm,k be defined as above, then there exists a constant C such that Proof. Consider the pointwise bias,

2 1 4 b(p) = Efˆ (p) − f(p) Ekfˆm,K − fk ≤ C + h m,K mhn  1 dM (p, q) = n K f(q)dq − f(p) Zq∈M θq(p)h  h  Proof. By doing the standard bias-variance decom- 1 kxk position, we have that = K O(kxk2)dx hn  h  2 Zx∈TpM 2 Ekfˆm,K − fk = Efˆm,K (p) − f(p) dp Z Where the last step is by applying change of co- M   ordinates via the canonical exponential map expp : + Var fˆm,K (p) dp ZM TpM 7→ M, and applying Taylor approximation   around 0, f(exp (x)) =: f¯(x) = f¯(0) + x∇f¯(0) + 4 1 p ≤ C1h + C2 O(kxk2). Hence, by applying change of variables mhn y = x/h: where the last inequality is by applying previous two 2 lemmas, immediately giving the desired result. b2(p)dp ≤ Ch4 kyk2K(kyk)dy dp ZM ZM  ZM  Note that as a consequence, setting the bandwidth −1/(n+4) ≤ C′h4 Vol(M) h ≈ m results in optimal rate of convergence of O(m−4/(n+4)), which is independent of the ambient space dimension D.

Lemma 14 (bounding the variance [27]). Let f and 4.3 Discussion fˆm,K be defined as above. Then there exists a con- stant C2, such that Density estimation is a central topic in statistics and machine learning. In case of nonparameteric density ˆ 1 estimation, convergence rates to the true density are Var fm,K (p)dp ≤ C2 n ZM mh known to be exponential in the dimension. In case

11 of manifolds, since the data is locally diffeomorphic an underestimate will result in low accuracies and an to a smaller subspace, one may expect that a weaker overestimate will require impractically large sample dependence on the ambient space. The result pre- sizes. Researchers have started looking into estimat- sented here is noteworthy as the number of samples ing the intrinsic dimension using likelihood and bin- needed to gain desired accuracy is independent of the packing methods ([21], [20]), though further progress ambient dimension. Note that the exponential depen- is needed for a more unified approach. dence on the intrinsic manifold dimension, although Researchers often find the “manifold assumption” still unacceptable, is generally more manageable in a (data lying exactly on a smooth manifold) too re- typical machine learning scenario. strictive. In an attempt to relax this assumption, [11] [14] argues that one should look at ℓ1 risk as it is in- recently proposed a new viewpoint of analyzing algo- variant under monotone transformations. It would be rithms in terms of local covariance dimension. This interesting to see if these rates can be sharpened in ℓ1, framework effectively incorporates data that is not when the data is known to lie on a manifold. ℓ2 and necessarily coming from an underlying manifold, but ℓ∞ risks have been considered in [18] using Fourier locally has low dimensional structure in an average analysis, though their estimator is not a proper prob- sense. Both practical and theoretical analysis of ma- ability density. chine learning problems in this promising framework is open. 5 Conclusion and future work Acknowledgments In this survey we examined how some of the known mathematical techniques can be applied in a new con- The author would like to thank Sanjoy Dasgupta, Joe text, when data is assumed to be sampled from a Pasquale and Lawrence Saul for insightful comments manifold. We observed that the manifold assumption and suggestions. leads to results that are significantly less dependent on the ambient dimension. We looked at random projections as a linear di- References mensionality reduction procedure on manifolds, and concluded that a projection onto a space of dimension [1] R. Baraniuk, M. Davenport, R. DeVore, and just Ω(n log D) can preserve all pairwise distances on M. Wakin. A simple proof of the restricted isom- a manifold remarkably well. We then focused on an- etry property for random matrices. Constructive alyzing the spectrum of the Laplace-Beltrami oper- Approximation, 2008. ator on functions on a manifold and concluded that the resulting eigenmap has the surprising property of [2] R. Baraniuk and M. Wakin. Random projections simplifying the manifold structure, making it into a of smooth manifolds. Foundations of Computa- more manageable object. Lastly, we looked at kernel tional Mathematics, 2007. density estimation to estimate high density regions [3] M. Belkin and P. Niyogi. Laplacian eigenmaps on a manifold and found that sample size needed to for dimensionality reduction and data represen- get desired accuracy can be made completely inde- tation. Neural Computation, 15(6):1373–1396, pendent of the ambient dimension. As we can see, 2003. significant progress has been in the area of Manifold Learning in the last few years, though much still re- [4] M. Belkin and P. Niyogi. Semi-supervised learn- mains largely unexplored. ing on Riemannian manifolds. Machine Learning In terms of low dimensional mappings, [23], [24] Journal, 56:209–239, 2004. proved that any Reimannian Manifold can be iso- metrically embedded in 2n + 1 dimensional - [5] M. Belkin and P. Niyogi. Towards a theoretical ian space. However finding such an embedding by a foundation for Laplacian based manifold meth- discrete algorithm still remains a hard open problem. ods. Conference on Computational Learning Note that all techniques mentioned in this survey Theory, 2005. and elsewhere in the literature crucially dependent on the knowledge of the intrinsic dimension of the man- [6] M. Belkin and P. Niyogi. Towards a theoretical ifold. However in a typical machine learning problem foundation for Laplacian based manifold meth- this quantity is unknown. Note that a poor estimate ods. Journal of Computer and System Sciences, of n can render manifold learning methods useless; 2007.

12 [7] M. Belkin, P. Niyogi, and V. Sindhwani. On [23] J. Nash. C1 isometric imbeddings. Annals of manifold regularization. International Confer- Mathematics, 60(3):383–396, 1954. ence on Artificial Intelligence and Statistics, 2005. [24] J. Nash. The imbedding problem for Riemannian manifolds. Annals of Mathematics, 63(1):20–63, [8] L. Cayton. Algorithms for manifold learning. 1956. UCSD Technical Report CS2008-0923, 2008. [25] P. Niyogi. Manifold regularization and semi- [9] Y. Chikuse. Statistics on special manifolds. supervised learning: some theoretical analysis. Springer, 2003. Technical Report TR-2008-01, 2008.

[10] K. Clarkson. Tighter bounds for random pro- [26] P. Niyogi, S. Smale, and S. Weinberger. Finding jections of manifolds. Computational Geometry, the of with high confi- 2007. dence from random samples. Disc. Computa- tional Geometry, 2006. [11] S. Dasgupta and Y. Freund. Random projec- tion trees and low dimensional manifolds. ACM [27] B. Pelletier. Kernel density estimation on Rie- Symposium on Theory of Computing, 2008. mannian manifolds. Statistics and Probability Letters, 73:297–304, 2005. [12] S. Dasgupta and A. Gupta. An elementary proof of the Johnson-Lindenstrauss lemma. UC Berke- [28] S. Rosenberg. The Laplacian on a Riemannian ley Tech. Report 99-006, March 1999. manifold. Cambridge University Press, 1997.

[13] L. Devroye. A course in density estimation. [29] S. Roweis and L. Saul. Nonlinear dimensionality Birkhauser Verlag AG, 1987. reduction by locally linear embedding. Science, 290, 2000. [14] L. Devroye and L. Gyorfi. Nonparametric density estimation: the L1 view. Wiley and Sons, 1984. [30] J. Tenebaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimen- [15] D. Donoho. Compressed sensing. 2004. sionality reduction. Science, 290, 2000.

[16] D. Donoho and C. Grimes. Hessian eigenmaps: [31] L. Wasserman. All of nonparametric statistics. locally linear embedding techniques for high di- Springer, 2005. mensional data. Proc. of National Academy of Sciences, 100(10):5591–5596, 2003.

[17] R. Duda, P. Hart, and D. Stork. Pattern Classi- fication. Wiley-Interscience, 2nd edition, 2000.

[18] H. Hendriks. Nonparametric estimation of prob- ability density on a Riemannian manifold us- ing fourier expansions. Annals of Statistics, 18(2):832–849, 1990.

[19] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a . Conf. in Modern Analysis and Probability, pages 189– 206, 1984.

[20] B. K´egl. Intrinsic dimension estimation using packing numbers. Neural Information Process- ing Systems, 14, 2002.

[21] E. Levina and P. Bickel. Maximum likelihood estimation of intrinsic dimension. Neural Infor- mation Processing Systems, 17, 2005.

[22] J. Milnor. from the differential view- point. Univ. of Virginia Press, 1972.

13