Towards an Algorithmic Realization of Nash’s Theorem

Nakul Verma CSE, UC San Diego [email protected]

Abstract

It is well known from that an n-dimensional Riemannian can be iso- metrically embedded in a of dimension 2n+1 [Nas54]. Though the proof by Nash is intuitive, it is not clear whether such a construction is achievable by an algorithm that only has access to a finite-size sample from the manifold. In this paper, we study Nash’s construction and develop two algorithms for embedding a fairly general class of n-dimensional Riemannian mani- folds (initially residing in RD) into Rk (where k only depends on some key manifold properties, such as its intrinsic dimension, its volume, and its curvature) that approximately preserves geodesic distances between all pairs of points. The first algorithm we propose is computationally fast and embeds the given manifold approximately isometrically into about O(2cn) dimensions (where c is an absolute constant). The second algorithm, although computationally more involved, attempts to minimize the dimension of the target space and (approximately isometrically) embeds the manifold in about O(n) dimensions.

1 Introduction Finding low-dimensional representations of has proven to be an important task in data analysis and data visualization. Typically, one wants a low-dimensional embedding to reduce computational costs while maintaining relevant information in the data. For many learning tasks, distances between data-points serve as an important approximation to gauge similarity between the observations. Thus, it comes as no surprise that distance-preserving or isometric are popular. The problem of isometrically embedding a differentiable manifold into a low dimensional Euclidean space has received considerable attention from the differential geometry community and, more recently, from the manifold learning community. The classic results by Nash [Nas54, Nas56] and Kuiper [Kui55] show that any compact of dimension n can be isometrically C1-embedded1 in Euclidean space of dimension 2n + 1, and C∞-embedded in dimension O(n2) (see [HH06] for an excellent reference). Though these results are theoretically appealing, they rely on delicate handling of metric tensors and solving a system of PDEs, making their constructions difficult to compute by a discrete algorithm. On the algorithmic front, researchers in the manifold learning community have devised a number of spectral algorithms for finding low-dimensional representations of manifold data [TdSL00, RS00, BN03, DG03, WS04]. These algorithms are often successful in unravelling non-linear manifold structure from samples, but lack rigorous guarantees that will be preserved for unseen data. Recently, Baraniuk and Wakin [BW07] and Clarkson [Cla07] showed that one can achieve approximate isometry via the technique of random projections. It turns out that projecting an n-dimensional manifold (initially residing in RD) into a sufficiently high dimensional random subspace is enough to approximately preserve all pairwise distances. Interestingly, this linear embedding guarantees to preserve both the ambient Euclidean distances as well as the geodesic distances between all pairs of points on the manifold without even looking at the samples from the manifold. Such a strong result comes at the cost of the dimension of the embedded space. To get (1 ǫ)-isometry2, for instance, Baraniuk and Wakin [BW07] show that ±n VD a target dimension of size about O ǫ2 log τ is sufficient, where V is the n-dimensional volume of the manifold and τ is a global bound on the curvature. This result was sharpened by Clarkson [Cla07] by  1A Ck-embedding of a smooth manifold M is an embedding of M that has k continuous . 2A (1 ± ǫ)-isometry means that all distances are within a multiplicative factor of (1 ± ǫ). Figure 1: A simple example demonstrating Nash’s embedding technique on a 1-manifold. Left: Original 1-manifold in some high dimensional space. Middle: A contractive mapping of the original manifold via a linear projection onto the vertical plane. Different parts of the manifold are contracted by different amounts – distances at the tail-ends are contracted more than the distances in the middle. Right: Final embedding after applying a series of spiralling corrections. Small spirals are applied to regions with small distortion (middle), large spirals are applied to regions with large distortions (tail-ends). Resulting embedding is isometric (i.e., geodesic distance preserving) to the original manifold. completely removing the dependence on ambient dimension D and partially substituting τ with more average- case manifold properties. In either case, the 1/ǫ2 dependence is troublesome: if we want an embedding with all distances within 99% of the original distances (i.e., ǫ = 0.01), the bounds require the dimension of the target space to be at least 10,000! One may wonder whether the dependence on ǫ is really necessary to achieve isometry. Nash’s theorem suggests that an ǫ-free bound on the target space should be possible. 1.1 Our Contributions In this work, we elucidate Nash’s C1 construction, and take the first step in making Nash’s theorem algorith- mic by providing two simple algorithms for approximately isometrically embedding n-manifolds (manifolds with intrinsic dimension n), where the dimension of the target space is independent of the ambient dimension D and the isometry constant ǫ. The first algorithm we propose is simple and fast in computing the target embedding but embeds the given n-manifold in about 2cn dimensions (where c is an absolute constant). The second algorithm we propose focuses on minimizing the target dimension. It is computationally more involved but embeds the given n-manifold in about O(n) dimensions. We would like to highlight that both our proposed algorithms work for a fairly general class of manifolds. There is no requirement that the original n-manifold is connected, or is globally isometric (or even globally diffeomorphic) to some subset of Rn as is frequently assumed by several manifold embedding algorithms. In addition, unlike spectrum-based embedding algorithms available in the literature, our algorithms yield an explicit C∞-embedding that cleanly embeds out-of-sample data points, and provide isometry guarantees over the entire manifold (not just the input samples). On the technical side, we emphasize that the techniques used in our proof are different from what Nash uses in his work; unlike traditional differential-geometric settings, we can only access the underlying mani- fold through a finite size sample. This makes it difficult to compute quantities (such as the curvature tensor and local functional form of the input manifold, etc.) that are important in Nash’s approach for constructing an isometric embedding. Our techniques do, however, use various differential-geometric concepts and our hope is to make such techniques mainstream in analyzing manifold learning algorithms. 2 Nash’s Construction for C1-Isometric Embedding Given an n-dimensional manifold M (initially residing in RD), Nash’s embedding can be summarized in two steps (see also [Nas54]). (1) Find a contractive3 mapping of M in the desired dimensional Euclidean space. (2) Apply an infinite series of corrections to restore the distances to their original lengths. In order to maintain the , the contraction and the target dimension in step one should be chosen carefully. Nash notes that one can use Whitney’s construction [Whi36] to embed M in R2n+1 without introducing any kinks, tears, or discontinuities in the embedding. This initial embedding, which does not necessarily preserve any distances, can be made into a contraction by adjusting the scale. The corrections in step two should also be done with care. Each correction stretches out a small region of the contracted manifold to restore local distances as much as possible. Nash shows that applying a successive sequence of spirals4 in directions normal to the embedded M is a simple way to stretch the distances while maintaining differentiability. The aggregate effect of applying these “spiralling perturbations” is a globally- isometric mapping of M in R2n+1. See Figure 1 for an illustration.

3A contractive mapping or a contraction is a mapping that doesn’t increase the distance between points. 4A spiral map is a mapping of the form t 7→ (t, sin(t), cos(t)).

2 Remark 1 Adjusting the lengths by applying spirals is one of many ways to do local corrections. Kuiper [Kui55], for instance, discusses an alternative way to stretch the contracted manifold by applying corruga- tions and gets a similar isometry result. 2.1 Algorithm for Embedding n-Manifolds: Intuition Taking inspiration from Nash’s construction, our proposed embedding will also be divided in two stages. The first stage will attempt to find a contraction Φ: RD Rd of our given n-manifold M RD in low dimen- → ⊂ sions. The second will apply a series of local corrections Ψ1, Ψ2,... (collectively refered to as the mapping Ψ: Rd Rd+k) to restore the geodesic distances. → Contraction stage: A pleasantly surprising observation is that a random projection of M into d = O(n) dimensions is a bona fide injective, differential-structure preserving contraction with high probability (details in Section 5.1). Since we don’t require isometry in the first stage (only a contraction), we can use a random projection as our contraction mapping Φ without having to pay the 1/ǫ2 penalty.

Correction stage: We will apply several corrections to stretch-out our contracted manifold Φ(M). To un- derstand a single correction Ψi better, we can consider its effect on a small section of Φ(M). Since, locally, the section effectively looks like a contracted n dimensional affine space, our correction map needs to restore distances over this n-flat. Let U := [u1,...,un] be a d n matrix whose columns form an orthonormal d × basis for this n-flat in R and let s1,...,sn be the corresponding shrinkages along the n directions. Then one can consider applying an n-dimensional analog of the spiral mapping: Ψi(t) := (t, Ψsin(t), Ψcos(t)), where Ψsin(t) := (sin((Ct)1),..., sin((Ct)n)) and Ψcos(t) := (cos((Ct)1),..., cos((Ct)n)). Here C serves as an n d “correction” matrix that controls how much of the surface needs to stretch. It turns out that if one × T 2 sets C to be the matrix SU (where S is a diagonal matrix with entry Sii := (1/si) 1, recall that si i − was the shrinkage along direction u ), then the correction Ψi precisely restores the shrinkages along the n orthonormal directions on the resultant surface (see our discussion in Section 5.2p for details). Since different parts of the contracted manifold need to be stretched by different amounts, we localize the effect of Ψi to a small enough neighborhood by applying a specific kind of kernel function known as a “bump” function in the analysis literature (details in Section 5.2, cf. Figure 5 middle). Applying different Ψi’s at different parts of the manifold should have an aggregate effect of creating an (approximate) isometric embedding.

We now have a basic outline of our algorithm. Let M be an n-dimensional manifold in RD. We first find a contraction of M in d = O(n) dimensions via a random projection. This preserves the differential structure but distorts the interpoint geodesic distances. We estimate the distortion at different regions of the projected manifold by comparing a sample from M with its projection. We then perform a series of spiral corrections—each applied locally—to adjust the lengths in the local neighborhoods. We will conclude that restoring the lengths in all neighborhoods yields a globally consistent (approximately) isometric embedding of M. Figure 4 shows a quick schematic of our two stage embedding with various quantities of interest. Based on exactly how these different local Ψi’s are applied gives rise to our two algorithms. For the first algorithm, we shall apply Ψi maps simultaneously by making use of extra coordinates so that different corrections don’t interfere with each other. This yields a simple and computationally fast embedding. We shall require about 2cn additional coordinates to apply the corrections, making the final embedding size of 2cn (here c is an absolute constant). For the second algorithm, we will follow Nash’s technique more closely and apply Ψi maps iteratively in the same embedding space without the use of extra coordinates. Since all Ψi’s will share the same coordinate space, extra care needs to be taken in applying the corrections. This will require additional computational effort in terms of computing normals to the embedded manifold (details later), but will result in an embedding of size O(n). 3 Preliminaries Let M be a smooth, n-dimensional compact Riemannian submanifold of RD. Since we will be working with samples from M, we need to ensure certain amount of regularity. Here we borrow the notation from Niyogi et al. [NSW06] about the condition number of M. Definition 1 (condition number [NSW06]) Let M RD be a compact Riemannian manifold. The condi- 1 ⊂ tion number of M is τ , if τ is the largest number such that the normals of length r<τ at any two distinct points p,q M don’t intersect. ∈ The condition number 1/τ is an intuitive notion that captures the “complexity” of M in terms of its curvature. We can, for instance, bound the directional curvature at any p M by τ. Figure 2 depicts the ∈

3 Figure 2: Tubular neighborhood of a manifold. Note that the normals (dotted lines) of a particular length incident at each point of the manifold (solid line) will intersect if the manifold is too curvy. normals of a manifold. Notice that long non-intersecting normals are possible only if the manifold is relatively flat. Hence, the condition number of M gives us a handle on how curvy can M be. As a quick example, let’s calculate the condition number of an n-dimensional sphere of radius r (embedded in RD). Note that in this case one can have non-intersecting normals of length less than r (since otherwise they will start intersecting at the center of the sphere). Thus the condition number of such a sphere is 1/r. Throughout the text we will assume that M has condition number 1/τ. We will use DG(p,q) to indicate the geodesic distance between points p and q where the underlying manifold is understood from the context, and p q to indicate the Euclidean distance between points p and q where the ambient space is understood fromk the− context.k To correctly estimate the distortion induced by the initial contraction mapping, we will additionally re- quire a high-resolution covering of our manifold.

Definition 2 (bounded manifold cover) Let M RD be a Riemannian n-manifold. We call X M an ⊂ ⊂ α-bounded (ρ, δ)-cover of M if for all p M and ρ-neighborhood Xp := x X : x p < ρ around p, we have ∈ { ∈ k − k }

xi−x0 xj −x0 exist points x0,...,xn Xp such that 1/2n, for i = j. (covering criterion) • ∈ kxi−x0k · kxj −x0k ≤ 6

Xp α. (local boundedness criterion) • | |≤ exists point x X such that x p ρ/2. (point representation criterion) • ∈ p k − k≤ for any n+1 points in X satisfying the covering criterion, let Tˆ denote the n-dimensional affine space • p p passing through them (note that Tˆp does not necessarily pass through p). Then, for any unit vector vˆ in Tˆ , we have vˆ v 1 δ, where v is the projection of vˆ onto the tangent space of M at p. (tangent p · kvk ≥ − space approximation criterion)

The above is an intuitive notion of manifold sampling that can estimate the local tangent spaces. Curiously, we haven’t found such “tangent-space approximating” notions of manifold sampling in the literature. We do note in passing that our sampling criterion is similar in spirit to the (ǫ,δ)-sampling (also known as “tight” ǫ-sampling) criterion popular in the Computational Geometry literature (see e.g. [DGGZ02, GW03]).

Remark 2 Given an n-manifold M with condition number 1/τ, and some 0 <δ 1, if ρ τδ/3√2n, then one can construct a 210n+1-bounded (ρ, δ)-cover of M – see Appendix A.2 for details.≤ ≤

We can now state our two algorithms. 4 The Algorithms Inputs. We assume the following quantities are given (i) n – the intrinsic dimension of M. (ii) 1/τ – the condition number of M. (iii) X – an α-bounded (ρ, δ)-cover of M. (iv) ρ – the ρ parameter of the cover.

4 Notation. Let φ be a random orthogonal projection map that maps points from RD into a random subspace of dimension d (n d D). We will have d to be about O(n). Set Φ := (2/3)( D/d)φ as a scaled version of φ. Since Φ is linear,≤ ≤Φ can also be represented as a d D matrix. In our discussion below we will use the × function notation and the matrix notation interchangeably, that is, for any p RpD, we will use the notation Φ(p) (applying function Φ to p) and the notation Φp (matrix-vector multiplication)∈ interchangeably. ′ ′ For any x X, let x0,...,xn be n + 1 points from the set x X : x x < ρ such that xi−x0 xj −∈x0 { ∈ k − k } 1/2n, for i = j (cf. Definition 2). Let Fx be the D n matrix whose column vectors kxi−x0k · kxj −x0k ≤ 6 × form some orthonormal basis of the n-dimensional subspace spanned by the vectors xi x0 i∈[n]. { − } Estimating local contractions. We estimate the contraction caused by Φ at a small enough neighborhood of M containing the point x X, by computing the “thin” Singular Value Decomposition (SVD) U Σ V T ∈ x x x of the d n matrix ΦFx and representing the singular values in the conventional descending order. That is, ΦF = U× Σ V T, and since ΦF is a tall matrix (n d), we know that the bottom d n singular values are x x x x x ≤ − zero. Thus, we only consider the top n (of d) left singular vectors in the SVD (so, Ux is d n, Σx is n n, 1 2 n i th × × and Vx is n n) and σx σx ... σx where σx is the i largest singular value. × ≥ ≥ 1≥ n 1 n Observe that the singular values σx,...,σx are precisely the distortion amounts in the directions ux,...,ux Rd 1 n i i at Φ(x) ([ux,...,ux ] = Ux) when we apply Φ. To see this, consider the direction w := Fxvx in the ∈ 1 n i i i i column-span of Fx ([vx,...,vx ] = Vx). Then Φw = (ΦFx)vx = σxux, which can be interpreted as: Φ i RD i Rd i maps the vector w in the subspace Fx (in ) to the vector ux (in ) with the scaling of σx. i Note that if 0 < σx 1 (for all x X and 1 i n), we can define an n d correction matrix ≤ x ∈ T ≤ ≤ × i 2 (corresponding to each x X) C := SxUx , where Sx is a diagonal matrix with (Sx)ii := (1/σx) 1. ∈−2 1/2 x − We can also write Sx as (Σx I) . The correction matrix C will have an effect of stretching the direction i − p ux by the amount (Sx)ii and killing any direction v that is orthogonal to (column-span of) Ux.

Algorithm 1 Compute Corrections Cx’s 1: for x X (in any order) do ∈ ′ ′ xi−x0 xj −x0 2: Let x0,...,xn x X : x x < ρ be such that 1/2n (for i = j). ∈ { ∈ k − k } kxi−x0k · kxj −x0k ≤ 6 3: Let F be a D n matrix whose columns form an orthonormal basis of the n-dimensional span of the x × vectors xi x0 i∈[n]. { −T } 4: Let UxΣxVx be the “thin” SVD of ΦFx. x −2 1/2 T 5: Set C := (Σx I) Ux . 6: end for −

Algorithm 2 Embedding Technique I Preprocessing Stage: We will first partition the given covering X into disjoint subsets such that no subset contains points that are too close to each other. Let x1,...,x|X| be the points in X in some arbitrary but fixed order. We can do the partition as follows: 1: Initialize X(1),...,X(K) as empty sets. 2: for xi X (in any fixed order) do ∈ (j) 3: Let j be the smallest positive integer such that xi is not within distance 2ρ of any element in X . (j) That is, the smallest j such that for all x X , x xi 2ρ. (j) (j) ∈ k − k≥ 4: X X xi . 5: end for ← ∪ { }

The Embedding: For any p M RD, we embed it in Rd+2nK as follows: ∈ ⊂ 1: Let t = Φ(p). 2: Define Ψ(t) := (t, Ψ1,sin(t), Ψ1,cos(t),..., ΨK,sin(t), ΨK,cos(t)) where Ψj,sin(t) := 1 n 1 n (ψj,sin(t),...,ψj,sin(t)) and Ψj,cos(t) := (ψj,cos(t),...,ψj,cos(t)). The individual terms are given by i x ψj,sin(t) := x∈X(j) ( ΛΦ(x)(t)/ω)sin(ω(C t)i) i x i = 1,...,n; j = 1,...,K ψ (t) := (j) ( Λ (t)/ω)cos(ω(C t) ) j,cos P x∈X p Φ(x) i −1/(1−(ka−bk/ρ)2) 11{ka−Pbk<ρ} ·e p where Λa(b)= −1/(1−(kq−bk/ρ)2) . Pq∈X 11{kq−bk<ρ} ·e 3: return Ψ(t) as the embedding of p in Rd+2nK .

5 Algorithm 3 Embedding Technique II

The Embedding: Let x1,...,x|X| be the points in X in some arbitrary but fixed order. Now, for any point p M RD, we embed it in R2d+3 as follows: ∈ ⊂ 1: Let t = Φ(p). 2: Define Ψ0,n(t):=(t, 0,..., 0)

d+3 3: for i = 1,..., X do | | | {z } 4: Define Ψi,0 := Ψi−1,n. 5: for j = 1,...,n do 6: Let ηi,j(t) and νi,j(t) be two mutually orthogonal unit vectors normal to Ψi,j−1(M) at Ψi,j−1(t). 7: Define Λ (t) Λ (t) Φ(xi) xi Φ(xi) xi Ψi,j(t):=Ψi,j−1(t)+ ηi,j(t) sin(ωi,j (C t)j )+ νi,j(t) cos(ωi,j (C t)j ) ωi,j ωi,j p  p  −1/(1−(ka−bk/ρ)2) 11{ka−bk<ρ} ·e where Λa(b)= −1/(1−(kq−bk/ρ)2) . Pq∈X 11{kq−bk<ρ} ·e 8: end for 9: end for R2d+3 10: return Ψ|X|,n(t) as the embedding of p into .

A few remarks are in order.

Remark 3 The function Λ in both embeddings acts as a localizing kernel that helps in localizing the effects of the spiralling corrections (discussed in detail in Section 5.2), and ω > 0 (for Embedding I) or ωi,j > 0 (for Embedding II) are free parameters controlling the frequency of the sinusoidal terms.

Remark 4 If ρ τ/4, the number of subsets (i.e. K) produced by Embedding I is at most α2cn for an α-bounded (ρ, δ)≤cover X of M (where c 4). See Appendix A.3 for details. ≤ Remark 5 The success of Embedding II crucially depends upon finding a pair of normal unit vectors η and ν in each iteration; we discuss how to approximate these in Appendix A.9.

We shall see that for appropriate choice of d, ρ, δ and ω (or ωi,j), our algorithm yields an approximate isometric embedding of M. 4.1 Main Result Theorem 3 Let M RD be a compact n-manifold with volume V and condition number 1/τ (as above). Let d = Ω(n + ln(V/τ⊂n)) be the target dimension of the initial random projection mapping such that d D. For any 0 < ǫ 1, let ρ (τd/D)(ǫ/350)2, δ (d/D)(ǫ/250)2, and let X M be an α-bounded≤ (ρ, δ)-cover of M≤. Now, let ≤ ≤ ⊂

cn i. N Rd+2αn2 be the embedding of M returned by Algorithm I (where c 4), I ⊂ ≤ ii. N R2d+3 be the embedding of M returned by Algorithm II. II ⊂ Then, with probability at least 1 1/poly(n) over the choice of the initial random projection, for all p,q M and their corresponding mappings− p ,q N and p ,q N , we have ∈ I I ∈ I II II ∈ II i. (1 ǫ)D (p,q) D (p ,q ) (1 + ǫ)D (p,q), − G ≤ G I I ≤ G ii. (1 ǫ)D (p,q) D (p ,q ) (1 + ǫ)D (p,q). − G ≤ G II II ≤ G 5 Proof Our goal is to show that the two proposed embeddings approximately preserve the length of all geodesic b ′ curves. Now, since the length of any given curve γ :[a,b] M is given by a γ (s) ds, it is vital to study how our embeddings modify the length of the tangent vectors→ at any point p Mk . k R∈ In order to discuss tangent vectors, we need to introduce the notion of a tangent space TpM at a particular point p M. Consider any smooth curve c :( ǫ,ǫ) M such that c(0) = p, then we know that c′(0) is the vector tangent∈ to c at p. The collection of all such− vectors→ formed by all such curves is a well defined vector

6 TF (p)F (M)

TpM v p

F (p) (DF )p(v) F (M) M

Figure 3: Effects of applying a smooth map F on various quantities of interest. Left: A manifold M containing point p. v is a vector tangent to M at p. Right: Mapping of M under F . Point p maps to F (p), tangent vector v maps to (DF )p(v).

space (with origin at p), called the tangent space TpM. In what follows, we will fix an arbitrary point p M and a tangent vector v T M and analyze how the various steps of the algorithm modify the length of ∈v. ∈ p Let Φ be the initial (scaled) random projection map (from RD to Rd) that may contract distances on M by various amounts, and let Ψ be the subsequent correction map that attempts to restore these distances (as defined in Step 2 for Embedding I or as a sequence of maps in Step 7 for Embedding II). To get a firm footing for our analysis, we need to study how Φ and Ψ modify the tangent vector v. It is well known from differential ′ geometry that for any smooth map F : M N that maps a manifold M Rk to a manifold N Rk , → ⊂ ⊂ there exists a (DF )p : TpM TF (p)N, known as the map or the pushforward (at p), that maps tangent vectors incident at p in→M to tangent vectors incident at F (p) in N. To see this, consider a vector u tangent to M at some point p. Then, there is some smooth curve c : ( ǫ,ǫ) M such that c(0) = p and c′(0) = u. By mapping the curve c into N, i.e. F (c(t)), we see that F (−c(t)) includes→ the point dF (c(t)) F (p) at t = 0. Now, by calculus, we know that the derivative at this point, dt is the directional t=0 derivative ( F ) (u), where ( F ) is a k′ k matrix called the gradient (at p). The quantity ( F ) is p p p precisely the∇ matrix representation∇ of this linear× “pushforward” map that sends tangent vectors of M∇ (at p) to the corresponding tangent vectors of N (at F (p)). Figure 3 depicts how these quantities are affected by applying F . Also note that if F is linear then DF = F . Observe that since pushforward maps are linear, without loss of generality we can assume that v has unit length.

A quick roadmap for the proof. In the next three sections, we take a brief detour to study the effects of applying Φ, applying Ψ for Algorithm I, and applying Ψ for Algorithm II separately. This will give us the necessary tools to analyze the combined effect of applying Ψ Φ on v (Section 5.4). We will conclude by relating tangent vectors to lengths of curves, showing approximate◦ isometry (Section 5.5). Figure 4 provides a quick sketch of our two stage mapping with the quantities of interest. We defer the proofs of all the supporting lemmas to the Appendix. 5.1 Effects of Applying Φ It is well known as an application of Sard’s theorem from differential topology (see e.g. [Mil72]) that almost every smooth mapping of an n-dimensional manifold into R2n+1 is a differential structure preserving em- bedding of M. In particular, a projection onto a random subspace (of dimension 2n + 1) constitutes such an embedding with probability 1. This translates to stating that a random projection into R2n+1 is enough to guarantee that Φ doesn’t collapse the lengths of non-zero tangent vectors. However, due to computational issues, we additionally require that the lengths are bounded away from zero (that is, a statement of the form (DΦ)p(v) Ω(1) v for all v tangent to M at all points p). k k≥ k k We can thus appeal to the random projections result by Clarkson [Cla07] (with the isometry parameter set to a constant, say 1/4) to ensure this condition. In particular, it follows

Lemma 4 Let M RD be a smooth n-manifold (as defined above) with volume V and condition number ⊂ 1/τ. Let R be a random projection matrix that maps points from RD into a random subspace of dimension d (d D). Define Φ := (2/3)( D/d)R as a scaled projection mapping. If d = Ω(n + ln(V/τ n)), then with probability≤ at least 1 1/poly(n) over the choice of the random projection matrix, we have − p (a) For all p M and all tangent vectors v T M, (1/2) v (DΦ) (v) (5/6) v . ∈ ∈ p k k ≤ k p k≤ k k (b) For all p,q M, (1/2) p q Φp Φq (5/6) p q . ∈ k − k ≤ k − k≤ k − k

7 RD Rd Rd+k

v u =Φv Φ Ψ (DΨ)t(u)

p t =Φp Ψ(t)

M ΦM ΨΦM

v =1 u 1 (DΨ)t(u) v k k k k≤ k k ≈ k k Figure 4: Two stage mapping of our embedding technique. Left: Underlying manifold M ⊂ RD with the quantities of interest – a fixed point p and a fixed unit-vector v tangent to M at p. Center: A (scaled) linear projection of M into a random subspace of d dimensions. The point p maps to Φp and the tangent vector v maps to u := (DΦ)p(v)=Φv. The length of v contracts to kuk. Right: Correction of ΦM via a non-linear mapping Ψ into Rd+k. We have k = O(α2cn) for correction technique I, and k = d +3 for correction technique II (see also Section 4). Our goal is to show that Ψ stretches length of contracted v (i.e. u) back to approximately its original length.

(c) For all x RD, Φx (2/3)( D/d) x . ∈ k k≤ k k In what follows, we assume that Φ isp such a scaled random projection map. Then, a bound on the length of tangent vectors also gives us a bound on the spectrum of ΦFx (recall the definition of Fx from Section 4).

Corollary 5 Let Φ, Fx and n be as described above (recall that x X that forms a bounded (ρ, δ)-cover i th ∈ of M). Let σx represent the i largest singular value of the matrix ΦFx. Then, for δ d/32D, we have 1/4 σn σ1 1 (for all x X). ≤ ≤ x ≤ x ≤ ∈ We will be using these facts in our discussion below in Section 5.4. 5.2 Effects of Applying Ψ (Algorithm I) As discussed in Section 2.1, the goal of Ψ is to restore the contraction induced by Φ on M. To understand the action of Ψ on a tangent vector better, we will first consider a simple case of flat manifolds (Section 5.2.1), and then develop the general case (Section 5.2.2). 5.2.1 Warm-up: flat M Let’s first consider applying a simple one-dimensional spiral map Ψ:¯ R R3 given by t (t, sin(Ct), cos(Ct)), where t I =( ǫ,ǫ). Let v¯ be a unit vector tangent to I (at, say, 0). Then→ note that 7→ ∈ − dΨ¯ (DΨ)¯ t=0(¯v)= = (1,C cos(Ct), C sin(Ct)) . dt t=0 − t=0

¯ 2 Thus, applying Ψ stretches the length of v¯ from 1 to (1,C cos(Ct), C sin(Ct )) t=0 = √1+ C . No- tice the advantage of applying the spiral map in computing the lengths:− the sine and| cosine terms combine together to yield a simple expression for the size of the stretch. In particular, if we want to stretch the length of v¯ from 1 to, say, L 1, then we simply need C = √L2 1 (notice the similarity between this expression ≥ − x and our expression for the diagonal component Sx of the correction matrix C in Section 4).

We can generalize this to the case of n-dimensional flat manifold (a section of an n-flat) by considering a map similar to Ψ¯ . For concreteness, let F be a D n matrix whose column vectors form some orthonormal × T basis of the n-flat manifold (in the original space RD). Let UΣV be the “thin” SVD of ΦF . Then FV forms an orthonormal basis of the n-flat manifold (in RD) that maps to an orthogonal basis UΣ of the projected n-flat manifold (in Rd) via the contraction mapping Φ. Define the spiral map Ψ¯ : Rd Rd+2n ¯ ¯ ¯ ¯ ¯1 ¯n ¯→ in this case as follows. Ψ(t) := (t, Ψsin(t), Ψcos(t)), with Ψsin(t) := (ψsin(t),..., ψsin(t)) and Ψcos(t) := ¯1 ¯n (ψcos(t),..., ψcos(t)). The individual terms are given as ¯i ψsin(t) := sin((Ct)i) ¯i i = 1,...,n, ψcos(t) := cos((Ct)i) where C is now an n d correction matrix. It turns out that setting C = (Σ−2 I)1/2U T precisely restores the contraction caused× by Φ to the tangent vectors (notice the similarity between− this expression with the

8 correction matrix in the general case Cx in Section 4 and our motivating intuition in Section 2.1). To see this, let v be a vector tangent to the n-flat at some point p (in RD). We will represent v in the FV basis (that is, i 1 n 2 i 2 i i 2 v = i αi(Fv ) where [Fv ,...,Fv ] = FV ). Note that Φv = i αiΦFv = i αiσ u = i 2 i k k i k k k k i(αiσ ) (where σ are the individual singular values of Σ and u are the left singular vectors forming P P P i the columns of U). Now, let w be the pushforward of v (that is, w = (DΦ)p(v) = Φv = i wie , P i Rd ¯ ¯ 2 where e i forms the standard basis of ). Now, since DΨ is linear, we have (DΨ)Φ(p)(w) = { } ¯ ¯ k P k ¯ i 2 ¯ i dΨ¯ dt dΨsin(t) dΨcos(t) i wi(DΨ)Φ(p)(e ) , where (DΨ)Φ(p)(e ) = dti = dti , dti , dti . The in- k k t=Φ(p) t=Φ(p) dividual components are given by   P

dψ¯k (t)/dti = +cos((Ct) )C sin k k,i k = 1,...,n; i = 1,...,d. dψ¯k (t)/dti = sin((Ct) )C cos − k k,i By algebra, we see that (D(Ψ¯ Φ)) (v) 2 = (DΨ)¯ ((DΦ) (v)) 2 = (DΨ)¯ (w) 2 k ◦ p k k Φ(p) p k k Φ(p) k d n n 2 2 2 2 2 = wk + cos ((CΦ(p))k)((CΦv)k) + sin ((CΦ(p))k)((CΦv)k) kX=1 kX=1 kX=1 d n T T = w2 + ((CΦv) )2 = Φv 2 + CΦv 2 = Φv 2 + (Φv) C C(Φv) k k k k k k k k k=1 k=1 X X T T = Φv 2 +( α σiui) U(Σ−2 I)U ( α σiui) k k i − i i i X X T = Φv 2 +[α σ1,...,α σn](Σ−2 I)[α σ1,...,α σn] k k 1 n − 1 n = Φv 2 +( α2 (α σi)2)= Φv 2 + v 2 Φv 2 = v 2. k k i − i k k k k − k k k k i i X X In other words, our non-linear correction map Ψ¯ can exactly restore the contraction caused by Φ for any vector tangent to an n-flat manifold.

In the fully general case, the situation gets slightly more complicated since we need to apply different spiral maps, each corresponding to a different size correction at different locations on the contracted manifold. Recall that we localize the effect of a correction by applying the so-called “bump” function (details below). These bump functions, although important for localization, have an undesirable effect on the stretched length of the tangent vector. Thus, to ameliorate their effect on the length of the resulting tangent vector, we control their contribution via a free parameter ω. 5.2.2 The General Case More specifically, Embedding Technique I restores the contraction induced by Φ by applying a non-linear map Ψ(t) := (t, Ψ1,sin(t), Ψ1,cos(t),..., ΨK,sin(t), ΨK,cos(t)) (recall that K is the number of subsets we 1 n decompose X into – cf. description in Embedding I in Section 4), with Ψj,sin(t) := (ψj,sin(t),...,ψj,sin(t)) 1 n and Ψj,cos(t):=(ψj,cos(t),...,ψj,cos(t)). The individual terms are given as

i x ψj,sin(t) := x∈X(j) ( ΛΦ(x)(t)/ω)sin(ω(C t)i) i x i = 1,...,n; j = 1,...,K, ψ (t) := (j) ( Λ (t)/ω)cos(ω(C t) ) j,cos P x∈X p Φ(x) i where Cx’s are the correctionP amountsp for different locations x on the manifold, ω > 0 controls the frequency (cf. Section 4), and ΛΦ(x)(t) is defined to be λΦ(x)(t)/ q∈X λΦ(q)(t), with exp( 1/(1 t Φ(Px) 2/ρ2)) if t Φ(x) < ρ. λ (t) := Φ(x) 0 − − k − k otherwise.k − k  λ is a classic example of a bump function (see Figure 5 middle). It is a smooth function with compact support. Its applicability arises from the fact that it can be made “to specifications”. That is, it can be made to vanish outside any interval of our choice. Here we exploit this property to localize the effect of our corrections. The normalization of λ (the function Λ) creates the so-called smooth partition of unity that helps to vary smoothly between the spirals applied at different regions of M. Since any tangent vector in Rd can be expressed in terms of the basis vectors, it suffices to study how DΨ acts on the standard basis ei . Note that (DΨ) (ei)= dt , dΨ1,sin(t) , dΨ1,cos(t) ,..., dΨK,sin(t) , dΨK,cos(t) , { } t dti dti dti dti dti t  9 0.4

1 0.35 0.4

0.3 0.3 0.5 0.2

0.25 0.1 0 0 (t)

x 0.2 −0.1 λ

−0.5 −0.2 0.15 −0.3

−1 0.1 −0.4 −1 −0.4 −0.5 0.05 −0.2 0 0 4 4 0.5 2 3 0.2 2 3 0 1 0 0 1 −2 −1 −2 −1 1 −4 −3 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0.4 −4 −3 |t−x|/ρ

Figure 5: Effects of applying a bump function on a spiral mapping. Left: Spiral mapping t 7→ (t, sin(t), cos(t)). Middle: Bump function λx: a smooth function with compact support. The parameter x controls the location while ρ controls the width. Right: The combined effect: t 7→ (t,λx(t) sin(t),λx(t) cos(t)). Note that the effect of the spiral is localized while keeping the mapping smooth. where dΛ1/2 (t) k i 1 x Φ(x) x x dψj,sin(t)/dt = x∈X(j) ω sin(ω(C t)k) dti + ΛΦ(x)(t)cos(ω(C t)k)Ck,i k = 1,...,n; i = 1,...,d 1/2 . dΛ (t) k i 1 x Φ(x)  x x j = 1,...,K dψ (t)/dt = P (j) cos(ω(C t) ) pΛ (t)sin(ω(C t) )C j,cos x∈X ω k dti − Φ(x) k k,i One can now observeP the advantage of having the termω. Byp picking ω sufficiently large, we can make the i first part of the expression sufficiently small. Now, for any tangent vector u = i uie such that u 1, we have (by algebra) k k ≤ 2 P (DΨ) (u) 2 = u (DΨ) (ei) k t k i t

dX n K Ak,x(t) 2 = u2 + sin + Λ (t)cos(ω(Cxt) )(Cxu) k ω Φ(x) k k j=1 (j) kX=1 kX=1 Xhx∈XX   q i Ak,x(t) 2 + cos Λ (t)sin(ω(Cxt) )(Cxu) (1) ω − Φ(x) k k (j) hx∈XX   q i k,x x 1/2 i k,x x 1/2 i where Asin (t) := i ui sin(ω(C t)k)(dΛΦ(x)(t)/dt ) and Acos (t) := i ui cos(ω(C t)k)(dΛΦ(x)(t)/dt ). We can further simplify Eq. (1) and get P P Lemma 6 Let t be any point in Φ(M) and u be any vector tagent to Φ(M) at t such that u 1. Let ǫ be k k≤ the isometry parameter chosen in Theorem 3. Pick ω Ω(nα29n√d/ρǫ), then ≥ n (DΨ) (u) 2 = u 2 + Λ (t) (Cxu)2 + ζ, (2) k t k k k Φ(x) k xX∈X kX=1 where ζ ǫ/2. | |≤ We will use this derivation of (DΨ) (u) 2 to study the combined effect of Ψ Φ on M in Section 5.4. k t k ◦ 5.3 Effects of Applying Ψ (Algorithm II) The goal of the second algorithm is to apply the spiralling corrections while using the coordinates more economically. We achieve this goal by applying them sequentially in the same embedding space (rather than simultaneously by making use of extra 2nK coordinates as done in the first algorithm), see also [Nas54]. Since all the corrections will be sharing the same coordinate space, one needs to keep track of a pair of normal vectors in order to prevent interference among the different local corrections. Rd R2d+3 More specifically, Ψ : (in Algorithm II) is defined recursively as Ψ := Ψ|X|,n such that (see also Embedding II in Section→ 4) Λ (t) Λ (t) Φ(xi) xi Φ(xi) xi Ψi,j(t):=Ψi,j−1(t)+ ηi,j (t) sin(ωi,j (C t)j )+ νi,j(t) cos(ωi,j (C t)j ), p ωi,j p ωi,j d+3 where Ψ (t) := Ψ (t), and the base function Ψ (t) is given as t (t, 0,..., 0). η (t) and ν (t) i,0 i−1,n 0,n 7→ i,j i,j are mutually orthogonal unit vectors that are approximately normal to Ψi,j−1(ΦM) at Ψi,j−1(t). In this section we assume that the normals η and ν have the following properties: z }| {

10 - ηi,j (t) v ǫ0 and νi,j (t) v ǫ0 for all unit-length v tangent to Ψi,j−1(ΦM) at Ψi,j−1(t). (quality of |normal· approximation)|≤ | · |≤

l l - For all 1 l d, we have dηi,j (t)/dt Ki,j and dνi,j (t)/dt Ki,j . (bounded directional derivatives)≤ ≤ k k ≤ k k ≤

We refer the reader to Section A.9 for details on how to estimate such normals.

l 2 Now, as before, representing a tangent vector u = l ule (such that u 1) in terms of its basis k k ≤ 2d+3 l dΨi,j (t) vectors, it suffices to study how DΨ acts on basis vectors.P Observe that (DΨi,j )t(e ) = l , dt k=1 t with the kth component given as  

dΨ (t) i,j−1 +(η (t)) Λ (t)Cxi Bi,j (t) (ν (t)) Λ (t)Cxi Bi,j (t) dtl i,j k Φ(xi) j,l cos − i,j k Φ(xi) j,l sin  k q q 1 dηi,j(t) i,j dνi,j (t) i,j + l ΛΦ(xi)(t)Bsin(t)+ l ΛΦ(xi)(t)Bcos(t) ωi,j dt k dt k h  q   q dΛ1/2 (t) dΛ1/2 (t) +(η (t)) Φ(xi) Bi,j (t)+(ν (t)) Φ(xi) Bi,j (t) , i,j k dtl sin i,j k dtl cos i i,j xi i,j xi k,l where Bcos(t) := cos(ωi,j (C t)j ) and Bsin(t) := sin(ωi,j (C t)j ). For ease of notation, let Ri,j be the terms in the bracket (being multiplied to 1/ωi,j ) in the above expression. Then, we have (for any i, j)

2 (DΨ ) (u) 2 = u (DΨ ) (el) k i,j t k l i,j t l X 2d+3 dΨi,j−1(t) xi xi = ul l +(ηi,j (t))k ΛΦ(xi)(t)cos(ωi,j (C t)j ) Cj,lul dt k kX=1 hXl   q Xl k,1 k,2 ζi,j ζi,j | {z } | {z 2 } (ν (t)) Λ (t)sin(ω (Cxi t) ) Cxi u +(1/ω ) u Rk,l − i,j k Φ(xi) i,j j j,l l i,j l i,j q Xl Xl i k,3 k,4 ζi,j ζi,j = (D|Ψ ) (u) 2 + Λ {z(t)(Cxi u)2 } | {z } k i,j−1 t k Φ(xi) j 2 2 2 =P ζk,1 k,2 k,3 k i,j =Pk ζi,j + ζi,j | {z } k,4  2 | k,4 {z k,}1 k,2 k,3 k,1 k,2 k,1 k,3 + ζi,j /ωi,j + 2ζi,j /ωi,j ζi,j + ζi,j + ζi,j + 2 ζi,j ζi,j + ζi,j ζi,j , (3) k X h    i Zi,j

| {z k,2 k,3 } where the last equality is by expanding the square and by noting that k ζi,j ζi,j = 0 since η and ν are orthogonal to each other. The base case (DΨ ) (u) 2 equals u 2. k 0,n t k k k P k,1 k,2 k,1 k,3 Again, by picking ωi,j sufficiently large, and by noting that the cross terms k(ζi,j ζi,j ) and k(ζi,j ζi,j ) are very close to zero since η and ν are approximately normal to the tangent vector, we have P P Lemma 7 Let t be any point in Φ(M) and u be any vector tagent to Φ(M) at t such that u 1. Let ǫ be the isometry parameter chosen in Theorem 3. Pick ω Ω (K +(α9n/ρ))(nd X )2/ǫk (recallk≤ that K i,j ≥ i,j | | i,j is the bound on the directional derivate of η and ν). If ǫ O ǫ/d(n X )2 (recall that ǫ is the quality of 0 0 approximation of the normals η and ν), then we have ≤ | |  |X| n (DΨ) (u) 2 = (DΨ ) (u) 2 = u 2 + Λ (t) (Cxi u)2 + ζ, (4) k t k k |X|,n t k k k Φ(xi) j i=1 j=1 X X where ζ ǫ/2. | |≤

11 5.4 Combined Effect of Ψ(Φ(M)) We can now analyze the aggregate effect of both our embeddings on the length of an arbitrary unit vector v tangent to M at p. Let u := (DΦ)p(v) = Φv be the pushforward of v. Then u 1 (cf. Lemma 4). See also Figure 4. k k ≤ Now, recalling that D(Ψ Φ) = DΨ DΦ, and noting that pushforward maps are linear, we have 2 ◦ 2 ◦ i Rd (D(Ψ Φ))p(v) = (DΨ)Φ(p)(u) . Thus, representing u as i uie in ambient coordinates of , and kusing Eq.◦ (2) (fork Algorithm I) or Eq. (4) (for Algorithm II), we get P 2 2 (D(Ψ Φ)) (v) = (DΨ) (u) = u 2 + Λ (Φ(p)) Cxu 2 + ζ, ◦ p Φ(p) k k Φ(x) k k x∈X X where ζ ǫ/2. We can give simple lower and upper bounds for the above expression by noting that Λ is| | a ≤ localization function. Define N := x X : Φ(x) Φ(p) < ρ as the neighborhood Φ(x) p { ∈ k − k } around p (ρ as per the theorem statement). Then only the points in Np contribute to above equation, since i ΛΦ(x)(Φ(p)) = dΛΦ(x)(Φ(p))/dt = 0 for Φ(x) Φ(p) ρ. Also note that for all x Np, x p < 2ρ (cf. Lemma 4). k − k≥ ∈ k − k x 2 x 2 Let xM := arg maxx∈Np C u and xm := argminx∈Np C u are quantities that attain the maxi- mum and the minimum respectively,k k then: k k

u 2 + Cxm u 2 ǫ/2 (D(Ψ Φ)) (v) 2 u 2 + CxM u 2 + ǫ/2. (5) k k k k − ≤ k ◦ p k ≤ k k k k Notice that ideally we would like to have the correction factor “Cpu” in Eq. (5) since that would give the perfect stretch around the point p. But what about correction Cxu for closeby x’s? The following lemma helps us continue in this situation.

x Lemma 8 Let p, v, u be as above. For any x Np X, let C and Fx also be as discussed above (recall that p x < 2ρ, and X M forms a bounded∈ ⊂(ρ, δ)-cover of the fixed underlying manifold M with k − k ⊂ condition number 1/τ). Define ξ := (4ρ/τ)+ δ + 4 ρδ/τ. If ρ τ/4 and δ d/32D, then ≤ ≤ 1 u 2 40 max ξD/d,ξD/d Cpxu 2 1 u 2 + 51 max ξD/d,ξD/d . − k k − · ≤ k k ≤ − k k · p p Note that we chose ρ (τd/D)(ǫ/350)2 and δ (d/D)(ǫ/250)2 (cf. theorem statement). Thus, combining Eq. (5) and Lemma≤ 8, we get (recall v = 1≤) k k (1 ǫ) v 2 (D(Ψ Φ)) (v) 2 (1 + ǫ) v 2. − k k ≤ k ◦ p k ≤ k k So far we have shown that our embedding approximately preserves the length of a fixed tangent vector at a fixed point. Since the choice of the vector and the point was arbitrary, it follows that our embedding approximately preserves the tangent vector lengths throughout the embedded manifold uniformly. We will now show that preserving the tangent vector lengths implies preserving the geodesic curve lengths.

5.5 Preservation of the Geodesics Pick any two (path-connected) points p and q in M, and let α be the geodesic5 path between p and q. Further let p¯, q¯ and α¯ be the images of p, q and α under our embedding. Note that α¯ is not necessarily the geodesic path between p¯ and q¯, thus we need an extra piece of notation: let β¯ be the geodesic path between p¯ and q¯ (under the embedded manifold) and β be its inverse image in M. We need to show (1 ǫ)L(α) L(β¯) (1 + ǫ)L(α), where L( ) denotes the length of the path (end points are understood). − ≤ ≤ · · First recall that for any differentiable map F and curve γ, γ¯ = F (γ) γ¯′ = (DF )(γ′). By (1 ǫ)- isometry of tangent vectors, this immediately gives us (1 ǫ)L(γ) L(¯γ)⇒ (1 + ǫ)L(γ) for any path±γ in M and its image γ¯ in embedding of M. So, − ≤ ≤

(1 ǫ)D (p,q)=(1 ǫ)L(α) (1 ǫ)L(β) L(β¯)= D (¯p, q¯). − G − ≤ − ≤ G Similarly,

D (¯p, q¯)= L(β¯) L(¯α) (1 + ǫ)L(α)=(1+ ǫ)D (p,q). G ≤ ≤ G 5Globally, geodesic paths between points are not necessarily unique; we are interested in a path that yields the shortest distance between the points.

12 6 Conclusion This work provides two simple algorithms for approximate isometric embedding of manifolds. Our algo- rithms are similar in spirit to Nash’s C1 construction [Nas54], and manage to remove the dependence on the isometry constant ǫ from the target dimension. One should observe that this dependency does however show up in the sampling density required to make the necessary corrections. The correction procedure discussed here can also be readily adapted to create isometric embeddings from any manifold embedding procedure (under some mild conditions). Take any off-the-shelf manifold embed- ding algorithm (such as LLE, Laplacian Eigenmaps, etc.) that maps an n-manifold in, say, d dimensions, but does not necessarilyA guarantee an approximate isometric embedding. Then as long as one can ensure that the embedding produced by is a one-to-one contraction6 (basically ensuring conditions similar to Lemma 4), we can apply correctionsA similar to those discussed in Algorithms I or II to produce an approximate iso- metric embedding of the given manifold in slightly higher dimensions. In this sense, the correction procedure presented here serves as a universal procedure for approximate isometric manifold embeddings. Acknowledgements The author would like to thank Sanjoy Dasgupta for introducing the subject, and for his guidance throughout the project. References [BN03] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen- tation. Neural Computation, 15(6):1373–1396, 2003. [BW07] R. Baraniuk and M. Wakin. Random projections of smooth manifolds. Foundations of Compu- tational Mathematics, 2007. [Cla07] K. Clarkson. Tighter bounds for random projections of manifolds. Comp. Geometry, 2007. [DF08] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. ACM Symposium on Theory of Computing, 2008. [DG03] D. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proc. of National Academy of Sciences, 100(10):5591–5596, 2003. [DGGZ02] T. Dey, J. Giesen, S. Goswami, and W. Zhao. Shape dimension and approximation from samples. Symposium on Discrete Algorithms, 2002. [GW03] J. Giesen and U. Wagner. Shape dimension and intrinsic metric from samples of manifolds with high co-dimension. Symposium on Computational Geometry, 2003. [HH06] Q. Han and J. Hong. Isometric embedding of Riemannian manifolds in Euclidean spaces. Amer- ican Mathematical Society, 2006. [JL84] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Conf. in Modern Analysis and Probability, pages 189–206, 1984. [Kui55] N. Kuiper. On C1-isometric embeddings, I, II. Indag. Math., 17:545–556, 683–689, 1955. [Mil72] J. Milnor. Topology from the differential viewpoint. Univ. of Virginia Press, 1972. [Nas54] J. Nash. C1 isometric imbeddings. Annals of Mathematics, 60(3):383–396, 1954. [Nas56] J. Nash. The imbedding problem for Riemannian manifolds. Annals of Mathematics, 63(1):20– 63, 1956. [NSW06] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high confi- dence from random samples. Disc. Computational Geometry, 2006. [RS00] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2000. [TdSL00] J. Tenebaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimen- sionality reduction. Science, 290, 2000. [Whi36] H. Whitney. Differentiable manifolds. Annals of Mathematics, 37:645–680, 1936. [WS04] K. Weinberger and L. Saul. Unsupervised learning of image manifolds by semidefinite program- ming. Computer Vision and Pattern Recognition, 2004.

6One can modify A to produce a contraction by simple scaling.

13 A Appendix A.1 Properties of a Well-conditioned Manifold Throughout this section we will assume that M is a compact submanifold of RD of dimension n, and condi- tion number 1/τ. The following are some properties of such a manifold that would be useful throughout the text.

Lemma 9 (relating closeby tangent vectors – implicit in the proof of Proposition 6.2 [NSW06]) Pick any two (path-connected) points p,q M. Let u T M be a unit length tangent vector and v T M ∈ ∈ p ∈ q be its parallel transport along the (shortest) geodesic path to q. Then7, i) u v 1 D (p,q)/τ, ii) · ≥ − G u v 2D (p,q)/τ. k − k≤ G Lemma 10p (relating geodesic distances to ambient distances – Proposition 6.3 of [NSW06]) If p,q M ∈ such that p q τ/2, then D (p,q) τ(1 1 2 p q /τ) 2 p q . k − k≤ G ≤ − − k − k ≤ k − k Lemma 11 (projection of a section of a manifoldp onto the tangent space) Pick any p M and define ∈ Mp,r := q M : q p r . Let f denote the orthogonal linear projection of Mp,r onto the tangent space T M{ .∈ Then, fork any− rk ≤τ/}2 p ≤ (i) the map f : M T M is 1 1. (see Lemma 5.4 of [NSW06]) p,r → p − 2 2 2 (ii) for any x,y Mp,r, f(x) f(y) (1 (r/τ) ) x y . (implicit in the proof of Lemma 5.3 of [NSW06])∈ k − k ≥ − · k − k

Lemma 12 (coverings of a section of a manifold) Pick any p M and define Mp,r := q M : q p n∈ { ∈′ k − k≤ r . If r τ/2, then there exists C Mp,r of size at most 9 with the property: for any p Mp,r, exists c} C such≤ that p′ c r/2. ⊂ ∈ ∈ k − k≤ Proof: The proof closely follows the arguments presented in the proof of Theorem 22 of [DF08]. D For r τ/2, note that Mp,r R is (path-)connected. Let f denote the projection of Mp,r onto ≤n ⊂ n TpM = R . Quickly note that f is 1 1 (see Lemma 11(i)). Then, f(Mp,r) R is contained in an ∼ − ⊂ n n-dimensional ball of radius r. By standard volume arguments, f(Mp,r) can be covered by at most 9 balls of radius r/4. WLOG we can assume that the centers of these covering balls are in f(Mp,r). Now, noting that the inverse image of each of these covering balls (in Rn) is contained in a D-dimensional ball of radius r/2 (see Lemma 11(ii)) finishes the proof.

Lemma 13 (relating closeby manifold points to tangent vectors) Pick any point p M and let q M (distinct from p) be such that D (p,q) τ. Let v T M be the projection of the vector∈ q p onto T∈ M. G ≤ ∈ p − p Then, i) v q−p 1 (D (p,q)/2τ)2, ii) v q−p D (p,q)/τ√2. kvk · kq−pk ≥ − G kvk − kq−pk ≤ G

Proof: If vectors v and q p are in the same direction, we are done. Otherwise, consider the plane spanned by vectors v and q p. Then− since M has condition number 1/τ, we know that the point q cannot lie within any τ-ball tangent to− M at p (see Figure 6). Consider such a τ-ball (with center c) whose center is closest to q and let q′ be the point on the surface of the ball which subtends the same angle (∠pcq′) as the angle formed by q (∠pcq). Let this angle be called θ. Then using cosine rule, we have cos θ = 1 q′ p 2/2τ 2. Define α as the angle subtended by vectors v and q p, and α′ the angle subtended− k − byk vectors v and q′ p. WLOG we can assume that the angles α and α′ −are less than π. Then, cos α cos α′ = cos θ/2. − ≥ Using the trig identity cos θ = 2cos2 θ 1, and noting q p 2 q′ p 2, we have 2 − k − k ≥ k − k v q p  θ − = cos α cos 1 q p 2/4τ 2 1 (D (p,q)/2τ)2. v · q p ≥ 2 ≥ − k − k ≥ − G

k k k − k p v q−p 2 Now, by applying the cosine rule, we have = 2(1 cos α). The lemma follows. kvk − kq−pk −

7Technically, it is not possible to directly compare two vectors that reside in different tangent spaces. However, since we only deal with manifolds that are immersed in some ambient space, we can treat the tangent spaces as n-dimensional affine subspaces. We can thus parallel translate the vectors to the origin of the ambient space, and do the necessary comparison (such as take the , etc.). We will make a similar abuse of notation for any calculation that uses vectors from different affine subspaces to mean to first translate the vectors and then perform the necessary calculation.

14 c θ τ q′ q

TpM p v

Figure 6: Plane spanned by vectors q − p and v ∈ TpM (where v is the projection of q − p onto TpM), with τ-balls ′ ′ tangent to p. Note that q is the point on the ball such that ∠pcq = ∠pcq = θ.

Lemma 14 (approximating tangent space by closeby samples) Let 0 < δ 1. Pick any point p0 M and let p ,...,p M be n points distinct from p such that (for all 1 i ≤n) ∈ 1 n ∈ 0 ≤ ≤ (i) D (p ,p ) τδ/√n, G 0 i ≤ (ii) pi−p0 pj −p0 1/2n (for i = j). kpi−p0k · kpj −p0k ≤ 6 ˆ ˆ Let T be the n dimensional subspace spanned by vectors pi p0 i∈[n]. For any unit vector uˆ T , let u be { − } ∈ the projection of uˆ onto T M. Then, uˆ u 1 δ. p0 · kuk ≥ −

pi−p0 ˆ Proof: Define the vectors vˆi := (for 1 i n). Observe that vˆi i∈[n] forms a basis of T . For kpi−p0k ≤ ≤ { } 1 i n, define vi as the projection of vector vˆi onto Tp0 M. Also note that by applying Lemma 13, we ≤ ≤ 2 2 have that for all 1 i n, vˆi vi δ /2n. Let V = [ˆv ,...,≤ vˆ≤] bek the−D kn ≤matrix. We represent the unit vector uˆ as Vα = α vˆ . Also, since 1 n × i i i u is the projection of uˆ, we have u = α v . Then, α 2 2. To see this, we first identify Tˆ with Rn i i i k k ≤ P via an isometry S (a linear map that preserves the lengths and angles of all vectors in Tˆ). Note that S can be P represented as an n D matrix, and since V forms a basis for Tˆ, SV is an n n invertible matrix. Then, × −1 × since Suˆ = SVα, we have α =(SV ) Suˆ. Thus, (recall Suˆ = 1) k k 2 −1 2 −T −1 α max (SV ) x = λmax((SV ) (SV ) ) k k ≤ x∈Sn−1 k k −1 −T T −1 T = λmax((SV ) (SV ) )= λmax((V V ) ) = 1/λmin(V V ) 1/1 ((n 1)/2n) 2, ≤ − − ≤ where i) λmax(A) and λmin(A) denote the largest and smallest eigenvalues of a square symmetric matrix A respectively, and ii) the second inequality is by noting that V TV is an n n matrix with 1’s on the diagonal and at most 1/2n on the off-diagonal elements, and applying the Gershgorin× circle theorem. Now we can bound the quantity of interest. Note that

u T uˆ uˆ (ˆu (ˆu u)) 1 uˆ u = 1 α (ˆv v ) · u ≥ | − − |≥ − k − k − i i − i k k i X 1 α vˆ v 1 (δ/√2n) α 1 δ, ≥ − | i|k i − ik≥ − | i|≥ − i i X X where the last inequality is by noting α √2n. k k1 ≤ A.2 On Constructing a Bounded Manifold Cover Given a compact n-manifold M RD with condition number 1/τ, and some 0 < δ 1. We can construct ⊂ ≤ an α-bounded (ρ, δ) cover X of M (with α 210n+1 and ρ τδ/3√2n) as follows. ≤ ≤ Set ρ τδ/3√2n and pick a (ρ/2)-net C of M (that is C M such that, i. for c,c′ C such that c = c′, c ≤c′ ρ/2, ii. for all p M, exists c C such that c⊂ p < ρ/2). WLOG we shall∈ assume that 6 k − k≥ ∈ ∈ k − k all points of C are in the interior of M. Then, for each c C, define Mc,ρ/2 := p M : p c ρ/2 , and the orthogonal projection map f : M T M that∈ projects M onto{T M∈ (notek that,− cf.k≤ Lemma} c c,ρ/2 → c c,ρ/2 c

15 n 11(i), fc is 1 1). Note that TcM can be identified with R with the c as the origin. We will denote the origin (c) − (c) as x0 , that is, x0 = fc(c). (c) Now, let Bc be any n-dimensional closed ball centered at the origin x0 TcM of radius r > 0 that is ∈ (c) (c) completely contained in fc(Mc,ρ/2) (that is, Bc fc(Mc,ρ/2)). Pick a set of n points x1 ,...,xn on the (c) (c) (⊂c) (c) surface of the ball Bc such that (xi x0 ) (xj x0 ) = 0 for i = j. Define the bounded manifold cover− as · − 6 −1 (c) X := fc (xi ). (6) c∈C,i[=0,...,n Lemma 15 Let 0 <δ 1 and ρ τδ/3√2n. Let C be a (ρ/2)-net of M as described above, and X be as in Eq. (6). Then X forms≤ a 210n+1≤-bounded (ρ, δ) cover of M.

Proof: Pick any point p M and define Xp := x X : x p < ρ . Let c C be such that p c < ρ/2. Then X has∈ the following properties.{ ∈ k − k } ∈ k − k p −1 (c) −1 (c) Covering criterion: For 0 i n, since fc (xi ) c ρ/2 (by construction), we have fc (xi ) −1 (c) ≤ ≤ k − k ≤ −1 (c) k −1 (c) − p < ρ. Thus, fc (xi ) Xp (for 0 i n). Now, for 1 i n, noting that DG(fc (xi ),fc (x0 )) k ∈ ≤ ≤ ≤ ≤ −1 (c) −1 (c) ≤ −1 (c) −1 (c) (c) fc (xi )−fc (x0 ) 2 fc (xi ) fc (x0 ) ρ (cf. Lemma 10), we have that for the vector vˆi := −1 (c) −1 (c) and k − k≤ kfc (xi )−fc (x0 )k (c) (c) (c) xi −x0 (c) (c) its (normalized) projection vi := (c) (c) onto TcM, vˆi vi ρ/√2τ (cf. Lemma 13). Thus, kxi −x0 k − ≤ for i = j, we have (recall, by construction, we have v(c) v( c) = 0) 6 i · j vˆ(c) vˆ(c) = (ˆv(c) v(c) + v(c)) (ˆv(c) v(c) + v(c)) | i · j | | i − i i · j − j j | = (ˆv(c) v(c)) (ˆv(c) v(c))+ v(c) (ˆv(c) v(c))+(ˆv(c) v(c)) v(c) | i − i · j − j i · j − j i − i · j | (ˆv(c) v(c)) (ˆv(c) v(c)) + vˆ(c) v(c) + vˆ(c) v(c) ≤ k i − i kk j − j k k i − i k k j − j k 3ρ/√2τ 1/2n. ≤ ≤ Point representation criterion: There exists x X , namely f −1(x(c)) (= c), such that p x ρ/2. ∈ p c 0 k − k≤ Local boundedness criterion: Define M := q M : q p < 3ρ/2 . Note that X f −1(x(c)): p,3ρ/2 { ∈ k − k } p ⊂ { c i c C Mp,3ρ/2, 0 i n . Now, using Lemma 12 we have that exists a cover N Mp,3ρ/2 of size ∈ ∩ 3n ≤ ≤ } ⊂ at most 9 such that for any point q Mp,3ρ/2, there exists n N such that q n < ρ/4. Note that, by construction of C, there cannot be an∈ n N such that it is within∈ distance ρ/k4 of− twok (or more) distinct c,c′ C (since otherwise the distance c ∈c′ will be less than ρ/2, contradicting the packing of C). Thus, C ∈M 93n. It follows that Xk − (kn + 1)93n 210n+1. | ∩ p,3ρ/2|≤ | p|≤ ≤ Tangent space approximation criterion: Let Tˆ be the n-dimensional span of vˆ(c) (note that Tˆ may p { i }i∈[n] p not necessarily pass through p). Then, for any unit vector uˆ Tˆp, we need to show that its projection up up ∈ onto TpM has the property uˆ 1 δ. Let θ be the angle between vectors uˆ and up. Let uc be the | · kupk | ≥ − projection of uˆ onto TcM, and θ1 be the angle between vectors uˆ and uc, and let θ2 be the angle between vectors uc (at c) and its parallel transport along the geodesic path to p. WLOG we can assume that θ1 and θ2 are at most π/2. Then, θ θ + θ π. We get the bound on the individual angles as follows. By applying ≤ 1 2 ≤ Lemma 14, cos(θ1) 1 δ/4, and by applying Lemma 9, cos(θ2) 1 δ/4. Finally, by using Lemma 16, up ≥ − ≥ − we have uˆ = cos(θ) cos(θ1 + θ2) 1 δ. · kupk ≥ ≥ −

Lemma 16 Let 0 ǫ ,ǫ 1. If cos α 1 ǫ and cos β 1 ǫ , then cos(α+β) 1 ǫ ǫ 2√ǫ ǫ . ≤ 1 2 ≤ ≥ − 1 ≥ − 2 ≥ − 1 − 2 − 1 2 2 Proof: Applying the identity sin θ = √1 cos θ immediately yields sin α √2ǫ1 and sin β √2ǫ2. Now, cos(α + β) = cos α cos β sin α sin−β (1 ǫ )(1 ǫ ) 2√ǫ ǫ ≤1 ǫ ǫ 2√ǫ ≤ǫ . − ≥ − 1 − 2 − 1 2 ≥ − 1 − 2 − 1 2

Remark 6 A dense enough sample from M constitutes as a bounded cover. One can selectively prune the dense sampling to control the total number of points in each neighborhood, while still maintaining the cover properties.

16 A.3 Bounding the number of subsets K in Embedding I

By construction (see the preprocessing stage of Embedding I), K = maxx∈X X B(x, 2ρ) (where B(x, r) denotes a Euclidean ball centered at x of radius r). That is, K is the largest| number∩ of x’s| ( X) that are within a 2ρ ball of some x X. ∈ ∈ Now, pick any x X and consider the set Mx := M B(x, 2ρ). Then, if ρ τ/4, Mx can be covered by 2cn balls of radius∈ ρ (see Lemma 12). By recalling that∩ X forms an α-bounded≤ (ρ, δ)-cover, we have X B(x, 2ρ) = X M α2cn (where c 4). | ∩ | | ∩ x|≤ ≤

A.4 Proof of Lemma 4

Since R is a random orthoprojector from RD to Rd, it follows that

Lemma 17 (random projection of n-manifolds – adapted from Theorem 1.5 of [Cla07]) Let M be a smooth compact n-manifold with volume V and condition number 1/τ. Let R¯ := D/dR be a scaling of R. Pick any 0 <ǫ 1 and 0 <δ 1. If d = Ω ǫ−2 log(V/τ n)+ ǫ−2n log(1/ǫ) + ln(1/δ) , then with probability p at least 1 ≤δ, for all p,q ≤M − ∈  (1 ǫ) p q Rp¯ Rq¯ (1 + ǫ) p q . − k − k ≤ k − k≤ k − k

We apply this result with ǫ = 1/4. Then, for d = Ω(log(V/τ n)+ n), with probability at least 1 1/poly(n), − (3/4) p q Rp¯ Rq¯ (5/4) p q . Now let Φ : RD Rd be defined as Φx := (2/3)Rx¯ = k − k ≤ k − k ≤ k − k → (2/3)( D/d)x (as per the lemma statement). Then we immediately get (1/2) p q Φp Φq (5/6) p q . k − k ≤ k − k ≤ kp− k Also note that for any x RD, we have Φx = (2/3)( D/d) Rx (2/3)( D/d) x (since R is an orthoprojector). ∈ k k k k≤ k k p p Finally, for any point p M, a unit vector u tangent to M at p can be approximated arbitrarily well by considering a sequence p ∈of points (in M) converging to p (in M) such that (p p)/ p p converges { i}i i − k i − k to u. Since for all points pi, (1/2) Φpi Φp / pi p (5/6) (with high probability), it follows that (1/2) (DΦ) (u) (5/6). ≤ k − k k − k≤ ≤ k p k≤

A.5 Proof of Corollary 5

1 n Rn 1 n Let vx and vx ( ) be the right singular vectors corresponding to singular values σx and σx respectively ∈ 1 1 n n of the matrix ΦFx. Then, quickly note that σx = ΦFxv , and σx = ΦFxv . Note that since Fx is 1 n k k n k k orthonormal, we have that Fxv = Fxv = 1. Now, since Fxv is in the span of column vectors of k k k k n Fx, by the sampling condition (cf. Definition 2), there exists a unit length vector v¯x tangent to M (at x) n n n n n n n such that Fxvx v¯x 1 δ. Thus, decomposing Fxvx into two vectors ax and bx such that ax bx and an := (F |vn v¯n· )¯vn|, ≥ we have− ⊥ x x x · x x σn = Φ(F vn) = Φ((F vn v¯n)¯vn) + Φbn x k x k k x x · x x x k (1 δ) Φ¯vn Φbn ≥ − k x k − k x k (1 δ)(1/2) (2/3) 2δD/d, ≥ − − p since bn 2 = F vn 2 an 2 1 (1 δ)2 2δ and Φbn (2/3)( D/d) bn (2/3) 2δD/d. k x k k x x k − k x k ≤ − − ≤ k x k≤ k x k≤ Similarly decomposing F v1 into two vectors a1 and b1 such that a1 b1 and a1 := (F v1 v¯1)¯v1, we have x x x x x⊥ x p x x x · x xp σ1 = Φ(F v1) = Φ((F v1 v¯1)¯v1) + Φb1 x k x x k k x x · x x xk Φ¯v1 + Φb1 ≤ x k xk (5/6) + (2/3) 2δD/d, ≤ p 1 where the last inequality is by noting Φbx (2/3) 2δD/d. Now, by our choice of δ ( d/32D), and by noting that d D, the corollary follows.k k≤ ≤ ≤ p

17 A.6 Proof of Lemma 6

We can simplify Eq. (1) by recalling how the subsets X(j) were constructed (see preprocessing stage of Embedding I). Note that for any fixed t, at most one term in the set Λ (t) (j) is non-zero. Thus, { Φ(x) }x∈X

d n (DΨ) (u) 2 = u2 + Λ (t)cos2(ω(Cxt) )(Cxu)2 +Λ (t)sin2(ω(Cxt) )(Cxu)2 k t k k Φ(x) k k Φ(x) k k kX=1 kX=1 xX∈X 1 2 2 + Ak,x(t) + Ak,x(t) /ω + 2Ak,x(t) Λ (t)cos(ω(Cxt) )(Cxu) ω sin cos sin Φ(x) k k

h ζ1 q    ζ2 | 2Ak,x(t) Λ{z (t)sin(ω(C} xt)| )(Cxu) {z } − cos Φ(x) k k q i ζ3 n | {z } = u 2 + Λ (t) (Cxu)2 + ζ, k k Φ(x) k xX∈X kX=1 where ζ := (ζ + ζ + ζ )/ω. Noting that i) the terms Ak,x(t) and Ak,x(t) are at most O(α9n√d/ρ) 1 2 3 | sin | | cos | (see Lemma 18), ii) (Cxu) 4, and iii) Λ (t) 1, we can pick ω sufficiently large (say, ω | k| ≤ Φ(x) ≤ ≥ Ω(nα29n√d/ρǫ) such that ζ ǫ/2 (where ǫ is the isometry constant from our main theorem). | |≤ p

Lemma 18 For all k, x and t, the terms Ak,x(t) and Ak,x(t) are at most O(α9n√d/ρ). | sin | | cos |

Proof: We shall focus on bounding Ak,x(t) (the steps for bounding Ak,x(t) are similar). Note that | sin | | cos |

1/2 1/2 1/2 d dΛ (t) d dΛ (t) d dΛ (t) 2 Ak,x(t) = u sin(ω(Cxt) ) Φ(x) u Φ(x) Φ(x) , | sin | i k dti ≤ | i|· dti ≤ v dti i=1 i=1 ui=1 X X uX t since u 1. Thus, we can bound Ak,x(t) by O(α9n√d/ρ) by noting the following lemma. k k≤ | sin |

Lemma 19 For all i, x and t, dΛ1/2 (t)/dti O(α9n/ρ). | Φ(x) |≤

Proof: Pick any t Φ(M), and let p M be (the unique element) such that Φ(p ) = t. Define N := ∈ 0 ∈ 0 p0 x X : Φ(x) Φ(p0) < ρ as the neighborhood around p0. Fix an arbitrary x0 Np0 X (since if { ∈ k −1/2 k i } ∈ ⊂ x0 / Np then dΛ (t)/dt = 0), and consider the function ∈ 0 Φ(x0)

1/2 2 2 1/2 λ (t) e−1/(1−(kt−Φ(x0)k /ρ )) Λ1/2 (t)= Φ(x0) = . Φ(x0) −1/(1−(kt−Φ(x)k2/ρ2)) λΦ(x)(t) e x∈Np0 ! x∈Np0 ! P P 18 Pick an arbitrary coordinate i 1,...,d and consider the (directional) derivative of this function 0 ∈ { }

1/2 dΛ (t) 1 dΛ (t) Φ(x0) = Λ−1/2 (t) Φ(x0) dti0 2 Φ(x0) dti0 1/2   2(ti Φ(x0)i ) e−At(x) e−At(x) − 0 − 0 (A (x ))2 e−At(x0) ρ2 t 0 x∈N x∈N  Xp0   Xp0    = 1/2 2 −A (x) 2 e−At(x0)  e t   x∈Np     X0   2(ti Φ(x)i ) e−At(x0) − 0 − 0 (A (x))2e−At(x) ρ2 t x∈N   Xp0  − 2 e−At(x)   x∈Np   X0   1/2 2(ti Φ(x0)i )  e−At(x) − 0 − 0 (A (x ))2 e−At(x0) ρ2 t 0 x∈N  Xp0    = 1.5 2 e−At(x) x∈N  Xp0  1/2 2(ti Φ(x)i ) e−At(x0) − 0 − 0 (A (x))2e−At(x) ρ2 t x∈Np    X0 , − 1.5 2 e−At(x) x∈N  Xp0  where A (x) := 1/(1 ( t Φ(x) 2/ρ2)). Observe that the domain of A is x X : t Φ(x) < ρ t − k − k t { ∈ k − k } and the range is [1, ). Recalling that for any β 1, β2e−β 1 and β2e−β/2 3, we have that ∞ ≥ | | ≤ | | ≤ A ( )2e−At(·) 1 and A ( )2e−At(·)/2 3. Thus, | t · |≤ | t · |≤

2(ti Φ(x0)i ) 2(ti Φ(x)i ) 3 e−At(x) 0 − 0 + e−At(x0)/2 0 − 0 1/2 · · ρ2 · ρ2 dΛ (t) x∈N x∈N Φ(x0) Xp0 Xp0 i0 1.5 dt ≤ −A t(x) 2 e x∈N  Xp0  (3)(2/ρ) e−At(x) + e−At(x0)/2 (2/ρ) x∈N x∈N Xp0 Xp0 ≤ 1.5 2 e−A t(x) x∈N  Xp0  O(α9n/ρ), ≤

n where the last inequality is by noting: i) Np0 α9 (since for all x Np0 , x p0 2ρ – cf. Lemma 4, X is an α-bounded cover, and by noting| that| ≤ for ρ τ/4, a ball of∈ radius 2kρ can− bek covered ≤ by 9n balls ≤ of radius ρ on the given n-manifold – cf. Lemma 12), ii) e−At(x) e−At(x)/2 1 (for all x), and iii) −At(x) | |≤| | ≤ x∈N e Ω(1) (since our cover X ensures that for any p0, there exists x Np0 X such that p0 ≥ ∈ ⊂ p x ρ/2 – see also Remark 2, and hence e−At(x) is non-negligible for some x N ). Pk 0 − k≤ ∈ p0

19 A.7 Proof of Lemma 7 2 2 Note that by definition, (DΨ)t(u) = (DΨ|X|,n)t(u) . Thus, using Eq. (3) and expanding the recur- sion, we have k k k k (DΨ) (u) 2 = (DΨ ) (u) 2 k t k k |X|,n t k = (DΨ ) (u) 2 +Λ (t)(Cx|X| u)2 + Z k |X|,n−1 t k Φ(x|X|) n |X|,n . . |X| n = (DΨ ) (u) 2 + Λ (t) (Cxi u)2 + Z . k 0,n t k Φ(xi) j i,j i=1 j=1 i,j h X X i X 2 2 Note that (DΨi,0)t(u):=(DΨi−1,n)t(u). Now recalling that (DΨ0,n)t(u) = u (the base case of the recursion), all we need to show is that Z ǫ/2. Thisk follows directlyk fromk thek lemma below. | i,j i,j|≤ 2 n 2 Lemma 20 Let ǫ0 O ǫ/d(n X ) , andP for any i, j, let ωi,j Ω (Ki,j +(α9 /ρ))(nd X ) /ǫ (as per the statement of Lemma≤ 7). Then,| | for any i, j, Z ǫ/2n X ≥. | |  | i,j |≤ | |  Proof: Recall that (cf. Eq. (3)) k,4 1 k,4 2 ζi,j k,1 k,2 k,3 k,1 k,2 k,1 k,3 Zi,j = 2 ζi,j + 2 ζi,j + ζi,j + ζi,j + 2 ζi,j ζi,j + 2 ζi,j ζi,j . ω ωi,j i,j k k k k X  X  X X (a) (b) (c) (d) Term (a): Note| that{z (ζk,}4)2| O d3(K {z+(α9n/ρ))2} (cf.| Lemma{z 21} (iv)).| By{z our choice} of ω , | k i,j | ≤ i,j i,j we have term (a) at most O(ǫ/n X ). P | |  Term (b): Note that ζk,1 + ζk,2 + ζk,3 O(n X +(ǫ/dn X )) (by noting Lemma 21 (i)-(iii), recall- i,j i,j i,j ≤ | | | | ing the choice of ω , and summing over all i′, j′). Thus, ζk,4(ζk,1 + ζk,2 + ζk,3) O d2(K + i,j k i,j i,j i,j i,j ≤ i,j (α9n/ρ)) n X +(ǫ/dn X ) . Again, by our choice of ω , term (b) is at most O(ǫ/n X ). i,j | | | | P | | Terms (c) and (d): We focus on bounding term (c) (the steps for bounding term (d) are same). Note that ζk,1ζk,2 4 ζk,1(η (t)) . Now, observe that ζk,1 is a tangent vector with length | k i,j i,j | ≤ | k i,j i,j k| i,j k=1,...,2d+3 at most O(dn X +(ǫ/dn X )) (cf. Lemma 21 (i)). Thus, by noting that η is almost normal (with quality P P  i,j of approximation| | ǫ ), we have| | term (c) at most O(ǫ/n X ). 0 | | By choosing the constants in the order terms appropriately, we can get the lemma.

k,1 k,2 k,3 k,4 Lemma 21 Let ζi,j , ζi,j , ζi,j , and ζi,j be as defined in Eq. (3). Then for all 1 i X and 1 j n, we have ≤ ≤| | ≤ ≤ k,1 i j−1 n (i) ζ 1 + 8n X + ′ ′ O(d(K ′ ′ +(α9 /ρ))/ω ′ ′ ), | i,j |≤ | | i =1 j =1 i ,j i ,j k,2 (ii) ζ 4, P P | i,j |≤ (iii) ζk,3 4, | i,j |≤ (iv) ζk,4 O(d(K +(α9n/ρ))). | i,j |≤ i,j Proof: First note for any u 1 and for any x X, 1 j n and 1 l d, we have Cxi u = k k ≤ i ∈ ≤ ≤ ≤ ≤ | l j,l l| (Cxi u) 4 (cf. Lemma 23 (b) and Corollary 5). | j |≤ P Noting that for all i and j, η = ν = 1, we have ζ2,k 4 and ζ3,k 4. k i,j k k i,j k | i,j |≤ | i,j |≤ Observe that ζk,4 = u Rk,l. For all i, j, k and l, note that i) dη (t)/dtl K and dν (t)/dtl i,j l l i,j k i,j k≤ i,j k i,j k≤ 1/2 l n k,4 n Ki,j and ii) dλ (t)/dt O(α9 /ρ) (cf. Lemma 19). Thus we have ζ O(d(Ki,j +(α9 /ρ))). | Φ(xi)P |≤ | i,j |≤ k,1 l k,1 Now for any i, j, note that ζi,j = l uldΨi,j−1(t)/dt . Thus by recursively expanding, ζi,j 1 + i j−1 n | | ≤ 8n X + ′ ′ O(d(K ′ ′ +(α9 /ρ))/ω ′ ′ ). | | i =1 j =1 i ,j P i ,j P P 20 A.8 Proof of Lemma 8 We start by stating the following useful observations:

Lemma 22 Let A be a linear operator such that max Ax δmax. Let u be a unit-length vector. If kxk=1 k k ≤ Au δmin > 0, then for any unit-length vector v such that u v 1 ǫ, we have k k≥ | · |≥ − δmax√2ǫ Av δmax√2ǫ 1 k k 1+ . − δmin ≤ Au ≤ δmin k k Proof: Let v′ = v if u v > 0, otherwise let v′ = v. Quickly note that u v′ 2 = u 2 + v′ 2 2u v′ = 2(1 u v′) 2ǫ. Thus,· we have, − k − k k k k k − · − · ≤ i. Av = Av′ Au + A(u v′) Au + δ √2ǫ, k k k k ≤ k k k − k ≤ k k max ii. Av = Av′ Au A(u v′) Au δ √2ǫ. k k k k ≥ k k − k − k ≥ k k− max Noting that Au δ yields the result. k k≥ min

Lemma 23 Let x ,...,x RD be a set of orthonormal vectors, F := [x ,...,x ] be a D n matrix 1 n ∈ 1 n × and let Φ be a linear map from RD to Rd (n d D) such that for all non-zero a span(F ) we have T ≤ ≤ T ∈ 0 < Φa a . Let UΣV be the thin SVD of ΦF . Define C = (Σ−2 I)1/2U . Then, k k ≤ k k − (a) C(Φa) 2 = a 2 Φa 2, for any a span(F ), k k k k − k k ∈ (b) C 2 (1/σn)2, where denotes the spectral norm of a matrix and σn is the nth largest singular valuek k of≤ΦF . k · k

Proof: Note that FV forms an orthonormal basis for the subspace spanned by columns of F that maps to UΣ via the mapping Φ. Thus, since a span(F ), let y be such that a = FVy. Note that i) a 2 = y 2, ii) Φa 2 = UΣy 2 = yTΣ2y. Now, ∈ k k k k k k k k T CΦa 2 = ((Σ−2 I)1/2U )ΦFVy 2 k k k − T T k = (Σ−2 I)1/2U UΣV Vy 2 k − k = (Σ−2 I)1/2Σy 2 kT −T k = y y y Σ2y − = a 2 Φa 2. k k − k k Now, consider C 2. k k T C 2 (Σ−2 I)1/2 2 U 2 k k ≤ k − k k k max (Σ−2 I)1/2x 2 ≤ kxk=1 k − k T max x Σ−2x ≤ kxk=1 2 i 2 = max xi /(σ ) kxk=1 i X (1/σn)2, ≤ where σi are the (top n) singular values forming the diagonal matrix Σ.

Lemma 24 Let M RD be a compact Riemannian n-manifold with condition number 1/τ. Pick any ⊂ x M and let Fx be any n-dimensional affine space with the property: for any unit vector vx tangent to M ∈ vxF at x, and its projection vxF onto Fx, vx 1 δ. Then for any p M such that x p ρ τ/2, · kvxF k ≥ − ∈ k − k≤ ≤ and any unit vector v tangent to M at p,(ξ := (2 ρ/τ)+ δ + 2 2ρδ/τ)

p i. v vF 1 ξ, · kvF k ≥ − 2 ii. vF 1 2ξ, k k ≥ −

21 iii. v 2 2ξ, k rk ≤ where v is the projection of v onto F and v is the residual (i.e. v = v + v and v v ). F x r F r F ⊥ r Proof: Let γ be the angle between vF and v. We will bound this angle. Let vx (at x) be the parallel transport of v (at p) via the (shortest) geodesic path via the manifold connec- tion. Let the angle between vectors v and vx be α. Let vxF be the projection of vx onto the subspace Fx, and let the angle between vx and vxF be β. WLOG, we can assume that the angles α and β are acute. Then, since γ α + β π, we have that v vF = cos γ cos(α + β). We bound the individual terms cos α ≤ ≤ · kvF k ≥ and cos β as follows. Now, since p x ρ, using Lemmas 9 and 10, we have cos(α) = v v 1 2ρ/τ. We also k − k ≤ | · x| ≥ − vxF vF have cos(β) = vx 1 δ. Then, using Lemma 16, we finally get v = cos(γ) · kvxF k ≥ − · kvF k | | ≥ 1 2ρ/τ δ 2 2ρδ/τ = 1 ξ. − − − − 2 2 2 vF 2 vF 2 2 v F Also note sincep1= v =(v ) + vr , we have vr = 1 v 2ξ, and k k · kvF k kvF k k k k k − · kvF k ≤ v 2 = 1 v 2 1 2ξ.   F r k k − k k ≥ − Now we are in a position to prove Lemma 8. Let vF be the projection of the unit vector v (at p) onto the subspace spanned by (the columns of) F and v be the residual (i.e. v = v + v and v v ). Then, noting x r F r F ⊥ r that p, x, v and Fx satisfy the conditions of Lemma 24 (with ρ in the Lemma 24 replaced with 2ρ from the statement of Lemma 8), we have (ξ := (4ρ/τ)+ δ + 4 ρδ/τ) a) v vF 1 ξ, p · kvF k ≥ − b) v 2 1 2ξ, k F k ≥ − c) v 2 2ξ. k rk ≤ We can now bound the required quantity Cxu 2. Note that k k Cxu 2 = CxΦv 2 = CxΦ(v + v ) 2 k k k k k F r k = CxΦv 2 + CxΦv 2 + 2CxΦv CxΦv k F k k rk F · r = v 2 Φv 2 + CxΦv 2 + 2CxΦv CxΦv k F k − k F k k rk F · r (a) (b) (c) where the last equality is by observing| vF is{z in the span} | of F{zx and} applying| Lemma{z 23} (a). We now bound the terms (a),(b), and (c) individually.

2 Term (a): Note that 1 2ξ vF 1 and observing that Φ satisfies the conditions of Lemma 22 − ≤ k k ≤ vF with δmax = (2/3) D/d, δmin = (1/2) Φv (cf. Lemma 4) and v 1 ξ, we have (recall ≤ k k · kvF k ≥ − ) Φv = u 1 p k k k k≤ 2 v v 2 Φv 2 1 v 2 Φ F k F k − k F k ≤ − k F k v k F k 2 vF 1 (1 2ξ ) Φ ≤ − − v k F k 2 vF 1 + 2ξ Φ ≤ − v k F k 2 2 1 + 2ξ 1 (4/3) 2ξD/d Φv ≤ − − k k 2 1 u + 2ξ + (8p/3) 2ξD/d , (7) ≤ − k k where the fourth inequality is by using Lemma 22. Similarly, in the other direction p  v 2 v 2 Φv 2 1 2ξ v 2 Φ F k F k − k F k ≥ − − k F k v k F k 2 vF 1 2ξ Φ ≥ − − v k F k 2 2 1 2ξ 1+(4/3) 2ξD/d Φv ≥ − − k k 2 1 u 2ξ + (32p/9)ξ(D/d )+(8/3) 2ξD/d . (8) ≥ − k k − p  22 N = Rd Φx4 Φx1 Φx3 Φx2 ΦM R2d+3 Φp

Figure 7: Basic setup for computing the normals to the underlying n-manifold ΦM at the point of interest Φp. Observe that even though it is difficult to find vectors normal to ΦM at Φp within the containing space Rd (because we only have a finite-size sample from ΦM, viz. Φx1, Φx2, etc.), we can treat the point Φp as part of the bigger ambient manifold N (= Rd, that contains ΦM) and compute the desired normals in a space that contains N itself. Now, for each i, j iteration of Algorithm II, Ψi,j acts on the entire N, and since we have complete knowledge about N, we can compute the desired normals.

Term (b): Note that for any x, Φx (2/3)( D/d) x . We can apply Lemma 23 (b) with σn 1/4 (cf. k k≤ k k x ≥ Corollary 5) and noting that v 2 2ξ, we immediately get k rk ≤ p 0 CxΦv 2 42 (4/9)(D/d) v 2 (128/9)(D/d)ξ. (9) ≤ k rk ≤ · k rk ≤

Term (c): Recall that for any x, Φx (2/3)( D/d) x , and using Lemma 23 (b) we have that Cx 2 16 (since σn 1/4 – cf. Corollaryk 5).k≤ k k k k ≤ x ≥ p Now let a := CxΦv and b := CxΦv . Then a = CxΦv Cx Φv 4, and b = F r k k k F k ≤ k kk F k ≤ k k CxΦv (8/3) 2ξD/d (see Eq. (9)). k rk≤ Thus, 2a b 2 a b 2 4 (8/3) 2ξD/d = (64/3) 2ξD/d. Equivalently, | · |≤pk kk k≤ · · p p (64/3) 2ξD/d 2CxΦv CxΦv (64/3) 2ξD/d. (10) − ≤ F · r ≤ p p Combining (7)-(10), and noting d D, yields the lemma. ≤ A.9 Computing the Normal Vectors The success of the second embedding technique crucially depends upon finding (at each iteration step) a pair of mutually orthogonal unit vectors that are normal to the embedding of manifold M (from the previous iteration step) at a given point p. At a first glance finding such normal vectors seems infeasible since we only have access to a finite size sample X from M. The saving grace comes from noting that the corrections are applied to the n-dimensional manifold Φ(M) that is actually a submanifold of d-dimensional space Rd. Let’s denote this space Rd as a flat d-manifold N (containing our manifold of interest Φ(M)). Note that even though we only have partial information about Φ(M) (since we only have samples from it), we have full information about N (since it is the entire space Rd). What it means is that given some point of interest Φp Φ(M) N, finding a vector normal to N (at Φp) automatically is a vector normal to Φ(M) (at Φp). Of course,∈ to⊂ find two mutually orthogonal normals to a d-manifold N, N itself needs to be embedded in a larger dimensional Euclidean space (although embedding into d+2 should suffice, for computational reasons we will embed N into Euclidean space of dimension 2d + 3). This is precisely the first thing we do before applying any corrections (cf. Step 2 of Embedding II in Section 4). See Figure 7 for an illustration of the setup before finding any normals. Now for every iteration of the algorithm, note that we have complete knowledge of N and exactly what function (namely Ψi,j for iteration i, j) is being applied to N. Thus with additional computation effort, one can compute the necessary normal vectors. More specifically, We can estimate a pair of mutually orthogonal unit vectors that are normal to Ψi,j (N) at Φp (for any step i, j) as follows.

23 Algorithm 4 Compute Normal Vectors Preprocessing Stage: rand rand R2d+3 1: Let ηi,j and νi,j be vectors in drawn independently at random from the surface of the unit-sphere (for 1 i X , 1 j n). ≤ ≤| | ≤ ≤ Compute Normals: For any point of interest p M, let t := Φp denote its projection into Rd. Now, for any iteration i,j (where 1 i X , and 1 j n∈), we shall assume that vectors η and ν upto iterations i,j 1 ≤ ≤| | ≤ ≤ − are already given. Then we can compute the (approximated) normals ηi,j(t) and νi,j(t) for the iteration i, j as follows. 1: Let ∆ > 0 be the quality of approximation. 2: for k = 1,...,d do 3: Approximate the kth tangent vector as Ψ (t + ∆ek) Ψ (t) T k := i,j−1 − i,j−1 , ∆ k th where Ψi,j−1 is as defined in Section 5.3, and e is the k standard vector. 4: end for rand rand 5: Let η = ηi,j , and ν = νi,j . 6: Use Gram-Schmidt orthogonalization process to extract ηˆ (from η) that is orthogonal to vectors T 1,...,T d . 7: Use{ Gram-Schmidt} orthogonalization process to extract νˆ (from ν) that is orthogonal to vectors T 1,...,T d,η . 8: return{ η/ˆ ηˆ }and ν/ˆ νˆ as mutually orthogonal unit vectors that are approximately normal to k k k k Ψi,j−1(ΦM) at Ψi,j−1(t).

A few remarks are in order.

Remark 7 The choice of target dimension of size 2d + 3 (instead of d + 2) ensures that a pair of random unit-vectors η and ν are not parallel to any vector in the tangent bundle of Ψi,j−1(N) with probability 1. This again follows from Sard’s theorem, and is the key observation in reducing the embedding size in Whitney’s embedding [Whi36]. This also ensures that our orthogonalization process (Steps 6 and 7) will not result in a null vector.

Remark 8 By picking ∆ sufficiently small, we can approximate the normals η and ν arbitrarily well by approximating the tangents T 1,...,T d well.

Remark 9 For each iteration i,j, the vectors η/ˆ ηˆ and ν/ˆ νˆ that are returned (in Step 8) are a smooth rand randk k k k modification to the starting vectors ηi,j and νi,j respectively. Now, since we use the same starting vectors rand rand ηi,j and νi,j regardless of the point of application (t = Φp), it follows that the respective directional derivates of the returned vectors are bounded as well.

By noting Remarks 8 and 9, the approximate normals we return satisfy the conditions needed for Embed- ding II (see our discussion in Section 5.3).

24