<<

arXiv:1901.09057v3 [math.DG] 23 Dec 2020 oryo e aapit eat function penalty a – points data new on poorly omof norm h riigset training the euaiainterm regularization rbe nti ae.I hoy ofidangtv rdetflwln o line flow gradient negative a find to theory, In paper. this in problem fit to manifold oigi h ieto of direction the in moving P rdetat gradient rdetvco ntetnetsaeof space tangent the in vector gradient a from far where i h eaiegain o of flow gradient negative the via nta,tegain o sotndsrtzd emv ntenega the in move we point discretized: initial often an is flow from gradient the Instead, h rdetof gradient the lohv a have also euaiaintr,a xlie eo.(nkeigwt h literat the with keeping (In below. explained k as term, regularization a pc oooycmn rmahg ooe omo the or norm Sobolev high a from coming topology space usto h nnt iesoa etrsaeo l asfrom maps all of space vector dimensional infinite the of subset medd opc manifold compact embedded, iighwt prxmt nt e of set finite a approximate to how mining eu novstesaeo mohembeddings smooth of space the involves setup rdetFlow Gradient n h iemrhs yeof type diffeomorphism the and oaodoefitn rcosn a choosing or – overfitting avoid To hl hr r hoeia hlegswt hsstp efcso a on focus we setup, this with challenges theoretical are there While omnapoc ndt nlssadmcielann smnfl lear manifold is learning machine and analysis data in approach common A = S lsiiain 81,6T9 ewrs aiodLear Manifold Keywords: 68T99. 58D15, Classification: MSC P ICEIE RDETFO O AIODLEARNING MANIFOLD FOR FLOW GRADIENT DISCRETIZED 1 d φ odto nmnfl learning. manifold in condition in ecn for descent dient ecnie h aewe h rdetof gradient the when case the consider we on o hsse eghi em fteReana emtyof geometry Riemannian the of on terms to of is in direction result length the main step – Our this direction embeddings. for fixed smooth a of bound space in the move leaving can before we far how estimating mohembeddings smooth hr rdetdsettkspaei h nnt iesoa space dimensional approa infinite an the present in we place paper, takes this descent In gradient gradient where continuing. of and implementations gradient, Many the functions. of versions, minima discretized find to mization Abstract. + ( ( φ M φ { P Froeveso hssadr prah e 5,[6. Gradient [36].) [5], see approach, standard this of overviews (For . y ( 2 .W rv hscs rsswhen arises case this prove We ). M i C } φ : 1 1 ) −→ E n o titn o uh ofi h aa hsatpclpnlyf penalty typical a Thus data. the fit to much” too “twisting for and hnieaeutltegain ssalrta pcfidaon.S amount. specified a than smaller is gradient the until iterate then , y , { eat function penalty P { y i y steEciendsac from distance Euclidean the is ) i tec on of point each at rdetdset rngtv rdetflw sasadr techniq standard a is flow, gradient negative or descent, Gradient } i } . R . P P 2 φ otistotrs i aafitn term fitting data a (i) terms: two contains 0 eindt rvn vrtig e.g. overfitting, prevent to designed : φ i.e. −→ E o xdse iet e point new a to size step fixed a for −∇ famanifold a of AAGL N TVNROSENBERG STEVEN AND GOLD DARA oigi h rdetdrcinfrastse ie recomputing size, step set a for direction gradient the in moving , P M R eat ucinta crsa embedding an scores that function penalty a , in P P o some for E M : E on . → E a n oa rgoa iiu of minimum global or local a find can , 1. hsi eeal nrcal o optrcalculations. computer for intractable generally is This φ r ie. nter,fidn lblmnmmof minimum global a finding theory, In given.) are M E uhthat such Introduction E P ie notmlebdig roeta bs fits” “best that one or embedding, optimal an gives into R orsod oavco edalong field vector a to corresponds sivratudrdffoopim of diffeomorphisms under invariant is { k P y hc yial otisadt tigtr and term fitting data a contains typically which i R ≪ } spitienra oteebde manifold embedded the to normal pointwise is 1 N i L =1 mlmnigadsrtzdvrino gra- of version discretized a Implementing . N φ nEcienspace Euclidean in E P ( 6 1 2 8 5 3.Temathematical The 33]. 25, 18, 12, 11, [6, M y Emb( = a penalize can i t the fits ) otecoetpitin point closest the to C ig medn pcs Discretized Spaces, Embedding ning, M, ∞ P { φ re htsaetplg.We topology. Fr´echet space y 2 R P 1 ht aiodlearning manifold to ch ( i M } tpadrcmuethe recompute and stop , φ N 1 φ φ iea xlctlower explicit an give E = ) ( eycoeybtperforms but closely very osdrda nopen an as considered ) ( ( φ to M M Emb( = R = ) iegain direction gradient tive rdetse – step gradient e n N .I particular, In ). ohfrbigtoo being for both ) R r,w suethat assume we ure, ecn eyon rely descent k φ E N ya by φ P M E ∈ implementation n ene oknow to need we , k ihteBanach the with P s ei opti- in ue i r M, natural a , ning, =1 , φ requires , ra optimal an or , k the ( R d φ M descent, -dimensional 2 N ( ( M i.e. ,w need we ), of ) φ s ( -Sobolev ;(i a (ii) ); M unction deter- , nea ince ) y , i.e. i P ) , , 2 DARA GOLD AND STEVEN ROSENBERG to estimate a lower bound for how far we can move from a fixed embedding φ in the negative gradient direction −∇Pφ and still remain in the space of embeddings. This practical issue is the main focus of this paper. In the main result, Theorem 3, we provide such a lower bound. We focus on the case where ∇Pφ is pointwise normal to φ(M), and motivate this restriction in §4. An example of manifold learning is the “learning” of black and white photographic images. Here an image with N pixels is stored as a vector in RN with each component taking values in a finite set which codifies the greyscale magnitude of the pixel. We assume that there locally exist k components of these vectors which implicitly determine the remaining N − k components. In the generic situation where the implicit function theorem applies, the set of images is then determined by an unknown k-manifold embedded in RN . Thus the computationally expensive task of storing many vectors in RN is reduced to finding a much lower dimensional manifold. Given a sample set {yi} of images, we fix a k-manifold M and N search for an embedding φ : M −→ R with φ(xi)= yi, for some xi ∈ M. The crucial issue of the topological type of M is discussed in [15]. This setup applies to recovering partially occluded objects in an image. In this case, the training set might consist of L unoccluded pictures of cars. Given new images of a car partially blocked by a stop sign, say only k pixels in each image are unblocked. Under the assumption that there is a unique way to fill in N −k remaining pixels, a “best fit” k-manifold determines the rest of the car’s shape [1, 21, 26]. We emphasize that our approach to manifold learning directly tackles the infinite di- mensional nature of this optimization problem without making any simplifying choices that reduce the problem to finite dimensions. Typical choices in the literature are parametric methods, which fix a finite dimensional parameter space of embeddings, and RKHS meth- ods, which reduce the optimization to a finite dimensional problem via the Representation Theorem, but only after making a choice of kernel function. In contrast, our approach only assumes that M is compact, possibly with boundary, and so must contend with infinite di- mensional analytic issues. Since we are given a finite set of training data, the compactness assumption is reasonable. We briefly discuss the issues with directly working with the smooth gradient flow on E. It may be difficult to prove that P is differentiable for typical data terms which measure the minimum distance from a data point to the embedded manifold. Even if P is differentiable, in this infinite dimensional case it is not clear that a gradient flow line γ(t) stays in E or converges as t −→ ∞ to a critical point of P , as E is an open dense set in the space of smooth maps from M to RN . (In finite dimensional problems, open dense sets do not occur.) Even if we can prove convergence, since neither P nor E is in general convex, calculus methods cannot tell us if a critical point is a local, much less a global, minimum. Perhaps most fundamentally, even the short time existence for the gradient flow may be difficult to establish, particularly if we use the most natural C∞ topology on E. These problems are well known in differential geometry, e.g., in the study of minimal submanifolds. In contrast, discretized gradient flow is both a tool for theoretical results on gradient flow [2, Ch. 11] and for computer implementations based on discretized, usually linearized, versions of gradient flow [13], although these are not guaranteed to converge [10]. In the theoretical direction, we argue that the entire penalty term should be invariant under the diffeomorphism group of M, just like typical data terms which measure the distance from data points to φ(M). In particular, regularization terms built from geometric quantities like the volume or total curvature of M have this invariance. In contrast, more familiar regular- ization terms like a Sobolev norm of the embedding are not diffeomorphism invariant. We prove in Theorem 1 that for a diffeomorphism invariant penalty function P , the gradient vector field ∇Pφ is guaranteed to be pointwise normal to φ(M). This simplifies the discretiza- tion process, and leads us to determine the distance one can move in a pointwise normal DISCRETIZEDGRADIENTFLOWFOR MANIFOLDLEARNING 3 direction to an embedding φ(M) and still remain in E. In the applied direction, Theorem 5.7 an explicit lower bound for this distance in terms of the geometry of φ(M) and knowledge of local coordinate charts for M.

2. Related work In addition to manifold learning, the use of gradient flow for functionals on infinite dimen- sional manifolds of maps has a large literature in machine learning, where this comes under the general heading of nonparametric methods. (In the parametric approach, one restricts attention to a finite dimensional submanifold depending on a finite dimensional family of parameters, usually in some Rn for some n.) Osher and Sethian introduced the Level Set Method [32], in which a decision boundary is treated as the level set of a function. Viewing the decision boundary this way avoids typical problems that arise with cusps and discon- tinuities in a flow whose speed is curvature dependent. This work has been extended in many directions, e.g., [30, 35, 37]. In supervised learning, [4] applied geometric gradient flow techniques to optimize a statistical labeling function using a penalty function that has a data term and a geometric regularization term, giving more refined information than contained in the decision boundary. It should be noted that this paper has to resort to parametric methods to implement the discretized gradient flow algorithm. There are intriguing con- nections between regularization methods and classical physical equations in Lin et al. [23]. We note that our discretization method is somewhat the reverse of the successful manifold approximation approach of Laplacian eigenmaps [5], where a discrete set of data in RN that apparently lies close to a submanifold is parametrized by a subset of Rk through eigenvectors of a graph Laplacian; this parametrization is our φ−1. There is a corresponding large literature in differential geometry for gradient flow in infinite dimensions, particularly as mentioned for minimal submanifolds. Here the penalty function is the purely geometric volume of the embedded manifold, and the gradient flow is the mean curvature flow. Hamilton [19] proved some long time existence results for mean curvature flow. Gerhardt [16] showed that convex, compact surfaces in Euclidean space and curves in a plane contract smoothly to a point under mean curvature flow. These long time existence results are nontrivial. Xiao [39] gave a short time estimate for mean curvature flow if the immersed hypersurface in Euclidean space is star-shaped. Huisken and Sinestrari [20] looked at compact hypersurfaces with positive mean curvature to study singularities than can arise during the flow. They introduced a method to get a series of rescaled flows that approach a smooth flow. Rupflin and Topping [34] studied finding a minimal immersion by doing gradient flow of the harmonic energy map paired with flowing the Riemannian metric on the domain surface. Although the gradient of a functional is typically computed using an inner product on the tangent space of the domain space, Mayer [27] used a discretized approximation to the gradient flow, which more closely mimics implementation processes. In particular, he replaces the time derivative in the equation of the flow with a finite difference term. This leads to short time movement in the direction of a minimizer of a naturally arising penalty function. It is worth noting that historically, pioneering work in the modern study of gradient flow in differential geometry was done by Morse [29] in the 1930s on the infinite dimensional space of paths on a Riemannian manifold, which was then adapted by Milnor [28] to develop Morse theory on finite dimensional manifolds. In turn, Morse theory has undergone widespread development through Floer theory and its many variants in the past 25 years [3]. Finally, the strongest connection to date between manifold learning and differential geometry is in the work of Fefferman et al. [15] on the “manifold hypothesis.” 4 DARA GOLD AND STEVEN ROSENBERG

3. Proof Outline for the Discretized Gradient Flow Estimate Because of the computational detail in §§4,5, we give an overview of the proof structure and the locations of key results.

3.1. General Overview. In Theorem 4.2 in §4, we give a natural condition on the penalty function P : E −→ R under which ∇P is pointwise normal to an embedding φ(M). Through- out the paper, we assume that P satisfies this condition. Given a pointwise normal vector field u along φ(M) with the length of each vector in u at most one, §5 gives a lower bound for t∗ such that

φt(m)= φ(m)+ tum ∗ 1 −1 remains an embedding for all |t| < t . In particular, this applies to u = kφ ·∇Pφ, where kφ = maxx∈M k∇Pφ(x)k. Since M is compact, it suffices to prove that φt(M) is an injective immersion.

3.2. Note on Computation of Key Values. Proposition 5.1 gives a condition under ∗ which φt is an immersion, and Theorem 5.2 defines the bound t in which φt is injective. Finally, Theorem 5.7 concludes the mapping is an embedding. In the proof of Theorem 5.2, ∗ t is initially a function of the quantities ǫ, δH ,δ,K. ǫ is defined in §5.1(1), and K is explicitly defined in §5.1(7) as the maximal principal eigenvalue of φ(M). In Lemma 4, ǫ is computed ∗ ∗ ∗ as a function of δ and K, so t = t (K, δ, δH ). The dependence of t on δH is eliminated after (5.18), so finally t∗ = t∗(K, δ). The computation of δ is significantly more involved. The characterizing property of δ is in §5.1(8). δ is defined in (5.2) as the minimum of a quantity δ(q0, v0), where (q0, v0) is in the normal bundle of φ(M). In turn, δ(q0, v0) is computed in the proof of Proposition 5.6 0 1 in three steps, each of which builds on the prior: δ (q0, v0) is defined in (5.11), δ (q0, v0) is 2 defined by (5.12), and δ (q0, v0) is defined in (5.16). Finally δ(q0, v0) is defined in (5.17) in 0 2 terms of δ (q0, v0), δ (q0, v0). These steps are recapped in Remark 5.2.

4. A Condition for Normal Gradient Vector Fields As in the photographic images example in the introduction, manifold learning involves N searching for an embedding φ : M −→ R with yi ∈ Im(φ) for training data {yi}. Of course, yi ∈ Im(φ) iff yi ∈ Im(φ ◦ g), where g ∈ Diff(M) is a diffeomorphism of M. Thus the penalty term P1 : E −→ R which measures goodness of fit should not distinguish between φ and φ◦g, i.e., this penalty term must be invariant under the action of Diff(M): P1(φ)= P1(φ◦g). The r 2 data penalty term P1(φ)= i=1 d (φ(M),yi) in the introduction is clearly diffeomorphism- invariant. (Since the quotient space E/Diff(M) may have a non-Hausdorff topology, we consider diffeomorphism-invariantP penalty functions on E, rather than penalty functions on the quotient space.) These types of invariant functionals are familiar in gauge theory, where functionals are invariant under gauge group actions, and in Gromov-Witten theory, where maps are defined only up to holomorphic automorphisms. Similarly, we can replace the non-diffeomorphism invariant regularization term kφks, which ′ is computed in a choice of local coordinates, by e.g. P2(φ) = vol(φ(M)), which measures a combination of the first derivatives of φ =(φ1,...,φN ), or by 1/2 ′ N s j j P2(φ)= M j=1((Id + ∆) φ ) · φ dvolM , which is equivalent to the s-Sobolev norm by 2 R3 ′ 2 2 ~ the basicR elliptichP estimate. As a simplei example, for E = Emb(S , ), P1(φ)= d (φ(S ), 0),

1The Euler class of the normal bundle e ∈ HN−dim(M)(M) is the obstruction to the global existence of a unit normal vector field. Since e may be nonzero, we must refer to vector fields whose elements have length at most one. If N > 2dim(M), the obstruction vanishes. DISCRETIZEDGRADIENTFLOWFOR MANIFOLDLEARNING 5

′ 2 2 P2 = vol(φ(S )), and for the standard unit sphere as the initial embedding φ0(S ), gradient ′ ′ ′ flow for P = P1 + P2 shrinks the unit sphere to the origin in infinite time. In this section, we prove that such penalty functions have special gradients, and apply this result to E. We first review a known result about the gradient function on a finite dimensional manifold with a group action. Recall that for a C1 function P : N −→ R on an oriented Riemannian manifold (N, h), the gradient vector field ∇P is characterized by

dPm(v)= h∇P, vih(m), for all m ∈ N, v ∈ TmN. Here dPm : TmN −→ R, the differential of P at m, is independent of the Riemannian metric. Lemma 4.1. Let G be a connected Lie group acting via isometries on a Riemannian manifold N. A function P : N −→ R is G-invariant (P (g · m)= P (m) for all m ∈ N,g ∈ G) iff ∇Pm is perpendicular to the orbit Om = {g · m : g ∈ G} for all m ∈ N.

Strictly speaking, we mean ∇P (m) ⊥h(m) TmOm.

Proof. If P is G-invariant, then Om is contained in a level set of P . The gradient is always perpendicular to a level set: for X ∈ TmO, take a curve γ(t) ∈ Om withγ ˙ (0) = X, and compute

0=(d/dt)|t=0P (γ(t)) = dPm(X)= h∇Pm,Xi.

Conversely, assume that ∇Pm ⊥ T Om for all m. Take a smooth path η(t), t ∈ [0, 1], from e ∈ G to a fixed g ∈ G, and for a fixed m ∈ N define γ(t)= η(t) · m. Then

0= h∇Pγ(t), γ˙ (t)i = dPγ(t)(γ ˙ (t)), so P is constant along γ(t). In particular, P (m)= P (γ(0)) = P (γ(1)) = P (g · m). 

We want to apply this result with N,G given by E, Diff(M), respectively. (Since Diff(M) need not be connected, we have to restrict to Diff0(M), the connected component of the identity diffeomorphism.) The smooth structure on mapping spaces is well known (see e.g., [14]). Rather than go through the technicalities of the Lie group structure on Diff(M) [31], we give a direct proof. The tangent space TφE at an embedding φ is given by the infinitesimal variation of a family N N of embeddings φ(t), which for fixed m ∈ M is given by (d/dt)|t=0φt(m) ∈ Tφ(m)R ≃ R . n Thus elements X of TφE are “R -valued vector fields along φ(M),” i.e., smooth functions X : M −→ RN . For φ ∈ E, M has a Riemannian metric gφ given by the φ-pullback of the standard N metric/dot product on R restricted to φ(M). Specifically, for v,w ∈ TmM, hv,wim = 2 dφ(v) · dφ(w). Denote the associated volume form on M by dvolφ. We take the L inner N product on TφE associated to the standard metric/dot product on R and gφ:

hX,Y iφ = Xm · Ym dvolφ(m). ZM Thus the gradient of P : E −→ R is characterized by

dPφ(X)= h∇Pφ,Xiφ = ∇Pm · Xm dvolφ(m). ZM Diff(M) acts on φ ∈E by g · φ = φ ◦ g−1. It is standard that Diff(M) acts via isometries on E with the L2 metric. In our setting, we can strengthen Lemma 4.1 to the pointwise normal condition ∇Pφ(m) · Qm = 0 for all Qm ∈ Tφ(m)φ(M), m ∈ M, for a Diff(M)-invariant P . 6 DARA GOLD AND STEVEN ROSENBERG

Theorem 4.2. For a C1 function P : E −→ R, the gradient ∇P is pointwise normal to Tφ(m)φ(M) for all m ∈ M and for all φ ∈ E if and only if P is invariant under diffeomor- phisms in Diff0(M), the path connected component of the identity in Diff(M). We note that this pointwise perpendicularity is measured in the usual dot product on RN , even though we have implicitly been using φ-pullback metrics on M. In particular, the theoedretical use of the pullback metric does not affect the practical implementation of discretized gradient flow.

Proof. Assume P is Diff0(M)-invariant. As in the Lemma, we conclude that ∇Pφ ⊥L2 TφOφ. Take a family of diffeomorphisms gt of M with g0 = Id and with tangent vector X = (d/dt)|t=0gt ∈ TIdDiff(M). Then φ ◦ gt ∈ Oφ, and the vector field (d/dt)|t=0φ ◦ gt = dφ(X) tangent to φ(M) is in TφOφ. Conversely, any tangent vector field V to φ(M) integrates to a family of diffeomorphisms in Diff0(M), so we conclude that V ∈ TφOφ and that (up to a choice of topology on Diff(M)) TφOφ is the space of tangent vector fields to φ(M).

Fix m0 ∈ M and a vector Qm0 ∈ Tφ(m0)φ(M). Choose a sequence ǫk −→ 0 and smooth R functions fk : φ(M) −→ such that M fk dvolφ = 1, supp(fk) ⊂ Bk(φ(m0)) ∩ φ(M), with B (φ(m )) the Euclidean ball of radius ǫ centered at φ(m ). Extend Q to a vector field k 0 R k 0 m0 Q = Qm on φ(M), and define the vector fields Yk on φ(M) by:

Yk(φ(m)) = fk(φ(m)) · Qm. Then we have

0 = lim h∇Pφ,Yǫk i = lim h∇Pφ, fk · Qi = lim ∇Pφ(φ(m)) · fk(φ(m))Qm dvolφ ǫk−→0 ǫk−→0 ǫk−→0 ZM = ∇Pφ(φ(m0)) · Qm0

Therefore ∇Pφ(φ(m0)) ⊥ Qm0 , and so ∇Pφ(φ(m0)) ⊥ Tφ(m0)φ(M). For the converse, assume that ∇Pφ(φ(m)) ⊥ Tφ(m)φ(M) for all m ∈ M. Then ∇P ⊥L2 Q for all tangent vector fields Q to φ(M), and so ∇P is perpendicular to the orbit of Diff0(M). As in Lemma 4.1, we conclude that P is Diff0(M)-invariant. 

5. Estimates for Flows in Normal Gradient Directions Under the assumption that our penalty function is diffeomorphism invariant, to implement discretized gradient flow, by Theorem 4.2 we have to know how far φ(M) can move in a fixed normal gradient direction while remaining in the space of embeddings. The next set of results gives an explicit estimate for the lower bound of this flow, with the main result in Theorem 5.7. Throughout the paper, we assume that M is compact. We recall that for compact mani- folds, φ : M −→ RN is an embedding iff it is an injective immersion. Here φ is an immersion if its differential dφ is pointwise injective, which is the infinitesimal condition for the map φ to be a local injection. Thus, there are two types of obstructions to a linearly deformed embedding φt of φ remaining an embedding: (1) a local obstruction, where distinct nearby points in φ(M) deform to the same point in φt(M); (2) a global obstruction, where points far from each other in the induced Riemannian metric on φ(M) deform to the same point N in φt(M) because they are close in R . The local obstruction is controlled by the injectivity of the differential. Specifically, in Theorem 5.7, the local obstruction is controlled by K, a bound on the principal curvatures of φ. The global obstruction, which cannot be treated by infinitesimal means, is controlled in Theorem 5.7 by δ, which is constructed by bounds in the Implicit Function Theorem. DISCRETIZEDGRADIENTFLOWFOR MANIFOLDLEARNING 7

5.1. Notation and Definitions.

(1) ǫ = ǫφ is chosen so that each s in the ǫ-neighborhood Bǫ(φ(M)) of φ(M) has a unique closest point q = q(s) in φ(M). The existence of this neighborhood is guaranteed by the ǫ-Neighborhood Theorem [22, Thm. 6.24]. Bǫ(φ(M)) is diffeomorphic to a neighborhood of the zero section of the normal bundle ν = νφ of φ(M): we have s − q ∈ νq = νφ,q, the fiber of νφ at q, and the map s 7→ s − q is the diffeomorphism. A lower bound for ǫ is given in terms of δ in (8) below in Lemma 5.5; it will become explicit in Remark 5.2. (2) We use two sets of coordinates on RN . Standard (global) coordinates are denoted 1 N (x ,...,x ). We also represent points s ∈ Bǫ(φ(M)) as

s =(q1,...,qk, v1,...,vN−k)=(q, v),

where the qi are local manifold coordinates and vj are local coordinates for the normal space. These are called normal coordinates. Thus q ∈ φ(M) has q = (q1,...,qk, 0,..., 0). Here k = dim(M). Note that normal coordinates are not well defined outside Bǫ(M). (3) A vector in νφ can be expressed either as tvq, where vq is a unit length vector at q, i or as v wi,q, where {wi,q} is an orthonormal basis of νφ,q. There are N − k {wi,q} vectors, each with N Euclidean coordinates. N (4) The endpoint map E : νφ → R is E(q, v)= q + v. It is given explicitly by:

1 k 1 N−k 1 i 1 N i N−k E(q ,...,q , v ,...,v )=(x (q)+ v wi,q,...,x (q)+ v wi,q ),

where the domain is in normal coordinates and the range is in standard coordinates. Points e = qe + ve for which the Jacobian of the E map is not full rank at (qe, ve) are by definition focal points [28, §6]. (5) The inclusion map φ(M) → RN is q = (q1, ··· , qk) 7→ (x1(q), ··· , xN (q)) = x(q) in manifold to Euclidean coordinates, so the first fundamental form is the matrix ∂x ∂x (gij) = ∂qi · ∂qj , where · is the Euclidean dot product. The second fundamental ∂2x form at the normal vector v ∈ νφ is the matrix IIv = v · ∂qi∂qj . (6) At a fixed q ∈ φ(M), we may choose manifold coordinates so that the first fundamen- tal form is the identity matrix. The principal curvatures of v at q are by definition the eigenvalues p1,...,pk of IIv. Here pi = pi(q, v). (7) Let K be the maximal principal eigenvalue of φ(M). Thus we take the maximum of the pi(v) over all unit vectors in νφ. (8) δ is chosen such that normal lines of length ǫ based at different, close points of φ(M) do not intersect: for dRN (φ(m1),φ(m2)) < δ, φ(m1)+ t1v1 =6 φ(m2)+ t2v2 for unit

normal vectors vi ∈ νφ(mi), i = 1, 2, and |t1|, |t2| < ǫ, with ǫ defined in (1) above. δ is precisely defined in (5.2), and estimated in Remark 5.2. (δ is called the reach of φ(M) in e.g. [15].)

Remark 5.1. In the calculations below, estimates for ǫ, δ, K are computed explicitly in terms of φ, local coordinates on M, and local coordinates on νφ. Specifically, a lower bound for ǫ in terms of K and δ is given in Lemma 4. K of course depends on φ, but is in fact independent of coordinates on M, as it is the maximum eigenvalue of any normal component of the trace of the second fundamental form. The estimate of δ uses φ, local coordinates on M, and local coordinates on νφ in e.g., the proof of Proposition 5.6. It is reasonable to assume knowledge of coordinates on M, as a manifold is specified by a cover of charts. In 8 DARA GOLD AND STEVEN ROSENBERG

2 fact, local coordinates on M and φ determine local coordinates on νφ. Thus, in the end our estimates depend only on local coordinates on M and on φ. See Remark 5.2 for more details. 5.2. Calculating the Flow Length to Remain an Embedding. In this section, we ∗ ∗ compute t such that for t < t and u a normal vector field along φ(M) with |uφ(m)| ≤ 1, the deformed manifold φt(M) = {φ(m)+ tuφ(m) : m ∈ M} is an embedding. Since M is assumed compact, it suffices to prove that each φt is an immersion. We start by determining which normal deformations φt(M) of φ(M) are still immersions. Proposition 5.1. Let u be a normal vector field of length at most one along φ(M) ⊂ RN , N and let ǫ be defined in §5.1(1). Then φt(M)= {φ(m)+ tuφ(m) : m ∈ M} is immersed in R for |t| < ǫ.

N Proof. Because φ : M −→ R is an embedding, it suffices to show that the map Ft : φ(M) → φt(M), Ft(q)= q + tuq, is an immersion. In normal coordinates, we have

1 k 1 k 1 N−k Ft(q ,...,q )=(q ,...,q , tuq,...,tuq ).

The differential DFt, written as an N × k matrix, is of the form

Id ×  k k  DFt = ,      ⋆      where ⋆ is some (N − k) × k matrix. This has rank k, so Ft is an immersion. We note ǫ is implicitly used as normal coordinates are only defined in Bǫ(φ(M)). 

∗ Thus φt is an embedding if it is injective. Theorem 5.2 proves injectivity for |t| ≤ t , where t∗ is defined in the Theorem statement. The proof of Theorem 5.2 follows after the proofs of Lemmas 5.3-5.5 and Proposition 5.6. Theorem 5.2. Let u be a normal vector field of length at most one along φ(M) ⊂ RN Let ∗ −1 N t = min{K ,δ/3}. Then φt : M → R given by m 7→ φ(m)+ tuφ(m) is injective for |t| ≤ t∗. Here δ is given by §5.1(8), and will be estimated explicitly after the proof of Proposition 5.6.

Proof. As in the previous proof, it suffices to show that Ft : φ(M) −→ φt(M) is injective. N We extend Ft to a map between open subsets of R by setting

Ht : Bǫ−t(φ(M)) −→ Bǫ(φ(M)), Ht(b)= b + tuq(b), where q(b) is the closest point in φ(M) to b. Note that Ht|φ(M) = Ft and that Ht is defined only for |t| < ǫ. We now proceed with a series of Lemmas.

q Lemma 5.3. For each q ∈ φ(M), there exists a ball Bδq (q) of radius δ around q on which Ht Ht Ht is a diffeomorphism.

2 N Take the standard basis {ei} of R . For I = (i1, ··· ,iN−k) with 1 ≤ i1 < ··· < iN−k ≤ N, lexico- graphically ordered, set eI = (ei1 ,...,eiN−k ) Let UI be the open set of q ∈ Φ(M) such that I is the smallest multi-index such that the projection of eI into νφ,q is a basis of νφ,q. Then νφ is trivial over UI , and we can form a new, fixed cover of M by taking {Vi ∩ UI }. In particular, the local coordinates on νφ in (2) are not extra data, since the embedding φ determines which q are in which UI . DISCRETIZEDGRADIENTFLOWFOR MANIFOLDLEARNING 9

Proof. In normal coordinates, we have 1 k 1 N−k 1 k 1 1 N−k N−k Ht(b)= Ht(q ,...,q , v ,...,v )=(q ,...,q , v + tuq(b),...,v + tuq(b) ).

For q =(q, 0) ∈ φ(M), the differential of the Ht map has the matrix ∂qi Id × 0 Idk×k ∂vj k k

DHt(q)=   =   . i i i i ∂(vi+tui )  ∂(v +tuq ) ∂(v +tuq)  q  j Id − × −   ∂qj ∂uj   ∂q (N k) (N k)      This matrix is invertible, so the Lemma follows from the inverse function theorem.  q Let δHt = minq{δHt }. Set

δH = min{δHt : |t| ≤ .999ǫ}. (5.1)

Note that δH = δH (u) depends on the choice of the normal vector field u.

∗ def Lemma 5.4. Ht|φ(M) is injective for |t| < t = min {ǫ, δH /3}. ∗ Proof. Assume instead that there exist x, y ∈ φ(M) such that x + tux = y + tuy for |t| < t .

By Lemma 5.3, dRN (x, y) > δHt . Then

δHt < dRN (x, y)= |x − y| = |x − (x + tux)+(x + tux) − y| ∗ ≤ |x − (x + tux)| + |(y + tuy) − y| = |tux| + |tuy| ≤ 2|t| < 2t

≤ 2δHt /3, ∗  since t < δHt /3. This is a contradiction. We now compute ǫ in §5.1(1) in terms of K in §5.1(7) and δ in §5.1(8). As mentioned above, K is computed locally on φ(M), while δ is computed globally using the Euclidean distance. Lemma 5.5. Set ǫ = min {K−1,δ/3}, where K is given in §5.1(7) and δ is given in §5.1(8). Then every point in Bǫ(φ(M)) has a unique closest point in φ(M). Proof. By [28, Lem. 6.3], the focal points of φ(M) along the normal line l = q + tv are −1 precisely the points q +pi v, where the pi are the nonzero principal curvatures. The proof of the ǫ-Neighborhood Theorem in [22, Thm. 6.24] uses the invertibility of the endpoint map, so we must have ǫ δ. As in the previous proof, we have ′ δ

We can now define δ in (5.2) below, after which we explicitly estimate it in the proof of Proposition 5.6. The steps of the estimate are recapped in Remark 5.2. We first restrict the N −1 endpoint map E : νφ −→ R to the compact set W = {v ∈ νφ : |v| ≤ .999K }. For fixed q0 ∈ φ(M) and (q0, v0) ∈ νφ,q0 ∩ W = Wq0 , the proof of Lemma 5.3 shows that DE(q0, v0) is invertible. Therefore, there is a ball of radius δ(q0, v0) around (q0, v0) on which E is a diffeomorphism. Set δq0 = δ(q0, 0) and

Aq0 = {q ∈ φ(M): dRN (q, q0) < δq0 /2}. 10 DARA GOLD AND STEVEN ROSENBERG

We claim that E is a diffeomorphism on the the set Bq0 ⊂ νφ given by

Bq0 = {(q, v): |v| < δq0 /2, q ∈ Aq0 }.

Indeed, for (q1, v1) ∈ Bq0 , we have

|(q1, v1) − (q0, 0)|≤|(q1, v1) − (q1, 0)| + |(q1, 0) − (q0, 0)| ≤ δq0 + δq0 /2 ≤ δq0 .

Thus for (q1, v1), (q2, v2) ∈ Bq0 and (q1, v1) =6 (q2, v2), we conclude (q1, 0), (q2, 0) ∈ Aq0 and

E(q1, v1) =6 E(q2, v2). Since E is invertible on Bq0 , it is a diffeomorphism onto its image. We set 1 δ = min{δ(q , v ):(q , v ) ∈ ν , |v| ≤ .999K−1}. (5.2) 2 0 0 0 0 φ

In other words, for q1, q2 ∈ φ(M) with dRN (q1, q2) < δ, we have q1 + v1 =6 q2 + v2 for

|v1|, |v2| < δ and (q1, v1) ∈ νφ,q1 , (q2, v2) ∈ νφ,q2 . For a fixed (q0, v0), it remains to compute δ(q0, v0) explicitly, after which δ in (5.2) is explicit. The computation of δ(q0, v0) uses a quantitative version [24] of the Implicit Function Theorem given in the next Proposition. The proof is in the Appendix. To set the notation, let the matrix norm kAk be the sup norm of the absolute values of 1 m+n m m+n m the entries. For G ∈ C (R , R ), let (s0,y0) ∈ R × R satisfy G(s0,y0) = 0. For Rm+n fixed δ > 0 let Vδ = Vδ(s0,y0) = {(s,y) ∈ : |s − s0| ≤ δ, |y − y0| ≤ δ}. We focus on the case G(s,y)= E(s) − y, the usual method to derive the Inverse Function Theorem from the Implicit Function Theorem.

Proposition 5.6. Assume that the m × m matrix ∂sG(s0,y0) of partial derivatives of G in the s directions is invertible. Choose δ0 > 0 such that

−1 sup kId − [∂sG(s0,y0)] ∂sG(s,y)k ≤ 1/2. (5.3) ∈ (s,y) Vδ0 Set (I) B 0 = sup ∈ k∂yG(s,y)k, δ (s,y) Vδ0 −1 (II) P = k∂sG(s0,y0) k, 1 −1 0 (III) δ = (2P Bδ0 ) δ . 0 Then for the case n = m and G(s,y)= E(s)−y, on the set {(s,y): ks−s0k < δ , ky −y0k < δ1,G(s,y)=0}, E has a C1 inverse: E(s) = y iff s = E−1(y). Equivalently, E is a C1 diffeomorphism on −1 E (Bδ1 (y0)) ∩ Bδ0 (s0). (5.4)

To apply the Proposition, we set n = m = N and G((q, v),y)= E(q, v) − y, where E is the endpoint map. We follow the Proposition’s labeling in a series of steps:

0 0 Criterion I: Independent of the value of δ = δ ((q0, v0),y0), we have

Bδ0 = sup ||∂yG((q, v),y)|| = sup ||∂y(E(q, v) − y)|| ∈ ∈ ((q,v),y) Vδ0 ((q,v),y) Vδ0 = sup k − Idk =1. ∈ ((q,v),y) Vδ0 Criterion II: By §5.1(4),(7),

∂(q,v)G((q0, v0),y0)= DE(q0, v0) DISCRETIZED GRADIENT FLOW FOR MANIFOLD LEARNING 11 is invertible for |v|

DE(q0, v0)= 1 1 1 1 ∂x i ∂wi ∂x i ∂wi 1 1 ∂q1 + v ∂q1 |(q0,v0) ··· ∂qk + v ∂qk |(q0,v0) w1,q0 ··· wN−k,q0 . . . .   .   .  . .  (5.5) N N N N ∂x i ∂wi ∂x i ∂wi N N  1 + v 1 |(q0,v0) ··· k + v k |(q0,v0) w 0 ··· w − 0   ∂q ∂q ∂q ∂q 1,q N k,q  By Cramer’s  rule,     −1 −1 ∗ P = kDE(q0, v0) k = (det(DE(q0, v0))) k(DE(q0, v0) k, (5.6) ∗ th where DE(q0, v0)(i,j) is the usual minor of DE(q0, v0) obtained by deleting the i row and th j column. Since φ and the wi are given, we obtain an estimate for P . 1 1 0 0 Criterion III: We now compute δ = δ (q0, v0), δ = δ (q0, v0) such that (5.3) holds for 0 ((q, v),y). Since (5.3) is independent of y in our case, we need δ (q0, v0) such that 0 −1 |(q, v)| < δ (q0, v0) ⇒kId − [DE(q0, v0)] DE(q, v)k ≤ 1/2. (5.7)

We consider a first order Taylor expansion of DE(q, v) around s0 = (q0, v0). (Note: The summed index j below refers to coordinates in RN , not an exponent). For s = (q, v), we have (1,1) j (1,N) j Rj (q, v)(s − so) ··· Rj (q, v)(s − so) . . DE(s) = DE(s0)+  . .  (5.8) R(N,1)(q, v)(s − s )j ··· R(N,N)(q, v)(s − s )j  j o j o  def  (p,r) j  = DE(s0)+(Rj (q, v)(s − so) ). r r r ∂x (q) i ∂wi (q) r r As in Criterion II, set fp = ∂qp + v ∂qp for all 1 ≤ p ≤ N, 1 ≤ r ≤ k, and fp = wp,q for 1 ≤ p ≤ N, k +1 ≤ r ≤ N. A uniform bound on the error term is given by Taylor’s theorem with integral remainder: 1 (p,r) j r j Rj (q, v)(s − s0) ≤ (1 − t)∂jfp ((1 − t)(q0, v0)+ t(q, v))dt · (s − s0) Z0 r − 1 ≤ max ∂jfp (q, v) :1 ≤ j ≤ N, |v| ≤ .999K , q ∈ φ(M ) |s − s0|

def= G(p,r)| s − s |. j 0 Here ∂j differentiates in the s variable. Set (p,r) (p,r) G = max{Gj } = max kGj k (5.9) j,p,r j Plugging (5.8) into the right hand side of (5.7) and canceling the identity matrix, the matrix norm in (5.7) becomes

−1 (p,r) j −1 p (ℓ,r) j [DE(q0, v0)] (Rj (q, v)(s − s0) ) = max ([DE(q0, v0)] ) (Rj (q, v)(s − s0) ) j,p,r ℓ −1 0 ≤ Nk[DE (q0, v0)] k· G · δ (q0, v0), (5.10) where the N comes from the sum over ℓ =1,...,N. Setting 0 −1 −1 δ (q0, v0)= 2NkDE(q0, v0) k· G , (5.11) we conclude that the estimate (5.7) is satisfied.  In summary, we now have − − 1 0 1 0 1 0 δ (q0, v0)=(2P Bδ (q0,v0)) δ (q0, v0)=(2P ) δ (q0, v0), (5.12) 1 by Criterion I. Thus δ (q0, v0) is estimated by Criterion II and III. 12 DARA GOLD AND STEVEN ROSENBERG

− 1 1 0 By Proposition 5.6, E is a diffeomorphism on E (Bδ (q0,v0)(y0)) ∩ Bδ (q0,v0)(q0, v0). To be explicit, we want to find radius δ(q0, v0) such that − 1 1 0 Bδ(q0,v0)(q0, v0) ⊂ E (Bδ (q0,v0)(y0)) ∩ Bδ (q0,v0)(q0, v0). (5.13)

2 We first find δ (q0, v0) such that

2 1 |(q, v) − (q0, v0)| < δ (q0, v0) ⇒|E(q, v) − E(q0, v0)| = |E(q, v) − y0| < δ (q0, v0). In other words, we want

2 1 |(q, v) − (q0, v0)| < δ (q0, v0) ⇒ E(q, v) ∈ Bδ (q0,v0)(y0). (5.14)

2 As above, we compute δ (q0, v0) by a Taylor series expansion of E around (q0, v0):

1 j N j E(q, v)= E(q0, v0)+ Rj (q, v)((q, v) − (q0, v0)) ,..., Rj (q, v)((q, v) − (q0,s0)) , j j ! X X with

p p i p −1 |Rj (q, v)| ≤ max ∂j(φ + v wi )(q, v) :1 ≤ j ≤ N, |v| ≤ .999K , q ∈ φ(M) def= Gp.  (5.15)

For s0 =(q0, v0),s =(q, v), we have

N 2 N 2 p j p 2 2 |E(s) − E(s0)| = Rj (s)(s − s0) ≤ |Rj (s)| |s − s0| p=1 j ! p=1 j ! X X X X N N p 2 2 p 2 2 ≤ N |G | |s − s0| ≤ |G δ (q0, v0)| . p=1 ! p=1 j X X X Therefore, for − N 1/2 2 1 p 2 δ (q0, v0)= δ (q0, v0) N |G | , (5.16) p=1 ! X estimate (5.14) holds. Finally, setting

2 0 δ(q0, v0) = min{δ (q0, v0), δ (q0, v0)} (5.17) accomplishes (5.13).

By Lemmas 5.4, 5.5, and using (5.2) to define δ, we know that Theorem 5.2 holds, i.e., φt is injective, for ∗ −1 t < min{K , δH /3,δ/3}. (5.18) ∗ −1 If we prove that δH > δ, then we get injectivity of φt for t < min{K ,δ/3}, which is Theorem 5.2. By the definition of δ in §3.1(8), we have x, y ∈ φ(M) and dRN (x, y) < δ implies x+t1vx =6 y + t2vy for |ti| < ǫ and for any unit normal vectors vx, vy at x, y, resp. By Lemma 5.3, for dRN (x, y) < δHt = δHt (u) for a fixed normal vector field u of length at most one, we have x + tux =6 y + tuy. (By the remarks above Lemma 5.3, we also have |t| < ǫ here.) Since δ does not depend on a choice of vector field u, we have δ ≤ δHt (u). This implies δ ≤ δH . Thus ∗ −1 we can conclude that φt is injective for t < min{K ,δ/3}, and the proof of Theorem 5.2 is complete. DISCRETIZED GRADIENT FLOW FOR MANIFOLD LEARNING 13

0 Remark 5.2. We review the explicit lower bound for δ. For G defined by (5.9), δ (q0, v0) is 1 p defined by (5.11). For P defined by (5.6), δ (q0, v0) is defined by (5.12). For G defined in 2 (5.15), δ (q0, v0) is defined in (5.16). Then (5.17) defines δ(q0, v0). Finally, (5.2) defines δ. In particular, lower bounds on P, G, and Gp will give a lower bound on δ. These constants depend on q-derivatives (i.e., M coordinate derivatives) of the RN coordinates of φ and of vectors in νφ (see e.g., (5.5)). Since the normal bundle is determined by M and φ, our estimates are explicit in the sense of Remark 5.1.

5.3. The Main Theorem. Since M is compact and since φt is an injective immersion for ∗ |t| ≤ t by Theorem 5.2, we obtain the main result that φt is an embedding for t less than an explicit t∗. Theorem 5.7. Let u be a normal vector field of length at most one along φ(M) ⊂ RN . Let t∗ = min{K−1,δ/3}, with K defined in §5.1(7) and δ estimated in Remark 2. Then N ∗ φt : M → R given by m 7→ φ(m)+ tuφ(m) is an embedding for |t| ≤ t . 6. Discussion In this paper, we have proposed treating manifold learning by gradient flow techniques that are standard in much of machine learning. By doing gradient flow in the infinite dimensional space of embeddings of a fixed manifold M into RN , we avoid parametric and RKHS methods. These methods typically restrict the class of manifolds considered to a finite dimensional space, which speeds up computation time at the cost of perhaps oversimplifying the problem. In our approach, we give both a theoretical reason to move only in normal directions to the embedded manifold and theoretical lower bounds on the existence for each step of a good discretized version of gradient flow on the space of embeddings. However, this paper does not discuss computational issues, which must be addressed in future work. In particular, one has to recompute the estimates for the maximal time t∗ of travel after each step. This reflects the theoretical issue that the gradient flow may leave the space of embeddings in finite time. It may be possible to add a penalty term to the objective function that forces the gradient flow to stay in the space of embeddings. This new term would involve the bounds we computed on both local quantities like K and global quantities like δ in §5.1. There are two theoretical issues that need further examination. The first is the choice of M: how is this manifold specified? Based on Riemannian geometry estimates dating to the 1980s, it is reasonable to assume that we want to consider manifolds of a fixed dimension with a priori a lower bound on volume, an upper bound on diameter, and two-sided bounds on sectional curvature. Cheeger’s finiteness theorem [7] asserts that there are only a finite number of diffeomorphism classes among all such manifolds. (It would be interesting to determine if the class G(d,V,τ) in [15] has a similar finiteness theorem.) However, while this in theory provides us with a finite list of choices, the proof of the finiteness theorem is nonconstructive. In practice, in many cases we might as well assume that M is the closed unit ball Bk in Rk. For example, in the famous Swiss roll examples, the data set appears to lie on the image of a severely deformed B2. In contrast, if the training data appears to lie on a deformed torus, B2 is a worse choice for M than the standard torus. Perhaps even more importantly, it is unclear how to specify the dimension of M in advance. This has been discussed in the literature: see e.g. [38] and its references for work done before the last decade, and [17] for more recent work. In these works, issues such as the potentially fractal/Hausdorff dimension of the data set have been discussed. From a more geometric mindset, we could speculatively start with a k-manifold, and hope that in the long run, M would collapse in the sense of Cheeger-Gromov [8] to a lower dimensional manifold of “best” dimension. Even more speculatively, since all Riemannian manifolds are via cut locus 14 DARA GOLD AND STEVEN ROSENBERG arguments homeomorphic to a closed ball with gluings on the boundary, we could start with the k-ball Bk, add a regularization term, like the volume of ∂Bk = Sk−1, that penalizes the existence of a boundary, and hope that long time flow provides both dimension collapse and boundary gluing. We have no evidence that this will work, but a low dimensional computation is potentially feasible.

Acknowledgements Our thanks to Carlangelo Liverani for allowing us to use his Quantitative Implicit Function Theorem. We are also grateful to Drew Lohn, Andres Larrain-Hubach and Qinxun Bai for their comments on an earlier draft of this paper.

Appendix A. The Quantitative Implicit Function Theorem This quantitative version of the Implicit Function theorem and its proof are from [24] (see also [9, Appendix A]).

1 m+n m m n For F ∈ C (R , R ), let (x0,λ0) ∈ R × R satisfy F (x0,λ0) = 0. Theorem A.1 (Quantitative Implicit Function Theorem). Assume that the m × m matrix ∂xF (x0,λ0) is invertible and choose δ > 0 such that −1 sup ||Id − [∂xF (x0,λ0)] ∂xF (x, λ)|| ≤ 1/2. (x,λ)∈Vδ −1 1 −1 Let Bδ = sup(x,λ)∈Vδ ||∂λF (x, λ)|| and M = ||∂xF (x0,λ0) ||. Set δ = (2MBδ) δ, and set n 1 m+n 1 Γδ1 = {λ ∈ R : |λ − λ0| < δ }, Vδ,δ1 = {(x, λ) ∈ R : |x − x0| ≤ δ, |λ − λ0| ≤ δ }. 1 1 Rm Then there exists g ∈ C (Γδ , ) such that all solutions of the equation F (x, λ)=0 in the set Vδ,δ1 are given by (g(λ),λ). In addition, −1 ∂λg(λ)= −(∂xF (g(λ),λ)) ∂λF (g(λ),λ).

1 1 Rm Proof. Take λ ∈ Vδ = |λ − λ0| < δ . Consider Uδ = {x ∈ : |x − x0| ≤ δ} and m Ωλ : Uδ → R defined by −1 Ωλ(x)= x − ∂xF (x0,λ0) F (x, λ).

For x ∈ Uδ, F (x, λ) = 0 is equivalent to x = Ωλ(x). We have 1 |Ωλ(x0) − Ωλ0 (x0)| ≤ M|F (x0,λ) − F (x0,λ0)| ≤ MBδδ . −1 1 In addition, |∂xΩλ| = |Id − ∂xF (x0,λ0) ∂xF (x, λ)| ≤ 1/2, so |Ωλ(x) − Ωλ(x0)| ≤ 2 |x − x0|. Thus

|Ωλ(x) − x0|≤|Ωλ(x) − Ωλ(x0)| + Ωλ(x0) − x0| 1 ≤ |x − x | + MB δ1 ≤ δ. 2 0 δ

Thus Ωλ is a contraction on Uδ, and Ωλ(x) = x has a unique solution x = g(λ) by the Contraction Fixed Point Theorem. We have therefore obtained a function g : Vδ1 → Uδ such that F (g(λ),λ) = 0. All solutions in Vδ,δ1 are of this form: if F (x1,λ1) = 0, then 1 |x − g(λ )| = |Ω (x ) − Ω (g(λ ))| ≤ |x − g(λ )|, 1 1 λ1 1 λ1 1 2 1 1 so x1 = g(λ1). ′ For the final statement in the Theorem, let λ,λ ∈ Γδ1 . As above, we have 1 |g(λ) − g(λ′)| ≤ |g(λ) − g(λ′)| + MB |λ − λ′| 2 δ DISCRETIZED GRADIENT FLOW FOR MANIFOLD LEARNING 15

This yields the Lipschitz continuity of g. To obtain differentiability, we note that by the differentiability of F and the Lipschitz continuity of g, for h ∈ Rn small enough,

|F (g(λ + h),λ + h) − F (g(λ),λ)+ ∂xF [g(λ + h) − g(λ), h]+ ∂λF (g(h), h)| = o(|h|). Since F (g(λ + h),λ + h)= F (g(λ),λ) = 0, we obtain −1 −1 lim |h| |g(λ + h) − g(λ) + [∂xF (g(h), h)] ∂λF (g(h), h)| =0. h−→0 

References 1. Agarwal A. and B. Triggs, A local basis representation for estimating human pose from cluttered images., Computer Vision – ACCV 2006. ACCV 2006. Lecture Notes in Computer Science 3851 (2006). 2. Luigi Ambrosia, Nicola Gigli, and Giuseppe Savar´e, Gradient Flows in Metric Spaces and in the Space of Probability Measures, Birkh¨auser, Basil, 2008. 3. Mich`ele Audin and Mihai Damian, Morse theory and Floer homology, Universitext, Springer, London; EDP Sciences, Les Ulis, 2014. 4. Qinxun Bai, Steven Rosenberg, Zheng Wu, and Stan Sclaroff, A differential geometric approach to classification, Proceedings of The 33rd International Conference on Machine Learning 48 (2016). 5. Mihail Belkin and Partha Niyogi, Laplacian eigenmaps for dimensionality reduction and data represen- tation, Neural Computation 15 (2003), 1373–1396. 6. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006), 2399– 2434. 7. Jeff Cheeger, Finiteness theorems for Riemannian manifolds, Amer. J. Math. 92 (1970), 61–74. 8. Jeff Cheeger and Mikhael Gromov, Collapsing Riemannian manifolds while keeping their curvature bounded. I, J. Differential Geom. 23 (1986), no. 3, 309–346. 9. Luigi Chierchia, Kolomogorov-Arnold-Moser (KAM) theory, Mathematics of Complexity and Dynamical Systems. Vols. 1–3, Springer, New York (2012), 810–836. 10. Yaim Cooper, Discrete gradient descent differs qualitatively from gradient flow, arXiv:1808.04839 (2018). 11. Antonio Criminisi, Jamie Shotton, and Ender Konukoglu, Decision forests: A unified framework for clas- sification, regression, density estimation, manifold learning and semi-supervised learning, Foundations and Trends in Computer Graphics and Vision 7 (2012), 81–227. 12. David Donoho and Carrie Grimes, Hessian eigenmaps: Locally linear embedding techniques for high- dimensional data, Proceedings of the National Academy of Sciences 100 (2003), no. 10, 5591–5596. 13. Mark Droske and Martin Rumpf, A variational approach to nonrigid morphological image registration, SIAM Journal on Applied Mathematics 2 (2004), 668–687. 14. James Eells, Jr., A setting for global analysis, Bull. Amer. Math. Soc. 72 (1966), 751–807. 15. Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan, Testing the manifold hypothesis, J. Amer. Math. Soc. 29 (2016), no. 4, 983–1049. 16. Claus Gerhardt, Evolutionary surfaces of prescribed mean curvature, Journal of Differential Equations 36 (1980), 139–172. 17. Daniele Granata and Vincenzo Carnevale, Accurate estimation of the intrinsic dimension us- ing graph distances: Unraveling the geometric complexity of datasets, Sci. Rep. 6 (2016), https://www.nature.com/articles/srep31377. 18. Guodong Guo, Yun Fu, Charles R. Dyer, and Thomas S. Huang, Image-based human age estimation by manifold learning and locally adjusted robust regression, IEEE Transactions on Image Processing 17 (2008), 1178–1188. 19. Richard S. Hamilton, Harnack estimate for the mean curvature flow, Journal of Differential Geometry 41 (1995), 215–226. 20. Gerhard Huisken and Carlo Sinestrari, Mean curvature flow singularities for mean convex surfaces, Calculus of Variations and Partial Differential Equations 8 (1999), 1–14. 21. Huang JB and Yang MH, Estimating human pose from occluded images, Computer Vision – ACCV 2009. Lecture Notes in Computer Science 5994 (2010). 22. John M. Lee, Introduction to Smooth Manifolds, Graduate Texts in Mathematics, vol. 218, Springer, New York, 2013. 23. Tong Lin, Hanlin Xue, Ling Wang, Bo Huang, and Hongbin Zha, Supervised learning via Euler’s elastica models, Journal of Machine Learning Research 16 (2015), 3637–3686. 16 DARA GOLD AND STEVEN ROSENBERG

24. Calangelo Liverani, Implicit function theorem (a quantitative version), https://www.mat.uniroma2.it/~liverani/Calcolo1-2016/implicit.pdf, Accessed: 1/05/2020. 25. Yunqian Ma and Yun Fu (eds.), Manifold Learning and Applications, CRC Press, Boca Raton, 2011. 26. Madadi, Escalera, Carruesco, Andujar, Bar´o, and Gonz`alez, Occlusion aware hand pose recovery from sequences of depth images, 12th IEEE International Conference on Automatic Face and Gesture Recog- nition (2017), 230–237. 27. Uwe F. Mayer, Gradient flows on nonpositively curved metric spaces and harmonic maps, Communica- tions in Analysis and Geometry 6 (1998), no. 2, 199–253. 28. John Milnor, Morse Theory, Press, Princeton, NJ, 1969. 29. Marston Morse, The foundations of the calculus of variations in m-space. Part I, Trans. Amer. Math. Soc. 31 (1929), 379–404. 30. and Jayant Shah, Optimal approximations by piecewise smooth functions and associated variational problems, Communications in Pure and Applied mathematics 42 (1989), no. 5, 577–685. 31. Hideki Omori, Infinite-dimensional Lie groups, Translations of Mathematical Monographs, vol. 158, American Mathematical Society, Providence, RI, 1997. 32. and James. A Sethian, Fronts propogating with curvature dependant speed: Algorithms based on Hamilton-Jacobi formulations, Journal of Computational Physics 79 (1988), 12–49. 33. Sam Roweis and Lawrence Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000), no. 5500, 2323–2326. 34. Melanie Rupflin and Peter M. Topping, Flowing maps to minimal surfaces, American Journal of Math- ematics 138 (2016), no. 4, 1095–1115. 35. James A. Sethian, Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science, vol. 3, Cambridge University Press, 1999. 36. Alexander Smola, Sebastian Mika, Bernhard Sch olkopf, and Robert Williamson, Regularized principal manifolds, JMLR 1 (2001), 179–209. 37. Kush Varshney and Alan Willsky, Classification using geometric level sets, Journal of Machine Learning Research 11 (2010), 491–516. 38. Xiaohui Wang and J. S. Marron, A scale-based approach to finding effective dimensionality in manifold learning, Electron. J. Stat. 2 (2008), 127–148. 39. Ling Xiao, Gradient estimates and lower bound for the blow-up time of star shaped mean curvature flow, arXiv:1311.3721v1 (2013). Email address: [email protected]

Tufts University, The Metric Geometry and Gerrymandering Group (MGGG), 163 Packard Ave., ML-36, Barnum Hall, Medford, MA 02155

Email address: [email protected]

Department of Mathematics and Statistics, Boston University, Boston, Ma 02215, USA