Arxiv:1901.09057V3 [Math.DG]
Total Page:16
File Type:pdf, Size:1020Kb
MSC Classification: 58D15, 68T99. Keywords: Manifold Learning, Embedding Spaces, Discretized Gradient Flow DISCRETIZED GRADIENT FLOW FOR MANIFOLD LEARNING DARA GOLD AND STEVEN ROSENBERG Abstract. Gradient descent, or negative gradient flow, is a standard technique in opti- mization to find minima of functions. Many implementations of gradient descent rely on discretized versions, i.e., moving in the gradient direction for a set step size, recomputing the gradient, and continuing. In this paper, we present an approach to manifold learning where gradient descent takes place in the infinite dimensional space E = Emb(M, RN ) of smooth embeddings φ of a manifold M into RN . Implementing a discretized version of gra- dient descent for P : E −→ R, a penalty function that scores an embedding φ ∈E, requires estimating how far we can move in a fixed direction – the direction of one gradient step – before leaving the space of smooth embeddings. Our main result is to give an explicit lower bound for this step length in terms of the Riemannian geometry of φ(M). In particular, we consider the case when the gradient of P is pointwise normal to the embedded manifold φ(M). We prove this case arises when P is invariant under diffeomorphisms of M, a natural condition in manifold learning. 1. Introduction A common approach in data analysis and machine learning is manifold learning, i.e., deter- L RN mining how to approximate a finite set of {yi}i=1 in Euclidean space by a k-dimensional embedded, compact manifold M for some k ≪ N [6, 11, 12, 18, 25, 33]. The mathematical setup involves the space of smooth embeddings E = Emb(M, RN ) considered as an open subset of the infinite dimensional vector space of all maps from M to RN with the Banach space topology coming from a high Sobolev norm or the C∞ Fr´echet space topology. We also have a C1 penalty function P : E → R which typically contains a data fitting term and a regularization term, as explained below. (In keeping with the literature, we assume that k and the diffeomorphism type of M are given.) In theory, finding a global minimum of P via the negative gradient flow of P on E gives an optimal embedding, or one that “best fits” the training set {yi}. arXiv:1901.09057v3 [math.DG] 23 Dec 2020 To avoid overfitting – or choosing a φ such that φ(M) fits the {yi} very closely but performs poorly on new data points – a penalty function P can penalize φ(M) both for being too far from {yi} and for “twisting too much” to fit the data. Thus a typical penalty function R r 2 P = P1 +P2 : E −→ contains two terms: (i) a data fitting term P1(φ)= i=1 d (φ(M),yi), where d(φ(M),yi) is the Euclidean distance from yi to the closest point in φ(M); (ii) a P regularization term P2 designed to prevent overfitting, e.g. P2(φ) = kφks, the s-Sobolev norm of φ. (For overviews of this standard approach, see [5], [36].) Gradient descent, i.e., moving in the direction of −∇P in E, can find a local or global minimum of P , or an optimal manifold to fit {yi}. While there are theoretical challenges with this setup, we focus on an implementation problem in this paper. In theory, to find a negative gradient flow line on E, we need to know the gradient of P at each point of E. This is generally intractable for computer calculations. Instead, the gradient flow is often discretized: we move in the negative gradient direction from an initial point φ0 for a fixed step size to a new point φ1, stop and recompute the gradient at φ1, then iterate until the gradient is smaller than a specified amount. Since a gradient vector in the tangent space of E corresponds to a vector field along φ(M), we need 1 2 DARA GOLD AND STEVEN ROSENBERG to estimate a lower bound for how far we can move from a fixed embedding φ in the negative gradient direction −∇Pφ and still remain in the space of embeddings. This practical issue is the main focus of this paper. In the main result, Theorem 3, we provide such a lower bound. We focus on the case where ∇Pφ is pointwise normal to φ(M), and motivate this restriction in §4. An example of manifold learning is the “learning” of black and white photographic images. Here an image with N pixels is stored as a vector in RN with each component taking values in a finite set which codifies the greyscale magnitude of the pixel. We assume that there locally exist k components of these vectors which implicitly determine the remaining N − k components. In the generic situation where the implicit function theorem applies, the set of images is then determined by an unknown k-manifold embedded in RN . Thus the computationally expensive task of storing many vectors in RN is reduced to finding a much lower dimensional manifold. Given a sample set {yi} of images, we fix a k-manifold M and N search for an embedding φ : M −→ R with φ(xi)= yi, for some xi ∈ M. The crucial issue of the topological type of M is discussed in [15]. This setup applies to recovering partially occluded objects in an image. In this case, the training set might consist of L unoccluded pictures of cars. Given new images of a car partially blocked by a stop sign, say only k pixels in each image are unblocked. Under the assumption that there is a unique way to fill in N −k remaining pixels, a “best fit” k-manifold determines the rest of the car’s shape [1, 21, 26]. We emphasize that our approach to manifold learning directly tackles the infinite di- mensional nature of this optimization problem without making any simplifying choices that reduce the problem to finite dimensions. Typical choices in the literature are parametric methods, which fix a finite dimensional parameter space of embeddings, and RKHS meth- ods, which reduce the optimization to a finite dimensional problem via the Representation Theorem, but only after making a choice of kernel function. In contrast, our approach only assumes that M is compact, possibly with boundary, and so must contend with infinite di- mensional analytic issues. Since we are given a finite set of training data, the compactness assumption is reasonable. We briefly discuss the issues with directly working with the smooth gradient flow on E. It may be difficult to prove that P is differentiable for typical data terms which measure the minimum distance from a data point to the embedded manifold. Even if P is differentiable, in this infinite dimensional case it is not clear that a gradient flow line γ(t) stays in E or converges as t −→ ∞ to a critical point of P , as E is an open dense set in the space of smooth maps from M to RN . (In finite dimensional problems, open dense sets do not occur.) Even if we can prove convergence, since neither P nor E is in general convex, calculus methods cannot tell us if a critical point is a local, much less a global, minimum. Perhaps most fundamentally, even the short time existence for the gradient flow may be difficult to establish, particularly if we use the most natural C∞ topology on E. These problems are well known in differential geometry, e.g., in the study of minimal submanifolds. In contrast, discretized gradient flow is both a tool for theoretical results on gradient flow [2, Ch. 11] and for computer implementations based on discretized, usually linearized, versions of gradient flow [13], although these are not guaranteed to converge [10]. In the theoretical direction, we argue that the entire penalty term should be invariant under the diffeomorphism group of M, just like typical data terms which measure the distance from data points to φ(M). In particular, regularization terms built from geometric quantities like the volume or total curvature of M have this invariance. In contrast, more familiar regular- ization terms like a Sobolev norm of the embedding are not diffeomorphism invariant. We prove in Theorem 1 that for a diffeomorphism invariant penalty function P , the gradient vector field ∇Pφ is guaranteed to be pointwise normal to φ(M). This simplifies the discretiza- tion process, and leads us to determine the distance one can move in a pointwise normal DISCRETIZEDGRADIENTFLOWFOR MANIFOLDLEARNING 3 direction to an embedding φ(M) and still remain in E. In the applied direction, Theorem 5.7 an explicit lower bound for this distance in terms of the geometry of φ(M) and knowledge of local coordinate charts for M. 2. Related work In addition to manifold learning, the use of gradient flow for functionals on infinite dimen- sional manifolds of maps has a large literature in machine learning, where this comes under the general heading of nonparametric methods. (In the parametric approach, one restricts attention to a finite dimensional submanifold depending on a finite dimensional family of parameters, usually in some Rn for some n.) Osher and Sethian introduced the Level Set Method [32], in which a decision boundary is treated as the level set of a function. Viewing the decision boundary this way avoids typical problems that arise with cusps and discon- tinuities in a flow whose speed is curvature dependent. This work has been extended in many directions, e.g., [30, 35, 37]. In supervised learning, [4] applied geometric gradient flow techniques to optimize a statistical labeling function using a penalty function that has a data term and a geometric regularization term, giving more refined information than contained in the decision boundary.