On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

Daniel Ting 1 Michael Jordan 2

Abstract ods.” These nonlinear dimension reduction (NLDR) meth- We develop theory for nonlinear dimensionality ods have permitted interesting theoretical analysis and al- reduction (NLDR). A number of NLDR methods lowed the field to move beyond linear dimension reduc- have been developed, but there is limited under- tion. But the theoretical results have fallen short of pro- standing of how these methods work and the re- viding characterizations of the overall scope of the prob- lationships between them. There is limited basis lem, including the similarities and differences of existing for using existing NLDR theory for deriving new methods and their respective advantages. Consider, for ex- algorithms. We provide a novel framework for ample, the classical problem of finding nonlinear embed- analysis of NLDR via a connection to the statis- dings of a Swiss roll with hole, shown in Figure1. Of the tical theory of linear smoothers. This allows us methods shown, Local Tangent Space Alignment is clearly to both understand existing methods and derive the best at recovering the underlying planar structure of the new ones. We use this connection to smooth- manifold, but there is no existing theoretical explanation of ing to show that asymptotically, existing NLDR this fact. Nor are there answers to the natural question of methods correspond to discrete approximations whether there are scenarios in which other methods would of the solutions of sets of differential equations be better, why other methods perform worse, or whether given a boundary condition. In particular, we the deficiencies can be corrected. The current paper aims can characterize many existing methods in terms to tackle some of these problems. Not only do we provide of just three limiting differential operators and new characterizations of NLDR methods, but by correct- boundary conditions. Our theory also provides ing deficiencies of existing methods we are able to propose a way to assert that one method is preferable to new methods in Laplace-Beltrami approximation theory. another; indeed, we show Local Tangent Space We analyze the general class of manifold learning meth- Alignment is superior within a class of methods ods that we refer to as local, spectral methods. These that assume a global coordinate chart defines an methods construct a matrix using only information in local isometric embedding of the manifold. neighborhoods and take a spectral decomposition to find a nonlinear embedding. This framework includes the com- monly used methods: Laplacian Eigenmaps (LE) (Belkin 1. Introduction & Niyogi, 2003), Local Linear Embedding (LLE) (Roweis & Saul, 2000), Hessian LLE (HLLE) (Donoho & Grimes, One of the major open problems in in- 2003), Local Tangent Space Alignment (LTSA) (Zhang & volves the development of theoretically-sound methods for Zha, 2004), and Diffusion Maps (DM) (Coifman & La- identifying and exploiting the low-dimensional structures fon, 2006). We also consider several recent improvements that are often present in high-dimensional data. Such to these classical methods, including Low-Dimensional arXiv:1803.02432v1 [stat.ML] 6 Mar 2018 methodology would have applications not only in super- Representation-LLE (Goldberg & Ritov, 2008), Modified- vised learning, but also in visualization and nonlinear LLE (Zhang & Wang, 2007), and MALLER (Cheng & dimensionality reduction, semi-, and Wu, 2013). Outside of our scope are global methods that manifold regularization. construct dense matrices that encode global pairwise in- Some initial steps have been made in this direction over formation; these include multidimensional scaling, princi- the years under the rubric of “manifold learning meth- pal components analysis, isomap (Tenenbaum et al., 2000), and maximum variance unfolding (Weinberger & Saul, 1Tableau Software, Seattle, WA, USA 2University 2006). We note in passing that these global methods have of California, Berkeley, CA, USA. Correspondence to: serious practical limitations, either in terms of a strong lin- Daniel Ting , Michael Jordan . earity assumption or computational intractability. Our general approach proceeds by showing that the embed- Preliminary work. Under review by the International Conference on Machine Learning (ICML). Copyright 2018 by the author(s). dings for convergent local methods are solutions of differ- On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

Order of Converges Boundary 2nd order penalty Method L  0 Condition E T r(Hf)2 E kHfk2 smoother & stable  0th  None Coefficient Laplacian Diffusion Maps  0th  f = 0 LLR-Laplacian Laplacian Eigenmaps  nd  Diffusion maps LLE 2 ∂f = 0  nd  ∂η Laplacian Eigenmaps LDR-LLE 2  st  (LDR-, m-)LLE∗ HLLE LDR-LLE+ 1 2  st  ∂ f LDR-LLE+ LTSA LLR-Laplacian 1 ∂η2 = β∆f nd LLR-Laplacian HLLE  2  LTSA  1st  Table 1: Categorization of NLDR methods by their limit Coefficient Laplacian  1st  operator and boundary conditions. Within each category, methods with better properties appear closer to the bot- Table 2: Properties of NLDR methods. Non-convergent or tom. Newly identified boundary conditions are highlighted unstable methods may converge if regularization is added. in grey. LLE and existing variants have a dependence on Lower order smoothers generally have less variance. New the manifold curvature which makes the induced second- methods and convergence results are highlighted in grey. order penalty different from E T r(Hf)2. Local, spectral methods share a construction paradigm: 1. Choose a neighborhood for each point x ; ential equations, with a set of boundary conditions induced i 2. Construct a matrix L with L 6= 0 only if x is a by the method. The treatment of the boundary conditions is ij j neighbor of x ; the critical novelty of our approach. This approach allows i 3. Obtain a nonlinear embedding from an eigen- or sin- us to categorize the methods considered into three classes gular value decomposition of L. of differential operators and accompanying boundary con- ditions, as shown in Table1. We are also able to delin- Step 2 for constructing the matrix L can be seen to be eate properties that allow comparisons of methods within equivalent to constructing a linear smoother. We use this a class; see Table2. These theoretical results allow us equivalence to compare existing NLDR methods and gen- to, for example, conclude that when the goal is to find erate new ones by examining the asymptotic bias of the an isometry-preserving, global coordinate chart, LTSA is smoother on the interior and boundary of the manifold. best among the classical methods considered. We can also In particular, we show that whenever the neighborhoods show that HLLE belongs to the same class and converges shrink in Step 1 as the number of points n → ∞, the ma- to the same limit. Hence, they can be used exchangeably as trix L converges to a differential operator on the interior smoothness penalties. We also improve existing Laplace- of the manifold. Nonlinear dimensionality reduction meth- Beltrami approximations. In particular, we give a Laplace- ods thus solve a eigenproblem involving this differential Beltrami approximation that is consistent as a smoothing operator. The solutions of the eigenproblem on a compact penalty even when boundary conditions are not satisfied. manifold are typically only well-separated and identifiable in the presence of boundary conditions. If L is symmet- Our analysis is based on the following two insights. First, ric and the smoother’s bias has a different asymptotic rate matrices obtained from local, spectral methods can be seen on the boundary than the interior, then it imposes a corre- as an operator that returns the bias from smoothing a func- sponding boundary condition for the eigenproblem. If L is tion. This allows us to make a connection to the theory not symmetric, then the boundary conditions can be deter- of linear smoothing in statistics. Second, as the neighbor- mined from the boundary bias for L and its adjoint LT . hood sizes for each local method decrease, we obtain a lin- ear operator that evaluates infinitesimal differences in that 2.1. Linear smoothers neighborhood. In other words, we obtain convergence to a differential operator. Thus, the asymptotic behavior of a Consider the standard regression problem for data gener- procedure can be characterized by the corresponding differ- ated by the model Yi = f(Xi) + i where i is a zero-mean ential operator in the interior of the manifold and (crucially) error term. A regression model is a linear smoother if by the boundary bias in the linear smoother. yˆ = S(x)y (1) 2. Preliminaries and the smoothing matrix S(x) does not depend on the We begin by introducing some basic mathematical con- outcomes y. Examples of linear smoothers include linear cepts and providing a high-level overview of the relation- regression on a matrix X, kernel and local ship between NLDR and linear smoothing. polynomial regression, and ridge regression. On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

Laplacian LLE LLE regularity assumptions and demonstrate that calculations in ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ●●●● ●●● ●●●●●● ● ●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●● ● ●●●●●●●● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●● ●●●●●●●●● Euclidean space approximate calculations on the manifold ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●● ● ● ●●● ●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●● ● ●● ● ●●●● ● ● ●●●●●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ●●● ● ●●● ●●●●●●●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●●● ● ●●●● ●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●● ● ●● ● ● ● ● ●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● x ●●● ●●●● ●●●●●●●●●●●●●●● ● ● ●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●● ●● ●●●●● ●●●●● ● ●● ●●●●●●●●●●●●●●●●●● ●●● ●●● ●● ●●●●●●●● ● ●● ●●● ● ●●● ●●●●● ●●●●●●●●●●●●●●● ● ●●● ●●● ●● ●●●●●●●● ●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●● ●●● ●●●●●●●●●●● ● ● ●●●● ●●●●●●●●● y ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●● ●● ●● ●●●●●●●●●●● ●●●● ●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●● ●●●● ● ●● ●●●●●●●●● ●●● ●● ●●● ●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●● ●●●● ● ●●●●●●●● ●●●● ●● ●●● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ●● ●●● ●●●●●●●●● ●● ● ● ●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●●●●● ●● ● ● ●● ●● ●●● ●● ●●●●●● ●●● ● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●● ●● ●●● ● ● ●●●●●● ●●●●●●●●●●● ● ● ●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●● ● ●●●● ●● ●● ● ● ●●●●● ●●●●●●● ●●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●● ●● ● ● ●●●●●●●● ●● ●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●● ●● ● ●● ●● ●●●●● ●●●●●●●●●●●●● ●●●● ● ●●● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● sufficiently closely. ●● ●●● ●●●●●●●●●●● ●● ●● ●● ●●● ●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●● ● ●● ●● ● ●●●●● ●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●● ●●● ●●●●●●●●●●● ●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●● ● ●●●●●●●●● ● ●● ●●● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●● ● ●●●●●●●●●●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● ●●● ●● ●●●●●●●●●●●● ● ●● ● ●●●●●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●● ●● ●● ●●●●●●●●●● ●●●● ●●● ● ●● ●● ●●● ●●●●●●●●● ●●●●●● ●●●●●●● ●●●● ●●● ●●● ● ●●●●●●●●●●●●●●●● ● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●● ●● ●●● ●● ●●●●●●● ●● ●●●● ● ●●●● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●● ●●●●●●●●●●● ● ● ● ●●● ●●●● ●● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●● ●●●● ●●●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●● ●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●● ●●●● ●● ●●●●● ● ●●●●●●●●●●● ●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ●● ●● ●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ● ●● ● z ● ●● ●● ● ● ●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●● ●● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●●●● ● ● ●●● ●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ● ●●● ● ●●●●● ●●●●●●●●●●●●●●●● ●●●● ●● ● ●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●● ●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●● ●● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●●●● ●●● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●●● ● ●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●● ●● ●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●● ●●●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●● ●●● ●●● ●●●● ●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●● ●●●● ● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●● ● ● ●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●● ●●●●● ●●● ●●●●●●●● ●●●● ●●● ● ●●●●●●●●●●●●●●●●●● ●●● ●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●● ●●●● ●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●●●● ● ● ●● ●●●● ●● ●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ● ●● ●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●● ●● ●●●●●● ●● ● ●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●●● ●●●●●●● ●●● ●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●● ●●● ●●● ● m ●●●●● ●●●●● ●●● ● ●●●● ● ●●●●●● ●●●● ● ●●●●● ●●●●●● ● ● ●● ● ● ● ● ● ●● We consider a smooth, compact -dimensional Rieman- ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●● ●● ●●●●●●●●●● ●●●●● ● ● ● ●●● ● ● ●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●● ●●●●●● ●●●●● ●●●●●● ● nian manifold M with smooth boundary embedded in Rd. LTSA LDR−LLE LDR−LLE d ●● ●●●●●●●●●●●●● ●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● {X ,...,X } ∈ ●●● ●● ● ●●●● ●● ● ● ●● ● ●●●●● From this manifold, points are drawn ● ●●●●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●● 1 n ●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● R ● ● ● ●●●●● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●●● ●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●● ●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●● ●● ●● ●● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ● ●●●●●●●● ●●●●●●●●●●●● ●● ●● ●●●●●● ●●●●●● ● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●● ●●●●● ●●● ●●●●● ●●●●●●● ●●●●● ●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●● ●●●●● ●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●●● ●●●● ●●● ●●●●●●●●● ●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●● ●●●●●●● ●●●●●● ●●● ●●●●●● ● ●● ●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●● ●●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●● ●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●● ●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●● ●●●● ●●●●● ● ●●●● ●●●●●●●● ●●● ● ●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● from a uniform distribution on the manifold. This unifor- ● ● ●● ●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●● ● ●●● ●●●●●● ●● ● ●●●● ● ● ● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●● ●●●● ● ●●●●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●● ●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●● ●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●● ●●●●●● ●● ●●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●● ●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● z ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●● ● ● ●●●● ●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●●●●● ● ●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●● ●●●●● ●●●●●●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●● ●●●●●●●●●●● ●●●●● ●●●●●●● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●●●● ●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●● ● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●● ●●●● ●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●●●● ●●●●● ●●●●● ●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● mity assumption can be weakened to admit a smooth den- ● ●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●●●● ● ● ●●● ●●● ●● ●●●●● ●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●●●● ●●●● ● ●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ● ●●●●●●● ● ●●● ● ●●●●● ●●●●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●●●● ● ●● ●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●● ● ● ●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●● ●●● ●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ●●●●●● ●●●●● ●●●●●● ●●●●●●● ● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●● ●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●● ●●● ●●●● ●●●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●●● ●●● ●●●●●●●●●● ●●●●● ● ●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●●●● ●● ●●●●●●● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●● ●●●●● ●●●●●●●●●●● ●● ● ● ●●● ●●●●●●●●●● ●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●● ●●●●●●● x ●●●● y sity and obtain density weighted asymptotic results. ● ●● ●●●●●● ●● ● ●● ● ●●●●● ●●●●● ●●●●●● ● ●●● ●●●●● ●●● ●●●● ●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●●●● ●● ●●●●●●●● ●●●●●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●● ●●●●●●● ●●●● ●●●●●●● ●●●●●●●●●●●●●●● ●●●●● ●●● ●●●●● ●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●●●●●●●●●● ●●●● ●●● ●

In the continuous case, we consider neighborhoods Nx = Figure 1: The behavior of methods on the Swiss roll with B(x, h) for a given bandwidth h where distance is Eu- hole are shown. LTSA yields the best reconstruction. The clidean distance in the ambient space. When appropriate, boundary effects of the graph Laplacian distort the mani- we will take h → 0 as n → ∞. In the discrete case, denote fold. LLE’s embedding often contains kinks. The LLR- by Ni the indices for neighbors of Xi in the point cloud. Laplacian and LDR-LLE produce a 3D embedding which This neighborhood construction is only mildly restrictive is nearly identical up to rotations. This embedding approx- since kNN neighborhoods are asymptotically equivalent to imates the functions x1, x2, x1x2. LLR-LLE was the better  neighborhoods in the interior of the manifold when points of the two at picking out a 2D embedding. are sampled uniformly and the radius of the neighborhoods is Θ(h) on the boundary.

In the regression problem, one wishes to infer the condi- 2.4. Local coordinates f Y tional mean function given noisy observations . For Most existing methods and the ones we construct require manifold learning problems, the goal is to learn noise- access to a local coordinate system with manifold dimen- Φ = f(X ) Φ less global coordinates i i but the outcomes sion m for each point. This coordinate system is estimated are never observed. In both linear smoothing and mani- using a local PCA or SVD at each point. Let XN be the fold learning, the smoothing matrix makes no use of the |N | × d matrix of ambient space coordinates for points in response and simply maps rough functions into a space neighborhood N centered on x and X˜N be the correspond- of smoother functions. The embedding that is learned by ing matrix centered on x. The m top right singular vectors NLDR methods is a space of functions that are best recon- ˜ T structed by the smoother. In other words, the smoother de- XN = UN ΛVN (2) fines a form of autoencoder and the learned space is a space of likely outputs from the autoencoder. give estimated tangent vectors at x. The top rescaled left singular vectors, τi = λiUN ,i, project points in Ni 2.2. NLDR constructions of linear smoothers to the tangent space. The normal coordinates ui, corre- sponding to the geodesics traced out by the tangent vectors It is a simple but crucial fact that each of the NLDR meth- V , agree closely with these tangent space coordinates. ods that we consider construct or approximate a matrix N ,i Specifically, by Lemma 6 in Coifman & Lafon(2006), L = G(I − S) where G is a diagonal matrix and S is a u = τ +q (t)+O(h4), where q is a homogeneous poly- linear smoother. Thus, L measures the bias of the predic- i i 3 3 nomial of degree three whenever the coordinates are in a tion Sf weighted by G. unit ball of radius h. We adopt the convention that ui(y) For example, Diffusion Maps, with a Gaussian kernel con- refers to the normal coordinates at x ∈ M for y ∈ Nx. structs the Nadaraya-Watson smoother S = D−1K, where Our results rely on integration in normal coordinates for a D is the diagonal degree matrix and K is the kernel ma- ball of radius h in the distance metric of the ambient space. trix. The constructed embedding consists of the right sin- To account for the manifold curvature, the volume form gular vectors corresponding to the smallest singular values and neighborhood sizes must be accounted for in the in- of LDM = I − S. Laplacian Eigenmaps, using an un- tegral. Lemma 7 in Coifman & Lafon(2006) further pro- normalized graph Laplacian, differ only by the reweighting vides a Taylor expansion for the volume form for the Rie- LLE = DLDM = D(I −S), which ensures that the matrix mannian metric dV (u) = 1 + q (u) + O(h3), where q is LLE is positive semidefinite. g 2 2 a homogeneous polynomial of degree two. Likewise, dis- tances in the ambient space and normal coordinates differ 2.3. Assumptions 2 2 3 by ky−xk = ku(y)k +q ˜2(u(y))+O(h ) where u(y) de- The usual setting for linear smoothing is Euclidean space. notes normal coordinates for y about x, q˜2 is homogenous To account for the manifold setting, we must make some degree 2, and the distance on the left-hand side is with re- On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding spect to the ambient space. Consequently, integrals for any cal polynomials. Furthermore, it can well approximated by homogeneous polynomial q´ of degree one or two satisfy examining the behavior on only low-degree polynomials. Z Z 4 3.2. Convergence to a differential operator q´(u)dVg(u) = q´(u)du + O(h ), (3) Nx B(0,h) Of particular interest is the case where Sn operates lo- where the integral on the right represents integration in Rm cally on shrinking neighborhoods. In this case, the fol- and the O(h4) term hides a smooth function that depends lowing theorem (proven in the supplementary material) is on the curvature of the manifold at x. an immediate consequence of applying the sequence of smoothers to the Taylor expansion. For the purposes of this paper, we do not account for error Theorem 1. Let S be a sequence of linear smoothers from estimation of the tangent space. Thus we assume that n where the support of S (x, ·) is contained in a ball of we have a sufficiently accurate estimate of the normal co- n radius hn. Further assume that the bias in the residu- ordinates. Accounting for this estimation error is a natural −k als hn (I − Sn)f = o(1) for some integer k and all direction for further work. ∞ −k f ∈ C ⊂ C (M). Then if hn → 0 and hn (I − Sn) converges to a bounded linear operator as n → ∞, then it 3. Analysis of existing methods must converge to a differential operator of order at most k on the domain C. Each of the existing NLDR methods construct a matrix Ln from a point cloud of n points with bandwidth h. The re- As a result, each of the existing NLDR methods considered sulting nonlinear embeddings are obtained from the bottom can be seen as discrete approximations for the solutions of T eigenvectors of Ln or Ln Ln. We consider Ln as a discrete differential equations Df = λf given some boundary con- approximation to an operator Lh : F → F on functions ditions, where D is some differential operator and λ ∈ . ∞ R F ⊂ C (M). We examine the limit operator Lh con- structed by each of the NLDR methods for a fixed band- 3.3. Laplacian and boundary behavior width h. We show that as the bandwidth h → 0, Lh → L0 where L0 is a differential operator. The exception is LLE The convergence of Laplacians constructed from point where there is no well defined limit. The stochastic con- clouds to a weighted Laplace-Beltrami operator has been n→∞ vergence of the empirical constructions Ln → Lh is not well-studied by Hein et al.(2007) and Ting et al.(2010). considered in this paper. In particular, the spectral convergence of Laplacians con- structed using kernels has been studied by Belkin & Niyogi 3.1. Taylor expansions and local polynomial bases (2007), von Luxburg et al.(2008), and Berry & Sauer (2016). In the presence of a smooth boundary, the eigen- As described in Section 2.2, most existing NLDR methods problem for the Diffusion Maps Laplacian that converges to can be expressed as L = D (I − S ), where D is a h h h h the unweighted Laplace-Beltrami operator has been shown multiplication operator corresponding to a diagonal matrix by Coifman & Lafon(2006) and Singer & Wu(2016) to im- and S is a linear smoother. Denote by S (x, ·) the linear h h pose Neumann boundary conditions. Specifically, it solves function such that hS (x, ·), fi = (S f)(x). This func- h h the differential equation tion exists by the Riesz representation theorem. We say Sh is local when for all x ∈ M, the support of S (x, ·) is ∂u h ∆u = λu s.t. (x) = 0 x ∈ ∂M, (5) contained in a ball of radius h centered on x. We assume ∂ηx that Sh is a bounded operator (as unbounded operators are where ηx is the vector normal to the boundary at x. This necessarily poor smoothers). result is easily extended to other Laplacian constructions. ∞ Consider Sh applied to a function f ∈ C (M). A Taylor As the Diffusion Maps operator is the bias operator for a expansion of f in a normal coordinates at x gives Nadaraya-Watson kernel smoother, and the unnormalized Laplacian is a rescaling of this bias, the Neumann boundary (S f)(x) = (S 1)(x)f(x) + ∇f(x)(S (y − x))(x) h h h conditions can be derived from existing analyses for kernel T  + T r Hf(x)Sh(y − x)(y − x) (x) + ··· smoothers. The boundary bias of the kernel smoother is r + O(h ), (4) c·h·∂f/∂ηx(x)+o(h) for some constant c when x ∈ ∂M while in the interior it is of order O(h2) when points are where the error term holds since f has bounded rth deriva- sampled uniformly. This trivially follows from noting that tives due to the compactness of M. Here y − x = u is the first moment of a symmetric kernel is zero when the a function denoting the normal coordinate map at x. From underlying distribution is uniform and that the boundary this it is clear that the asymptotic behavior of Sh in a neigh- introduces asymmetry in the direction ηx. Hence, the oper- borhood of x is determined by its behavior on a basis of lo- ator must be scaled by Θ(h−2) to obtain a non-trivial limit On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

Laplacian LLR−Laplacian Coefficient Laplacian T −1 grange multipliers gives wi ∝ (X˜ ˜ X˜ + λI) 1N . __ __ _ Ni ˜ i 1.4 __ __ N ____ 1.4 ____ i ______0.4 ______1.2 ______1.2 ______˜ ______1.0 __ __ _ Consider the singular value decomposition X = __ 1.0 __ _ ˜ _ _ 0.3 _ N _ _ __ i ______0.8 _ _ T _ 0.8 _ _ _ _ _ λ _ _ _ UDV , and let Uc be the orthogonal complement of U. _ _ _ _ _ 0.2 _ 0.6 _ _ _ 0.6 ______This allows us to rewrite the weights 0.4 _ _ 0.4 ______0.1 ______0.2 _ _ _ _ 0.2 ______

0.0 0.0 0.0 2 −1 T −1 T wi ∝ (U(D + λI) U + λ UcUC )1Ni (7) 0.10 0.04 2 −1 T ∝ 1Ni − U(I − λ(D + λI) )U 1Ni . (8) 0.02 0.00 0.00

−0.02 In other words, the weights are the same as the constant

−0.10 weights used by a h-ball Laplacian but with a correction −0.04 −0.06 term. This correction term in Eq.8 depends only on the Figure 2: The top row shows spectra for different Laplace- curvature of the manifold. One can see this by noting that Beltrami approximations on a line segment. The Coeffi- the top m left singular vectors correspond to the points ex- cient Laplacian spectrum is very different as the spectrum pressed in normal coordinates around Xi. As the neighbor- T is not discrete. The bottom nontrivial eigenvector shows hood is symmetric around Xi, it follows that U1:m1 ≈ 0. the effect of different boundary conditions. A Neumann The remaining left singular vectors correspond to direc- condition yields a cos function. A second-order condition tions of curvature of the manifold. From this, is it easy yields linear functions. No condition yields useless eigen- to see that LLE with h-ball neighborhoods is equivalent to functions. ∆h − C in the interior of the manifold where ∆h is the h- ball Laplacian and C is some correction term that reduces the penalty on the curvature of a function when it matches in the interior, but this scaling yields divergent behavior the curvature of the manifold. at the boundary when the Neumann boundary condition ∂f/∂η (x) = 0 is not met. Since eigenfunctions cannot x 3.5. Hessian LLE exhibit this divergent behavior, they must satisfy the bound- ary conditions. The significance of this result is given by Hessian LLE performs a local quadratic regression in each the following simple theorem. neighborhood. It then discards the locally linear portion in Theorem 2. Let L be an operator imposing the Neumann order to construct a functional that penalizes second deriva- boundary condition ∂f/∂ηx(x) = 0. Then, if there exists tives. The Taylor expansion in Eq.4 gives that the entries of an isometry-preserving global coordinate chart, a spectral the Hessian are coefficients of a local quadratic polynomial decomposition of L cannot recover the global coordinates, regression. as they are not in the span of the eigenfunctions of L. Given a basis ZNi of monomials of degree ≤ 2 in lo- cal coordinates for each point in neighborhood Ni where 3.4. Local Linear Embedding the first m + 1 basis elements span linear and constant

functions, the entries in the estimated Hessian H(f)Xi are Local Linear Embedding (LLE) solves the boundary bias Π(ZT Z )−1Z f where Π is the projection onto the Ni Ni Ni problem by explicitly canceling out the first two terms in m + 2 through 1 + m + m(m − 1)/2 coordinates. the Taylor expansion. For a point cloud X, LLE constructs a weight matrix W satisfying W 1 = 1 and W X = X. Rather than directly estimating the Hessian, HLLE per- Goldberg & Ritov(2008) and Ting et al.(2010) showed forms the following approximation. Orthonormalize the that this condition is not sufficient to identify a limit opera- basis of monomials using Gram-Schmidt and discard the ˜ tor. The behavior of LLE depends on a regularization term. first m+1 vectors to obtain ZNi . By orthonormality, the re- We give a more refined analysis of the constrained ridge gression coefficients in this basis are simply the inner prod- ucts Z˜T f. We note that the approximation only recovers regression procedure used to generate the weights. Ni the Hessian in the interior of the manifold and when the ˜ Let Ni = Ni\{i} be the neighborhood that excludes i and underlying sampling distribution is uniform. X˜N = XN − 1 Xi be the points in N˜i centered on i i |Ni| ˆ ˜ Xi. To reconstruct Xi from its neighbors, one solves the Given Hessian estimates βi = ZNi f for each point, the regression problem sum of their Frobenius norms is easily expressed as ! Xi = X ˜ wi = Xi + X˜ ˜ wi, (6) Ni Ni X X X kHf(x )k2 ≈ f T Z˜T Z˜ f = f T QHLLE f. i Ni Ni i under the constraint that the weights sum to one, kwik1 = i i i 1. Adding a ridge regression penalty λ and applying La- On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

3.6. Local Tangent Space Alignment To identify the order of the differential operator that LTSA converges to, we simply need to find the scaling that yields Local tangent space alignment (LTSA) computes a set of a non-trivial limit. For the functional f T LLT SAf to non- global coordinates that can be aligned to local normal co- h trivially converge as h → 0, the component terms f T Q f ordinates via local affine transformations. y should also non-trivially converge. Since Qy is a projection Given neighbors of Xi expressed in local coordinates onto the space orthogonal to linear functions, this is equiv- |Ni|×m f T QT Q f = kQ fk2 = O(α h4(kHf k2 + h)) UNi ∈ R , LTSA finds global coordinates Y and alent to y y y h y . affine alignment transformations Ai that minimize the dif- For this to converge non-trivially, one must have αh = ference between the aligned global and local coordinates, O(h−4). The same argument holds for HLLE. Thus, both 2 T Ji(Y, A) = kCYNi −UNi AikF , where C = 1−11 /|Ni| LTSA and HLLE yield fourth-order differential operators is a centering matrix. in the interior of M.

Given the global coordinates Y , this is a least-squares re- 4.2. HLLE and LTSA difference is neglible gression problem for a fixed set of covariates UNi . Thus, the best linear predictors for CY are given by CY H(1) To show the equivalence of HLLE and LTSA, we show Ni Ni Ni (1) that the remainder term resulting from their difference is where H is the hat matrix for the covariates UN . The Ni i the bias term of an even higher order smoother. As such, LT SA objective can be expressed using local operators Qi as this bias term goes to zero faster than the HLLE and LTSA X X T T (1) bias terms, which yields convergence in the weak operator min Ji(Y, A) = YN C (I − HN )CYNi A i i topology. This gives following theorem. Proof details are i i ! given in the supplementary material. T X LT SA LT SA HLLE = Y Qi Y. (9) Theorem 3. Lh − Lh → 0 as h → 0 in the weak ∞ i operator topology of C (M) equipped with the L2 norm.

4. Equivalence of HLLE and LTSA 4.3. HLLE and LTSA boundary behavior Although LTSA is derived from a very different objec- To establish that the limit operator has the same eigenfunc- tive than HLLE, we will show that they are asymptotically tions, we must show that the boundary conditions that those equivalent. This greatly strengthens a result in Zhang et al. eigenfunctions satisfy also match. We provide a theorem (2018) which showed the equivalence of a modified version where a single proof identifies the boundary condition for of HLLE to LTSA under the restrictive condition that there both methods. This boundary condition applies to the sec- are exactly m(m + 1)/2 + m − 1 neighbors. ond derivatives of a function and admits linear functions, unlike the Neumann boundary condition imposed by graph- We show the asymptotic equivalence in two steps. First, Laplacian-based methods. we show LHLLE and LLT SA converge to the same differ- h h LT SA HLLE ential operator by showing convergence in the weak oper- Theorem 4. Lh f(x) → ∞ and Lh f(x) → ∞ as ∞ ator topology. Then, we give an argument that derives the h → 0 for any x ∈ ∂M and f ∈ C (M) unless boundary condition for both methods. ∂2f m + 1 2 (x) = (∆f)(x) ∀x ∈ ∂M, (11) 4.1. Continuous extension ∂ηx 2 Both HLLE and LTSA are composed of the sum of local where ηx is the tangent vector orthogonal to the boundary. projection operators Q on the point cloud. As we wish i We outline the proof, providing the details in the supple- to study their limit behavior on smooth functions, we con- mentary material. The proof shows that boundary bias sider the continuous extension to the manifold. To form the of the LTSA smoother is of order Ω(h2T r(Hf(x)Σ)) for continuous extension, the sum is replaced by an integral some matrix Σ. This is of lower order than the required α Z scaling h−4 for the functional to converge. LLT SAf = h Q fdV (y), (10) h V ol y g h M Since locally linear and constant functions are in the null where Vg is the natural volume measure on M, αh is an space by construction, only the behavior of the smoother appropriate normalizing constant, V olh is the volume of on quadratic polynomials needs to be considered. The dif- m a ball of radius h in R , and Qy are local operators. In ficulty in analyzing the behavior arises from needing to LT SA the case of LTSA, Qy = INy − Hy where Hy is the average over multiple local regressions on different neigh- projection onto linear functions in the neighborhood Ny borhoods and dealing with a non-product measure. By ex- and on a local normal coordinate system rather than linear ploiting symmetry, the problem of a non-product measure functions on a discrete set of points. is removed, and the multivariate regression problem can be On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding reduced to a univariate linear regression with respect to u1, LLE operator. In other words, a spectral decomposition of where u1 is the coordinate function in the direction ηx or- the LLE operator may simply recover a linear transforma- thogonal to the boundary. Due to the shape of the asymmet- tion of the ambient space. ric neighborhood, the quadratic functions u2, i > 1, induce i Low dimensional representation (Goldberg & Ritov, 2008) a nonzero conditional mean E(u2|u ) = (1−u2)/(m+1). i 1 1 LLE (LDR-LLE) is a modified version of LLE that re- Since u2 has a negative sign in this expression, the coeffi- 1 moves some of the effect of curvature in the manifold by cients for u2 and each u2 can be set to cancel out the bound- 1 i reconstructing each point based on its tangent space coor- ary bias to obtain the result. dinates only. As this still cancels out the first two terms of a Taylor expansion, it is an approximation to the Laplace- 4.4. Partial equivalence of HLLE and Laplacian Beltrami operator. The weights at Xi are chosen to be in Although the Laplace-Beltrami operator induces a penalty the subspace spanned by points in Ni in the ambient space on the gradient and appears fundamentally different from and orthogonal to the tangent space, and thus, explicitly pe- the Hessian penalty induced by HLLE and LTSA, they can nalizes functions with curvature that matches the manifold. be compared by squaring the Laplace-Beltrami operator However, as the singular values for directions orthogonal to create a fourth-order differential operator that induces to the tangent space decrease as O(h2) compared to O(h) a slightly different penalty on the Hessian. for the tangent space, the method is subject to numerical instability. As such we classify LDR-LLE as well as LLE Consider the smoothness penalty obtained by squaring the as being based on a second-order smoother. To compensate Laplace-Beltrami operator compared to the HLLE penalty. for this instability, it adds a regularization term. Z 2 2 We propose a further modification, LDR-LLE+, which re- J∆2 (f) = hf, ∆ fi = h∆f, ∆fi = T r(Hf(z)) dVg(z) moves the artificial restriction on the weights to the span of Z the points in the neighborhood and generates a first-order J (f) = T r (Hf(z))2 dV (z), HLLE g smoother. The weights are given by where the second equality follows from the self-adjointness T wi ∝ 1Ni − U1:mU1:m1, (12) of the Laplace-Beltrami operator. These penalties are iden- tical for one-dimensional manifolds. However, a slightly where U1:m are the top left singular vectors of XNi after different penalty is induced on multidimensional mani- centering on Xi. folds. Given global coordinates g , . . . , g , the quadratic 1 m Another variant of LLE is modified LLE (mLLE) (Zhang polynomial g g is in the null space of the twice-iterated 1 2 & Wang, 2007). While LDR-LLE explicity constructs a Laplacian but not in the null space of the HLLE penalty. local smoother that can be expressed by linear functions in the tangent coordinates and a bias operator in which linear 5. Alternate constructions functions are in its null space, mLLE achieves a similar effect by adding vectors orthogonal to the tangent space to We provide a few examples to illustrate how alternative the bias operator in order to remove non-tangent directions constructions can address undesirable properties of existing from the null space. methods and the implications of the changes. In particular, we propose a convergent and more stable variation of LLE. We find LLE and its variants behave similarly to a local 5.2. Local Linear Regression linear regression smoother, replacing the Neumann bound- A simple alternative construction is to directly use a lo- ary condition for Laplacians with a second-order bound- cal linear regression (LLR) smoother to approximate the ary condition that admits linear functions. Furthermore, we Laplace-Beltrami operator. Since only functions linear in generate a new Laplace-Beltrami approximation that gen- the normal coordinates should be included in the null space erates a smoothing penalty that does not impose boundary of the bias operator, each local regression is performed on conditions. The resulting operator is shown to not have the projection of the neighborhood into its tangent space. well-separated eigenfunctions. Cheng & Wu(2013) propose this linear smoother in the context of local linear regression on a manifold and note its 5.1. Low-dimensional representation LLE+ connection to the Laplace-Beltrami operator without ana- lyzing the convergence of the associated operator. Our con- LLE has the deficiency that the curvature of the manifold tribution is determining the boundary conditions. affects the resulting operator and smoothness penalty. In the worst case when the regularization parameter is very In particular, we study the boundary behavior for the op- small, the embedding Φ: M → Rd of the manifold in the erator LLLR∗LLLR where LLLR∗ is the adjoint of LLLR. ambient space lies close to the null space of the resulting The resulting boundary conditions have the same form as On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding those for LTSA, albeit with different constants, and admit Penalty on cosine basis Penalty on polynomial basis

1.1 1.1 ● ● ● linear functions as eigenfunctions. ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 1.0 ● h Theorem 5. Let Sh be the expected continuous extension t ● u r 0.9 0.9 t

f ●

for a local linear regression smoother with bandwidth h L T f and L = h−2V ol−1(I − S ). Then there is some β > 0 0.8 0.8 h h h ● 0.7 0.7

∞ ∗ Ratio ● such that for any f ∈ C (M) and x ∈ ∂M, L Lhf(x) Laplacian h Coefficient Laplacian 0.6 0.6 converges as h → 0 only if LLR−Laplacian 0.5 0.5 ∂2f 2 4 6 8 10 2 4 6 8 10 β (x) = (∆f)(x), (13) 2 Figure 3: When restricted to cosine functions which satisfy ∂ηx the Neumann boundary condition, the smoothness penalty where ηx is the vector normal to the boundary at x. induced by different Laplacian approximations are all simi- lar. For other functions, sign(x)|x|(i+1)/3, the smoothness This result is obtained by examining the behavior of the penalty can significantly differ. The Coefficient Laplacian adjoint L∗ by reducing it to computing the average resid- h remains close to the true penalty even in this case. ual over a set of univariate linear regressions. The result is that the adjoint has boundary bias Ω(g(x)) when applied to a function g. Combined with existing analyses for the The resulting operator has desirable behavior when used as boundary bias of L(LLR) (Ruppert & Wand, 1994) gives a smoothing penalty. However, its eigenfunctions are use- ∗ −2 less for nonlinear dimensionality reduction. This occurs for (LhLhf)(x) = Ω(h )Ω(1) → ∞ if the boundary condi- tion is not met. Interestingly, this yields a Dirichlet bound- two reasons. First, removal of boundary conditions yields ary condition when swapping the order the operators are an operator without a pure point spectrum. For example, in ∗ one dimension, the resulting operator solves the differential applied LhLh as h → 0. √ equation y00 = −λy which has solutions y = cos( λx) for We note that this procedure is similar to LDR-LLE. The all λ > 0. Thus, it is unclear what solutions are picked out row weights must sum to one to preserve constant func- by an eigendecomposition of the discrete approximations. tions, Sn1 = 1, and each point’s tangent coordinates are Second, one can show that Lhf does not converge uni- perfectly reconstructed since the coordinate functions are T formly at the boundary even when the functional f Lhf linear functions of themselves. Empirically we found these converges. As a result, the eigenvectors of the discrete ap- perform very similarly as shown in Figure1, but Local Lin- proximations are uninformative. The resulting deficiencies ear Regression was more likely to exclude the quadratic in- are illustrated in Figure2. teraction polynomial from its bottommost eigenfunctions. 6. Discussion 5.3. Laplacian without boundary conditions For existing Laplacian approximations, the smoothness This paper explores only one aspect of the connection be- T 2 tween manifold learning and linear smoothing, namely the penalty f Lnf → k∇fk is only guaranteed to converge when f satisfies the boundary conditions imposed by L. analysis of manifold learning methods and ways to use that Furthermore, the only existing Laplace-Beltrami approxi- analysis to improve existing NLDR methods. mation that yields a positive semidefinite matrix and non- However, this connection can be exploited in many addi- negative smoothness penalty is the graph Laplacian. tional ways. For instance, the methods developed for man- We derive a construction, the Coefficient Laplacian, that ifold learning can be used for linear smoothing. In particu- is both self-adjoint and guarantees convergence of the lar, the linear smoother associated with LTSA can be used smoothness penalty for all C∞(M) functions. This is as a smoother for non-parametric regression problems. A achieved by explicitly estimating the gradient with a local simple consequence of our analysis is that the smoother can achieve O(h4) bias in the interior even though it only linear regression. For each neighborhood Ni, estimate nor- mal coordinates U with respect to X . Define uses a first-order local linear regression. By contrast, lo- Ni i cal linear regression yields a bias of O(h2) in the interior. S = (U˜ T U˜ )−1U˜ T (I − 1eT ) The connection can also be exploited for multi-scale anal- Ni Ni Ni Ni i (14) X T T yses. In the case of the Laplacian, the weighted bias op- L = AN S SN A . (15) i Ni i Ni erator L is the infinitesimal generator of a diffusion with i Pt = exp(−Lt) defining its transition kernels and can be

Thus, SNi YNi yields the coefficients for a linear regression used to generate Diffusion wavelets (Coifman & Maggioni, of the centered YNi − 1Yi on UNi without the intercept 2006). Other bias operators can be substituted to gener- T P T P 2 term. Thus, f Lf = i f QNi f → i k∇fk as n → ate orthogonal bases with local, compact support. Another ∞ and h → 0 sufficiently slowly. possibility is to combine existing linear smoothers with On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding manifold learning penalties to generate new smoothers and Tenenbaum, J., De Silva, V., and Langford, J. A global ge- NLDR techniques by using the theory of generalized addi- ometric framework for nonlinear dimensionality reduc- tive models (Buja et al., 1989). tion. Science, 290(5500):2319–2323, 2000. Ting, D., Huang, L., and Jordan, M. I. An analysis of the References convergence of graph Laplacians. In ICML, 2010. Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimen- von Luxburg, U., Belkin, M., and Bousquet, O. Consis- sionality reduction and data representation. Neural Com- tency of spectral clustering. Annals of Statistics, 36(2): putation, 15(6):1373–1396, 2003. 555–586, 2008. Belkin, M. and Niyogi, P. Convergence of Laplacian eigen- Weinberger, K. and Saul, L. Unsupervised learning of im- NIPS maps. In , 2007. age manifolds by semidefinite programming. Interna- Berry, T. and Sauer, T. Consistent manifold represen- tional journal of computer vision, 70(1):77–90, 2006. tation for topological data analysis. arXiv preprint Zhang, S., Ma, Z., and Tan, H. On the equivalence of arXiv:1606.02353, 2016. HLLE and LTSA. IEEE transactions on cybernetics, 48 Buja, A., Hastie, T., and Tibshirani, R. Linear smoothers (2):742–753, 2018. and additive models. The Annals of Statistics, 17(2): Zhang, Z. and Wang, J. MLLE: Modified locally linear 453–510, 1989. embedding using multiple weights. In NIPS, 2007. Cheng, M.Y. and Wu, H.T. Local linear regression on man- Zhang, Z. and Zha, H. Principal manifolds and nonlinear ifolds and its geometric interpretation. Journal of the dimensionality reduction via tangent space alignment. American Statistical Association, 108(504):1421–1434, SIAM journal on scientific computing, 26(1):313–338, 2013. 2004. Coifman, R. and Lafon, S. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.

Coifman, R. and Maggioni, M. Diffusion wavelets. Ap- plied and Computational Harmonic Analysis, 21(1):53– 94, 2006.

Donoho, D. L. and Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100 (10):5591, 2003.

Goldberg, Y. and Ritov, Y. LDR-LLE: LLE with low- dimensional neighborhood representation. Advances in Visual Computing, pp. 43–54, 2008.

Hein, M., Audibert, J.-Y., and von Luxburg, U. Graph Laplacians and their convergence on random neighbor- hood graphs. Journal of Machine Learning Research, 8: 1325–1370, 2007.

Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):2323, 2000.

Ruppert, D. and Wand, M. Multivariate locally weighted least squares regression. The Annals of Statistics, 22(3): 1346–1370, 1994.

Singer, A. and Wu, H.T. Spectral convergence of the con- nection Laplacian from random samples. Information and Inference: A Journal of the IMA, 6(1):58–123, 2016. On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding Appendices

A. Convergence to differential operator

Theorem 6. Let Sn be a sequence of linear smoothers where the support of Sn(x, ·) is contained in a ball of radius hn. −k ∞ Further assume that the bias in the residuals hn (I − Sn)f = o(1) for some integer k and all f ∈ C ⊂ C (M). Then −k if hn → 0 and hn (I − Sn) converges to a bounded linear operator as n → ∞, then it must converge to a differential operator of order at most k on the domain C.

−k Proof. Let S∞ is be the limit of h (I − Sn). Let q` be any monomial with degree ` > k and Ix,n be some smooth ` function that is 1 on a ball of radius hn around x. Then the pointwise product kq` · Ix,nk = O(hn). Since S∞ is bounded, `−k (Snq`)(x) = (Sn(Ix,n · q`))(x) = O(hn ) → 0 . Furthermore, the convergence is uniform over all x. Thus, the behavior is determined on a basis of polynomials of degree at most k. Applying S∞ to a Taylor expansion with remainder gives the desired result.

B. Equivalence of HLLE and LTSA

LT SA HLLE ∞ Theorem 7. Lh − Lh → 0 as h → 0 in the weak operator topology of C (M) equipped with the L2 norm.

HLLE Proof. The local operators Qi for HLLE can be described more succinctly in terms of the difference of two linear (2) smoothers. Let Hx be hat matrix for a local quadratic regression in the neighborhood Nx.

HLLE (2) (1) Qx = αh(Hx − Hx ) (16) LT SA (2) = αh(Qx + (Hx − INx )). (17)

(2) ∞ Let Rx = Hx − INx . For any f, g ∈ C (M),

T −4+3+3 αhg Rxf = αhhRxg, Rxfi ≤ O(h ) (18) since quadratic terms and lower can be removed from the Taylor expansion.

C. Boundary behavior of HLLE and LTSA

∞ LT SA Theorem 8. For any x ∈ ∂M and function f ∈ C (M), Lh f(x) → ∞ as h → 0 unless ∂2f m + 1 2 (x) = (∆f)(x) ∀x ∈ ∂M (19) ∂ηx m + 2 where ηx is the tangent vector orthogonal to the boundary.

To do this, we show that the required scaling by h−4 for the functional to converge causes the value at the boundary to go to ∞ when this condition is not met. Let x ∈ M. Assume h is sufficiently small so that there exists a normal coordinate chart at x such that B(x, 2h) is contained in the neighborhood for the chart. Furthermore, choose the normal coordinates such that the first coordinate u1 corresponds to the direction ηx orthogonal to the boundary and pointing inwards.

Consider the behavior of Lh on the basis of polynomials expressed in this coordinate chart. Denote by Qy the hat operator for a linear regression on linear and constant functions restricted to a ball of radius h centered on y. Note that any polynomial of degree at most 1 is in the null space of any Qy by construction. Likewise, by symmetry, any polynomial 2 uiuj with j < i or of odd degree is in the null space of Qy. This leaves only the polynomials ui .

Each (Qyf)(z) computes the residual at z from regressing f against linear functions in the neighborhood at y. The residual 2 2 ri = (Qyui )(0) = (Qy(ui − yi) )(0) for all y ∈ B(0, h) since linear functions are in the null space. By symmetry of the 2 neighborhoods in the interior, ry = (Q0ui )(−y) whenever i > 1. In other words, in the interior of the manifold, averaging On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding over the multiple regressions is equivalent to averaging over the residuals of a single regression. Since the expected residual 4 is always 0 for least squares regression, (Lhf)(x) = O(h ) for any x ∈ Int(M).

More intuitively, one wishes to evaluate the residual at z over all neighborhoods Ny that include z. Because linear functions are in the null space, this depends only on the curvature of the response f at x. Furthermore, since all neighborhoods have essentially identical shape, the hat matrices for every neighborhood are also identical, so the residual only depends on the offset z − y. Averaging over the different y is equivalent to fixing a single regression at B(z, h) and averaging over the residuals at all offsets y ∈ B(z, h). However, the boundary behavior is very different since the shape of the neighborhoods can change. First, consider the 2 quadratic polynomial u1 in the direction orthogonal to the tangent space on the boundary. Since the residual is always 2 evaluated at the boundary for a linear regression and u1 is convex, the residual is necessarily positive. Figure4 provides an (αh) −2 2 −4 intuitive illustration of this. Furthermore, since evaluating (Qy (αh) u1)(x) = O((αh) ) It follows that (Lhf)(x) = 2 2 2 3 αh ∂ f/∂u1(x) + O(h ) for some non-zero constant α. 2 For polynomials ui where i > 1, we note that any linear regression on linear functions uj has corresponding coefficients βj = 0 for all j > 1 by symmetry of the neighborhood. Thus, the prediction can be reduced to a linear regression where 2 the response is the conditional mean E(ui |u1) given the neighborhood. Integrating over slices of an m-sphere gives

2 −1 2 E(ui |u1) = (m + 1) (1 − u1). (20)

Since constant terms are removed by the regression and since summing over the residual function for a given value of u1 2 2 2 2 2 2 2 2 gives E(ui − uˆi |u1) = E(ui − E(ui |u1)|u1) + E(E(ui |u1) − uˆi |u1) = E(E(ui |u1) − uˆi |u1), the summed residuals are 2 the same as those obtained for the function −u1/(m + 1). 2 2 m+1 Pm 2 2 Thus, for any x ∈ ∂M, (Lhf)(x) → ∞ unless 0 = ∂ f/∂u1 + m−1 i=2 ∂ f/∂ui which is equivalent to equation 19.

y ●

x

Figure 4: Figure illustrating that the residual is always positive when evaluating a linear regression on the boundary. The dots are the centers defining the neighborhood on which regression is performed. The vertical line represents the boundary of the manifold that the neighborhoods do not cross. The dashed lines are the regression fits. It is easy to see that the residuals at the boundary are always strictly positive. On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

We checked this boundary condition using simulation on a manifold isomorphic to a rectangle. We took 10 estimated eigenfunctions and computed their Hessian at a point on the boundary. This generates a 10 × 3 matrix consisting of the ∂2f ∂2f ∂2f estimates 2 , 2 , . We take the svd of this matrix. The distribution of the top, middle, and bottom singular value ∂u1 ∂u2 ∂u1∂u2 is shown in figure5. There is one eigenvalue clearly close to 0 that represents the boundary condition. The average bottom right singular vector is given in table3 and show fairly good correspondence to the theoretical calculations on a modestly fine grid.

distribution of Hessian eigenvalues 4 3 2 Density 1 0

0.0 0.5 1.0 1.5 2.0 2.5

N = 51 Bandwidth = 0.132

Figure 5: Figure illustrating singular values of the Hessian of eigenfunctions at the boundary.

2nd Derivative Estimated value ∂2f 2 0.935 ∂u1 ∂2f 2 -0.346 ∂u2 2 ∂ f 0.00 ∂u1∂u2 ∆f 0.588 ∂2f Predicted 2 0.882 ∂u1 Table 3: Table of mean Hessian values and the prediction of the 2nd derivative given the Laplacian showing fair correspon- dence of the simulated values to the predicted ones on a modestly fine grid.

D. Boundary Behavior of Local linear regression As the boundary bias of local linear regression is already well studied, existing results can help determine the boundary conditions for the resulting operator. However, as the local linear regression smoother is not self-adjoint, the behavior of the adjoint must also be determined. Let x ∈ ∂M and u be normal coordinates at x in a neighborhood B containing B(x, 2h)∩M. We again choose u1 to correspond to the tangent vector that is normal to the boundary and pointing inwards ∗ ∗ so that u1 ≥ 0. We wish to evaluate the boundary behavior of LhLhf(x) = hLhδx,Lhfi where Lh is the adjoint of Lh and δx is the Dirac delta. The main part of our proof is to evaluate Lhδx. On Nonlinear Dimensionality Reduction, Linear Smoothing and Autoencoding

Similar to the proof for theorem8, we first show we can reduce the problem to a univariate linear regression. We do this through an orthogonal basis. Then we use the usual regression equations to actually compute the value. For each (0) −1 −1 y ∈ B(x, h) ∩ M, denote Ny = B(y, h) ∩ M and u˜i (·, y) = h V ol(Ny) (ui(·)1(· ∈ Ny) − ui(y)). It is easy (0) to see that by symmetry that there are functions u˜i(·, y) =u ˜i (·, y) + i which are orthogonal to each other and to the 2 constant function in L2(Ny) for i > 1 and where i = O(h ). Likewise, let u˜1(·, y) be similarly defined to be orthogonal to u˜i(·, y) for i > 1. To generate an orthonormal basis, we must orthogonalize it with respect to the constant function as −1 well. Gram-Schmidt gives v1(·, y) =u ˜1(·, y) − V ol(Ny) hu˜1(·, y),I(· ∈ Ny)iI(· ∈ Ny) is orthogonal to the constant function. Thus, by evaluating 2 −1 h (Lhδx)(y) = δx(y) − V ol(Ny) I(x ∈ Ny) (21) m X −2 − ku˜i(·, y)k u˜i(y, y)hu˜i(·, y), δxi (22) i=2 −2 − kv1(·, y)k v1(y, y)hv1(·, y), δxi (23) −1 = δx(y) − V ol(Ny) I(y ∈ Nx) (24) −1 −2 − V ol(Ny) kv1(·, y)k hu˜1(·, y),I(· ∈ Ny)iv1(x, y) (25) where the inner product is taken with respect to L2(B). Integrating over B, the first two terms each have magnitude 1 and cancel each other out. The third term may be rewritten −1 −2 − V ol(Ny) kv1(·, y)k hu˜1(·, y),I(· ∈ Ny)iv1(x, y) (26) −2 2 = −kv1(·, y)k (˜u1(x, y)µy − µy) (27) −2 −1 −1 2 = kv1(·, y)k (V ol(Ny) h u1(y)µy + µy) (28) (29) −1 2 where µy = V ol(Ny) hu˜1(·, y),I(· ∈ Ny)i > 0 unless u1(y)(1 + O(h )) ≥ h. It follows that −2 −1 hLhδx,I(· ∈ Nx)i = h hΩ(V ol(Ny) ),I(· ∈ Nx)i (30) = Ω(h−2) (31) → ∞ as h → 0 (32) By applying this to a Taylor expansion, one concludes that for any continuously differentiable function which is non-zero ∗ at x ∈ ∂M, |(Lhf)(x)| → ∞ as h → 0. By Theorem 2.2 in (Ruppert & Wand, 1994), a point x with distance dist(x, ∂M) < h has boundary bias Z  1  (L f)(x) = α(µ µ ) T r(Hf(x)uuT )du h 2 1 u Dx,h 1 Z µ1 = u1du Dx,h Z 2 µ2 = u1du + o(T race(Hf(x))) Dx,h where α > 0 is some constant and Dx,h is the unit m-sphere cut along the plane orthogonal to the first coordinate at u1 = dist(x, ∂M)/h. This is a linear function in the Hessian. Furthermore, all odd moments of ui for i > 1 are 0, and their second moments are all equal. It is a linear function of the diagonal of the Hessian and of the form (Lhf)(x) = ∂2f −β 2 (x) + (∆f)(x) + o(1) for some β 6= 0. ∂u1 Thus, the boundary condition that functions must satisfy is ∂2f β 2 (x) = (∆f)(x). (33) ∂u1 Although it takes the same form as the HLLE boundary condition, the constants are different. However, we do not know of a reason to prefer one boundary condition over the other.