TANGENT SPACE INTRINSIC MANIFOLD REGULARIZATION FOR DATA REPRESENTATION

Shiliang Sun

Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 200241, P. R. China

m ABSTRACT where fi ∈ R corresponds to xi and m << d. Classical lin- ear dimensionality reduction methods include principal com- A new regularization method called tangent space intrinsic ponent analysis (PCA) and metric multidimensional scaling manifold regularization is presented, which is intrinsic to data (MDS) [4]. Recent nonlinear dimensionality reduction meth- manifold and favors linear functions on the manifold. Funda- ods achieved a great success especially for representing data mental elements involved in its formulation are local tangent that obey the manifold assumption, e.g., successfully unfold- space representations which we estimate by local principal ing the intrinsic degrees of freedom for gradual changes of component analysis, and the connections which relate adja- images and videos. Representative algorithms in this cate- cent tangent spaces. We exhibit its application to data repre- gory include locally linear embedding [2], isomap [3], Lapla- sentation where a nonlinear embedding in a low-dimensional cian eigenmaps [1], Hessian eigenmaps [5], maximum vari- space is found by solving an eigen-decomposition problem. ance unfolding with semidefinite programming [6], and oth- Experimental results including comparisons with state-of-the- ers [7]. Some of the above linear and nonlinear dimensional- art techniques show the effectiveness of the proposed method. ity reduction methods can be characterized as spectral meth- Index Terms— Regularization, tangent space, manifold ods, because computationally they often contain the proce- learning, dimensionality reduction, data representation dure of eigen-decomposition of an appropriately constructed matrix and then exploit the top or bottom eigenvectors [6]. The principle of regularization has its root in mathemat- 1. INTRODUCTION ics to solve ill-posed problems, and is widely used in pattern recognition and . Many well-known algo- In this paper, we present a new data-dependent regulariza- rithms, e.g., SVMs [8, 9], ridge regression and lasso [10, 11], tion method named tangent space intrinsic manifold regular- can be interpreted as instantiations of the idea of regulariza- ization for function learning, which is motivated largely by tion. The regularization method presented in this paper is in- the geometry of the data-generating distribution. In particu- trinsic to data manifold which prefers linear functions on the lar, data lying in a high-dimensional space are assumed to be manifold. Namely, functions with constant manifold deriva- intrinsically of low dimensionality. That is, data can be well tives are more appealing. Fundamental elements involved in characterized by far fewer parameters or degrees of freedom the regularization formulation are local tangent space repre- than the actual ambient representation. This setting is usually sentations and the connections which relate adjacent tangent referred to as manifold learning, and the distribution of data is spaces. When applying this regularization principle to nonlin- regarded to live on or near a low-dimensional manifold. The ear dimensionality reduction with data modeled as adjacency validity of manifold learning, especially for high-dimensional graphs, it turns out that the resultant problem can be readily image and text data, has already been testified by recent de- solved by eigen-decomposition. We illustrate that this regu- velopments [1, 2, 3]. Although our new regularization method larization method can obtain good and reasonable data em- has potentials to be applied to a variety of pattern recognition bedding results. and machine learning problems, here we only consider unsu- pervised data representation or dimensionality reduction. 2. THE PROPOSED REGULARIZATION The problem of representing data in a low-dimensional space for the sake of data visualization and organization is es- We are interested in estimating a function f(x) defined on sentially a dimensionality reduction problem. Given a data set M ⊂ Rd, where M is a smooth manifold on Rd. We as- k Rd k {xi}i=1 with xi ∈ , its task is to deliver a data set {fi}i=1 sume that f(x) can be well approximated by a linear function with respect to the manifold M. Let m be the dimensional- Thanks to NSFC Project 61075005 and Shanghai Knowledge Service Platform Project (No. ZF1213) for funding, and to Prof. Tong Zhang from ity of M. At each point z ∈ M, f(x) can be represented > 2 Rutgers University for helpful discussions. Email: [email protected] as a linear function f(x) ≈ bz + wz uz(x) + o(kx − zk ) lo- cally around z, where uz(x) = Tz(x − z) is an m-dimensional 3. GENERALIZATION AND REFORMULATION vector representing x in the tangent space around z, and Tz is an m × d matrix that projects x around z to a representa- Relating data with a discrete weighted graph is popular es- tion in the tangent space of M at z. Note that in this paper pecially in graph-based machine learning methods. It also the basis for Tz is computed using local PCA for its simplic- makes sense for us to generalize the regularizer in (2) using a ity and wide applicability. In particular, the point z and its symmetric weight matrix W constructed from the above data neighbors are sent over to the regular PCA procedure and the collection Z. top m eigenvectors are returned back as rows of matrix Tz. Entries in W characterize the closeness of different points m The weight vector wz ∈ R is an m-dimensional vector, and where the points are often called nodes in the terminology of it is also the manifold-derivative of f(x) at z with respect to graphs. Usually there are two steps involved in constructing the uz(·) representation on the manifold, which we write as a weighted graph. The first step builds an adjacency graph ∇T f(x)|x=z = wz. by putting an edge between two “close” points. People can Mathematically, a linear function with respect to the man- choose to use parameter  ∈ R or parameter n ∈ N to de- ifold M, which is not necessarily a globally linear function in termine close points, which means that two nodes would be Rd, is a function that has constant manifold derivative. How- connected if their Euclidean distance is within  or either node ever, this does not mean wz is a constant function of u due is among the n nearest neighbors of the other as indicated by to the different coordinate systems when the “anchor point” z the Euclidean distance. The second step calculates weights changes from one point to another. This needs to be compen- on the edges of the graph with a certain similarity measure. sated using “connections” that map a coordinate representa- For example, the heat- computes weight Wij 2 0 −kxi−xj k /t tion uz0 to uz for any z near z. for two connected nodes i and j by Wij = exp To see how our approach works, we assume for simplicity with parameter t > 0, while for nodes not directly connected > that Tz is an orthogonal matrix for all z: TzTz = I(m×m). the weights would be zero [1]. This means that if x ∈ M is close to z ∈ M, then x − z ≈ Therefore, the generalization of the tangent space intrinsic > 2 Tz Tz(x − z) + O(kx − zk ). Now consider x that is close to manifold regularizer turns out to be 0 both z and z . We can express f(x) both in the tangent space k k representation at z and z0, which gives R({bz,wz}z∈Z ) = Wij bzi − bzj − > 0 > 0 0 2 2 Xi=1 Xj=1 bz + wz uz(x) ≈ bz + wz0 uz (x) + O(kx − z k + kx − zk ).  > 2 > 2 > 0 > 0 w T (z − z ) + γkw − T T w k . (4) That is, bz + wz uz(x) ≈ bz + wz0 uz (x). This means that zj zj i j zi zi zj zj 2 > > 0   bz + wz Tz(x − z) ≈ bz0 + wz0 Tz0 (x − z ). Now we reformulate the regularizer (4) into a canonical > 0 matrix quadratic form to facilitate subsequent formulations Setting x z, we obtain b b 0 w 0 T 0 z z , and = z ≈ z + z z ( − ) on data representation. In particular, we would like to rewrite > 0 > > 0 bz0 + wz0 Tz0 (z − z ) + wz Tz(x − z) ≈ bz0 + wz0 Tz0 (x − z ). the regularizer as a quadratic form in terms of a symmetric This implies that matrix S as follows, > > > 0 bz1 bz1 wz Tz(x − z) ≈ wz0 Tz (x − z) > > 0 2 2  .   .  ≈wz0 Tz0 Tz Tz(x − z) + O(kx − z k + kx − zk ). (1) . .

 bzk  S1 S2  bzk  Since (1) holds for arbitrary x ∈ M close to z ∈ M, it fol- R({bz, wz}z∈Z ) =   >  , (5) > > 0 > 0 > 0 wz1  S2 S3 wz1  lows that, wz ≈ wz0 Tz Tz +O(kz−z k) or wz ≈ TzTz0 wz +     0  .   .  O(kz − z k).  .   .      This means that if we expand at points z1,..., zk ∈ Z, w  w   zk   zk  and denote neighbors of zj as N (zj ), then the correct regu- larizer will be S1 S2 where > is a block matrix representation of S and k S2 S3 the size of S1 is k × k. In this paper, we suppose that wz R({bz, wz}z∈Z ) = bz − bz − i i j (i = 1,...,k) is an m-dimensional vector. Thus, the size Xi=1 j∈NX(zi)  of S is (k + mk) × (k + mk). Due to space limit, we omit w> T z z 2 γ w T T >w 2 . (2) zj zj ( i − j) + k zi − zi zj zj k2 the detailed derivation for S1, S2 and S3, which will be made   available on the internet. With z(x) = arg minz∈Z kx − zk2, the function f(x) is ap- proximated as > 4. APPLICATION TO DATA REPRESENTATION f(x) = bz(x) + wz(x)Tz(x) x − z(x) , (3) which is a very natural formulation for out-of-example exten- Data representation or dimensionality reduction is intrinsi- sions. cally an ill-posed problem since this is an unsupervised task TSIMR isomap LLE MVU Laplacian 0.6950 0.9281 0.9247 0.9308 0.7991

Table 1. Residual variances on the Swiss roll with a hole.

(a) Swiss-roll data (b) Result

Fig. 1. Embedding results of the TSIMR on the Swiss roll.

(a) TSIMR (b) isomap (c) LLE

(a) Data with a hole (b) TSIMR (c) isomap

(d) MVU (e) Laplacian

Fig. 3. Embedding results for the face images.

(d) LLE (e) MVU (f) Laplacian Hence, the embedding f can be found as the first k entries of the eigenvector corresponding to the least eigenvalue. Mean- Fig. 2. Embedding results on the Swiss roll with a hole. while we also obtain the w that matches the embedding f. In order to find an embedding in Rm where m is the dimensionality of manifold M, we definite a matrix F = and there can be multiple different solutions depending on (f1,..., fm) where f1,..., fm are the first k entries of the the preferences or specific objectives defined by users. When eigenvectors in accordance with the m least eigenvalues, re- the tangent space intrinsic manifold regularization method is m spectively. Then, the embedding of zi (i = 1,...,k) in R used, we actually prefer embedding functions with constant would be the i-th row of matrix F . manifold derivatives. > > > > Define w = (wz1 , wz2 ,..., wzk ) . Suppose vector > 5. EXPERIMENTS f = (bz1 , bz2 ,...,bzk ) is an embedding or representation of points z1, z2,..., zk in a line. Suppose the whole graph We evaluated the tangent space intrinsic manifold regulariza- is connected (otherwise we can do embedding for each con- tion (TSIMR) for data representation with synthetic and real- nected component). A reasonable criterion for finding a world data sets. The parameter γ is set to 1. good embedding under the principle of the tangent space in- trinsic manifold regularization is to minimize the objective 5.1. Swiss roll without and with a hole f > f S under appropriate constraints. w w The “Swiss roll” is a 2-dimensional manifold embedded in To remove an arbitrary scaling factor in both f and w, we a 3-dimensional ambient space. For the first experiment we f > f uniformly sampled 2000 points from the manifold, which are take into account the constraint = 1. Therefore, w w depicted in Fig. 1(a). The construction of the adjacency graph the optimization problem becomes uses 10 nearest neighbors, and the heat kernel is employed to assign weights to the edges of the graph. Specifically, the f > f f > f kernel parameter t is fixed as the average of the squared dis- min S , s.t. = 1. (6) f,w w w w w tances between all points and their most nearest neighbors. For data embedding in a 2-dimensional space, the result is It is easy to show that the solution is given by the eigen- shown in Fig. 1(b), which precisely reflects the intrinsic de- vector corresponding to the minimal eigenvalue of the eigen- grees of freedom of the manifold. f f Moreover, we show that our method still works well even decomposition S = λ . Since matrix S is positive w w if there is a hole existing in the Swiss roll. This case resem- semidefinite, all its eigenvalues {λ1,λ2,...,λk+mk}, which bles the situation in practice that data are not sufficiently sam- we suppose are sorted in ascending order, are non-negative. pled from a certain region. We shoveled a rectangular hole Fig. 4. Embedding results with the corresponding face images. through the manifold, and the corresponding sample is given cies of variances of pose and expression, which are indicated in Fig. 2(a). The data representation in a 2-dimensional space with curves. The left curse shows a gradual change of pose by the TSIMR is shown in Fig. 2(b), which faithfully reflects from left to right, and expression from calm to grimace. The the existence and shape of the rectangular hole. The embed- right curve corresponds to the change of expression which is ding results using isomap, LLE (locally linear embedding), first from happy to happy and grimace, and then to calm and MVU (maximum variance unfolding), and Laplacian regular- grimace. The bottom straight line shows an interleaving be- ization are also included in Fig. 2. tween pose and expression: first, under the sad expression, Now we use the metric of residual variance, namely 1 − pose changes from left to right; then under the happy expres- 2 R (D1, D2) [3], to measure the embedding performance of sion, pose repeats the same “from left to right” pattern. The different methods. A low residual variance reflects a good outcome from the TSIMR illustrates a well interpretable prop- data representation. From the results reported in Table 1, we erty on the intrinsic degrees of freedom of the face images. see that the TSIMR method gives the best data representation. To further understand the behavior of the TSIMR, we ex- amine the locally linear representation given in (3). Vector 5.2. Face images wz(x) and matrix Tz(x) jointly determine a projection vector > wz(x)Tz(x), whose effect is to perform an inner product with x This data set contains 1965 face images taken from sequential in the original coordinate system. Consequently, similar to the frames of a small video [2]. The size of each image is 20 × eigenface representation of major eigenvectors derived from 28. However, since the face images mainly include varying > PCA, we can visualize wz(x)Tz(x) as a tangent space intrin- pose and expression, they are believed to reside on a low- sic manifold face whose visualization is omitted here due to dimensional manifold. > space limit. Note that wz(x)Tz(x) actually represents a linear The construction of the adjacency graph uses 12 nearest combination of the eigenvectors given by local PCA. neighbors, which is identical to the setup in [2]. The heat kernel is adopted to assign weights to the edges of the graph, where the parameter t is set to 5dav with dav being the aver- 6. CONCLUSION age of the squared distances between all points and their most nearest neighbors. For data embedding in a 2-dimensional In this paper, we have proposed a new regularization method space, the TSIMR method provides a very tidy and compact called tangent space intrinsic manifold regularization, which representation as shown in Fig. 3(a). Embedding results using favors linear functions on the manifold and can perform di- the other methods with the same setting of 12 nearest neigh- rect out-of-sample extensions. We derived the corresponding bors are given in Fig. 3(b)∼3(e). The outcome of our TSIMR matrix quadratic form representation and further proposed a is more concise and orderly than the other results. new method for data representation whose solution tends out For interpretation of the embedding results, in Fig. 4 we to be an eigen-decomposition problem. Experimental results randomly select and render a half of the original face pic- including comparisons with other methods have shown the tures in the low-dimensional space. We identify three tenden- effectiveness of the proposed method. 7. REFERENCES

[1] M. Belkin and P. Niyogi, “Lapalcian eigenmaps for di- mensionality reduction and data representation,” Neural Computation, vol. 15, pp. 1373–1396, 2003.

[2] S. Roweis and L. Saul, “Nonlinear dimensionality re- duction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.

[3] J. Tenenbaum, V. de Silva, and J. Langford, “A global geometric framework for nonlinear dimensionality re- duction,” Science, vol. 290, pp. 2319–2323, 2000.

[4] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, New York, 2000.

[5] D. Donoho and C. Grimes, “Hessian eigenmaps: Lo- cally linear embedding techniques for high-dimensional data,” in Proceedings of the National Academy of Arts and Sciences, 2003, pp. 5591–5596.

[6] K. Weinberger and L. Saul, “Unsupervised learning of image manifolds by semidefinite programming,” Inter- national Journal of Computer Vision, vol. 70, pp. 77–90, 2006.

[7] Z. Zhang and H. Zha, “Principal manifolds and non- linear dimensionality reduction by local tangent space alignment,” SIAM Journal of Scientific Computing, vol. 26, pp. 313–338, 2004.

[8] J. Shawe-Taylor and S. Sun, “A review of optimization methodologies in support vector machines,” Neurocom- puting, vol. 74, pp. 3609–3618, 2011.

[9] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

[10] S. Sun, R. Huang, and Y. Gao, “Network-scale traffic modeling and forecasting with graphical lasso and neu- ral networks,” Journal of Transportation Engineering, vol. 138, pp. 1358–1367, 2012.

[11] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Se- ries B (Methodological), vol. 58, pp. 267–288, 1996.