Building Deep Networks on Grassmann Manifolds
Total Page:16
File Type:pdf, Size:1020Kb
Building Deep Networks on Grassmann Manifolds Zhiwu Huangy, Jiqing Wuy, Luc Van Goolyz yComputer Vision Lab, ETH Zurich, Switzerland zVISICS, KU Leuven, Belgium fzhiwu.huang, jiqing.wu, [email protected] Abstract The popular applications of Grassmannian data motivate Learning representations on Grassmann manifolds is popular us to build a deep neural network architecture for Grassman- in quite a few visual recognition tasks. In order to enable deep nian representation learning. To this end, the new network learning on Grassmann manifolds, this paper proposes a deep architecture is designed to accept Grassmannian data di- network architecture by generalizing the Euclidean network rectly as input, and learns new favorable Grassmannian rep- paradigm to Grassmann manifolds. In particular, we design resentations that are able to improve the final visual recog- full rank mapping layers to transform input Grassmannian nition tasks. In other words, the new network aims to deeply data to more desirable ones, exploit re-orthonormalization learn Grassmannian features on their underlying Rieman- layers to normalize the resulting matrices, study projection nian manifolds in an end-to-end learning architecture. In pooling layers to reduce the model complexity in the Grass- summary, two main contributions are made by this paper: mannian context, and devise projection mapping layers to re- spect Grassmannian geometry and meanwhile achieve Eu- • We explore a novel deep network architecture in the con- clidean forms for regular output layers. To train the Grass- text of Grassmann manifolds, where it has not been possi- mann networks, we exploit a stochastic gradient descent set- ble to apply deep neural networks. More generally, treat- ting on manifolds of the connection weights, and study a ma- ing Grassmannian data in deep networks can be very valu- trix generalization of backpropagation to update the struc- able in a variety of machine learning applications. tured data. The evaluations on three visual recognition tasks show that our Grassmann networks have clear advantages • We generalize backpropagation to train the proposed net- over existing Grassmann learning methods, and achieve re- work with deriving an connection weight update rule on a sults comparable with state-of-the-art approaches. specific Riemannian manifold. Furthermore, we incorpo- rate QR decomposition into backpropagation that might Introduction prove very useful in other applications since QR decom- position is a very common linear algebra operator. This paper introduces a deep network architecture on Grass- mannians, which are manifolds of linear subspaces. In many computer vision applications, linear subspaces have become Background a core representation. For example, for face verification Grassmannian Geometry (Huang et al. 2015b), emotion estimation (Liu et al. 2014b) A Grassmann manifold Gr(q; D) is a q(D − q) dimen- and activity recognition (Cherian et al. 2017), the image sional compact Riemannian manifold, which is the set of sets of a single person are often modeled by low dimen- q-dimensional linear subspaces of the RD. Thus, each point sional subspaces that are then compared on Grassmanni- on Gr(q; D) is a linear subspace that is spanned by the re- ans. Besides, for video classification, it is also very common lated orthonormal basis matrix X of size D × q such that to use autoregressive and moving average (ARMA) model XT X = I , where I is the identity matrix of size q × q. arXiv:1611.05742v3 [cs.CV] 29 Jan 2018 q q (Vemulapalli, Pillai, and Chellappa 2013). The parameters One of the most popular approaches to represent linear of the ARMA model are known to be modeled with a high- subspaces and approximate the true Grassmannian geodesic dimensional linear subspace. For shape analysis, the widely- distance is the projection mapping framework Φ(X) = used affine and linear shape spaces for specific configura- XXT proposed by (Edelman, Arias, and Smith 1998). tions can be also identified by points on the Grassmann man- As the projection Φ(X) is a D × D symmetric ma- ifold (Anirudh et al. 2017). Applications involving dynamic trix, a natural choice of inner product is hX1; X2iΦ = T environments and autonomous agents often perform online tr(Φ(X1) Φ(X2)). The inner product induces a distance visual learning by using subspace tracking techniques like metric named projection metric: incremental principal component analysis (PCA) to dynam- −1=2 T T ically learn a better representational model as the appearance dp(X1; X2) = 2 kX1X1 − X2X2 kF : (1) of the moving target (Turaga et al. 2011). where k·kF indicates the matrix Frobenius norm. As proved Copyright c 2018, Association for the Advancement of Artificial in (Harandi et al. 2013), the projection metric canp approxi- Intelligence (www.aaai.org). All rights reserved. mate the true geodesic distance up to a scale of 2. Grassmann Learning idea (Huang et al. 2015b), the FRMap layers are proposed To perform discriminant learning on Grassmann manifolds, to firstly perform transformations on input orthonormal ma- many works (Hamm and Lee. 2008; Hamm and Lee 2009; trices of subspaces to generate new matrices by adopting a Cetingul and Vidal 2009; Harandi et al. 2011; Harandi et al. full rank mapping scheme. Then, the ReOrth layers are de- 2013; Harandi et al. 2014) either adopt tangent space ap- veloped to normalize the output matrices of the FRMap lay- proximation of the underlying manifolds, or exploit posi- ers so that they can keep the basic orthogonality. In other words, the normalized data become orthonormal matrices tive definite kernel functions to embed the manifolds into 1 reproducing kernel Hilbert spaces. In both of such two cases, that reside on Stiefel manifold . As well-studied in (Edel- any existing Euclidean techniques can then be applied to the man, Arias, and Smith 1998; Hamm and Lee. 2008), the pro- embedded data, since Hilbert spaces respect Euclidean ge- jection metric performing on orthonormal matrices can rep- ometry as well. For example, (Hamm and Lee. 2008) first resent linear subspaces and respect the geometry of Grass- embeds the Grassmannian into a high dimensional Hilbert mann manifolds, which is actually a quotient manifold of the space, and then applies traditional Fisher analysis methods. Stiefel manifold. Accordingly, we develop projection map- Obviously, most of these methods are limited to the Mer- ping (ProjMap) layers to maintain the Grassmannian prop- cer kernels, and hence are restricted to use only kernel- erty of the resulting data. Meanwhile, the ProjMap layers based classifiers. Moreover, their computational complexity are able to transfer the resulting Grassmannian data into increases steeply with the growing number of training sam- Euclidean data, which enables the regular Euclidean lay- ples. ers such as softmax layers for classification. The ProjMap More recently, a new learning scheme was proposed by and softmax layers forms the Output block for GrNet. Ad- (Huang et al. 2015b) to perform a geometry-aware dimen- ditionally, since traditional pooling layers can reduce the sionality reduction from the original Grassmann manifold network complexity, we also study projection pooling (Pro- to another lower-dimensional, more discriminative Grass- jPooling) layers on the projection matrix form of the re- mann manifold. This could better preserve the original Rie- sulting orthonormal matrices. As it is non-trivial to perform mannian data structure, which commonly leads to more fa- pooling on non-Euclidean data directly, we develop a Pool- vorable classification performances as studied in classical ing block to combine ProMap, ProjPooling and orthonor- manifold learning. While (Huang et al. 2015b) has reached mal mapping (OrthMap) layers, which respectively achieves some success, it merely adopts a shallow learning scheme on Euclidean representation, performs pooling on resulting Eu- Grassmann manifolds, which is still far away from the best clidean data and transforms the results back to orthonormal solution for the problem of representation learning on non- data. The proposed GrNet structure is illustrated in Fig.1. linear manifolds. Accordingly, this paper attempts to open up a possibility of deep learning on Grassmannians. FRMap Layer To learn compact and discriminative Grassmannian repre- Manifold Network sentation for better classification, we design the FRMap lay- By leveraging the paradigm of traditional neural networks, ers to first transform the input orthonormal matrices of sub- an increasing number of networks (Masci et al. 2015; spaces to new matrices by a linear mapping function ffr as Ionescu, Vantzos, and Sminchisescu 2015; Huang and X = f (k)(X ; W ) = W X ; Van Gool 2017) have been built over general manifold k fr k−1 k k k−1 (2) domains. For instance, (Masci et al. 2015) proposed a where X 2 Gr(q; d ) is the input of the k-th layer2, ‘geodesic convolution’ on local geodesic coordinate sys- k−1 k−1 dk×dk−1 tems to extract local patches on the shape manifold for Wk 2 R∗ ; (dk < dk−1) is the transformation ma- shape analysis. In particular, the method implements con- trix (connection weights) that is basically required to be a d ×q volutions by sliding a window over the shape manifold, and row full rank matrix, Xk 2 R k is the resulting matrix. local geodesic coordinates are employed instead of image Generally, the transformation result WkXk−1 is not an or- patches. In (Huang and Van Gool 2017), to deeply learn ap- thonormal basis matrix. To tackle this problem, we exploit a propriate features on the manifolds of symmetric positive normalization strategy of QR decomposition in the follow- definite (SPD) matrices, a deep network structure was de- ing ReOrth layer. In addition, as classical deep networks, 1 m veloped with some spectral layers, which can be trained by multiple projections fWk ;:::; Wk g per FRMap layer can a variant of backpropagation. Nevertheless, to the best of our be applied on each input orthonormal matrix as well, where knowledge, this is the first work that studies a deep network m is the number of transformation matrices.