Monet: Moments Embedding Network

MoNet: Moments Embedding Network Mengran Gou1 Fei Xiong2 Octavia Camps1 Mario Sznaier1 1Electrical and Computer Engineering, Northeastern University, Boston, MA, US 2Information Sciences Institute, USC, CA, US mengran, camps, msznaier @coe.neu.edu { } [email protected] Abstract Table 1. Comparison of 2nd order statistic information based neural networks. Bilinear CNN (BCNN) only has 2nd order information and does not use matrix normalization. Both improved Bilinear pooling has been recently proposed as a feature BCNN (iBCNN) and G2DeNet take advantage of matrix normal- encoding layer, which can be used after the convolutional ization but suffer from large dimensionality because they use the layers of a deep network, to improve performance in mul- square-root of a large pooled matrix. Our proposed MoNet, with tiple vision tasks. Instead of conventional global average the help of a novel sub-matrix square-root layer, can normalize the pooling or fully connected layer, bilinear pooling gathers local features directly and reduce the final representation dimen- 2nd order information in a translation invariant fashion. sion significantly by substituting the fully bilinear pooling with However, a serious drawback of this family of pooling lay- compact pooling. ers is their dimensionality explosion. Approximate pooling 1st order Matrix Compact methods with compact property have been explored towards moment normalization capacity resolving this weakness. Additionally, recent results have BCNN [22,9] shown that significant performance gains can be achieved by using matrix normalization to regularize unstable higher iBCNN [21] order information. However, combining compact pooling G2DeNet [33] with matrix normalization has not been explored until now. In this paper, we unify the bilinear pooling layer and MoNet the global Gaussian embedding layer through the empirical moment matrix. In addition, with a proposed novel sub-matrix square-root layer, one can normalize the output though CNNs are trained from end to end, they can be also of the convolution layer directly and mitigate the dimen- viewed as two parts, where the convolutional layers are fea- sionality problem with off-the-shelf compact pooling meth- ture extraction steps and the later fully connected (FC) lay- ods. Our experiments on three widely used fine-grained ers are an encoding step. Several works have been done classification datasets illustrate that our proposed archi- to explore substituting the FC layers with conventional em- tecture MoNet can achieve similar or better performance bedding methods in both two-stage fashion [4, 11] and end- than G2DeNet . When combined with compact pooling tech- arXiv:1802.07303v1 [cs.CV] 20 Feb 2018 to-end trainable way [22, 13]. nique, it obtains comparable performance with the encoded Bilinear CNN was first proposed by Lin et al.[22] to feature of 96% less dimensions. pool the second order statistics information across the spa- tial locations. Bilinear pooling has been proven to be suc- 1. Introduction cessful in many tasks, including fine-grained image classification [16,9], large-scale image recognition [20], segmen- Embedding local representations of an image to form a tation [13], visual question answering [8, 34], face recog- feature that is representative yet invariant to nuance noise nition [28] and artistic style reconstruction [10]. Wang et is a key step in many computer vision tasks. Before the al.[33] proposed to also include the 1st order information phenomenal success of deep convolutional neural networks by using a Gaussian embedding. It has been shown that (CNN) [18], researchers tackled this problem with hand- the normalization method is also critical to these CNNs crafted consecutive independent steps. Remarkable works performance. Two normalization methods have been pro- 1 T include HOG [6], SIFT [24], covariance descriptor[30], posed for the bilinear pooled matrix, M = n X X, where n C VLAD[14], Fisher vector [26] and bilinear pooling [3]. Al- X R × represents the local features. On one hand, ∈ 1 (C+1)⨉(C+1) k … … … sgnsqrt L2norm classifier CNN W⨉H⨉C W⨉H⨉(C+1) Sub-matrix W⨉H⨉(C+1) square-root … OR Homogenous k Matrix normalization D mapping sgnsqrt L2norm classifier Element normalization Figure 1. Architecture of the proposed moments-based network MoNet. With the proposed sub-matrix square-root layer, it is possible to perform matrix normalization before bilinear pooling or further apply compact pooling to reduce the dimensionality dramatically without undermining performance. because M is Symmetric Positive Definite (SPD), Ionescu layer, we can take advantage of compact pooling to approx- et al.[13] proposed to apply matrix-logarithm to map the imate the tensor product but with much fewer dimensions. SPD matrices from the Riemannian manifold to an Eu- The main contributions of this work are three-fold: T clidean space following log(M) = UM log(SM )UM with 2 T We unify the G DeNet and bilinear pooling CNN us- M = UM SM UM . On the other hand, Wang et al.[33, 21] • proposed matrix-power to scale M non-linearly with Mp = ing the empirical moment matrix and decouple the p T Gaussian embedding from the bilinear pooling. UM SM UM . In both works, matrix-power was shown to have better performance and numerically stability than the We propose a new sub-matrix square-root layer to di- • matrix-logarithm. In addition, Li et al.[20] provided theo- rectly normalize the features before bilinear pooling retical support on the superior performance of matrix-power layer, which makes it possible to reduce the dimen- normalization in solving a general large-scale image recog- sionality of the representation using compact pooling. nition problem. Therefore, we propose to also integrate the matrix-power normalization into our MoNet architecture. We derive the gradient of the proposed layer using • A critical weakness of the above feature encoding is the matrix back propagation so that the whole proposed extremely high dimensionality of the encoded features. Due moments-based network “MoNet” architecture can be to the tensor product1, the final feature dimension is C2 optimized jointly. where C is the number of feature channels of the last convolution layer. Even for relatively low C = 512 as in VGG- 2. Related work 16 [29], the dimensionality of the final feature is already Lasserre et al.[19] proposed to use the empirical mo- more than 260K. This problem can be alleviated by us- ment matrix formed by explicit in-homogeneous polyno- ing random projections [9], tensor sketching [9,5], and the mial kernel basis for outlier detection. In [12], the empir- low rank property [16]. However, because the matrix-power ical moments matrix was applied as a feature embedding normalization layer is applied on the pooled matrix M, it method for the person re-identification problem and it was is difficult to combine matrix normalization and compact shown that the Gaussian embedding [23] is a special case pooling to achieve better performance and reduce the final when the moment matrix order equals to 1. However, both feature dimensions at the same time. of these works focus on a conventional pipeline and did not In this paper, we re-write the formulation of G2DeNet bring it to modern CNN architectures. using the tensor product of homogeneous padded local fea- Ionescu et al.[13] introduced the theory and practice tures to align it with the architecture of BCNN so that the of matrix back-propagation for training CNNs, which en- Gaussian embedding operation and the bilinear pooling are able structured matrix operations in deep neural networks decoupled. Instead of working on the bilinear pooled ma- training. Both [21] and [33] used it to derive the back- trix M, we derive the sub-matrix square-root layer to per- propagation of the matrix square-root and matrix logarithm form the matrix-power normalization directly on the (infor a symmetric matrix. Li et al.[20] applied a general- )homogeneous local features. With the help of this novel ized p-th order matrix power normalization instead of the 1We show that the Gaussian embedding can be written as a tensor prod- square-root. In our case, since we want to apply the ma- uct in sec. 3.2.1 trix normalization directly on a non-square local feature matrix, we cannot plug-in the equation directly from previous 3.2. Sub-matrix square-root layer works. 3.2.1 Forward propagation Low dimension compact approximation of bilinear pooling has been explored recently. Gao et al.[9] bridged the Since both works [33, 21] showed with extensive exper- bilinear pooling and a linear classifier with a second order iments that the matrix square-root normalization is better polynomial kernel and adopted the off-the-shelf kernel ap- than the matrix logarithm for performance and training sta- proximation methods Random Maclaurin [15] and Tensor bility, we also choose the matrix square-root normalization Sketch [27] to pool the local features in a compact way. of M: 1 Cui [5] generalized this approach to higher order polynomi- ˜ 2 T M = UM SM UM (3) als with Tensor Sketch. By combining with bilinear SVM, T Kong et al.[16] proposed to impose a low-rank constrain to where M = UM SM UM . Since we have reduce the number of parameters. However, none of these T T T T approaches can be easily integrated with matrix normaliza- M = X˜ X˜ = VS U USV (4) tion because of the absence of a bilinear pooled matrix. where X˜ = USVT . Given UT U = I and ST S is a square ˜ 3. MoNet matrix, we can re-write Eq.3 w.r.t. to X as T 1 T The overview of the architecture of the proposed MoNet M˜ = V(S S) 2 V (5) network is shown in Fig.1.

Monet: Moments Embedding Network

Low-Rank Tucker Approximation of a Tensor from Streaming Data : ; \ \ } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine

Fast and Guaranteed Tensor Decomposition Via Sketching

High-Order Attention Models for Visual Question Answering

Visual Attribute Transfer Through Deep Image Analogy ∗

Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition

High-Order Attention Models for Visual Question Answering

Monet: Moments Embedding Network∗

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Hadamard Product for Low-Rank Bilinear Pooling

Higher-Order Count Sketch: Dimensionality Reduction That Retains Efﬁcient Tensor Operations