Monet: Moments Embedding Network∗

MoNet: Moments Embedding Network∗ Mengran Gou1 Fei Xiong2 Octavia Camps1 Mario Sznaier1 1Electrical and Computer Engineering, Northeastern University, Boston, MA, US 2Information Sciences Institute, USC, CA, US mengran, camps, msznaier @coe.neu.edu { } [email protected] Abstract Table 1. Comparison of 2nd order statistical information-based neural networks. Bilinear CNN (BCNN) only has 2nd order information and does not use matrix normalization. Both improved Bilinear pooling has been recently proposed as a feature BCNN (iBCNN) and G2DeNet take advantage of matrix normal- encoding layer, which can be used after the convolutional ization but suffer from large dimensionality since they use the layers of a deep network, to improve performance in mul- square-root of a large pooled matrix. Our proposed MoNet, with tiple vision tasks. Different from conventional global aver- the help of a novel sub-matrix square-root layer, can normalize the age pooling or fully connected layer, bilinear pooling gath- local features directly and reduce the final representation dimen- ers 2nd order information in a translation invariant fash- sion significantly by substituting bilinear pooling with compact ion. However, a serious drawback of this family of pooling pooling. layers is their dimensionality explosion. Approximate pool- 1st order Matrix Compact ing methods with compact properties have been explored moment normalization capacity towards resolving this weakness. Additionally, recent re- BCNN [22, 9] sults have shown that significant performance gains can be achieved by adding 1st order information and applying ma- iBCNN [21] trix normalization to regularize unstable higher order in- G2DeNet [36] formation. However, combining compact pooling with matrix normalization and other order information has not been MoNet explored until now. In this paper, we unify bilinear pooling and the global Gaussian embedding layers through the is a key step in many computer vision tasks. Before the empirical moment matrix. In addition, we propose a novel phenomenal success of deep convolutional neural networks sub-matrix square-root layer, which can be used to normal- (CNN) [18], researchers tackled this problem with hand- ize the output of the convolution layer directly and mitigate crafted consecutive independent steps. Remarkable works the dimensionality problem with off-the-shelf compact pool- include HOG [6], SIFT [24], covariance descriptor [33], ing methods. Our experiments on three widely used fine- VLAD [14], Fisher vector [27] and bilinear pooling [3]. grained classification datasets illustrate that our proposed Although CNNs are trained from end to end, they can be architecture, MoNet, can achieve similar or better perfor- also viewed as two parts, where the convolutional layers are arXiv:1802.07303v2 [cs.CV] 29 Mar 2018 2 mance than with the state-of-art G DeNet. Furthermore, feature extraction steps and the later fully connected (FC) when combined with compact pooling technique, MoNet ob- layers are an encoding step. Several works have been done tains comparable performance with encoded features with to explore substituting the FC layers with conventional em- 96% less dimensions. bedding methods in both two-stage fashion [4, 11] and end- to-end trainable way [22, 13]. Bilinear CNN (BCNN) was first proposed by Lin et al. 1. Introduction [22] to pool the second order statistics information across Embedding local representations of an image to form a the spatial locations. Bilinear pooling has been proven to feature that is representative yet invariant to nuisance noise be successful in many tasks, including fine-grained image classification [16, 9], large-scale image recognition [20], ∗This work was supported in part by NSF grants IIS-1318145, ECCS- segmentation [13], visual question answering [8, 37], face 1404163, CMMI-1638234 and CNS-1646121; AFOSR grant FA9550-15- 1-0392; and the Alert DHS Center of Excellence under Award Number recognition [29] and artistic style reconstruction [10]. Wang 2013-ST-061-ED0001. et al. [36] proposed to also include the 1st order informa- 1 (C+1)⨉(C+1) k … … … sgnsqrt L2norm classifier CNN W⨉H⨉C W⨉H⨉(C+1) Sub-matrix W⨉H⨉(C+1) square-root … OR Homogenous k Matrix normalization D mapping sgnsqrt L2norm classifier Element normalization Figure 1. Architecture of the proposed moments-based network MoNet. With the proposed sub-matrix square-root layer, it is possible to perform matrix normalization before bilinear pooling or further apply compact pooling to reduce the dimensionality dramatically without undermining performance. tion by using a Gaussian embedding in G2DeNet. It has of G2DeNet using the tensor product of the homogeneous been shown that the normalization method is also critical to padded local features to align it with the architecture of these CNNs performance. Two normalization methods have BCNN so that the Gaussian embedding operation and bilin- 1 T been proposed for the bilinear pooled matrix, M = n X X, ear pooling are decoupled. Instead of working on the bilin- n C where X R × represents the local features. On one ear pooled matrix M, we derive the sub-matrix square-root ∈ hand, because M is Symmetric Positive Definite (SPD), layer to perform the matrix-power normalization directly on Ionescu et al. [13] proposed to apply matrix-logarithm to the (in-)homogeneous local features. With the help of this map the SPD matrices from the Riemannian manifold to an novel layer, we can take advantage of compact pooling to T Euclidean space, followed by log(M) = UM log(SM )UM approximate the tensor product, but with much fewer di- T with M = UM SM UM . On the other hand, [36, 21] pro- mensions. posed matrix-power to scale M non-linearly with Mp = The main contributions of this work are three-fold: p T UM SM UM . In both works, matrix-power was shown to We unify the G2DeNet and bilinear pooling CNN us- have better performance and numerically stability than the • matrix-logarithm. In addition, Li et al. [20] provided theo- ing the empirical moment matrix and decouple the retical support on the superior performance of matrix-power Gaussian embedding from bilinear pooling. normalization in solving a general large-scale image recog- We propose a new sub-matrix square-root layer to di- nition problem. • rectly normalize the features before the bilinear pool- A critical weakness of the above feature encoding is the ing layer, which makes it possible to reduce the dimen- extremely high dimensionality of the encoded features. Due sionality of the representation using compact pooling. to the tensor product1, the final feature dimension is C2 where C is the number of feature channels of the last con- We derive the gradient of the proposed layer using ma- • volution layer. Even for relatively low C = 512 as in VGG- trix back propagation, so that the whole proposed mo- 16 [30], the dimensionality of the final feature is already ments embedding network “MoNet” architecture can more than 262K. This problem can be alleviated by us- be optimized jointly. ing random projections [9], tensor sketching [9, 5], and the low rank property [16]. However, because the matrix-power 2. Related work normalization layer is applied on the pooled matrix M, it is non-trivial to combine matrix normalization and compact Bilinear pooling was proposed by Tenenbaum et al. [32] pooling to achieve better performance and reduce the final to model two-factor structure in images to separate style feature dimensions at the same time. from content. Lin et al. [22] introduced it into a convolutional neural network as a pooling layer and improved it In this paper, we propose a new architecture, MoNet, further by adding matrix power normalization in their recent that integrates matrix-power normalization with Gaussian work [21]. Wang et al. [36] proposed G2DeNet with Gaus- embedding. To this effect, we re-write the formulation sian embedding, followed by matrix normalization to incor- porate 1st order moment information and achieved the state- 1We show that the Gaussian embedding can be written as a tensor product in sec. 3.2.1 In the following sections, we will use tensor product and of-the-art performance. In a parallel research track, low di- bilinear pooling interchangeably. mension compact approximations of bilinear pooling have 2 been also explored. Gao et al. [9] bridged bilinear pool- help of the proposed HM layer, we can re-write the Gaus- ing with a linear classifier with a second order polynomial sian embedding layer with a HM layer followed by a tensor kernel by adopting the off-the-shelf kernel approximation product, as explained next. n C methods Random MacLaurin [15] and Tensor Sketch [28] Assume X R × , corresponding to n features with ∈ to pool the local features in a compact way. Cui [5] gener- dimension C and n > C, mean µ and covariance Σ. The alized this approach to higher order polynomials with Ten- homogeneous mapping of X is obtained by padding X with sor Sketch. By combining with bilinear SVM, Kong et al. an extra dimension set to 1. For the simplicity of the fol- [16] proposed to impose a low-rank constraint to reduce the lowing layers, instead of applying the conventional bilinear number of parameters. However, none of these approaches pooling layer as in [22], we also divide the homogeneous can be easily integrated with matrix normalization because feature by the square-root of the number of samples. Then, of the absence of a bilinear pooled matrix. the forward equation of the homogeneous mapping layer is: Lasserre et al. [19] proposed to use the empirical mo- 1 n (C+1) ment matrix formed by explicit in-homogeneous polyno- X˜ = [1 X] R × (1) mial kernel basis for outlier detection. Sznaier et al. [31] √n | ∈ improved the performance for the case of data subspaces, ˜ by working on the singular values directly.

Monet: Moments Embedding Network∗

Low-Rank Tucker Approximation of a Tensor from Streaming Data : ; \ \ } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine

Fast and Guaranteed Tensor Decomposition Via Sketching

High-Order Attention Models for Visual Question Answering

Visual Attribute Transfer Through Deep Image Analogy ∗

Monet: Moments Embedding Network

Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition

High-Order Attention Models for Visual Question Answering

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Hadamard Product for Low-Rank Bilinear Pooling

Higher-Order Count Sketch: Dimensionality Reduction That Retains Efﬁcient Tensor Operations