So Far, We Have Introduced…

Total Page:16

File Type:pdf, Size:1020Kb

So Far, We Have Introduced… So far, we have introduced… ´ Mallat’s invariant scattering transform networks ´ The deeper the network, the more invariant (translation, local deformation, scaling and rotation) ´ Poggio et al. local (sparse), hierarchical, compositional functions ´ Avoid the curse of dimensionality ´ Papyan et al. sparse cascaded convolutional dictionary learning ´ Uniqueness and stability guarantees of sparse recovery Let’s continue on … ´ Harmonic Analysis: What are the optimal (transferrable) representations of functions as input signals (sounds, images, …)? ´ Approximation Theory: When and why are deep networks better than shallow networks? ´ Optimization: What is the landscape of the empirical risk and how to minimize it efficiently? ´ Statistics: How can deep learning generalize well without overfitting the noise? Generalization of Supervised Learning All �, � ~ �, where � is Collect data: �", �" � = 1, 2, … , �} unknown A common approach to learn: ERM (Empirical Risk Minimization) @ 1 Learn a model: �: � → �, � ∈ ℋ min � � : = = �(�; � , � ) ; � " " " �: model parameters �(�; �, �): loss function w.r.t. data Predict new data: � → �(�) Population Risk: �(�) ≔ E[�(�; �, �)] Generalization Error Bias-Variance Tradeoff? Deep models Models where p>20n are common Why big models generalize well? n=50,000 CIFAR10 d=3,072 k=10 What happens when I turn off the regularizers? Train Test Model parameters p/n loss error CudaConvNet 145,578 2.9 0 23% CudaConvNet 145,578 2.9 0.34 18% (with regularization) MicroInception 1,649,402 33 0 14% ResNet 2,401,440 48 0 13% Ben Recht FoCM 2017 How to control generalization error? Motivation • However, when Φ ( X ; ⇥ ) is a large, deep network, current best mechanism to control generalization gap has two key ingredients: – Stochastic Optimization ❖ “During training, it adds the sampling noise that corresponds to empirical- population mismatch” [Léon Bottou]. – Make the model as large as possible. ❖ see e.g. “Understanding Deep Learning Requires Rethinking Generalization”, [Ch. Zhang et al, ICLR’17]. Traditional Learning Theory Common form of generalization bound (in expectation or high probability) @ �������� ������� � � ≤ � � + ; � Capacity Measurement Complexity |�|: # of VC-dimension �� ≤ �( � log |�|) edges Y YZ[ ��X �-Covering number log � ℱ, �, � ≤ � �: # of layers O QR �OY Y Rademacher Average �\ ℱ ≤ �(� ) Big model should fail! Training Algorithms for Deep Learning Commonly Used Algorithms Learn the Non-adaptive Adaptive curvature adaptively Easy to • SGD implement • Adam • SGD with • AdaGrad momentu • AdaDelta m Automaticall • …… • …… y tune parameters Stochastic Gradient Descent Objective loss function: @ 1 min � � : = = �(�; � , � ) ; � " " " where (�", �") is the data, � is the parameter vector. @ � O(n) time � = � − = ��(� ; � , � ) Gradient Descent: hZ[ h � h " " complexity @ SGD: �hZ[ = �h − ���(�h; �"l, �"l), where �h is uniform in {1, …, n} O(1) time Extensions: mini-batch SGD complexity Spectrally-normalized margin bounds for neural networks Peter L. Bartlett⇤ Dylan J. Foster† Matus Telgarsky‡ Abstract This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized spectral complexity:theirLipschitzconstant,meaningtheproductof the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity. 1 Overview Neural networks owe their astonishing success not only to their ability to fit any data set: they also generalize well,meaningtheyprovideaclosefitonunseendata.Aclassicalstatisticaladageisthat models capable of fitting too much will generalize poorly; what’s going on here? Let’s navigate the many possible explanations provided by statistical theory. A first observation is that any analysis based solely on the number of possible labellings on a finite training set — as is the case with VC dimension — is doomed: if the function class can fit all possible labels (as is the case with neural networksSGD in standard with configurations early stopping (Zhang et al., 2017)), regularizes then this analysis can not distinguish it from the collection of all possible functions! arXiv:1706.08498v2 [cs.LG] 5 Dec 2017 Figure 1: An analysis of AlexNet (Krizhevsky et al., 2012) trained with SGD on cifar10, both with original and with randomBartlett labels. et al. Triangle-marked 2017. Generalization curves error track of AlexNet excessin risk Cifar10 across training epochs (on a log scale), with an ‘x’ marking the earliest epoch with zero training error. Circle-marked curves track Lipschitz constants, normalized so that the two curves for random labels meet. The Lipschitz constants tightly correlate with excess risk, and moreover normalizing them by margins (resulting in the square-marked curve) neutralizes growth across epochs. ⇤<[email protected]>; University of California, Berkeley and Queensland University of Technology; work performed while visiting the Simons Institute. †<[email protected]>; Cornell University; work performed while visiting the Simons Institute. ‡<[email protected]>; University of Illinois, Urbana-Champaign; work performed while visiting the Simons Institute. 1 (a) Margins. (b) Normalized margins. Figure 2: Margin distributions at the end of training AlexNet on cifar10, with and without random labels. With proper normalization, random labels demonstrably correspond to a harder problem. =(A1,...,AL) let F denote the function computed by the corresponding network: A A F (x) := σL(ALσL 1(AL 1 σ1(A1x) )). (1.1) A − − ··· ··· dL The network output F (x) R (with d0 = d and dL = k)isconvertedtoaclasslabelin 1,...,k A 2 { } by taking the arg max over components, with an arbitrary rule for breaking ties. Whenever input data d n d x1,...,xn R are given, collect them as rows of a matrix X R ⇥ . Occasionally, notation will be 2 T th 2 overloaded to discuss F (X ), a matrix whose i column is F (xi).LetW denote the maximum of A A d, d ,...,d . The l norm is always computed entry-wise; thus, for a matrix, it corresponds to { 1 L} 2 k · k2 the Frobenius norm. Next, define a collection of reference matrices (M1,...,ML) with the same dimensions as A1,...,AL; for instance, to obtain a good bound for ResNet (He et al., 2016), it is sensible to set Mi := I,the identity map, and the bound below will worsen as the network moves farther from the identity map; for AlexNet (Krizhevsky et al., 2012), the simple choice Mi =0suffices. Finally, let σ denote the spectral : k · k norm, and let p,q denote the (p, q) matrix norm, defined by A p,q = ( A:,1 p,..., A:,m p) q for d m (a)k · Margins.k (b) Normalizedk k margins.k k k k A R ⇥ . The spectral complexity RF = R of a network F with weights is the defined as Margin2 and NetworkA A LipschitzA basedA Figure 2: Margin distributions at the end of training AlexNet on cifar10, with3/ and2 without random L L 2/3 labels. With proper normalization, random labels demonstrablyA correspondi> Mi> 2 to,1 a harder problem. GeneralizationR := ⇢errori Ai σ boundk − k . (1.2) A 0 k k 1 0 2/3 1 i=1 i=1 Ai σ Y X k k =(A1,...,AL) let F denote the function@ computedA by@ the corresponding network:A A (BartlettThe followingA theorem et provides al. a2017) generalization bound for neural networks whose nonlinearities are fixed but whose weightF matrices(x) := σL(AhaveLσL bounded1(AL 1 spectralσ1(A1x complexity) )). R . (1.1) A A − − ··· ··· A dL The networkTheorem output 1.1.F Let(x) nonlinearitiesR (with d(0σ=1,...,d andσLd)Land= k reference)isconvertedtoaclasslabelin matrices (M1,...,ML) be1,...,k given as above A 2 { } by taking(i.e., theσi argis ⇢ maxi-Lipschitzover components, and σi(0) = with 0). Then an arbitrary for (x, y rule), (x1 for,y1 breaking),...,(xn ties.,yn) Wheneverdrawn iid from input any data probability d d n d n x1,...,xdistributionn R are over given,R collect1,...,k them, as with rows probability of a matrix atX leastR1 ⇥ δ. Occasionally,over ((xi,yi)) notationi=1, every will margin be γ > 0 2 d ⇥T { k } th 2 − overloadedand network to discussF F: (X ), a matrixwith weight whose matricesi column=( is AF1,...,A(xi).LetL) satisfyW denote the maximum of AR R A d, d ,...,d . The lA norm! is always computed entry-wise;A thus, for a matrix, it corresponds to { 1 L} 2 k · k2 the Frobenius norm. X 2R ln(1/δ) Next, define a collectionPr arg of maxreferenceF (x) matricesj = y (Mγ,...,M(F )+ ) withk thek sameA ln(W dimensions)+ as A ,...,A, ; j A 6 R1 A LO γn r n 1! L for instance, to obtain ah good bound for ResNeti (He et al., 2016), it is sensible to set Mi := I,the b e identity map, and the bound1 below1 will worsen as the network moves farther from the identity2 map; for where γ (f) n− f(xi)yi γ + maxj=yi f(xi)j and X 2 = xi 2. AlexNet (KrizhevskyR et al., 2012),i the simple choice M6 =0suffices. Finally,k k let i kdenotek the spectral i k · kσ norm, and let p,q denote the⇥(p, q) matrix norm, defined by⇤ A := ( Ap:,1 p,..., A:,m p) for Theb full proof andP a generalization beyond spectral normsp,q is relegatedP to the appendix,q but a sketch d m k · k k k k k k k A R is⇥ provided. The spectral in Section complexity 3, alongRF with= R a lowerof a network bound.F Sectionwith weights3 also givesis a the discussion defined as of related work: 2 briefly, it’s essential to note thatA marginA and Lipschitz-sensitiveA bounds A have a long history in the neural 3/2 networks literature (Bartlett,L 1996; AnthonyL and Bartlett, 1999;2/3 Neyshabur et al., 2015); the distinction Ai> Mi> 2,1 R := ⇢i Ai σ k − k .
Recommended publications
  • Low-Rank Tucker Approximation of a Tensor from Streaming Data : ; \ \ } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell
    SIAM J. MATH.DATA SCI. © 2020 Society for Industrial and Applied Mathematics Vol. 2, No. 4, pp. 1123{1150 Low-Rank Tucker Approximation of a Tensor from Streaming Data˚ : ; x { } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell Abstract. This paper describes a new algorithm for computing a low-Tucker-rank approximation of a tensor. The method applies a randomized linear map to the tensor to obtain a sketch that captures the important directions within each mode, as well as the interactions among the modes. The sketch can be extracted from streaming or distributed data or with a single pass over the tensor, and it uses storage proportional to the degrees of freedom in the output Tucker approximation. The algorithm does not require a second pass over the tensor, although it can exploit another view to compute a superior approximation. The paper provides a rigorous theoretical guarantee on the approximation error. Extensive numerical experiments show that the algorithm produces useful results that improve on the state-of-the-art for streaming Tucker decomposition. Key words. Tucker decomposition, tensor compression, dimension reduction, sketching method, randomized algorithm, streaming algorithm AMS subject classifcations. 68Q25, 68R10, 68U05 DOI. 10.1137/19M1257718 1. Introduction. Large-scale datasets with natural tensor (multidimensional array) struc- ture arise in a wide variety of applications, including computer vision [39], neuroscience [9], scientifc simulation [3], sensor networks [31], and data mining [21]. In many cases, these tensors are too large to manipulate, to transmit, or even to store in a single machine. Luckily, tensors often exhibit a low-rank structure and can be approximated by a low-rank tensor factorization, such as CANDECOMP/PARAFAC (CP), tensor train, or Tucker factorization [20].
    [Show full text]
  • Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix
    Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix Insu Han 1 Haim Avron 2 Jinwoo Shin 3 1 Abstract side, we obtain a o(n2)-time approximation scheme of f (A)x ≈ T T >x for an arbitrary vector x 2 n due This paper studies how to sketch element-wise U V R to the associative property of matrix multiplication, where functions of low-rank matrices. Formally, given the exact computation of f (A)x requires the complexity low-rank matrix A = [A ] and scalar non-linear ij of Θ(n2). function f, we aim for finding an approximated low-rank representation of the (possibly high- The matrix-vector multiplication f (A)x or the low-rank > rank) matrix [f(Aij)]. To this end, we propose an decomposition f (A) ≈ TU TV is useful in many machine efficient sketching-based algorithm whose com- learning algorithms. For example, the Gram matrices of plexity is significantly lower than the number of certain kernel functions, e.g., polynomial and radial basis entries of A, i.e., it runs without accessing all function (RBF), are element-wise matrix functions where entries of [f(Aij)] explicitly. The main idea un- the rank is the dimension of the underlying data. Such derlying our method is to combine a polynomial matrices are the cornerstone of so-called kernel methods approximation of f with the existing tensor sketch and the ability to multiply the Gram matrix to a vector scheme for approximating monomials of entries suffices for most kernel learning. The Sinkhorn-Knopp of A.
    [Show full text]
  • Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine
    JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, JUNE 2019 1 Low-Rank Pairwise Alignment Bilinear Network For Few-Shot Fine-Grained Image Classification Huaxi Huang, Student Member, IEEE, Junjie Zhang, Jian Zhang, Senior Member, IEEE, Jingsong Xu, Qiang Wu, Senior Member, IEEE Abstract—Deep neural networks have demonstrated advanced abilities on various visual classification tasks, which heavily rely on the large-scale training samples with annotated ground-truth. Slot Alaskan Dog However, it is unrealistic always to require such annotation in h real-world applications. Recently, Few-Shot learning (FS), as an attempt to address the shortage of training samples, has made significant progress in generic classification tasks. Nonetheless, it Insect Husky Dog is still challenging for current FS models to distinguish the subtle differences between fine-grained categories given limited training Easy to Recognise A Little Hard to Classify data. To filling the classification gap, in this paper, we address Lion Audi A5 the Few-Shot Fine-Grained (FSFG) classification problem, which focuses on tackling the fine-grained classification under the challenging few-shot learning setting. A novel low-rank pairwise Piano Audi S5 bilinear pooling operation is proposed to capture the nuanced General Object Recognition with Single Sample Fine-grained Object Recognition with Single Sample differences between the support and query images for learning an effective distance metric. Moreover, a feature alignment layer Fig. 1. An example of general one-shot learning (Left) and fine-grained one- is designed to match the support image features with query shot learning (Right). For general one-shot learning, it is easy to learn the ones before the comparison.
    [Show full text]
  • Fast and Guaranteed Tensor Decomposition Via Sketching
    Fast and Guaranteed Tensor Decomposition via Sketching Yining Wang, Hsiao-Yu Tung, Alex Smola Anima Anandkumar Machine Learning Department Department of EECS Carnegie Mellon University, Pittsburgh, PA 15213 University of California Irvine fyiningwa,[email protected] Irvine, CA 92697 [email protected] [email protected] Abstract Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in statistical learning of latent variable models and in data mining. In this paper, we propose fast and randomized tensor CP decomposition algorithms based on sketching. We build on the idea of count sketches, but introduce many novel ideas which are unique to tensors. We develop novel methods for randomized com- putation of tensor contractions via FFTs, without explicitly forming the tensors. Such tensor contractions are encountered in decomposition methods such as ten- sor power iterations and alternating least squares. We also design novel colliding hashes for symmetric tensors to further save time in computing the sketches. We then combine these sketching ideas with existing whitening and tensor power iter- ative techniques to obtain the fastest algorithm on both sparse and dense tensors. The quality of approximation under our method does not depend on properties such as sparsity, uniformity of elements, etc. We apply the method for topic mod- eling and obtain competitive results. Keywords: Tensor CP decomposition, count sketch, randomized methods, spec- tral methods, topic modeling 1 Introduction In many data-rich domains such as computer vision, neuroscience and social networks consisting of multi-modal and multi-relational data, tensors have emerged as a powerful paradigm for han- dling the data deluge. An important operation with tensor data is its decomposition, where the input tensor is decomposed into a succinct form.
    [Show full text]
  • High-Order Attention Models for Visual Question Answering
    High-Order Attention Models for Visual Question Answering Idan Schwartz Alexander G. Schwing Department of Computer Science Department of Electrical and Computer Engineering Technion University of Illinois at Urbana-Champaign [email protected] [email protected] Tamir Hazan Department of Industrial Engineering & Management Technion [email protected] Abstract The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset. 1 Introduction The quest for algorithms which enable cognitive abilities is an important part of machine learning and appears in many facets, e.g., in visual question answering tasks [6], image captioning [26], visual question generation [18, 10] and machine comprehension [8]. A common trait in these recent cognitive-like tasks is that they take into account different data modalities, for example, visual and textual data. To address these tasks, recently, attention mechanisms have emerged as a powerful common theme, which provides not only some form of interpretability if applied to deep net models, but also often improves performance [8].
    [Show full text]
  • Visual Attribute Transfer Through Deep Image Analogy ∗
    Visual Attribute Transfer through Deep Image Analogy ∗ Jing Liao1, Yuan Yao2 ,y Lu Yuan1, Gang Hua1, and Sing Bing Kang1 1Microsoft Research, 2Shanghai Jiao Tong University : :: : : :: : A (input) A0 (output) B (output) B0 (input) Figure 1: Our technique allows us to establish semantically-meaningful dense correspondences between two input images A and B0. A0 and B are the reconstructed results subsequent to transfer of visual attributes. Abstract The applications of color transfer, texture transfer, and style transfer share a common theme, which we characterize as visual attribute We propose a new technique for visual attribute transfer across im- transfer. In other words, visual attribute transfer refers to copying ages that may have very different appearance but have perceptually visual attributes of one image such as color, texture, and style, to similar semantic structure. By visual attribute transfer, we mean another image. transfer of visual information (such as color, tone, texture, and style) from one image to another. For example, one image could In this paper, we describe a new technique for visual attribute trans- be that of a painting or a sketch while the other is a photo of a real fer for a pair of images that may be very different in appearance but scene, and both depict the same type of scene. semantically similar. In the context of images, by ”semantic”, we refer to high-level visual content involving identifiable objects. We Our technique finds semantically-meaningful dense correspon- deem two different images to be semantically similar if both are of dences between two input images. To accomplish this, it adapts the the same type of scene containing objects with the same names or notion of “image analogy” [Hertzmann et al.
    [Show full text]
  • Monet: Moments Embedding Network
    MoNet: Moments Embedding Network Mengran Gou1 Fei Xiong2 Octavia Camps1 Mario Sznaier1 1Electrical and Computer Engineering, Northeastern University, Boston, MA, US 2Information Sciences Institute, USC, CA, US mengran, camps, msznaier @coe.neu.edu { } [email protected] Abstract Table 1. Comparison of 2nd order statistic information based neu- ral networks. Bilinear CNN (BCNN) only has 2nd order infor- mation and does not use matrix normalization. Both improved Bilinear pooling has been recently proposed as a feature BCNN (iBCNN) and G2DeNet take advantage of matrix normal- encoding layer, which can be used after the convolutional ization but suffer from large dimensionality because they use the layers of a deep network, to improve performance in mul- square-root of a large pooled matrix. Our proposed MoNet, with tiple vision tasks. Instead of conventional global average the help of a novel sub-matrix square-root layer, can normalize the pooling or fully connected layer, bilinear pooling gathers local features directly and reduce the final representation dimen- 2nd order information in a translation invariant fashion. sion significantly by substituting the fully bilinear pooling with However, a serious drawback of this family of pooling lay- compact pooling. ers is their dimensionality explosion. Approximate pooling 1st order Matrix Compact methods with compact property have been explored towards moment normalization capacity resolving this weakness. Additionally, recent results have BCNN [22,9] shown that significant performance gains can be achieved by using matrix normalization to regularize unstable higher iBCNN [21] order information. However, combining compact pooling G2DeNet [33] with matrix normalization has not been explored until now. In this paper, we unify the bilinear pooling layer and MoNet the global Gaussian embedding layer through the empir- ical moment matrix.
    [Show full text]
  • Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition
    JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Multi-Objective Matrix Normalization for Fine-grained Visual Recognition Shaobo Min, Hantao Yao, Member, IEEE, Hongtao Xie, Zheng-Jun Zha, and Yongdong Zhang, Senior Member, IEEE Abstract—Bilinear pooling achieves great success in fine- grained visual recognition (FGVC). Recent methods have shown 푤 푐 Normalization that the matrix power normalization can stabilize the second- Reshape order information in bilinear features, but some problems, e.g., CNN ℎ redundant information and over-fitting, remain to be resolved. 푐 푐2 In this paper, we propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normal- 푐 ize a bilinear representation in terms of square-root, low-rank, and sparsity. These three regularizers can not only stabilize the Covariance Bilinear Image Visual Feature second-order information, but also compact the bilinear features Matrix Feature and promote model generalization. In MOMN, a core challenge is how to jointly optimize three non-smooth regularizers of different convex properties. To this end, MOMN first formulates Fig. 1. A diagram of bilinear pooling for FGVC. The covariance matrix is them into an augmented Lagrange formula with approximated obtained by first reshaping the visual feature X into n×c and then calculating X>X. regularizer constraints. Then, auxiliary variables are introduced to relax different constraints, which allow each regularizer to be solved alternately. Finally, several updating strategies based on gradient descent are designed to obtain consistent Bilinear pooling is first introduced in [7], which pools convergence and efficient implementation. Consequently, MOMN the pairwise-correlated local descriptors into a global rep- is implemented with only matrix multiplication, which is well- resentation, i.e.
    [Show full text]
  • High-Order Attention Models for Visual Question Answering
    High-Order Attention Models for Visual Question Answering Idan Schwartz Alexander G. Schwing Department of Computer Science Department of Electrical and Computer Engineering Technion University of Illinois at Urbana-Champaign [email protected] [email protected] Tamir Hazan Department of Industrial Engineering & Management Technion [email protected] Abstract The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset. 1 Introduction The quest for algorithms which enable cognitive abilities is an important part of machine learning and appears in many facets, e.g., in visual question answering tasks [6], image captioning [26], visual question generation [18, 10] and machine comprehension [8]. A common trait in these recent cognitive-like tasks is that they take into account different data modalities, for example, visual and textual data. arXiv:1711.04323v1 [cs.CV] 12 Nov 2017 To address these tasks, recently, attention mechanisms have emerged as a powerful common theme, which provides not only some form of interpretability if applied to deep net models, but also often improves performance [8].
    [Show full text]
  • Monet: Moments Embedding Network∗
    MoNet: Moments Embedding Network∗ Mengran Gou1 Fei Xiong2 Octavia Camps1 Mario Sznaier1 1Electrical and Computer Engineering, Northeastern University, Boston, MA, US 2Information Sciences Institute, USC, CA, US mengran, camps, msznaier @coe.neu.edu { } [email protected] Abstract Table 1. Comparison of 2nd order statistical information-based neural networks. Bilinear CNN (BCNN) only has 2nd order in- formation and does not use matrix normalization. Both improved Bilinear pooling has been recently proposed as a feature BCNN (iBCNN) and G2DeNet take advantage of matrix normal- encoding layer, which can be used after the convolutional ization but suffer from large dimensionality since they use the layers of a deep network, to improve performance in mul- square-root of a large pooled matrix. Our proposed MoNet, with tiple vision tasks. Different from conventional global aver- the help of a novel sub-matrix square-root layer, can normalize the age pooling or fully connected layer, bilinear pooling gath- local features directly and reduce the final representation dimen- ers 2nd order information in a translation invariant fash- sion significantly by substituting bilinear pooling with compact ion. However, a serious drawback of this family of pooling pooling. layers is their dimensionality explosion. Approximate pool- 1st order Matrix Compact ing methods with compact properties have been explored moment normalization capacity towards resolving this weakness. Additionally, recent re- BCNN [22, 9] sults have shown that significant performance gains can be achieved by adding 1st order information and applying ma- iBCNN [21] trix normalization to regularize unstable higher order in- G2DeNet [36] formation. However, combining compact pooling with ma- trix normalization and other order information has not been MoNet explored until now.
    [Show full text]
  • Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix
    Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix Insu Han 1 Haim Avron 2 Jinwoo Shin 3 1 Abstract side, we obtain a o(n2)-time approximation scheme of f (A)x ≈ T T >x for an arbitrary vector x 2 n due This paper studies how to sketch element-wise U V R to the associative property of matrix multiplication, where functions of low-rank matrices. Formally, given the exact computation of f (A)x requires the complexity low-rank matrix A = [A ] and scalar non-linear ij of Θ(n2). function f, we aim for finding an approximated low-rank representation of the (possibly high- The matrix-vector multiplication f (A)x or the low-rank > rank) matrix [f(Aij)]. To this end, we propose an decomposition f (A) ≈ TU TV is useful in many machine efficient sketching-based algorithm whose com- learning algorithms. For example, the Gram matrices of plexity is significantly lower than the number of certain kernel functions, e.g., polynomial and radial basis entries of A, i.e., it runs without accessing all function (RBF), are element-wise matrix functions where entries of [f(Aij)] explicitly. The main idea un- the rank is the dimension of the underlying data. Such derlying our method is to combine a polynomial matrices are the cornerstone of so-called kernel methods approximation of f with the existing tensor sketch and the ability to multiply the Gram matrix to a vector scheme for approximating monomials of entries suffices for most kernel learning. The Sinkhorn-Knopp of A.
    [Show full text]
  • Hadamard Product for Low-Rank Bilinear Pooling
    Under review as a conference paper at ICLR 2017 HADAMARD PRODUCT FOR LOW-RANK BILINEAR POOLING Jin-Hwa Kim Kyoung-Woon On Interdisciplinary Program in Cognitive Science School of Computer Science and Engineering Seoul National University Seoul National University Seoul, 08826, Republic of Korea Seoul, 08826, Republic of Korea [email protected] [email protected] Jeonghee Kim & Jung-Woo Ha Naver Labs, Naver Corp. Gyeonggi-do, 13561, Republic of Korea fjeonghee.kim,[email protected] Byoung-Tak Zhang School of Computer Science and Engineering & Interdisciplinary Program in Cognitive Science Seoul National University & Surromind Robotics Seoul, 08826, Republic of Korea [email protected] ABSTRACT Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmen- tation, and visual question-answering, to get state-of-the-art performances tak- ing advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally com- plex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property. 1 INTRODUCTION Bilinear models (Tenenbaum & Freeman, 2000) provide richer representations than linear models. To exploit this advantage, fully-connected layers in neural networks can be replaced with bilinear pooling. The outer product of two vectors (or Kroneker product for matrices) is involved in bilinear pooling, as a result of this, all pairwise interactions among given features are considered.
    [Show full text]