A Spectral Nonlocal Block for Deep Neural Networks

A Spectral Nonlocal Block for Deep Neural Networks Lei Zhu* 1 Qi She* 1 Lidan Zhang 1 Ping Guo 1 Intel Research Lab China, Beijing, China [email protected] Abstract Sun, 2015). Secondly, stacking more layers cannot always The nonlocal-based blocks are designed for cap- increase the effective receptive fields (Luo et al., 2016), turing long-range spatial-temporal dependencies which indicates the convolutional layers may still lack the in computer vision tasks. Although having shown mechanism to efficiently model these dependencies. excellent performances, they lack the mechanism To address these issues, inspired by the classical “nonlocal to encode the rich, structured information among mean” method in image denoising field, Wang et al.(2018) elements in an image. In this paper, to theoreti- proposes the nonlocal block to encode the “full-range” de- cally analyze the property of these nonlocal-based pendencies in one module by exploring the relation between blocks, we provide a unified approach to interpret- pairwise positions. For each central position, the nonlocal ing them, where we view them as a graph filter block firstly computes the pairwise relations between the generated on a fully-connected graph. When the central position and all positions to form an attention map, graph filter is approximated by Chebyshev poly- and then aggregates the features of all positions by weighted nomials, a generalized formulation can be derived mean according to the attention map. The aggregated fea- for explaining the existing nonlocal-based blocks tures are filtered by the convolutional operator and finally (e:g:; nonlocal block, nonlocal stage, double at- added to the features of each central position to form the tention block). Furthermore, we propose an effi- output. Due to its simplicity and effectiveness, the nonlo- cient and robust spectral nonlocal block, which cal neural network1 has been widely applied in image and can be flexibly inserted into deep neural networks video classification (Wang et al., 2018; Yue et al., 2018; Tao to catch the long-range dependencies between spa- et al., 2018; Chen et al., 2018), image segmentation (Huang tial pixels or temporal frames. Experimental re- et al., 2019; Yue et al., 2018; Wang et al., 2018) and person sults demonstrate the clear-cut improvements and re-identification (Liao et al., 2018; Zhang et al., 2019) tasks. practical applicabilities of the spectral nonlocal 2 block on image classification (Cifar-10/100, Ima- The nonlocal (mean) operator in the nonlocal block is re- geNet), fine-grained image classification (CUB- lated to the spatial-based graph convolution, which takes the 200), action recognition (UCF-101), and person aggregation of the central position/node and its neighbor po- re-identification (ILID-SVID, Mars, Prid-2011) sitions/nodes in the image/graph to get a new representation. tasks. Battaglia et al.(2018) shows the relation of the nonlocal block and the spatial-based graph convolutions in Message Passing Network (Gilmer et al., 2017). However, the nonlo- arXiv:1911.01059v4 [cs.CV] 10 Feb 2020 1. Introduction cal block focuses on a fully-connected graph, considering all the other positions in the image. Instead, the spatial based Capturing the long-range spatial-temporal dependencies be- graph convolution focuses on a sparse graph, which natu- tween spatial pixels or temporal frames plays a crucial role rally eliminates the redundant information. Specifically, the in the computer vision tasks. Convolutional neural net- affinity matrix (the similarity metric of pairwise positions) works (CNNs) are inherently limited by their convolution computed in the nonlocal block is obtained from all the fea- operators which are devoted to capture local features and re- tures of the upper layer, thus leading to the interference of lations, e:g:; a 7 × 7 region, and are inefficient in modeling feature aggregations. Therefore, the current nonlocal block long-range dependencies. Deep CNNs model these depen- needs an elaborate arrangement for its position, number, and dencies, which commonly refers to enlarge receptive fields, channel number to eliminate this negative effect. via stacking multiple convolution operators. However, two 1 unfavorable issues are raised in practice. Firstly, repeating The nonlocal neural network is the deep CNNs with nonlocal block inserted 2 convolutional operations comes with higher computation The nonlocal block consists a nonlocal operator and a residual connection and memory cost as well as the risk of over-fitting (He & Spectral Nonlocal Block Figure 1. The spatial (A) and spectral (B) view of a nonlocal block. The pink dots indicate each patch in the feature map and the “Aggregation” means calculating the weighted mean as the numerator of Eq. (1). The dotted arrows mean “copy” and full arrows mean “feed forward”. The green bars are the node features and the length means their strength (best view in color). To increase the robustness and applicability of the nonlocal ing nonlocal blocks in multiple vision tasks, including block in real-world applications, from the spectral-based image classification, fine-grained image classification, graph convolution views (Defferrard et al., 2016b;a; Levie action recognition, and person re-identification. et al., 2018), we reformulate the nonlocal block based on the property of graph spectral domain. As shown in Fig.1, the 2. Preliminary input image is fed into the convolutional layers to extract discriminative features such as the wing, the head, the claw In this paper, we use bold uppercase characters to denote and the neck. These features can be seen as the input of the matrix-valued random variable and italic bold uppercase the nonlocal block. Different from the spatial view which to denote the matrix. Vectors are denoted with lowercase. firstly aggregates the input features by weighted mean and The Nonlocal Block (NL) follows the nonlocal operator that then uses convolutional operator to filter as in Fig.1 A, our calculates a weighted mean between the features of each spectral view constructs a fully-connected graph based on position and all possible positions as shown in Fig.1 A. The their similarity and then directly filters the input features in nonlocal operator is defined as: a global view profited by the graph filter shown in Fig.1 B. P h i In practice, the Chebyshev polynomials is utilized to ap- j f(Xi;:; Xj;:)g(Xj;:) proximate the graph filter for reducing the number of pa- F (Xi;:) = ; (1) P f(X ; X ) rameters and computational cost (Phillips, 2003). This ap- j i;: j;: proximated formulation has successfully filled the gap be- where X 2 RN×C1 is the input feature map, i; j are the tween the spectral view and the spatial view of the nonlocal position indexes in the feature map, f(·) is the affinity kernel block (Wu et al., 2019). Thus other extended nonlocal-based which can adopt the “Dot Product”, “Traditional Gaussian”, e:g:; blocks ( nonlocal block, nonlocal stage, double atten- “Embedded Gaussian” or other kernel metrics with a finite tion block) can be further theoretically interpreted in the Frobenius norm. g(·) is a linear embedding that is defined as: spectral view. C1×Cs g(Xj;:) = Xj;:WZ with WZ 2 R . Here N is the Based on the points above, we propose Spectral Nonlocal total positions of each features and C1;Cs are the number of Block (SNL) that concerns the rich, structured information channels for the input and the transferred features. j When in an image via encoding the graph structure. The SNL inserting the NL block into the network structure, a linear guarantees the existence of the graph spectral domain and transformation and a residual connection are added: has more mathematical guarantees to improve its robustness and accuracy. In a nutshell, our contributions are threefold: Yi;: = Xi;: + F (Xi;:)W; (2) where W 2 RCs×C1 is the weight matrix. • We have theoretically bridged the gap between nonlocal block (spatial-based approaches) and the graph spectral filter method (spectral-based approach) 3. Spectral Nonlocal (SNL) Block • We propose a spectral nonlocal block as an efficient, The nonlocal operator can be explained under the graph simple, and generic component of deep neural net- spectral domain. It can be briefly divided into two steps: works, which captures long-range spatial-temporal de- generating a fully-connected graph to model the relation pendencies between spatial pixels or temporal frames. between the position pairs; converting the input features into the graph domain and learning a graph filter. In this section, • The SNL achieves a clear-cut improvement over exist- we firstly give the definition of the spectral view for the Spectral Nonlocal Block nonlocal operator. Then, we interpret other nonlocal-based Property 1. Theorem.1 requires the graph Laplacian L operators from this spectral view. Finally, we propose the has non-singular eigenvalues and eigenvectors. Thus, the Spectral Nonlocal (SNL) Block and highlight its properties. affinity matrix A should be non-negative and symmetric. Remark 1. Based on Theorem.1, new nonlocal operators 3.1. Nonlocal block in the spectral view can be theoretically designed by using different types of graph filter such as the Chebyshev filter (Defferrard et al., The matrix form of the nonlocal operator in Eq. (1) is: 2016a;b), the graph wavelet filter (Hammond et al., 2011) −1 F (X) = (DM M)ZW = AZW; (3) and the Caylet filter (Levie et al., 2018). In this work, we utilize the Chebyshev filter. where M = (M ), M = f(X ; X ) and Z = XW . ij ij i;: j;: Z Different from the spectral-based graph con- The matrix M 2 N×N is composed by pairwise similari- Remark 2. R volutions that focus on the sparse graph with fixed struc- ties between pixels. Z 2 N×Cs is the transferred feature R ture (Defferrard et al., 2016a), Theorem.1 focuses on the map that compresses the channels of X by a linear trans- fully-connected graph structure in which affinity matrix A formation with W 2 C1×Cs .

Load more