Orthogonal Convolutional Neural Networks

Orthogonal Convolutional Neural Networks Jiayun Wang Yubei Chen Rudrasis Chakraborty Stella X. Yu UC Berkeley / ICSI fpeterwg,yubeic,rudra,[email protected] Spectrum of Kernel � 2.0 layer33 Abstract 30% 1.5 kernel orth 24% OCNN Deep convolutional neural networks are hindered by 18% 1.0 12% layer15 layer27 training instability and feature redundancy towards further 0.5 6% layer7 baseline performance improvement. A promising solution is to im- 0.0 0 0.2 0.4 0.6 0.8 1 0 2000 4000 6000 8000 pose orthogonality on convolutional filters. a) histogram of filter similarities b) convolution kernel spectrum We develop an efficient approach to impose filter or- Image Classification Accuracy 12% 80% thogonality on a convolutional layer based on the doubly baseline 10% OCNN block-Toeplitz matrix representation of the convolutional 8% OCNN kernel orth 78% kernel orth kernel instead of using the common kernel orthogonality ap- 6% proach, which we show is only necessary but not sufficient 4% baseline 2% for ensuring orthogonal convolutions. 75% 0 0.2 0.4 0.6 0.8 1 ResNet18 ResNet34 ResNet50 Our proposed orthogonal convolution requires no addi- c) histogram of filter similarities d) classification accuracy tional parameters and little computational overhead. This method consistently outperforms the kernel orthogonality Figure 1. Our OCNN can remove correlations among filters and alternative on a wide range of tasks such as image clas- result in consistent performance gain over standard convolution baseline and alternative kernel orthogonality baseline (kernel orth) sification and inpainting under supervised, semi-supervised during testing. a) Normalized histograms of pairwise filter simi- and unsupervised settings. Further, it learns more diverse larities of ResNet34 for ImageNet classification show increasing and expressive features with better training stability, robust- correlation among standard convolutional filters with depth. b) ness, and generalization. Our code is publicly available. A standard convolutional layer has a long-tailed spectrum. While kernel orthogonality widens the spectrum, our OCNN can produce 1. Introduction a more ideal uniform spectrum. c) Filter similarity (for layer 27 in While convolutional neural networks (CNNs) are widely a) is reduced most with our OCNN. d) Classification accuracy on CIFAR100 always increases the most with our OCNN. successful [36, 14, 50], several challenges still exist: over parameterization or under utilization of model capacity Table 1. Summary of experiments and OCNN gains. [21, 12], exploding or vanishing gradients [7, 17], growth Task Metric Gain CIFAR100 classification accuracy 3% Image in saddle points [13], and shifts in feature statistics [31]. ImageNet classification accuracy 1% Classification Through our analysis to solve these issues, we observe that semi-supervised learning classification accuracy 3% fine-grained image retrieval kNN classification accuracy 3% convolutional filters learned in deeper layers are not only Feature unsupervised image inpainting PSNR 4.3 highly correlated and thus redundant (Fig.1a), but that each Quality image generation FID 1.3 Cars196 NMI 1.2 layer also has a long-tailed spectrum as a linear opera- Robustness black box attack attack time 7x less tor (Fig.1b), contributing to unstable training performance from exploding or vanishing gradients. volutions with our orthogonality loss during training, net- We propose orthogonal CNN (OCNN), where a convo- works produce more uniform spectra (Fig.1b) and more di- lutional layer is regularized with orthogonality constraints verse features (Fig.1c), delivering consistent performance during training. When filters are learned to be as orthog- gains with various network architectures (Fig.1d) on various onal as possible, they become de-correlated. Their filter tasks, e.g. image classification/retrieval, image inpainting, responses are much less redundant. Therefore, the model image generation, and adversarial attacks (Table1). capacity is better utilized, which improves the feature ex- Many works have proposed the orthogonality of linear pressiveness and consequently the task performance. operations as a type of regularization in training deep neural Specifically, we show that simply by regularizing con- networks. Such a regularization improves the stability and 1 � = � ∗ �, stride 1 We prove that kernel orthogonality is in fact only neces- Output � Kernel � Input � � get patches sary but not sufficient for orthogonal convolutions. Conse- � 1 2 3 �"' �"' �"# �"# 1 2 … 4 5 6 quently, the spectrum of a convolutional layer is still non- = Conv( … �"' �"' �"# �"# ) 3 4 output channel , � � �'# �'# 7 8 9 � '' '' uniform and exhibits a wide variation even when the kernel �'' �'' … �'# �'# reshape � input channel matrix K itself is orthogonal (Fig.1b). Patch-matrix �: 1 2 3 4 �'' �'' �'' �'' 1 2 4 5 � More recent works propose to improve the kernel or- 1 2 2 3 1 2 3 4 �6' �6' �6' �'' � 2 3 5 6 × unroll 4 5 5 6 ... ... ... ... thogonality by normalizing spectral norms [40], regular- 2 ... ... ... ... (a) � � 4 5 7 8 = × × 4 5 5 6 2 7 8 8 9 izing mutual coherence [5], and penalizing off-diagonal 1 2 3 4 �"' �"' �"' �"' 5 6 8 9 elements [8]. Despite the improved stability and perfor- 2×2 2×2×� 2×2 � = � �: mance, the orthogonality of K is insufficient to make a reshape Input channel 1 1 flatten linear convolutional layer orthogonal among its filters. In 1 2 3 4 5 6 7 8 9 2 1 3 1 �'' �'' �'' �'' contrast, we adopt the DBT matrix form, and regularize 4 (b) 2 5 = 2 �'' �'' �'' �'' × 3 … 6 kConv(K; K) − Irk instead. While the kernel K is indi- 3 � � � � 4 '' '' '' '' 7 8 … 4 �'' �'' �'' rectly represented in the DBT matrix K, the representation Input channel 1 Input channel 9 … Output channel 1 Output channel Output channel 1 Output channel … of input X and output Y is intact and thus the orthogonality � = � � property of their transformation can be directly enforced. Figure 2. Basic idea of our OCNN. A convolutional layer Y = We show that our regularization enforces orthogonal Conv(K; X) can be formulated as matrix multiplications in two ways: a) im2col methods [58, 26] retain kernel K and convert in- convolutions more effectively than kernel orthogonality methods, and we further develop an efficient approach for put X to patch-matrix Xe. b) We retain input X and convert K to a doubly block-Toeplitz matrix K. With X and Y intact, we di- our OCNN regularization. rectly analyze the transformation from the input to the output. We To summarize, we make the following contributions. further propose an efficient algorithm for regularizing K towards orthogonal convolutions and observe improved feature expressive- 1. We provide an equivalence condition for orthogonal ness, task performance and uniformity in K’s spectrum (Fig.1b). convolutions and develop efficient algorithms to im- plement orthogonal convolutions for CNNs. performance of CNNs [5, 57,3,4], since it can preserve energy, make spectra uniform [61], stabilize the activation 2. With no additional parameters and little computational distribution in different network layers [46], and remedy the overhead, our OCNN consistently outperforms other exploding or vanishing gradient issues [1]. orthogonal regularizers on image classification, gener- Existing works impose orthogonality constraints as ker- ation, retrieval, and inpainting under supervised, semi- nel orthogonality, whereas ours directly implements orthog- supervised, and unsupervised settings. onal convolutions, based on an entirely different formula- Better feature expressiveness, reduced feature correlation, tion of a convolutional layer as a linear operator. more uniform spectrum, and enhanced adversarial robust- Orthogonality for a convolutional layer Y = ness may underlie our performance gain. Conv(K; X) can be introduced in two different forms (Fig.2). 2. Related Works 1. Kernel orthogonality methods [57,3,4] view convolution as multiplication between the kernel matrix K Im2col-Based Convolutions. The im2col method [58, 26] has been widely used in deep learning as it enables efficient and the im2col [58, 26] matrix Xe, i.e. Y = KXe. The orthogonality is enforced by penalizing the disparity GPU computation. It transforms the convolution into a Gen- between the Gram matrix of kernel K and the identity eral Matrix to Matrix Multiplication (GEMM) problem. matrix, i.e. kKKT − Ik. However, the construction of Fig.2a illustrates the procedure. a) Given an input X, we Ck2×H0W 0 Xe from input X is also a linear operation Xe = QX, first construct a new input-patch-matrix Xe 2 R and Q has a highly nonuniform spectrum. by copying patches from the input and unrolling them into columns of this intermediate matrix. b) The kernel-patch- 2. Orthogonal convolution keeps the input X and the 2 matrix K 2 RM×Ck can then be constructed by reshaping output Y intact by connecting them with a doubly the original kernel tensor. Here we use the same notation for block-Toeplitz (DBT) matrix K of filter K, i.e. Y = simplicity. c) We can calculate the output Y = KX where KX and enforces the orthogonality of K directly. We e we reshape Y back to the tensor of size M × H × W – the can thus directly analyze the linear transformation desired output of the convolution. properties between the input X and the output Y . The orthogonal kernel regularization enforces the kernel 2 Existing works on CNNs adopt kernel orthogonality, due to K 2 RM×Ck to be orthogonal. Specifically, if M ≤ Ck2, T its direct filter representation. the row orthogonal regularizer is Lkorth-row =

Load more