Wide Hidden Expansion Layer for Deep Convolutional Neural Networks

Wide Hidden Expansion Layer for Deep Convolutional Neural Networks Min Wang Baoyuan Liu Hassan Foroosh University of Central Florida Amazon University of Central Florida [email protected] [email protected] [email protected] Abstract the networks by massively increasing the number of activation functions with a novel Wide Hidden Expansion (WHE) Non-linearity is an essential factor contributing to the layer, which can be flexibly embedded into different net- success of deep convolutional neural networks. Increasing work architectures via appending to convolutional layers. the non-linearity in the network will enhance the network’s In the WHE layer, while the number of output channels learning capability, attributing to better performance. We remains the same as that of input channels, there implic- present a novel Wide Hidden Expansion (WHE) layer that itly exists a large number of hidden channels connecting can significantly increase (by an order of magnitude ) the the input and the output. Each hidden channel is followed number of activation functions in the network, with very lit- by one activation function. Therefore, the number of acti- tle increase of computational complexity and memory con- vation functions can grow along with that of hidden chan- sumption. It can be flexibly embedded with different net- nels. Each hidden channel passes the weighted sum of two work architectures to boost the performance of the original input channels through an activation function and then ag- networks. The WHE layer is composed of a wide hidden gregates the result to one output channel. Such structured layer, in which each channel only connects with two in- sparse connectivity guarantees that the computational com- put channels and one output channel. Before connecting to plexity of this layer is bound to be a fraction of the tradi- the output channel, each intermediate channel in the WHE tional convolutional layer. On the other hand, hidden chan- layer is followed by one activation function. In this manner, nels merely exist temporarily in the GPU cache during both the number of activation functions can grow along with the the forward and backward phase of the layer. Therefore, number of channels in the hidden layer. We apply the WHE the memory consumption (both training and inference) of layer to ResNet, WideResNet, SENet, and MobileNet archi- WHE is aligned with the size of the input tensor rather than tectures and evaluate on ImageNet, CIFAR-100, and Tiny that of hidden channels. ImageNet dataset. On the ImageNet dataset, models with The merit of the proposed WHE layer is straightforward: the WHE layer can achieve up to 2.01% higher Top-1 accu- we can increase the number of channels as well as the num- racy than baseline models, with less than 4% computation ber of activation functions by more than an order of mag- increase and less than 2% more parameters. On CIFAR- nitude, with minimal increase of consumed resources (both 100 and Tiny ImageNet, when applying the WHE layer to computation and memory). Hidden channels here cannot be ResNet models, it demonstrates consistent improvement in considered equivalent to the channels in the traditional con- the accuracy of the networks. Applying the WHE layer to volutional layer since they are very sparsely connected to ResNet backbone of the CenterNet object detection model the much narrower input and output. However, they do in- can also boost its performance on COCO and Pascal VOC troduce substantial activation functions, which provide the datasets. non-linearity that is essential for a deep network to learn the complex distribution of visual data. 1. Introduction Comparing to methods that utilize channel-wise “bottleneck” structure like ResNet [10] and ResNeXt [31], the fun- In convolutional neural networks, the convolutional lay- damental difference between the WHE layer and them is the ers perform linear transformations to the feature space, structured sparse connections in WHE. Because each hid- while the activation function layers provide non-linear den channel in WHE is only connected to two input chan- warping to the feature space. Both linear transformation nels and one output channel, a substantial increase in hidden and non-linear warping are essential to the network’s capa- channels requires only minor extra computation. While in bility of learning complex distribution. ResNeXt, each hidden channel is connected to all the input In this work, we propose to increase the non-linearity in and output channels in the same group. With the same con- 934 figuration on the input, hidden, and output channels, a mod- Novel layer designs ule that is built with the “bottleneck” and grouping struc- While most of the state-of-the-art deep architectures are ture as in ResNeXt consumes more than 10 times computa- designed by reconfiguring existing layers, in recent years, tion than WHE. Additionally, the structured sparse connec- there is some research focusing on proposing new layers, tion in WHE makes it possible for the hidden channels to to which this work also belongs. Several new types of ac- be computed on-the-fly and never stored on the main GPU tivation functions [3, 5, 9, 16, 28] have been proposed to memory, which is difficult to be achieved by ResNeXt. replace the original ReLU function. In NIN [20], a micro We evaluate our WHE layer on the ImageNet classifica- multiple layer perceptron network is used to scan the in- tion dataset, CIFAR-100, and Tiny ImageNet datasets with put feature map to extract better representation. In SICNet various network architectures. It demonstrates that when [27] and MobileNet[12], the expensive convolutional layer adding the WHE layer to existing network architectures, is replaced with a combination of a novel 2D intra-channel our method can consistently achieve significant improve- convolution and 1 × 1 convolutional layer, achieving simi- ment with a marginal increase of computation and param- lar representative power with much lower complexity. CN- eters. We also show that applying WHE to the backbone NPack [29] performs the convolution in the frequency do- of the object detection network can boost performance on main with Discrete Cosine Transform (DCT), and achieves different datasets. high compression by discarding low-energy frequency co- The rest of this paper is structured as follows. In sec- efficients. In LBCNN [17], the pre-defined sparse binary tion 3, we introduce the mechanism of the proposed WHE 2D kernels are convolved with the input, and then linearly layer, analyze its computational complexity and memory combined to replace the traditional convolutional layers. consumption, and discuss the implementation details; In section 4, we explain the evaluation of our method, and dis- Activation functions cuss the experimental result; Section 5 concludes the paper. The predominant approach for a higher non-linearity in the network is to design more powerful activation functions. 2. Related Work The most extensively used activation function is Rectified Linear Unit (ReLu) due to its fast convergence. In ReLU, Deep Architectures the non-linearity comes only at the linear path’s change at the origin. Several ReLU variants are proposed to increase AlexNet [19] is the first deep CNN architecture applied the non-linearity of the networks. PReLU [9] introduces to large scale image classification problems. ZFNet [35] learnable parameters for the slope of the negative linear part adjusts the kernel and stride size of the first convolutional to increase the diversity of the activation function. ELU [5] layer in AlexNet to achieve better accuracy. VGGNet [24] proposes to use the exponential curve for the negative part uses 3 × 3 convolutional layer exclusively, which is very and makes the activation function differentiable at every lo- deep and computational expensive at that time but shows cation. LuTU [28] proposes to learn the shape of activa- the power of depth in network design. GoogleNet [26], in tion function with a look-up table structure and observes contrast, takes a very different approach. While having sim- the peak and valley patterns widely existing in the shapes ilar depth to VGGNet, its architecture is much more com- of the learned activation function, which can be replaced by plex and diverse, achieving similar accuracy with much less Gaussian mixture models. theoretical computation. Batch Normalization [15] makes Sparse connections it possible to train a deep network without vanishing or ex- ploding gradients. In ResNet [10], the residual connection The layer proposed in this work can be considered as a reduces the length of the back- propagation path and enables hidden one, with structured sparse connections with the in- effective training of network with more than one hundred put/output layers. Sparsity in convolutional neural networks layers. ResNeXt [31] improves the “cardinality” of the net- has been widely studied in recent years. In SCNN [22], the work with group convolutions. In [34], extremely wide net- convolutional kernel is decomposed along both the spatial works are evaluated against deep architectures. The authors dimensions and the channel dimension to exploit its redun- show that width can be as effective as depth regarding im- dancy and then sparsified by training with l1 penalty. In proving network performance. DenseNet [14] inserts resid- [8], the weights in the kernel are pruned, quantized, and ual connections between each pair of convolutional layers encoded with Huffman coding to achieve high compres- to improve feature propagation and reuse. In SENet [13], a sion. In [30], structured sparsity of the convolutional ker- squeeze-and-excitation block is introduced to aggregate the nel is learned with group Lasso regularization over the fil- spatial information in each channel. In NASNet [37], an ter and channel dimensions. [11] introduces sparsity along optimized convolutional architecture is learned on a small the channel dimension by imposing Lasso regression on the dataset and then applied to large scale datasets.

Load more