<<

The 2018 5th International Conference on Systems and Informatics (ICSAI 2018)

Improving Convolutional Neural Network Using Pseudo ReLU Zheng Hu1, Yongping Li2, and Zhiyong Yang1 1School of Software, Nanchang Hangkong University, Nanchang, China 2Shanghai Institute of Applied Physics, Chinese Academy of Science, Shanghai, China e-mail: [email protected], [email protected], and [email protected]

Abstract—Rectified linear unit (ReLU) is a widely used activation (ReLU) will affect next directly. In this study, a novel in artificial neural networks, it is considered to be an approach named as PD-ReLU, was proposed and applied to efficient active function benefit from its simplicity and non- AlexNet for better performances. linearity. However, ReLU’s derivative for negative inputs is zero, which can make some ReLUs inactive for essentially all inputs II. RELATED WORKS during the training. There are several ReLU variations for solving this problem. Comparing with ReLU, they are slightly A. Leaky ReLUs different in form, and bring other drawbacks like more expensive Comparing with original ReLU, Leaky ReLUs have a in computation. In this study, pseudo were tried relative small positive gradient for negative inputs [3]. And replacing original derivative of ReLU while ReLU itself was there will be no zero gradient in the active function. However, unchanged. The pseudo derivative was designed to alleviate the it gives no more sparsity to neural network, because the zero derivative problem and be consistent with original derivative in general. Experiments showed using pseudo outputs for negative inputs will not be zero. It is defined as: derivative ReLU (PD-ReLU) could obviously improve AlexNet (a x if x  0 f (x) =  (1) typical convolutional neural network model) in CIFAR-10 and 0.01• x if x  0 CIFAR-100 tests. Furthermore, some empirical criteria for designing such pseudo derivatives were proposed. Parametric rectified linear units (PReLUs) is an varition of Leaky ReLU, it made the small positive gradient into a Keywords-Pseudo derivative; ReLU; convolutional neural parameter that will be learned like other neural network network; AlexNet parameters [9]. It is defined as: x if x  0 I. INTRODUCTION f (x) =  (2) a • x if x  0 A. Rectified linear units Randomized leaky rectified linear units (RReLUs) is Recently, ReLU was considered to be the most popular another varition of Leaky ReLU, it made the small positive for deep neural networks [1][2], especially gradient a random value conforms to a uniform distribution [10] in image processing tasks. ReLU’s output is the maximum of which is defined as: input and zero, which is easy to compute. Its output equals to a ~ U(l,u), l  u and l,u  0,1 (3) input when input value is non-negative. This property can  ) alleviate the gradient vanishing and exploding problems which B. Exponential linear units usually occur with sigmoid or tanh activation functions. ReLU’s outputs become zero when input is negative, which Exponential linear units (ELUs) also have identity can give neural network more sparsity. However, ReLU’s function form for non-negative inputs like ReLUs. For derivative also become zero when input value is negative, this negative input, their outputs will smoothly approach -α, which property may result in some outputs get stuck in zero for makes the mean activation closer to zero. It has been shown essentially all inputs. And this is known as dying ReLU that ELUs have better converge speed, and can obtain higher problem. Several variants of ReLUs were proposed to make classification accuracy than ReLUs in some models [4]. improvements, like Leaky ReLUs [3] and Exponential linear x if x  0 f (x) =  (4) units [4]. They all have no zero derivative, but became more  •(ex −1) if x  0 complex than original ReLU in form.  α is a non-negative hyper-parameter to be tuned. B. AlexNet Convolutional neural network (CNN) has made big III. PSEUDO DERIVATIVE RECTIFIED LINEAR UNITS (PD- progress in image processing, such as image classification [5] RELUS) and object detection [6]. One typical model for image A. Pseudo Derivative for Active Function’s Back classification is AlexNet. It was designed by SuperVision Propagation group, and won the ImageNet Large Scale Visual Recognition Challenge 2012 [7]. It mainly contains 5 convolutional layers, In back propagation, network weights were adjusted 3 fully connected layers [8], and there are ReLUs after each based on their derivatives. However, ReLU’s derivative has a convolutional layer. Comparing with deeper CNN models, drawback which may result in dying ReLU problem. So, AlexNet has no residual connections. So, each active function manually creating a pseudo derivative was proposed to

This work was supported by National Natural Science Foundation of China (61501218).

978-1-7281-0120-0/18/$31.00 ©2018 IEEE 283 alleviate such problem in this study. And ReLUs with pseudo Comparison of these two pseudo derivatives and original derivative showed its capacity to provide better performance ReLU’s derivative was shown in Fig. 1. Although pseudo than original ReLUs in the experiments. derivatives got unreal slopes of active function, this could be There were some empirical limits in designing pseudo considered as a reasonable anticipation of increasing for derivative. Assert the original derivative function is f (x), the negative inputs, especially for these inputs near zero. pseudo derivative function is g(x) . p(x) is the probability density function of original input values, q(x) is the probability density function of input values while using pseudo derivatives. And training loss is convergence. First of all, pseudo derivative should be consistent with the original derivative in general, because adjusting the network weights must according to their variation trends. So and should satisfy the follow constraint: + g(x)• q(x)− f (x)• p(x) dx  C1. (5) − C1 is an empirical positive value depends on the model and data, which restricts difference of two derivatives. Figure 1. The derivatives of ReLU, PD-ReLU-sigmoid(α=4) and PD-ReLU- Considering and were similar (while outputs with sloped step(α=2.718). were slightly better than in this study, IV. EXPERIMENTS should also slightly differed from ), so (1) could be A. Data Sets and Training Parameters simplified to: Classification performances were evaluated on CIFAR-10 + g(x)− f (x) dx  C1. (6) and CIFAR-100 datasets. CIFAR-10 contains 10 different − classes and CIFAR-100 contains 100 [11]. Each of them has Secondly, the weights’ total amount of update should be 50,000 training images and 10,000 test images. similar during the training. So and should satisfy The neural network structure was AlexNet using original the follow constraint: ReLUs, Leaky ReLUs (negative slop was 0.01) and PD-ReLUs + g(x)•q(x)− f (x)• p(x) dx  C2. (7) separately. Because image size of CIFAR-10/CIFAR-100 was − relatively small (32×32), AlexNet in this study contains only Analogously, C2 is an empirical positive value (a small one fully connected layer, its structure was shown in Table I. number), and (3) could be simplified to: + TABLE I. CONVOLUTIONAL NEURAL NETWORK STRUCTURE g(x)− f (x) dx  C2. (8) − Convolutional AlexNet If expression in (8) excesses , which means pseudo Layer Index derivative update values much less or more than normal 1 Convolutional layer, 11×11 derivative, the training wouldn’t be optimum, or even wouldn’t be completed. ReLU/Leaky ReLU/PD-ReLU Max-pooling, 2×2 kernel B. Empirical PD-ReLUs Two types of empirical PD-ReLUs were introduced in 2 Convolutional, 5×5 kernel this study. The first pseudo derivative was a sigmoid function: ReLU/Leaky ReLU/PD-ReLU (9) g(x) = sigmoid ( • x) Max-pooling, 2×2 kernel ReLU’s original derivative function was the step function. × Sigmoid function was consistent with it in general, and is easy 3 Convolutional, 3 3 kernel to satisfy (6) and (8). ReLU/Leaky ReLU/PD-ReLU The second was a sloped step function: 4 Convolutional, 3×3 kernel 1 if x   / 2  (10) ReLU/Leaky ReLU/PD-ReLU g(x) = x / if − / 2  x   / 2  5 Convolutional, 3×3 kernel 0 if x  − / 2 ReLU/Leaky ReLU/PD-ReLU This sloped step function was just like a stretched step function, and there was a scope for negative input to have Max-pooling, 2×2 kernel positive derivative which gave these inputs more chances to get Fully connected out of the inactive situation.

Each pseudo derivative has a positive hyper-parameter α. They avoided consistent zero for all negative inputs.

284 All the experiments were carried out on the basis of Both g(x + 0.5) and failed to converge. While PyTorch (a framework for fast and flexible updated the weights too much, the training loss experimentation) [12]. PyTorch’s random crop and random usually become divergence after several training loops. And horizontal flip modules were used for data augments. Each is the opposite, training loss couldn’t even start training process got 164 loops (epochs), the initial learning rate was 0.1 which would shrink to 0.01 after 81 loops and 0.001 reducing. These test results were consistent with criterion (8). after 123 loops. Each result was the third best result out of 5 times of repeat. V. CONCLUSION The back propagation is an essentially mechanism which B. Results and Discussion adjusts network weights according to derivatives. If the Table II shows the results of CIFAR-10/CIFAR-100 tests. derivative had drawbacks in the adjustment, applying a manual PD-ReLU-sigmoid was tested with one parameter (α=4), and amendment (using pseudo derivative) is possible. In this work, PD-ReLU-sloped step was tested with two (α=1, α=2.718). The pseudo derivatives were applied to ReLU in a typical best results for CIFAR-10 and CIFAR-100 is PD-ReLU-sloped convolutional neural network—AlexNet, and got much better step (α=1) and PD-ReLU-sloped step (α=2.718), respectively. performance. Because the pseudo derivatives were designed to alleviate dying ReLU problem by giving some anticipation of TABLE II. ERROR RATE OF CIFAR-10/CIFAR-100 TESTS increasing to negative inputs. Despite the training process is a Active Test Error% little bit longer, the generated model had all benefits of original Functions Cifar-10 Cifar-100 ReLUs with better accuracy. ReLU 22.31 56.11 REFERENCES Leaky ReLU 22.06 55.66 [1] Lecun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, PD-ReLU-sigmoid (α=4) 21.82 54.42 521(7553):436. PD-ReLU-sloped step (α=1) 21.58 54.6 [2] Ramachandran P, Zoph B, Le Q V. Searching for Activation Functions. arXiv preprint arXiv:1710.05941, 2017. PD-ReLU-sloped step (α=2.718) 22.13 53.69 [3] Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve Comparison of convergence curves with the active neural network acoustic models. Proc. icml. 2013, 30(1): 3. functions are shown in Fig. 2 and 3. Based on the experiments, [4] Clevert D A, Unterthiner T, Hochreiter S. Fast and accurate deep we found convergence speed and error rate are better with PD- network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. ReLUs. This is more obvious trend in CIFAR-100 compared [5] Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual with that in CIFAR-10. Recognition Challenge. International Journal of , 2015, Criterion (8) was also tested by experiments. The 115(3):211-252. [6] Girshick R. Fast r-cnn. Proceedings of the IEEE international conference experiments replacing a proper g(x) with g(x + ) and on computer vision. 2015: 1440-1448. g(x − ). PD-ReLU-sigmoid (α=4) was chosen to be the proper [7] http://www.image-net.org/challenges/LSVRC/2012/results.html . Tests were done with g(x  0.25) and g(x  0.5) on [8] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. International Conference on Neural CIFAR-100. Results were shown in table III. (While criterion Information Processing Systems. Curran Associates Inc. 2012:1097- (6) was clear, for an arbitrary pseudo derivative wouldn’t 1105. adjust weights properly, and it was unnecessary to do any [9] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: Surpassing experiment.) human-level performance on classification. Proceedings of the IEEE international conference on computer vision. 2015: 1026-1034. TABLE III. TEST RESULTS OF G(X), G(X  0.25) AND G(X 0.5) [10] Xu B, Wang N, Chen T, et al. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, Active Functions Test Results 2015. ( ) g(x) = sigmoid (4• x) Converge Test Error% [11] Nair V, Hinton G E. Rectified linear units improve restricted boltzmann Yes 54.42 machines. Proceedings of the 27th international conference on (ICML-10). 2010: 807-814. g(x + 0.25) Yes 55.31 [12] Ketkar, Nikhil. "Introduction to ." Deep Learning with Python. Apress, Berkeley, CA, 2017. 195-208. g(x − 0.25) Yes 54.08 g(x + 0.5) No - g(x − 0.5) No -

285

Figure 2. Convergence curves of AlexNet with different activations on CIFAR-10 test sets.

286

Figure 3. Convergence curves of AlexNet with different activations on CIFAR-100 test sets.

287