A REVIEW and PERFORMANCE ANALYSIS of NON-LINEAR ACTIVATION FUNCTIONS in DEEP NEURAL NETWORKS Sushma Priya Anthadupula *1, Dr
Total Page:16
File Type:pdf, Size:1020Kb
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com A REVIEW AND PERFORMANCE ANALYSIS OF NON-LINEAR ACTIVATION FUNCTIONS IN DEEP NEURAL NETWORKS Sushma Priya Anthadupula *1, Dr. Manasi Gyanchandani*2 *1 M.Tech Scholar, Department of Computer Science Engineering, Maulana Azad National Institute of Technology, Bhopal, M.P, India. *2 Professor, Department of Computer Science Engineering, Maulana Azad National Institute of Technology, Bhopal, M.P, India. ABSTRACT Activation Functions are mathematical equations which perform a nonlinear transformation on the input such that the neuron can be used to learn and perform complex tasks efficiently. A weighted sum of the input is calculated and a bias is added to it to produce an output. The decision to fire a neuron is made after applying the activation function to the output. An analysis of activation functions used in ANN is presented. Our paper explores the efficiency of activation functions on the MNIST and Fashion MNIST datasets. The convergence speed during the backpropagation after applying the activation functions is also shown in this work. Keywords: Activation Function, Neural Network, Hidden Unit, Backpropagation, Deep Neural Networks. I. INTRODUCTION The brain is an intelligent system. Artificial Neural Networks (ANN) tries to mimic this behavior. The output of the neuron is produced by summing up the products of weight and inputs and added with a bias. The output value can range anywhere between -∞ to +∞. Activation function is defined as any function which can bound the output of a neuron in the range of (0,1) or (-1,1) to fire a neuron. Different activation functions like sigmoid, Hyperbolic Tangent, ReLU, ELU, Softplus, or LeakyReLU can approximate arbitrary continuous functions. Sigmoid function has vanishing gradient problem. As a result, the Rectified Linear Unit (ReLU) activation function was suggested, which quickly gained popularity, but the Dying ReLU problem arose. Later Leaky ReLU, Parametric ReLU, Exponential Linear Unit (ELU), etc. have been introduced. Automation will help you find the right activation functions in [1]. A neural network does the same job as Linear Regression without an activation function. The number of hidden layers has shown no effect on the output without an activation function in the output layer. Real world data such as videos, images, music is nonlinear in nature. A nonlinear activation function must be continuous and differentiable between 0 and 1. This produces a nonlinear decision boundary. The network architecture is decided by variables called hyper parameters. These help in determining the accuracy or improving the training time of the neural network. Batch size, momentum, learning rate, and number of epochs are some of the trainable hyper parameters. The number of hidden neurons could be increased, which would enhance prediction accuracy, training speed, and stability[2] [3] [4]. Adaptive batch sizes for non-convex problems are seen to produce good results [5]. In this paper, we compare different activation functions on the MNIST Handwritten Digits dataset and the MNIST Fashion Dataset. Number of evaluation images are 10,000, validation images are 10,000, and training images are 50,000 in this dataset, each with a 28x28 greyscale pixel representation of one of the ten digits. On the MNIST , the performance of the activation functions is investigated using two hidden layers, each with 100 units. II. REVIEW OF ACTIVATION FUNCTIONS The advantages and implementations of activation functions, which are commonly used in deep learning algorithms, are discussed in this paper. Table 1 discusses the uses and limitations of various activation functions. The equations of all the activation functions can be seen in Table 2. 1) Sigmoid Activation Function The Sigmoid Function is widely used since several years. Squashing function or logistic function are other names of Sigmoid function [6]. It has a smooth S-shaped curve. It is a differentiable function and is bounded continuously with positive derivatives everywhere [7]. It has applications in modeling logistic www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1168] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com regression tasks, binary classification problems [8]. Sharp gradients from deep hidden layers to the input layer are propagated during backpropagation. Figure 1 shows the shape of sigmoid function. The following are the variants of Sigmoid activation function a) Hard Sigmoid Function The equation of this variant of sigmoid function is given by ( ) ( ( )) Hard sigmoid has a lower computation cost and shows positive results on binary classification tasks based on deep learning [9]. b) Sigmoid-Weighted Linear Units (SiLU) The equation for this activation function is given by ( ) ( ) Where s= input vector, ∑ c) Sigmoid-Weighted Linear Units Derivative (dSiLU) The equation for dSiLU is given by ( ) ( ) ( ( ( ))) dSiLU which looks like an overshooting sigmoid function outperformed the standard Sigmoid function significantly [10]. Figure 1 : Sigmoid Activation Function 2) Hyperbolic Tangent Activation Function The tanh function suffers from dead neurons problem because when x = 0 then the tanh function attains a gradient of 1. tanh is extensively used in speech recognition tasks and in recurrent neural networks for natural language processing [11]. Hard Hyperbolic Tangent is a variant of tanh. Figure 2 shows the shape of tanh function. www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1169] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com a) Hard Hyperbolic Tangent The Hard tanh is a variant of tanh that is less costly and more computationally powerful. The -1 to 1 range is covered by the Hard tanh feature. The equation is as follows: -1 if x <- 1 x if x = -1 1 if x > 1 Hard tanh applications include usage in natural language processing to improve speed and accuracy [12]. Figure 2 : Hyperbolic Tangent Activation Function 3) Rectified Linear Unit For deep neural networks, ReLU is the most commonly used activation function. Its ease of use and linear property that preserves non-linear behavior aid in achieving cutting-edge results. To trigger neurons, it uses a threshold operation. Its performance is better compared to Sigmoid and tanh functions with respect to generalization and training. The property of ReLU to squish the values between maximum and zero, helps to introduce sparsity in the hidden units. However, in some cases during backpropagation, the value of gradient is zero, due to which the weights and biases are not updated. Dying neurons problem is seen in ReLU due to which the weight updates cannot activate other neurons and thus hinders the learning. To solve this issue Leaky ReLU has been proposed [16]. ReLU and its variants are used effectively in Restricted Boltzmann machines (RBM) and in CNN [13]. Figure 3 shows the shape of ReLU function. Figure 3 : ReLU Activation Function a) Leaky ReLU Function (LReLU) To change the weights during backpropagation, the LReLU function adds a slight negative slope to the ReLU. www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1170] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com To make sure the gradients are not zero during the training parameter has been introduced. This ensures the dead neurons problem is rectified. As compared to hyperbolic tangent and regular ReLU, LReLU ensures improved sparsity and dispersion [14] [15]. b) Randomized Leaky ReLU (RReLU) RReLU is a vigorous form of leaky ReLU in which the network is trained using a number drawn at random from a uniform distribution U (l; u). If the average of all aj is taken during training without any dropout parameter, the test output is given by It has been discovered that RReLU outperforms ReLU in classification tasks [16] [17] . c) S-shaped ReLU (SReLU) Both convex and nonconvex functions are learned by the S-shaped ReLU. The need of less parameters helps SReLU in training deeper networks easily. SReLU has shown performance improvements on most popular datasets with and without data augmentation for regularization. d) Parametric ReLU (PReLU) In Leaky ReLU the slope is fixed for all the negative inputs. However, in PReLU a parameter α is considered as slope. The neural network figures out the value of α through gradient descent. 4) Softplus Function The softplus function enhances the performance and stabilization of deep neural networks. It is considered to have nonzero gradient properties. When the Softplus function was compared to the ReLU and Sigmoid functions, the Softplus function performed better with less epochs to convergence during testing. 5) Exponential Linear Units (ELUs) ELUs are used to accelerate deep neural network training. Since ELU uses identity for positive values, it avoids the vanishing gradient problem. When compared to ReLU, ELU effectively eliminates bias shifts. 6) Swish Function The Swish function is a hybrid