e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com A REVIEW AND PERFORMANCE ANALYSIS OF NON-LINEAR ACTIVATION FUNCTIONS IN DEEP NEURAL NETWORKS Sushma Priya Anthadupula *1, Dr. Manasi Gyanchandani*2 *1 M.Tech Scholar, Department of Computer Science Engineering, Maulana Azad National Institute of Technology, Bhopal, M.P, India. *2 Professor, Department of Computer Science Engineering, Maulana Azad National Institute of Technology, Bhopal, M.P, India. ABSTRACT Activation Functions are mathematical equations which perform a nonlinear transformation on the input such that the neuron can be used to learn and perform complex tasks efficiently. A weighted sum of the input is calculated and a bias is added to it to produce an output. The decision to fire a neuron is made after applying the to the output. An analysis of activation functions used in ANN is presented. Our paper explores the efficiency of activation functions on the MNIST and Fashion MNIST datasets. The convergence speed during the after applying the activation functions is also shown in this work. Keywords: Activation Function, Neural Network, Hidden Unit, Backpropagation, Deep Neural Networks. I. INTRODUCTION The brain is an intelligent system. Artificial Neural Networks (ANN) tries to mimic this behavior. The output of the neuron is produced by summing up the products of weight and inputs and added with a bias. The output value can range anywhere between -∞ to +∞. Activation function is defined as any function which can bound the output of a neuron in the range of (0,1) or (-1,1) to fire a neuron.

Different activation functions like sigmoid, Hyperbolic Tangent, ReLU, ELU, Softplus, or LeakyReLU can approximate arbitrary continuous functions. has vanishing gradient problem. As a result, the Rectified Linear Unit (ReLU) activation function was suggested, which quickly gained popularity, but the Dying ReLU problem arose. Later Leaky ReLU, Parametric ReLU, Exponential Linear Unit (ELU), etc. have been introduced. Automation will help you find the right activation functions in [1].

A neural network does the same job as Linear Regression without an activation function. The number of hidden layers has shown no effect on the output without an activation function in the output layer. Real world data such as videos, images, music is nonlinear in nature. A nonlinear activation function must be continuous and differentiable between 0 and 1. This produces a nonlinear decision boundary. The network architecture is decided by variables called hyper parameters. These help in determining the accuracy or improving the training time of the neural network. Batch size, momentum, learning rate, and number of epochs are some of the trainable hyper parameters. The number of hidden neurons could be increased, which would enhance prediction accuracy, training speed, and stability[2] [3] [4]. Adaptive batch sizes for non-convex problems are seen to produce good results [5]. In this paper, we compare different activation functions on the MNIST Handwritten Digits dataset and the MNIST Fashion Dataset. Number of evaluation images are 10,000, validation images are 10,000, and training images are 50,000 in this dataset, each with a 28x28 greyscale pixel representation of one of the ten digits. On the MNIST , the performance of the activation functions is investigated using two hidden layers, each with 100 units. II. REVIEW OF ACTIVATION FUNCTIONS

The advantages and implementations of activation functions, which are commonly used in deep learning algorithms, are discussed in this paper. Table 1 discusses the uses and limitations of various activation functions. The equations of all the activation functions can be seen in Table 2.

1) Sigmoid Activation Function

The Sigmoid Function is widely used since several years. Squashing function or logistic function are other names of Sigmoid function [6]. It has a smooth S-shaped curve. It is a differentiable function and is bounded continuously with positive derivatives everywhere [7]. It has applications in modeling logistic www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1168] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com regression tasks, binary classification problems [8]. Sharp gradients from deep hidden layers to the input layer are propagated during backpropagation. Figure 1 shows the shape of sigmoid function.

The following are the variants of Sigmoid activation function

a) Hard Sigmoid Function

The equation of this variant of sigmoid function is given by

( ) ( ( ))

Hard sigmoid has a lower computation cost and shows positive results on binary classification tasks based on deep learning [9].

b) Sigmoid-Weighted Linear Units (SiLU)

The equation for this activation function is given by ( ) ( )

Where s= input vector,

c) Sigmoid-Weighted Linear Units Derivative (dSiLU)

The equation for dSiLU is given by

( ) ( ) ( ( ( ))) dSiLU which looks like an overshooting sigmoid function outperformed the standard Sigmoid function significantly [10].

Figure 1 : Sigmoid Activation Function

2) Hyperbolic Tangent Activation Function

The tanh function suffers from dead neurons problem because when x = 0 then the tanh function attains a gradient of 1. tanh is extensively used in speech recognition tasks and in recurrent neural networks for natural language processing [11]. Hard Hyperbolic Tangent is a variant of tanh. Figure 2 shows the shape of tanh function. www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1169] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com a) Hard Hyperbolic Tangent

The Hard tanh is a variant of tanh that is less costly and more computationally powerful. The -1 to 1 range is covered by the Hard tanh feature. The equation is as follows: -1 if x <- 1 x if x = -1 1 if x > 1

Hard tanh applications include usage in natural language processing to improve speed and accuracy [12].

Figure 2 : Hyperbolic Tangent Activation Function

3) Rectified Linear Unit

For deep neural networks, ReLU is the most commonly used activation function. Its ease of use and linear property that preserves non-linear behavior aid in achieving cutting-edge results. To trigger neurons, it uses a threshold operation. Its performance is better compared to Sigmoid and tanh functions with respect to generalization and training. The property of ReLU to squish the values between maximum and zero, helps to introduce sparsity in the hidden units. However, in some cases during backpropagation, the value of gradient is zero, due to which the weights and biases are not updated. Dying neurons problem is seen in ReLU due to which the weight updates cannot activate other neurons and thus hinders the learning. To solve this issue Leaky ReLU has been proposed [16]. ReLU and its variants are used effectively in Restricted Boltzmann machines (RBM) and in CNN [13]. Figure 3 shows the shape of ReLU function.

Figure 3 : ReLU Activation Function

a) Leaky ReLU Function (LReLU)

To change the weights during backpropagation, the LReLU function adds a slight negative slope to the ReLU. www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1170] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com To make sure the gradients are not zero during the training parameter has been introduced. This ensures the dead neurons problem is rectified. As compared to hyperbolic tangent and regular ReLU, LReLU ensures improved sparsity and dispersion [14] [15].

b) Randomized Leaky ReLU (RReLU)

RReLU is a vigorous form of leaky ReLU in which the network is trained using a number drawn at random from a uniform distribution U (l; u). If the average of all aj is taken during training without any dropout parameter, the test output is given by

It has been discovered that RReLU outperforms ReLU in classification tasks [16] [17] .

c) S-shaped ReLU (SReLU)

Both convex and nonconvex functions are learned by the S-shaped ReLU. The need of less parameters helps SReLU in training deeper networks easily. SReLU has shown performance improvements on most popular datasets with and without data augmentation for regularization.

d) Parametric ReLU (PReLU)

In Leaky ReLU the slope is fixed for all the negative inputs. However, in PReLU a parameter α is considered as slope. The neural network figures out the value of α through gradient descent.

4) Softplus Function

The softplus function enhances the performance and stabilization of deep neural networks. It is considered to have nonzero gradient properties. When the Softplus function was compared to the ReLU and Sigmoid functions, the Softplus function performed better with less epochs to convergence during testing.

5) Exponential Linear Units (ELUs)

ELUs are used to accelerate deep neural network training. Since ELU uses identity for positive values, it avoids the vanishing gradient problem. When compared to ReLU, ELU effectively eliminates bias shifts.

6) Swish Function

The Swish function is a hybrid function that combines the sigmoid AF and the input function. The function is computed using a -based automated search technique. Smoothness, bounded below and unbounded above the upper limits, Non-monotonic are all properties of the Swish function [1].

7) Exponential linear Squashing Function (ELiSH)

ELiSH has two parts namely linear and sigmoid. The linear part eliminates the vanishing gradient problem and the sigmoid part is useful in improving the information flow. It has been successfully applied and tested on ImageNet Dataset [18].

The function is represented with the equation

( ) ( ( )) 0

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1171] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com Table-1: Uses and Limitations of Activation Functions

S.No Activation Uses Limitations Function 1 Sigmoid Easy to understand, Used in shallow Gradient saturation, non-zero networks [8]. centered output, slow convergence 2 Hyperbolic Multi-layer neural networks shows Vanishing Gradient problem Tangent better training efficiency with tanh and the output is zero-centered. 3 ReLU Faster learning, better performance, Overfits easily, Dying Neurons Eliminates Vanishing Gradients Problem, Reliability, Simplicity 4 Leaky ReLU No dying ReLU problem, Training time Coefficient of x is predefined is less 5 ELU Merger between good features of ReLU Saturates for large negative values and Leaky ReLU, converges to zero faster

Table-2: Activation Function and Corresponding Equations

S.No Activation Function Equation of the function

1 Sigmoid

2 Hyperbolic Tangent e e

e e

3 ReLU ( )

4 Leaky ReLU ( ) ( )

5 ReLU6 ( ( ) )

6 ELU ( ) ( ( ))

7 PReLU ( ) ( )

8 RReLU

if ≥ 0

aji xji if < 0

9 SELU ( ( ) ( ( )))

10 Swish

11 Mish ( ( ))

Table-3: Properties of Activation Functions

S.No Activation Function Range Continuity Order Monotonic Derivative 1 Sigmoid (-∞, +∞) C∞ Yes 2 Hyperbolic Tangent (-1,+1) C∞ Yes 3 ReLU [0, ∞] C0 Yes 4 Leaky ReLU (-∞, +∞) C0 Yes www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1172] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com

5 ReLU6 [0,6] C0 Yes 6 ELU (-∞, +∞) C1 if α= Yes , if α≥ C0 if α≠ 7 PReLU (-∞, +∞) C0 Yes 8 SELU (-γα, +∞) C0 Yes 9 Swish (≃0.27, +∞) C∞ No 10 Mish (≃0.30, +∞) C∞ No

III. RESULTS AND DISCUSSION The behavior of activation functions has been studied by performing the following experiments a) Performance Analysis on MNIST dataset accuracy

Using a batch size of 64 and 10 epochs, the behavior of various common activation functions on the MNIST set is investigated. The rate of learning is believed to be 0.01. The output layer is given the SoftMax treatment. The accuracy of various activation functions after implementation on the MNIST dataset is shown in Table 4.

Table-4: The Accuracy Analysis of Various Activation Functions on MNIST Dataset (In %)

S.No Activation Function Accuracy (in %)

1 Sigmoid 83.37

2 Hyperbolic Tangent 95.84

3 ReLU 97.29

4 Leaky ReLU (α=0.5) 94.82

5 ReLU6 97.45

6 ELU (α=0.5) 96.56

7 SELU (α=0.5 , γ=0.5) 93.51 b) Convergence Speed during Backpropagation The learning rate is 0.65, the number of layers is 3, and the maximum number of iterations is 1000. Stochastic gradient descent is used to train the neural network. The model weights are modified using the backpropagation algorithm. Table-5: The Convergence Speed Analysis of Activation Functions during Backpropagation

S.No Activation Function Total Error Convergence Step 1 Sigmoid 0.000183 118 2 Hyperbolic Tangent 0.010139 15 3 ReLU 0.101700 6 4 ReLU6 0.001167 532 5 ELU (α=0.5) 0.002508 7 6 SELU (α=0.5 , γ=0.5) 0.430113 1 7 Mish 0.036197 10

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1173] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com

Figure 1 : Sigmoid Function Convergence Figure 2 : tanh Function Convergence

Figure 3 : ReLU Function Convergence Figure 4 : ReLU6 Function Convergence

Figure 5: ELU Function Convergence Figure 6 : SELU Function Convergence

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1174] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com

Figure 7 : Mish Function Convergence c) Performance Analysis on Fashion MNIST dataset accuracy (in %)

Fashion-MNIST is a Zalando article image dataset that includes a training data is of 60,000 images and a test set of 10,000 images. Each image is a 28x28 grayscale image and has a label from one of ten categories. Each image has 28x28 dimensions, making a total of 784 pixels. The accuracy of various activation functions after implementation on the Fashion MNIST dataset is shown in Table 6. Table-6: The Accuracy Analysis of Various Activation Functions on Fashion MNIST Dataset (In %)

S.No Activation Function Accuracy (%) 1 Sigmoid 86.30 2 Hyperbolic Tangent 84.86 3 ReLU 86.30 4 ReLU6 85.71 5 ELU (α=0.5) 86.25 IV. CONCLUSION This paper offers a detailed overview of the activation functions used in deep learning as well as comparison of emerging developments with the state-of-art research findings. We began with a brief overview of deep learning and activation functions, followed by a description of the various types of activation functions discussed, along with some basic implementations, benefits, drawbacks, and properties of the activation functions. The future work will involve introducing a new activation function which will beat the existing drawbacks and ensure better performance and optimization. Performance can be tested on popular datasets to see improvements. V. REFERENCES [1] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for Activation Functions”, arXiv: 7 0.0594 [cs], Oct. 2017. [2] X. Li and X. Yu, “Influence of Learning Rate and Neuron Number on Prediction of Animal Phenotype Value Using Back-Propagation Artificial Neural Network”, 2009 Second International Symposium on Computational Intelligence and Design, Changsha, 2009, pp. 270-274, doi: 10.1109/ISCID.2009.214. [3] M. A. B. Siddique, M. M. R. Khan, R. B. Arif and Z. Ashrafi, “Study and Observation of the Variations of Accuracies for Handwritten Digits Recognition with Various Hidden Layers and Epochs using Neural Network Algorithm”, 20 8 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT), Dhaka, Bangladesh, 2018, pp. 118-123, doi: 10.1109/CEEICT. 2018.8628144. [4] R. Saeed, R. Ghnemat, G. Benbrahim and A. Elhassan, “Learning with Dynamic Architectures for Artificial Neural Networks - Adaptive Batch Size Approach”, 2019 2nd International Conference on new www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1175] e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021 Impact Factor- 5.354 www.irjmets.com Trends in Computing Sciences (ICTCS), Amman, Jordan, 2019, pp. 1-4, doi:10.1109/ICTCS.2019.8923070. [5] J. Turian, J. Bergstra, and Y. Bengio, “Quadratic features and deep architectures for chunking”, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, vol. 2009, pp.245–248. [6] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of backpropagation learning”, Natural to Artificial Neural Computation. IWANN 1995. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 1995, pp. 195–201. [7] R. M. Neal, “Connectionist learning of belief networks”, Artificial Intelligence, vol. 56, no. 1, pp. 71–113, 1992. [8] M. Courbariaux, Y. Bengio, and J. P. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations”, NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2December 2015 Pages 3123–3131 pp. 3123–3131, 2015. [9] Stefan Elfwing, Eiji Uchibe, Kenji Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning”, Neural Networks, Volume 07, 20 8, Pages 3-11, ISSN 0893-6080, https://doi.org/10.1016/j.neunet.2017.12.012. [10] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language Modeling with Gated Convolutional Networks”, arXiv, 20 7. [11] A. Maas, A. Hannun, and A. Ng, “Rectifier Nonlinearities Improve Neural Network Acoustic Models”, International Conference on Machine Learning (icml), 2013 [12] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural Language Processing (Almost) from Scratch”, Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. [13] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines”, Haifa, 20 0, pp. 807–814. [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, 20 2. [15] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning”, MIT Press, 2016. [16] B. Xu, N. Wang, H. Kong, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolution Network”, arXiv, 20 5. [17] M. Basirat and P. M. Roth, “The Quest for the Golden Activation Function”, arXiv, 20 8.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [1176]