Arxiv:2108.09598V1 [Cs.LG] 21 Aug 2021

SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function

Sayan Nag*,1 Mayukh Bhattacharyya*,2

1University of Toronto 2Stony Brook University

Abstract tive, ReLU was not only easier to optimize compared to its contemporaries (sigmoid and tanh), but also showed better Activation functions play a pivotal role in determin- generalization and improved convergence properties, which ing the training dynamics and neural network performance. The widely adopted activation function ReLU led to its wide adoption. despite being simple and effective has few disadvan- ReLU however has few disadvantages including the infa- tages including the Dying ReLU problem. In order to mous Dying ReLU phenomenon (2019; 2013). The absence tackle such problems, we propose a novel activation of any negative portion resulted in such a problem which function called Serf which is self-regularized and non- can be noticed through a gradient information loss caused monotonic in nature. Like Mish, Serf also belongs to by collapsing the negative inputs to zero. On the other hand, the Swish family of functions. Based on several ex- ReLU is also non-differentiable which can result in in- periments on computer vision (image classification and consistencies during gradient-based optimization. Keeping object detection) and natural language processing (ma- those in mind, researchers have propounded various activa- chine translation, sentiment classification and multi- modal entailment) tasks with different state-of-the-art tion functions including leaky ReLU (2013), PReLU(2015), architectures, it is observed that Serf vastly outperforms ELU (2015), GELU (2016), SELU (2017), Swish (2017), ReLU (baseline) and other activation functions includ- Mish (2019), etc. Out of the aforementioned activation func- ing both Swish and Mish, with a markedly bigger mar- tions, Mish mostly outperforms its contemporaries includ- gin on deeper architectures. Ablation studies further ing Swish. Mish has a continuous profile which renders bet- demonstrate that Serf based architectures perform better information propapagation as compared to ReLU. It was ter than those of Swish and Mish in varying scenar- inspired by the self-gating property of Swish. As opposed ios, validating the effectiveness and compatibility of to Swish, Mish possess a preconditioner which results in Serf with varying depth, complexity, optimizers, learn- smoother gradients and better optimization. ing rates, batch sizes, initializers and dropout rates. Fi- In this work, we have proposed a novel activation func- nally, we investigate the mathematical relation between Serf Swish and Serf, thereby showing the impact of pre- tion called which is non-monotonic and is also inspired by the self-gating property of Swish. We define Serf conditioner function ingrained in the first derivative of x Serf which provides a regularization effect making gra- as f(x) = x erf(ln(1 + e )) where erf is the error func- dients smoother and optimization faster. tion (1998). Swish, Mish and Serf belong to the same family of activation functions possessing self-gating property. Like Mish, Serf also possess a pre-conditioner which results Introduction in better optimization and thus enhanced performance. Our Activation functions are point-wise functions which play a experiments demonstrate that our proposed activation func- crucial role in introducing non-linearity in neural networks. tion Serf outperforms ReLU, Swish and even Mish for dif- arXiv:2108.09598v3 [cs.LG] 25 Aug 2021 In a neural network, linear transformed inputs are passed ferent standard architectures in a variety of tasks including through an activation function giving rise to non-linear image classification, object detection, graph node classifica- counterparts. These non-linear point-wise activation function, machine translation, sentiment classification and multi- tions are hugely responsible for the performance of neural modal entailment, involving varied datasets. We have also networks. Thus, choosing a suitable activation function for conducted ablation studies on MNIST (1998) and CIFAR- better training and improved efficiency has always been an 10 datasets (Krizhevsky, Nair, and Hinton ) to demonstrate interesting area of research. Activation functions like tanh the efficiency of Serf over Swish and Mish. and sigmoid were widely used in previous works (1989; 1998; 1995; 2009). However, they had some disadvantages Related Work including upper-boundedness. This paved the way for the One of the mostly used activation functions is Rectified Lin- development of an activation function widely known as Rec- ear Units (ReLU) (2010). Originally proposed for Restricted tified Linear Unit (ReLU) (2010). Being simple yet effec- Boltzmann Machines, this activation function gained promi- * denotes equal contribution nence because of its simplicity and effectiveness and even- tually replaced the sigmoid and tanh units. Despite being sion (ViT (2020) and MLP Mixer (2021)), as well as Natu- computationally efficient, it is not entirely devoid of short- ral Language Processing (GPT-2 (2019) and GPT-3 (2020)). comings. In order to address those, Leaky ReLU (LReLU) Another activation function which rose to prominence due to was introduced which replaced the constant zero portion of its performance in state-of-the-art classification and object the ReLU function with a linear function thereby ’leaking’ detection tasks, is Mish. Mish has its roots in Swish and was some information (2013). LReLU showed superior perfor- developed by methodical analysis over the attributes that led mance compared to ReLU, and the performance was further to the efficacy of Swish. enhanced when the slope of the negative part was learnt as Taking inspiration from the development of Mish, we pro- an extra parameter using Parametric ReLU (PReLU) (2015). pose an activation function called Serf. Serf is defined as: However, lower boundedness is important in order to ren- der strong regularization effects which was absent in both f(x) = x erf(ln(1 + ex)) (1) LReLU and PReLU. Furthermore, similar to ReLU, they are also not differentiable. Properties Keeping these aspects in mind, researchers proposed acti- Serf is bounded below and unbounded above. Serf is smooth, vation functions like Exponetial Linear Units (ELU) (2015) non-monotonic and differentiable. It also preserves small and Scaled Exponential Linear Units (SELU) (2017). ELU portion of negative weights. Serf is inspired by Swish and and SELU possess better convergence characteristics along Mish where the self-gating property has been used to multi- with a saturation plateau in its negative region. However, ply the output of a non-linear function of an input with the these activation functions have found to be incompatible same non-modulated input. Self-gating is advantageous be- with Batch Normalization (BN) (2015). cause it requires only a single-scalar input, whilst normal Finally, using self-gating property Swish was proposed gating requires multiple two-scalar inputs (2017). which addressed the aforementioned drawbacks to a greater • Upper unboundedness: Activation functions like tanh extent abreast demonstrating superior results compared to and sigmoid have upper bounds. So, initialization should the previous established activation functions (2017). Be- happen in the linear regime of these activation functions. longing to the same class as Swish, another activation func- Such a property is not desirable since it leads to saturation tion called Mish was proposed which performed equally while training due to near-zero gradients (2010). ReLU well or better than Swish in most of the computer vision being unbounded above attempted to avoid the saturation tasks (2019). Our proposed activation function, Serf, is also problem. This is a crucial attribute which can be noticed in inspired from the self-gating mechanism and thus belongs all the successors of the ReLU function like leakyReLU, to the Swish-like class of functions. It has been shown ex- GELU, Swish, Mish, etc. Serf also possesses this feature, perimentally that our proposed Serf outperforms other acti- with its positive side as an approximate linear function of vation functions in a variety of computer vision and natural the input (see Figure 1). This makes Serf a good candidate language processing tasks. for an activation function. Serf • Lower boundedness: Activation functions must posses lower bounds in order to provide strong regularization ef- Motivation fects. However, in the ReLU activation function a neu- Activation functions introduce non-linearity in the neural ron receiving negative input will always output zero even- networks and they play a very important role in the over- tually become dead or inactive and hence useless. This all performance of a network. ReLU has been the most is referred to as the dying ReLU phenomenon (2019; widely used activation function in neural networks. How- 2013). It usually happens when the learning rate is high ever, it suffers from several disadvantages, the most notice- or if there is a large negative bias. By preserving of a able one being the dying ReLU phenomenon. This problem small portion of negative information, Serf mitigates the ensued from the missing negative part in the ReLU activa- aforementioned problem furthermore resulting in better tion function which restrains the negative values to zero. At expressivity and improved gradient flow. The negative the same time, ReLU is not continuously differentiable. Fur- bound for Serf is approximately 0.3484 (see Figure 1). thermore, ReLU is a non-negative function. This creates a • Differentiability: Unlike ReLU, Serf is continuously dif- non-zero mean problem where the mean activation larger ferentiable. This is beneficial owing to the fact that it than zero. Such an issue is not desirable for network con- avoids singularities and any concomitant ill-effects dur- vergence (2015). ing gradient-based optimization. In order to address these aforesaid problems to some • Preconditioner: Serf is closely related to Swish which extent, in the recent past several new activation functions can be noticed in its first derivative. The first derivative of emerged including leaky ReLU, ELU, Swish, etc. Swish is Serf is given as: seemingly an ideal candidate for an activation function with properties including non-monotonicity and ability to pre- 2 x 2 f(x) serve small negative weights abreast maintaining a smooth f 0(x) = √ e− ln((1+e )) xσ(x) + profile. Similar to Swish, activation functions like GELU π x (2) (2016) has gained popularity especially in the transformer f(x) = p(x) swish(x) + based architectures used both in the fields of Computer Vi- x Figure 1: Activation functions (Left), first derivatives (Middle) and second derivatives (Right) for Swish, Mish and Serf.

that output landscape is indicative of the loss landscape. We randomly initialize a 6-layered neural network, where we pass the x and y coordinates of each point in a grid as input, and plot the scalar network output for each grid point. For ReLU activation function, the output landscape of the neural network has sharp transitions in contrast to that of Serf. This conforms to the enhanced performance of Serf as compared to ReLU.

Experiments Figure 2: Output landscapes of a randomly initialized 6- In this section we will demonstrate the performance of layered neural network with ReLU (Left) and Serf (Right) our proposed activation function, Serf when used in dif- activations. ferent state-of-the-art architectures on image, sequence and graph datasets for disparate tasks. We will also present results from ablation studies done on MNIST and CIFAR-10 Here, σ is the sigmoid function and p(x) is a pre- datasets. Overall, our proposed activation function, Serf out- conditioner function. Preconditoners make the gradients performed its contemporaries in most of the tasks. smoother and have been previously used extensively in optimization problems. The inverse of a symmetric positive deﬁnite matrix has been used as a preconditioner in Ablations case of gradient descent. Application of such precondi- Model hyper-parameters play an important role in the train- tioners makes the objective function smoother thereby in- ing and optimization processes of a neural network thus creasing the rate of convergence (1986). Therefore, the having direct consequences in the generalizability of a net- strong regularization effect contributed by such precondi- work. Such hyper-parameters include network depths, net- toner in case of Serf makes the gradients smoother and work widths, type of weight initializations, dropout rates, optimization faster, thereby outperforming Swish as can batch sizes, learning rates, and optimizers. Here we analyze be noticed in the experiments. and compare the impacts of different hyper-parameters on Mish also has a precondtioner which makes it perform our chosen networks with three different activation functions better than Swish. The difference between Mish and namely Swish, Mish, and Serf. We have used MNIST and Swish is that in Serf we used the error function (erf) CIFAR-10 datasets for this purpose. whereas in Mish tanh function is used. Serf, however, outperforms Mish in most experiments (see Experiments MNIST and Ablations). We speculate that Serf’s preconditioner • Dense Units: The number of dense units refers to the function renders better regularization effects than that of number of neurons present in a dense layer. In this case Mish. we have used a 4 layered architecture with one dense • Smoothness: Smooth loss landscapes indicates easier op- layer followed by a Batch Normalization Layer and SGD timization with less local optima and hence better general- (1951) as an optimizer. We observe that as the number of ization minimizing inﬂuence of initializations and learn- dense units increases, the model complexity increases and ing rates. The output landscapes of a randomly chosen Serf outperforms Swish and Mish (Fig 3). This suggests 6-layered neural network with ReLU and Serf activation that Serf works well with complex models. This has also functions has been shown in Figure 2. It is to be noted been noticed in other experiments. Figure 3: Ablations for MNIST dataset. Top: Testing Accuracies vs Dense Units (Left), Dropout Rates (Middle) and Initializers (Right) for Swish, Mish and Serf. Bottom: Testing Accuracies vs Learning Rates (Left), Optimizers (Middle) and Number of Layers (Right) for Swish, Mish and Serf.

• Dropout Rates: As the dropout rate increases, the overall icantly higher accuracy as compared to Swish and Mish performance for all three activation functions drop, how- (Fig 3). This makes Serf a suitable candidate for large and ever, the performance degradation for Serf is relatively complex networks. lesser as compared to Swish and Mish (Fig 3). • Initializers: The performance of Serf is better than both CIFAR-10 We have used a ResNet-18 model with a dense Swish and Mish in all except for random uniform initial- layer and a classiﬁcation head in tandem. The results are ob- ization (Fig 3). This suggests that Serf is a better candidate tained with training the model without pre-trained weights compared to its contemporaries. over multiple runs for 20 epochs, which gives a decent con- • Learning Rates: With varying learning rates, Serf per- vergence point. forms better than both Swish and Mish (Fig 3). Particu- • Batch Size: We observed that with decreasing training larly, with higher learning rates, the degradation is quite batch sizes, the performance for all the competing activa- pronounced in Swish and not that much in Mish and Serf. tion functions drop (Fig 4), however, Serf holds the better We have used SGD (1951) as an optimizer in this case. position out of all three over all the batch sizes. Adam • Optimizers: In this case, with varying optimizers, the (2014) is used as the optimzer here. overall performance of Serf is equal or marginally better than Swish and Mish (Fig 3). Performance drop can be • Optimizers: In this case, with varying optimizers, the noticed for all three activation functions in case of Ada- overall performance of Serf is equal or marginally better grad optimizer (2011). than Swish and Mish (Fig 4). Performance drop can be noticed for all three activation functions in case of (2011) • Number of layers: In this case, each dense layer was fol- and SGD optimizers (1951). lowed by a Batch Normalization layer. As the number of dense layers increases, i.e., the depth of the network • Learning Rates: Serf is observed to be performing bet- increases, models become complex and optimization be- ter or equal to both Mish and Swish on all of the learning comes difﬁcult. The degradation in the performances for rates evaluated barring 0.1 where a steeper drop is ob- all the three different activation functions conforms the served in Serf compared to the other two activation func- aforementioned fact. However, Serf maintained a signif- tions (Fig 4). Adam (2014) is used as the optimzer here. Figure 4: Ablations for CIFAR-10 dataset. Testing Accuracies vs Batch Sizes (Left), Optimizers (Middle) and Learning Rates (Right) for Swish, Mish and Serf.

Image Classiﬁcation Methods ReLU Mish Serf (Ours)

For image classification we have considered different stan- SqueezeNet 84.14 85.98 86.32 dard architectures applied on two datasets namely CIFAR- Resnet-50 86.54 87.03 88.07 10 and CIFAR-100. WideResnet-50-2 86.39 86.57 86.73 CIFAR-10/100: We have considered different deep ShuffleNet-v2 83.93 84.07 84.55 learning architectures (for CIFAR-10: SqueezeNet ResNeXt-50 (32 × 4d) 87.25 87.97 88.49 (2016), Resnet-50 (2016a), WideResnet-50-2 (2016), Inception-v3 90.93 91.55 92.89 ShuffleNet-v2 (2018), ResNeXt-50 (2017), Inception-v3 (2016), DenseNet-121 (2017), MobileNet-v2 (2018) and DenseNet-121 88.59 89.05 89.07 EfficientNet-B0 (2019); for CIFAR-100: Resnet-164 MobileNet-v2 85.74 86.39 86.61 (2016b), WideResnet-28-10 (2016), DenseNet-40-12 EfficientNet-B0 (Swish) 78.26 78.02 78.41 (2017), Inception-v3 (2016)) with three disparate activation functions, namely, ReLU (baseline), Mish and Serf (proposed). This has been done for image classification Table 1: Top-1 % Accuracy values of different state-of-the- task on CIFAR-10 and CIFAR-100 datasets where for each art methods for different activation functions on CIFAR-10 network we have only changed the activation functions and have kept every other parameter constant for fair comparisons. Tables 1 and 2 show that Serf consistently Object Detection outperformed both ReLU and Mish activation functions Object detection is considered one of the important visual across all the architectures used in the experiment (1% to scene understanding tasks. In our case, we have considered 2% improvement over baseline ReLU) for both CIFAR-10 Pascal VOC dataset for object detection task using YOLOv3 and CIFAR-100 datasets, thereby suggesting Serf to be the (2018) and tiny YOLOv3 frameworks. Leaky ReLU is in- best candidate activation function for CIFAR-10/100 image trinsic to the YOLOv3 framework. For fair comparisons we classification tasks. have only changed the activation function keeping other hy- We have also used two recent architectures namely MLP perparameters fixed as outlined in (2018). Mean Average Mixer (2021) and Compact Convolutional Transformers Precision (MAP) scores in Table 4 clearly indicate that our (CCT) (2021). We have trained the MLP Mixer for 10 proposed Serf outperforms the baseline leaky ReLU based epochs only on CIFAR-10 training dataset, separately with architectures in the object detection task for Pascal VOC two different activation functions (GELU and Serf) and dataset. tested the models on the CIFAR-10 test dataset. The Top- 1 % and Top-5 % Accuracy values (see Table 3) suggest that Semi-supervised Node Classification Serf’s performance is comparable to GELU’s (baseline) per- Following the implementations outline in (2016), we have formance for MLP Mixer. We have also trained CCT for 50 considered 3 different datasets, namely CITESEER, CORA epochs on CIFAR-10 training dataset, separately with three and PUBMED for semi-supervised node classification using different activation functions (ReLU, Mish and Serf) and three disparate activation functions, namely, ReLU (base- tested the models on CIFAR-10 test dataset. Serf clearly outline), Mish and Serf (proposed). All training parameters performs ReLU (baseline) and Mish in this case (see Table and hyper-parameters have been kept same as mentioned 3). The results indicate that Serf is a better activation func- in (2016) for fair comparisons. Table. 5 shows that Serf tion for transformer based architectures. either performed equally well or better than both ReLU Methods ReLU Mish Serf (Ours) Model Activations [email protected] [email protected]:.95

Resnet-164 74.55 75.02 75.13 YOLOv3 LeakyReLU 0.740 0.473 WideResnet-28-10 76.32 77.03 77.54 YOLOv3 Serf (Ours) 0.766 0.501 DenseNet-40-12 73.68 73.91 74.16 YOLOv3 Tiny LeakyReLU 0.503 0.219 Inception-v3 71.54 72.38 72.95 YOLOv3 Tiny Serf (Ours) 0.514 0.227

Table 2: Top-1 % Accuracy values of different state-of-the- Table 4: Mean Average Precision scores for different object art methods for different activation functions on CIFAR-100 detection models on the Pascal VOC dataset. LeakyReLU is instrinsic to YOLO framework. Activations Top-1 % Acc Top-5 % Acc Dataset ReLU Mish Serf (Ours) MLP-Mixer (GELU) 64.14 96.71 MLP-Mixer (Serf) 64.36 96.59 CORA 81.5 81.7 81.7 CITESEER 70.3 71.3 71.7 CCT (ReLU) 79.05 97.72 PUBMED 79.0 79.3 79.4 CCT (Mish) 80.02 98.70 CCT(Serf) 80.23 98.65 Table 5: Top-1 % Accuracy values of different state-of-the- art GNN semi-supervised node classification methods for Table 3: Top-1 and Top-5 % Accuracy values (classification) different activation functions on CORA, CITESEER and after 10 epochs of training MLP-Mixer for GELU (SOTA) PUBMED. and Serf activation functions on CIFAR-10 test dataset and 50 epochs of training Compact Convolutional Transformer (CCT) for ReLU, Mish and Serf functions on CIFAR-10 test tually obtaining the best results for the proposed Serf func- dataset. tion (see Table 7). For the Pol Emo 2.0 sentiment database (2019), we have trained a BERT based model (2018) for two different activation functions, Mish and Serf. The Precision, and Mish activation functions across the three different Recall and F1-scores suggest that Serf performed equally datasets, thereby indicating versatility of the proposed ac- well or better than Mish for this task (see Table 8). tivation function. For multi-modal entailment task, we have used the multi- modal entailment database, recently introduced by the Machine Translation, Sentiment Classification & Google Research 1. We have used a smaller variant of the Multi-modal Entailment original BERT model. The code used for this purpose is In this section, we have demonstrated the effectiveness of available at 2. We have used two activation functions for our proposed Serf activation function in Machine Transla- comparison purposes: GELU and Serf. Table 9 shows the tion and Sentiment Classification tasks. We have considered accuracies on test dataset averaged over 5 runs (each trained 3 different architectures and datasets. for 10 epochs). The accuracy values suggest that Serf per- For Machine Translation we have used a sequence- forms marginally better than GELU in this case. to-sequence Transformer (2017) Encoder-Decoder based model trained (for 20 epochs) on the the Multi30k dataset Conclusions and Future Works for German-English translation (2016). For comparison pur- In this paper, we have proposed a novel activation function poses, we have considered ReLU, GELU, Mish and Serf called Serf. Serf has properties such as upper unbounded- (proposed) and observed that Serf outperformed the remain- ness, lower boundedness, non-monotonicity and smoothness ing three activation functions as suggested by the BLEU which are the desired properties for an activation function. scores shown in the Table 6. Serf has shown to be effective against different ablations For sentiment classification, we have considered two compared to its contemporaries like Swish and Mish. Exper- datasets, namely imdb movie review sentiment and Pol Emo imental results with different state-of-the-art architectures 2.0 sentiment datasets. For the imdb movie review sentiment on varied datasets for disparate tasks including image classi- dataset (2011), we have considered: (i) a simple architec- fication, object detection, sentiment classification, machine ture consisting of a 1D conv net with a text embedding layer translation and multi-modal entailment demonstrate that the which we have trained using three different activation func- proposed Serf outperforms the baseline ReLU performance tions and noticed that Serf outperformed the other two activation functions (ReLU and Mish) suggesting that Serf also 1https://github.com/google-research-datasets/recognizing- works well with simple architectures (see Table 7), and (ii) multimodal-entailment a 4-layer Transformer model which we have also trained for 2https://github.com/sayakpaul/Multimodal-Entailment- 20 epochs for each of the three activation functions even- Baseline Scores ReLU GELU Mish Serf (Ours) References [1998] Andrews, L. C. 1998. Special functions of mathematics BLEU 35.55 35.62 35.36 36.06 for engineers, volume 49. Spie Press. [1986] Axelsson, O., and Lindskog, G. 1986. On the rate Table 6: BLEU scores of seq2seq Transformer model (after of convergence of the preconditioned conjugate gradient training for 20 epochs) for different activation functions on method. Numerische Mathematik 48(5):499–523. Multi30k test dataset. [2020] Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Ka- plan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, Model ReLU Mish Serf (Ours) G.; Askell, A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 1D conv with text-embedding 85.36 85.99 86.18 [2015] Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 4-layer Transformer model 88.82 88.99 89.03 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Table 7: Top-1 % Accuracy values of 1D conv net with a text [2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. embedding layer and 4-layer Transformer model for ReLU, 2018. Bert: Pre-training of deep bidirectional transformers Mish and Serf on imdb movie review sentiment dataset. for language understanding. arXiv preprint arXiv:1810.04805. [2020] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weis- Activation Precision Recall F1-score senborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Min- derer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is Mish 0.8374 0.8329 0.8346 worth 16x16 words: Transformers for image recognition at Serf (Ours) 0.8377 0.8330 0.8342 scale. arXiv preprint arXiv:2010.11929. [2011] Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive Table 8: Precision, Recall and F1-scores for different acti- subgradient methods for online learning and stochastic opti- vation functions for sentiment classification using BERT on mization. Journal of machine learning research 12(7). Pol Emo 2.0 sentiment database. [2016] Elliott, D.; Frank, S.; Sima’an, K.; and Specia, L. 2016. Multi30k: Multilingual english-german image de- Metric GELU Serf (Ours) scriptions. In Proceedings of the 5th Workshop on Vision and Language, 70–74. Association for Computational Linguis- Mean Accuracy 85.28 85.42 tics. [2010] Glorot, X., and Bengio, Y. 2010. Understanding the Table 9: Mean Accuracy values of GELU and Serf based difficulty of training deep feedforward neural networks. In architectures for multi-modal entailment task. Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. JMLR Workshop and Conference Proceedings. [2021] Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; and Shi, H. 2021. Escaping the big data paradigm with and as well as other activation functions like Swish, Mish compact transformers. arXiv preprint arXiv:2104.05704. and GELU. The results can be improved with desired hyper- parameters for Serf which can be obtained with a hyperpa- [2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving rameter search. deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE interna- Future works include: (1) the understanding of the im- tional conference on computer vision, 1026–1034. portance and contribution of pre-conditioner as a regular- [2016a] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep izer and how modifying it can have an impact on the final residual learning for image recognition. In Proceedings of results; this can lead to the development of more effective the IEEE conference on computer vision and pattern recognition, activation functions, (2) the development of Hard-Serf like 770–778. Hard-Swish and Hard-Mish and compare its performance on [2016b] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Iden- different benchmark datasets and tasks, (3) the development tity mappings in deep residual networks. In European confer- and exploration of probabilistic version of Serf as shown ence on computer vision, 630–645. Springer. in (2019), (4) the development of parameterized Serf like PReLU, and finally (5) comparison the performance of Serf [2016] Hendrycks, D., and Gimpel, K. 2016. Gaussian error with other contemporary activation functions for tasks such linear units (gelus). arXiv preprint arXiv:1606.08415. as image super-resolution, image reconstruction, etc. Over- [2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; and Wein- all, Serf is a simple, effective and versatile activation func- berger, K. Q. 2017. Densely connected convolutional net- tion which can be incorporated in any neural network for works. In Proceedings of the IEEE conference on computer vision better training and performance gains. and pattern recognition, 4700–4708. [2016] Iandola, F. N.; Han, S.; Moskewicz, M. W.; Ashraf, [1995] Mira, J., and Sandoval, F. 1995. From Natural to Ar- K.; Dally, W. J.; and Keutzer, K. 2016. Squeezenet: Alexnet- tificial Neural Computation: International Workshop on Artificial level accuracy with 50x fewer parameters and¡ 0.5 mb model Neural Networks, Malaga-Torremolinos, Spain, June 7-9, 1995: size. arXiv preprint arXiv:1602.07360. Proceedings, volume 930. Springer Science & Business Me- [2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: dia. Accelerating deep network training by reducing internal co- [2019] Misra, D. 2019. Mish: A self regularized variate shift. In International conference on machine learning, non-monotonic neural activation function. arXiv preprint 448–456. PMLR. arXiv:1908.08681 4:2. [2009] Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; and Le- [2010] Nair, V., and Hinton, G. E. 2010. Rectified linear Cun, Y. 2009. What is the best multi-stage architecture for units improve restricted boltzmann machines. In Icml. object recognition? In 2009 IEEE 12th international conference [2019] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; on computer vision, 2146–2153. IEEE. Sutskever, I.; et al. 2019. Language models are unsupervised [2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for multitask learners. OpenAI blog 1(8):9. stochastic optimization. arXiv preprint arXiv:1412.6980. [2017] Ramachandran, P.; Zoph, B.; and Le, Q. V. [2016] Kipf, T. N., and Welling, M. 2016. Semi-supervised 2017. Searching for activation functions. arXiv preprint classification with graph convolutional networks. arXiv arXiv:1710.05941. preprint arXiv:1609.02907. [2018] Redmon, J., and Farhadi, A. 2018. Yolov3: An incre- mental improvement. arXiv preprint arXiv:1804.02767. [2017] Klambauer, G.; Unterthiner, T.; Mayr, A.; and Hochreiter, S. 2017. Self-normalizing neural networks. In [1951] Robbins, H., and Monro, S. 1951. A stochastic ap- Proceedings of the 31st international conference on neural infor- proximation method. The annals of mathematical statistics 400– mation processing systems, 972–981. 407. [2019] Kocon,´ J.; Zasko-Zieli´ nska,´ M.; and Miłkowski, P. [2018] Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; 2019. Polemo 2.0 sentiment analysis dataset for conll. and Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on [] Krizhevsky, A.; Nair, V.; and Hinton, G. Cifar-10 (cana- computer vision and pattern recognition, 4510–4520. dian institute for advanced research). [2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and [1989] LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Wojna, Z. 2016. Rethinking the inception architecture for Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Back- computer vision. In Proceedings of the IEEE conference on com- propagation applied to handwritten zip code recognition. puter vision and pattern recognition, 2818–2826. Neural computation 1(4):541–551. [2019] Tan, M., and Le, Q. 2019. Efficientnet: Rethinking [1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. model scaling for convolutional neural networks. In Interna- 1998. Gradient-based learning applied to document recog- tional Conference on Machine Learning, 6105–6114. PMLR. nition. Proceedings of the IEEE 86(11):2278–2324. [2021] Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; [1998] LeCun, Y. 1998. The mnist database of handwritten Zhai, X.; Unterthiner, T.; Yung, J.; Keysers, D.; Uszkoreit, J.; digits. http://yann. lecun. com/exdb/mnist/. Lucic, M.; et al. 2021. Mlp-mixer: An all-mlp architecture [2019] Lee, J.; Shridhar, K.; Hayashi, H.; Iwana, B. K.; for vision. arXiv preprint arXiv:2105.01601. Kang, S.; and Uchida, S. 2019. Probact: A probabilistic [2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; activation function for deep neural networks. arXiv preprint Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. arXiv:1905.10761 5:13. Attention is all you need. In Advances in neural information [2019] Lu, L.; Shin, Y.; Su, Y.; and Karniadakis, G. E. 2019. processing systems, 5998–6008. Dying relu and initialization: Theory and numerical exam- [2017] Xie, S.; Girshick, R.; Dollar,´ P.; Tu, Z.; and He, K. ples. arXiv preprint arXiv:1903.06733. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer [2018] Ma, N.; Zhang, X.; Zheng, H.-T.; and Sun, J. 2018. vision and pattern recognition, 1492–1500. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on [2016] Zagoruyko, S., and Komodakis, N. 2016. Wide resid- computer vision (ECCV), 116–131. ual networks. arXiv preprint arXiv:1605.07146. [2011] Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–150. Portland, Oregon, USA: Association for Computational Linguistics. [2013] Maas, A. L.; Hannun, A. Y.; Ng, A. Y.; et al. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, 3. Citeseer.