DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020

Performance Analysis of Various Activation Functions Using LSTM Neural Network For Movie Recommendation Systems

ANDRÉ BROGÄRD

PHILIP SONG

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Performance Analysis of Various Activation Functions Using LSTM Neural Network For Movie Recommendation Systems

ANDRÉ BROGÄRD, PHILIP SONG

Degree Project in Computer Science, DD142X Date: June 8, 2020 Supervisor: Erik Fransén Examiner: Pawel Herman School of Electrical Engineering and Computer Science Swedish title: Prestandaanalys av olika aktiveringsfunktioner i LSTM neurala nätverk applicerat på rekommendationssystem för filmer

iii

Abstract

The growth of importance and popularity of recommendations system has in- creased in many various areas. This thesis focuses on recommendation sys- tems for movies. Recurrent neural networks using LSTM blocks have shown some success for movie recommendation systems. Research has indicated that by changing activation functions in LSTM blocks, the performance, measured as accuracy in predictions, can be improved. In this study we compare four different activation functions (hyperbolic tangent, sigmoid, ELU and SELU activation functions) used in LSTM blocks, and how they impact the predic- tion accuracy of the neural networks. Specifically, they are applied to the block input and the block output of the LSTM blocks. Our results indicate that the hyperbolic tangent, which is the default, and sigmoid perform about the same, whereas the ELU and SELU functions perform worse. Further re- search is needed to identify other activation functions that could improve the prediction accuracy and improve certain aspects of our methodology. iv

Sammanfattning

Rekommendationssystem har ökat i betydelse och popularitet i många olika områden. Denna avhandling fokuserar på rekommendationssystem för filmer. Recurrent neurala nätverk med LSTM blocks har visat viss framgång för re- kommendationssystem för filmer. Tidigare forskning har indikerat att en änd- ring av aktiverings funktioner har resulterat i förbättrad prediktering. I denna studie jämför vi fyra olika aktiveringsfunktioner (hyperbolic tangent, sigmoid, ELU and SELU) som appliceras i LSTM blocks och hur de påverkar predik- teringen i det neurala nätverket. De appliceras specifikt på block input och block output av LSTM blocken. Våra resultat indikerar att den hyperboliska tangentfunktionen, som är standardvalet, och sigmoid funktionen presterar li- ka, men ELU och SELU presterar båda sämre. Ytterligare forskning krävs för att indentifiera andra aktiveringsfunktioner och för att förbättra flera delar av metodologin. Contents

1 Introduction 1 1.1 Problem Statement ...... 2 1.2 Scope ...... 2

2 Background 3 2.1 Artifical Neural Networks ...... 3 2.2 Multilayer Perception ANN ...... 4 2.3 ...... 4 2.4 Long Short-Term Memory ...... 5 2.4.1 LSTM Architecture ...... 5 2.4.2 Activation Functions ...... 6 2.5 Metrics ...... 9 2.6 Related work ...... 10

3 Methods 12 3.1 Dataset ...... 12 3.2 Implementation ...... 12 3.3 Evaluation ...... 13

4 Results 14

5 Discussion 17 5.1 Result ...... 17 5.2 Improvements ...... 18

6 Conclusions 19

Bibliography 20

v

Chapter 1

Introduction

With more online movie platforms becoming available, people have a lot of movie content to choose from. According to a study from Ericsson, people spend up to one hour per day searching for movie content [1]. Seeking to min- imize this time, movie recommendation systems have been developed using Artificial Intelligence [2]. Recommendation systems aim to solve the problem of information over- loading, which denies access to interesting items, by filtering information [3]. One such way is through collaborative filtering (CF) where similar users’ in- terests are considered [3]. Popular approaches to CF include the use of neural networks, and in [4] it is demonstrated that CF can be converted to a sequence prediction problem with the use of recurrent neural networks (). Long Short-Term Memory (LSTM), an RNN with LSTM blocks, was de- signed in order to solve a problem with RNN and has shown an improvement in performance [5]. LSTM has been applied in several recommendation systems [6] targeted at both entertainment (movie, music, videos) and e-commerce set- tings and has outperformed state-of-the-art models in many cases. In [4] an LSTM neural network was applied to the top-N recommendation problem, using the default choice of activation functions, recommending 10 movies the user would be interested in seeing next. The rating of a movie was ignored, only the sequence of watched movies was considered. It was observed that extra features such as age, rating or sex did not lead to an increase in accuracy. Both the Movielens and Netflix dataset were used and LSTM outperformed all baseline models in nearly all metrics. This study will use the same framework as in [4]. Since there has been success in switching activation functions [7], the study will compare different choices of activation functions in LSTM blocks and its impact on prediction

1 2 CHAPTER 1. INTRODUCTION

accuracy in the context of movie recommendations.

1.1 Problem Statement

The most important functionality for movie recommendation systems is the ability to predict a user’s preference of movies. Therefore, in our project we will investigate the performance, measured as accuracy in predictions for movies, of LSTM using various activation functions applied to the top-N rec- ommender problem in movie recommendation. To this end we pose a question: How does applying different activation functions to LSTM blocks affect the accuracy of predicting movies for users?

1.2 Scope

The implementation of LSTM is the same as in [4] with small modifications. This study is therefore only considering this type of LSTM applied on the top N recommendation problem. In [4] they limit the amount of features to only three (user id, movie id and timestamp) and further conclude that more features such as sex or age doesn’t improve the accuracy of the models, unless they are all put together. We limit the features identically. Only the Movielens 1M dataset will be used in this study because of limited computational resources. Additionally, only the hyperbolic, sigmoid, ELU and SELU activation functions will be tested due to them showing promising results in previous work. Chapter 2

Background

2.1 Artifical Neural Networks

Artificial Neural Networks (ANN) are a type of computing system inspired by the biological neural networks in human brains [8]. There are a lot of dif- ferent types of networks, all characterized with the following components: a set of nodes, in our case artificial neurons (nodes), and connections between these nodes called weights. Like the synapses in a biological brain, each con- nection between nodes can transmit a signal to other nodes. The neurons re- ceive inputs and some processing and computing occurs and then an output has been obtained which can be signaled to other neurons connected to it. The weight in each connection determines the strength of one node’s influence on another[9]. Figure 2.1 shows how an artificial neuron receives inputs, which are multiplied by weights and then the mathematical function, activation func- tion, determines the activation of the neuron. The activation functions will be more thoroughly discussed in section 2.4.2.

Figure 2.1: An artificial Neuron

3 4 CHAPTER 2. BACKGROUND

2.2 Multilayer Perception ANN

Multilayer (MLP) are comprised of one or more layers of neurons. The numbers of neurons in the input and output depends on the problem whereas the number of neurons in the hidden layers are arbitrary. The goal of MLPs is to approximate a function f ∗. For example, a classifier y = f ∗(x) maps an input x to a category y. The MLPs are also called feedforward neural networks because information flows through the function being evaluated from x, through the intermediate computations used to define f, and finally to the output y. There are no feedback connections in which outputs of the model are fed back to itself. If an MLP were to include feedback connections, they would be a recurrent neural network (RNN) [10].

2.3 Recurrent Neural Network

A weakness with MLP is that is lacks the ability to learn and efficiently store temporal dependencies [10]. A recurrent neural network is specialized for pro- cessing a sequence of values and they can scale to much longer sequences than networks without sequence-based specialization. Another advantage for RNN over MLP is the ability to share parameters across different parts of a model. For example, if we have two sentences, “I went to Nepal in 2009” and “In 2009, I went to Nepal”. Then extracting when the narrator went to Nepal using a model, the MLP model, which processes sentences of fixed length would have separate parameters for each input feature, which means it would need to learn all the rules of the language separately in each position in the sentence. Whereas, the RNN shares the same weights across several time steps. However, RNN has a problem with long-term memory, meaning it lacks the ability to connect present information to old information in order to achieve correct context [10]. For example, consider trying to predict the last word in the meaning “I grew up in France... I speak fluent French”. Latest information suggests the word to be a language. But to tell which specific lan- guage it is, context from further back in the text about France is needed. It is possible for the gap between the recent information and the information fur- ther back to become very large. As this gap grows RNNs become unable to use the past information as context for the recent information. Fortunately, Long Short-Term Memory neural network, is explicitly designed to solve long-term dependency problem [11]. CHAPTER 2. BACKGROUND 5

2.4 Long Short-Term Memory

As discussed in previous section, RNN has a problem with long term mem- ory. Long Short-Term Memory (LSTM), a special kind of RNN, is capable of learning long-term dependencies using LSTM blocks [11]. The network was designed to solve the problem and has shown an improvement in performance [5]. Each LSTM block consists of one or more self-connected memory cells along with input, forget, and output gates. The memory cells are able to store and access information for longer time to improve performance.

2.4.1 LSTM Architecture The main concept with LSTMs is the cell state, the round circle “Cell” in fig- ure 2.2. The cell state holds information which flows in and out between each LSTM block. More explicitly, the output of a cell is called hidden state. In figure 2.2 hidden state is the output of the cell together with the pointwise op- eration from the output gate [11]. With regulated structures called gates, the LSTM has the ability to remove or add information to the cell state and hidden state. They consist of a sigmoid neural net layer and a pointwise multiplica- tion operation. The sigmoid layer, the round circle with σ in the figure, outputs numbers between zero and one. The numbers represent how much information that will flow through the gate. If a zero is returned nothing will flow through whereas 1 stands for all information bits flow through. The function determin- ing the output value between zero and one is called and can be switched in the neural nets [11]. The three gates (input, forget and output gates), block input and block output activation functions are displayed in the figure. The sign is a pointwise multiplication of two vectors. The activation functions are σ and tanh [7].

ft = σ(Wf xt + Uf ht−1 + bf ) (2.1)

it = σ(Wixt + Uiht−1 + bi) (2.2)

oi = σ(Woxt + Uoht−1 + bo) (2.3) ˜ Ct = tanh(WC xt + UC ht−1 + bC ) (2.4) ˜ Ct = ft Ct−1 + it Ct (2.5)

ht = ot tanh(Ct) (2.6)

The forget, input and output gates of each LSTM block are defined by ˜ equations 2.1-2.3 respectively. Ct defined in equation 2.4 is at time t the block 6 CHAPTER 2. BACKGROUND

Figure 2.2: Architecture of a single LSTM block where σ is the sigmoidal gates. From [12] input which consists of a tanh layer with the input gate. Together they decide what information will be stored in the cell state, Ct. The cell state is updated from the old cell state at time t. W and U are weight matrices and b is a bias vector. Finally, the hidden state ht, is block output at time t.

2.4.2 Activation Functions Nodes of neural networks take N number of inputs which is passed through a nonlinearity into an output. These nonlinearities are called activations func- tions. This is illustrated in figure 2.1. A bad choice of activation functions can lead to loss of input data or vanishing/exploding gradients in the neural network [13].

Sigmoid function The has a range of [0, 1] and is illustrated in figure 2.3. The formula is given by: 1 σ(x) = e−x − 1 CHAPTER 2. BACKGROUND 7

Figure 2.3: Sigmoid activation function

Hyperbolic tangent function The hyperbolic tangent formula, further referred to as the hyperbolic function. Is defined by: sinh(x) tanh(x) = cosh(x) It has a range of [−1, 1] and is illustrated in figure 2.4.

Exponential linear unit The ELU was introduced in [14] and made the deep neural network of the study learn faster and more accurately. Its formula is given by

x : x > 0 ELU(x) = α(ex − 1) : x ≤ 0

In figure 2.5, the alpha parameter is set to 1, then its range is [−1, ∞].

Self-normalizing exponential linear unit The SELU was introduced in [15]. It is similar to the ELU but with additional and specific parameters. It has properties that should eliminate the possibility of vanishing/exploding gradients. The function is illustrated in figure 2.6. It 8 CHAPTER 2. BACKGROUND

Figure 2.4: Hyperbolic activation function

Figure 2.5: ELU activation function CHAPTER 2. BACKGROUND 9

Figure 2.6: SELU activation function is defined by:

λx : x > 0 SELU(x) = λα(ex − 1) : x ≤ 0 λ = 1.0507009873554804934193349852946 α = 1.6732632423543772848170429916717

2.5 Metrics

These are the same metrics used in [4] and are thus identically defined. They are used to evaluate qualities in various recommendation systems.

• Sps. The Short-term Prediction Success captures the ability of the method to predict the next item. It is 1 if the next item is present in the recom- mendations, 0 else.

• Recall. The usual metrics for top-N recommendation captures the ability of the method to do long term predictions.

• User coverage. The fraction of users who received at least one correct recommendation. Average recall (and precision) hide the distribution of success among users. A high recall could still mean that many users 10 CHAPTER 2. BACKGROUND

do not receive any good recommendation. This metrics captures the generality of the method.

• Item coverage. The number of distinct items that were correctly recom- mended. It captures the capacity of the method to make diverse, suc- cessful, recommendations.

Observe that these metrics are all computed using recommendation systems which always produces ten recommendations for each user.

2.6 Related work

Applying different activation functions In previous works [7] and [12] a comparative study was conducted where the performance of an LSTM network was analysed when switching different ac- tivation functions. Both papers concluded that switching activation functions impact the performance of the network. Although the standard activation func- tion in the sigmoidal gates, the sigmoid function, give high performance, some other tested less-recognized functions activation functions which could result in more accurate performance. Furthermore, in [7] they compared exactly 23 different activation functions, where the three gates (the input, output and forget gate) change activation functions while block input and block output activation functions is held constant with the hyperbolic tangent (tanh). Addi- tionally, the authors encourage further research to be conducted on other parts of an LSTM network such as the effect of changing the hyperbolic tangent function on the block input and block output instead of changing the activa- tion functions in the three gates. Different activation functions have been applied on more complex LSTM based neural networks on different areas rather than recommendation systems [16]. Several activation functions have been tested in LSTM blocks [16], in the context of a spatiotemporal convolutional LSTM (convLSTM) network introduced by [17], and applied on the MNIST dataset. The study showed great performance for ELU and SELU activation functions and outperformed traditional and popular choices such as the hyperbolic and sigmoid activation functions. CHAPTER 2. BACKGROUND 11

Applying LSTM in movie recommender systems The authors’ experiments in [4] where they tested LSTM in movie recommen- dation systems, showed that “...the LSTM produces very good results on the Movielens and Netflix datasets, and is especially good in terms of short term prediction and item coverage”. Furthermore the authors mention it is possible to achieve better performance by adjusting the RNNs to specifically handle collaborative filtering problems. Chapter 3

Methods

3.1 Dataset

The dataset used is Movielens 1M. The dataset contains many possible features that are not considered in the model, only the user id, movie id and timestamp are treated as features. Preprocessing is included in the LSTM implementation by [4].

3.2 Implementation

The modifications to the original code by [4] can be found in the authors’ fork of the original repository on github: github.com/andrebrogard/ sequence-based-recommendations. The only modifications made are the option to specify which activation functions to apply to the all individ- ual gates of the LSTM blocks when training and testing the model.

Neural network parameters The authors of [4] observed comparable performance and the fastest using a layer size of 20 neurons. Common to all layer sizes tested is that they seemed to not improve beyond 100 epochs. Therefore all our tests use a layer size of 20 neurons and runs for just above 100 epochs. One epoch is a unit of measurement indicating training the model on the entire dataset once.

Switching activation functions The hyperbolic activation function is default in the cell and hidden state which are referred to block input and block output. The sigmoid function is default

12 CHAPTER 3. METHODS 13

for the input, output and forget gate. In our tests, we will compare four different activations functions applied on the block input and block output identically, namely the hyperbolic, sigmoid, ELU and SELU functions.

3.3 Evaluation

Metrics The metrics used are identical to those of [4] and captures the same properties in order to make the results comparable. They are all calculated in the context where the recommendation system makes ten recommendations. See 2.5 for their definition.

Test Data Set and Validation Data Set The validation set is used during training to assess the accuracy of each model produced. The test set is used after, and has never before been seen by the model. All results observed in the study are from using the test data set. Test data size and validation data size has both been chosen to 500 to maintain comparability with [4].

Number of tests Training will be conducted for each activation function on the dataset 15 times in order to capture variance and observe a fair result. The models are then evaluated according to the metrics above. Chapter 4

Results

Figure 4.1-4.4 show mean sps, recall, user coverage and item coverage respec- tively across intermediate epochs from 1 to 102. All results are evaluated on the test data on saved models from each intermediate epoch. Each activation function was, as described, used to train a model 15 times, from which the mean of all metrics has been evaluated. Table 4.1 shows the mean and the standard deviation of the results over 15 models. Both ELU and SELU perform worse than the sigmoid and hyperbolic func- tion across all metrics. Additionally, ELU always performs worse than SELU. The hyperbolic and sigmoid function is similar in their performance, with a slight advantage only to the hyperbolic in the recall metric. An observation shared between most activation functions and metrics is that the models don’t seem to improve significantly beyond around 20 epochs. In the recall and sps metric all activation functions instead decrease. The SELU function always decreases in all metrics after around 50 epochs. The ELU function instead always decreases after around 20 epochs.

Activation function SPS(%) Recall(%) User Coverage(%) Item Coverage Hyperbolic 26.0 ± 1.4 7.05 ± 0.16 85.2 ± 1.0 595 ± 11 Sigmoid 26.6 ± 1.1 6.91 ± 0.17 84.7 ± 1.5 610 ± 14 SELU 22.9 ± 1.5 6.08 ± 0.19 74.4 ± 1.5 507 ± 15 ELU 16.1 ± 2.5 4.94 ± 0.43 78.9 ± 3.2 413 ± 27

Table 4.1: Comparison of activation functions and their metrics.

14 CHAPTER 4. RESULTS 15

Figure 4.1: The mean sps across intermediate epochs. Evaluated on the test data.

Figure 4.2: The mean recall across intermediate epochs. Evaluated on the test data.

Figure 4.3: The mean user coverage across intermediate epochs. Evaluated on the test data. 16 CHAPTER 4. RESULTS

Figure 4.4: The mean item coverage across intermediate epochs. Evaluated on the test data. Chapter 5

Discussion

5.1 Result

The ELU and SELU seems to have had a negative impact on the models, as they did not achieve the same accuracy as the hyperbolic and sigmoid functions. Both functions were less accurate with shorter term and longer term recom- mendations and less users received a correct recommendation and fewer items were ever recommended. Interestingly, the sigmoid and hyperbolic function displayed no significant difference in metrics and the SELU function achieved the highest mean sps value at around 50 epochs compared to all other activa- tion functions before it started decreasing. The ELU displayed the lowest mean and highest standard deviation in mostly all metrics. This further indicates that ELU was not a good choice of activation function. Moreover, SELU had lower mean but similar standard deviation to the sigmoid and hyperbolic function. We believe that is a promis- ing property of the SELU function as it appears to be as stable as the sigmoid and hyperbolic function. The sigmoid function yields better results in sps and item coverage over the hyperbolic. Additionally, the standard deviation is slightly lower for the sigmoid function in those two metrics. Thus, the sigmoid function could be a substitute for the default function according to our results. The metrics associated with the hyperbolic function should be comparable with the results of [4] because the same framework is used and similar tests were performed. They presented, for a layer size of 20 neurons as ours, better results; their mean for sps on the hyperbolic function, on the same dataset, was well over 30% for around 100 epochs. Furthermore, it wasn’t until around 100 epochs that the model stopped improving. Our results show that most activa-

17 18 CHAPTER 5. DISCUSSION

tion functions had already attained its maximum sps at around 20 epochs. Had we observed a smoother learning rate, then we would have had more convinc- ing results for the SELU and ELU functions.

5.2 Improvements

The choice of neural network parameters may explain the difference in re- sults compared to [4], especially the learning rate could affect the models. It could contribute to the fact that our models reach maximum value quicker and hinders it from achieving similar results. We use the default learning rate pa- rameters of the framework for RNN, which uses Adam, which might explain the difference compared to [4]. Furthermore, the layer size, which was 20 neu- rons in this study, should have been varied as in [4] to better observe possible differences in learning rate. What neural network parameters to use should be considered more carefully in future work. In each LSTM-block the block input and block output activation functions were the only ones changed while maintaining the same activation functions in the three gates (input, forget and output gates) using the sigmoid function (as is default). Whereas, in [7], 23 activation functions were applied on the three gates. The same activation functions that showed great performance in that study was not tested here because of time restraints. We did not observe a sig- nificant advantage for an activation function compared to the default. For fu- ture work, more comprehensive experiments evaluating more activation func- tions should be performed. The study in [4] uses two datasets: Movielens 1M and Netflix. In this study, only the Movielens 1M is used because of time restraints. Therefore, our results could be very bound to the structure of this specific dataset. In future work, more datasets need to be considered. The performance for each activation function is evaluated strictly on ac- curacy using each metric. The temporal aspect was overlooked. Because our tests did not record the duration the network was trained; how and if an ac- tivation function achieves better accuracy in shorter time was not evaluated. To better evaluate an activation function, future work should not overlook the temporal aspect. Chapter 6

Conclusions

In this study, we have demonstrated that by changing activation functions in LSTM neural networks, the prediction accuracy for movie recommendation systems can be altered. Moreover, we have compared the performance of four different activation functions in the LSTM neural networks (hyperbolic tan- gent, sigmoid, ELU and SELU activation functions). Our results show that the tangent and sigmoid functions yielded higher prediction accuracy for movie recommendation systems than the ELU and SELU. We have only compared four different activation functions and trained the neural network on a single dataset. More research is needed to search for other activations functions that might perform better than the default hyper- bolic tangent function. Furthermore, only one dataset was used and temporal aspect was not considered. More and larger datasets should be employed to search for higher performing activation functions.

19 Bibliography

[1] Ericsson Consumer Lab. TV and Media - a consumer driven future of media. 2017. [2] Song Tang, Zhiyong Wu, and Kang Chen. “Movie Recommendation via BLSTM”. In: MultiMedia Modeling. Ed. by Laurent Amsaleg et al. Cham: Springer International Publishing, 2017, pp. 269–279. isbn: 978-3-319-51814-5. [3] F.O. Isinkaye, Y.O. Folajimi, and B.A. Ojokoh. “Recommendation sys- tems: Principles, methods and evaluation”. In: Egyptian Informatics Journal 16.3 (2015), pp. 261–273. issn: 1110-8665. doi: https:// doi . org / 10 . 1016 / j . eij . 2015 . 06 . 005. url: http : / / www . sciencedirect . com / science / article / pii / S1110866515000341. [4] Robin Devooght and Hugues Bersini. Collaborative Filtering with Re- current Neural Networks. 2016. arXiv: 1608.07400 [cs.IR]. [5] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735–1780. [6] Ayush Singhal, Pradeep Sinha, and Rakesh Pant. “Use of Deep Learn- ing in Modern Recommendation System: A Summary of Recent Works”. In: International Journal of Computer Applications 180.7 (Dec. 2017), pp. 17–22. issn: 0975-8887. doi: 10 . 5120 / ijca2017916055. url: http://dx.doi.org/10.5120/ijca2017916055. [7] Amir Farzad, Hoda Mashayekhi, and Hamid Hassanpour. “A compara- tive performance analysis of different activation functions in LSTM net- works for classification”. In: Neural Computing and Applications 31.7 (2019), pp. 2507–2521. issn: 1433-3058. doi: 10.1007/s00521- 017-3210-6. url: https://doi.org/10.1007/s00521- 017-3210-6.

20 BIBLIOGRAPHY 21

[8] Yung-Yao Chen et al. “Design and Implementation of Cloud Analytics- Assisted Smart Power Meters Considering Advanced Artificial Intelli- gence as Edge Analytics in Demand-Side Management for Smart Homes”. In: Sensors (Basel, Switzerland) 19.9 (May 2019). s19092047[PII], p. 2047. issn: 1424-8220. doi: 10 . 3390 / s19092047. url: https : / / pubmed.ncbi.nlm.nih.gov/31052502. [9] Patrick Henry Winston. Artificial Intelligence (3rd Ed.) USA: Addison- Wesley Longman Publishing Co., Inc., 1992. isbn: 0201533774. [10] , , and Aaron Courville. . http://www.deeplearningbook.org. MIT Press, 2016. [11] Christopher Olah. Understanding LSTM Networks. Aug. 2017. url: http: //colah.github.io/posts/2015-08-Understanding- LSTMs/#fn1. [12] Gecynalda S. da S. Gomes, Teresa B. Ludermir, and Leyla M. M. R. Lima. “Comparison of new activation functions in neural network for forecasting financial time series”. In: Neural Computing and Applica- tions 20.3 (2011), pp. 417–439. issn: 1433-3058. doi: 10 . 1007 / s00521-010-0407-3. url: https://doi.org/10.1007/ s00521-010-0407-3. [13] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. “On the Impact of the Activation function on Deep Neural Networks Training”. In: Pro- ceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Pro- ceedings of Machine Learning Research. Long Beach, California, USA: PMLR, Sept. 2019, pp. 2672–2680. url: http://proceedings. mlr.press/v97/hayou19a.html. [14] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). 2015. arXiv: 1511.07289 [cs.LG]. [15] Günter Klambauer et al. “Self-Normalizing Neural Networks”. In: CoRR abs/1706.02515 (2017). arXiv: 1706.02515. url: http://arxiv. org/abs/1706.02515. [16] Nelly Elsayed, Anthony Maida, and Magdy Bayoumi. “Effects of Differ- ent Activation Functions for Unsupervised Convolutional LSTM Spa- tiotemporal Learning”. In: Advances in Science, Technology and Engi- neering Systems Journal 4 (Apr. 2019). doi: 10.25046/aj040234. 22 BIBLIOGRAPHY

[17] Xingjian SHI et al. “Convolutional LSTM Network: A Machine Learn- ing Approach for Precipitation Nowcasting”. In: Advances in Neural Information Processing Systems 28. Ed. by C. Cortes et al. Curran As- sociates, Inc., 2015, pp. 802–810. url: http://papers.nips. cc / paper / 5955 - convolutional - lstm - network - a - machine - learning - approach - for - precipitation - nowcasting.pdf.

TRITA-EECS-EX-2020:414

www.kth.se