On the Difficulty of Training Recurrent Neural Networks

Total Page:16

File Type:pdf, Size:1020Kb

On the Difficulty of Training Recurrent Neural Networks On the difficulty of training recurrent neural networks Razvan Pascanu [email protected] Universit´ede Montr´eal,2920, chemin de la Tour, Montr´eal,Qu´ebec, Canada, H3T 1J8 Tomas Mikolov [email protected] Speech@FIT, Brno University of Technology, Brno, Czech Republic Yoshua Bengio [email protected] Universit´ede Montr´eal,2920, chemin de la Tour, Montr´eal,Qu´ebec, Canada, H3T 1J8 Abstract t There are two widely known issues with prop- ut xt E erly training recurrent neural networks, the vanishing and the exploding gradient prob- Figure 1. Schematic of a recurrent neural network. The lems detailed in Bengio et al. (1994). In recurrent connections in the hidden layer allow information this paper we attempt to improve the under- to persist from one input to another. standing of the underlying issues by explor- ing these problems from an analytical, a geo- 1.1. Training recurrent networks metric and a dynamical systems perspective. A generic recurrent neural network, with input u and Our analysis is used to justify a simple yet ef- t state x for time step t, is given by: fective solution. We propose a gradient norm t clipping strategy to deal with exploding gra- xt = F (xt−1; ut; θ) (1) dients and a soft constraint for the vanishing gradients problem. We validate empirically In the theoretical section of this paper we will some- our hypothesis and proposed solutions in the times make use of the specific parametrization: 1 experimental section. xt = Wrecσ(xt−1) + Winut + b (2) 1. Introduction In this case, the parameters of the model are given by the recurrent weight matrix Wrec, the bias b and A recurrent neural network (RNN), e.g. Fig. 1, is a input weight matrix Win, collected in θ for the gen- neural network model proposed in the 80's (Rumelhart eral case. x0 is provided by the user, set to zero or et al., 1986; Elman, 1990; Werbos, 1988) for modeling learned, and σ is an element-wise function. A cost P time series. The structure of the network is similar to = t measures the performance of the net- E 1≤t≤T E that of a standard multilayer perceptron, with the dis- work on some given task, where t = (xt). tinction that we allow connections among hidden units E L associated with a time delay. Through these connec- One approach for computing the necessary gradients is tions the model can retain information about the past, backpropagation through time (BPTT), where the re- enabling it to discover temporal correlations between current model is represented as a multi-layer one (with events that are far away from each other in the data. an unbounded number of layers) and backpropagation is applied on the unrolled model (see Fig. 2). While in principle the recurrent network is a simple and powerful model, in practice, it is hard to train We will diverge from the classical BPTT equations at properly. Among the main reasons why this model is this point and re-write the gradients in order to better so unwieldy are the vanishing gradient and exploding 1For any model respecting eq. (2) we can construct gradient problems described in Bengio et al. (1994). another model following the more widely known equation x = σ(W x + W u + b) that behaves the same Proceedings of the 30 th International Conference on Ma- t rec t−1 in t (for e.g. by redefining E := E (σ(x ))) and vice versa. chine Learning, Atlanta, Georgia, USA, 2013. JMLR: t t t Therefore the two formulations are equivalent. We chose W&CP volume 28. Copyright 2013 by the author(s). eq. (2) for convenience. On the difficulty of training recurrent neural networks t 1 E − Et Et+1 ∂ t 1 ∂ t ∂ grow exponentially more than short term ones. The E − E t+1 ∂xt 1 ∂xt ∂xE − t+1 vanishing gradients problem refers to the opposite be- xt 1 xt xt+1 − haviour, when long term components go exponentially ∂xt 1 ∂xt ∂xt+1 ∂xt+2 − ∂x ∂xt 2 t 1 ∂xt ∂xt+1 − − fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. ut 1 ut ut+1 − 2.1. The mechanics Figure 2. Unrolling recurrent neural networks in time by creating a copy of the model for each time step. We denote To understand this phenomenon we need to look at the by xt the hidden state of the network at time t, by ut the form of each temporal component, and in particular at @x input of the network at time t and by Et the error obtained the matrix factors t (see eq. (5)) that take the form @xk from the output at time t. of a product of t k Jacobian matrices. In the same way a product of t− k real numbers can shrink to zero − highlight the exploding gradients problem: or explode to infinity, so does this product of matrices (along some direction v). @ X @ t E = E (3) @θ @θ In what follows we will try to formalize these intuitions 1≤t≤T (extending similar derivations done in Bengio et al. + @ t X @ t @xt @ xk (1994) where a single hidden unit case was considered). E = E (4) @θ @xt @xk @θ 1≤k≤t If we consider a linear version of the model (i.e. set σ to the identity function in eq. (2)) we can use the power @xt Y @xi Y T 0 iteration method to formally analyze this product of = = Wrecdiag(σ (xi−1)) (5) @xk @xi−1 Jacobian matrices and obtain tight conditions for when t≥i>k t≥i>k the gradients explode or vanish (see the supplementary These equations were obtained by writing the gradi- materials). It is sufficient for ρ < 1, where ρ is the + @ xk ents in a sum-of-products form. @θ refers to the spectral radius of the recurrent weight matrix Wrec, 2 \immediate" partial derivative of the state xk with for long term components to vanish (as t ) and ! 1 respect to θ, where xk−1 is taken as a constant with necessary for ρ > 1 for them to explode. respect to θ. Specifically, considering eq. 2, the value + We generalize this result for nonlinear functions σ @ xk of any row i of the matrix ( ) is just σ(xk−1). 0 0 @Wrec where σ (x) is bounded, diag(σ (xk)) γ , Eq. (5) also provides the form of Jacobian matrix by relyingj onj singular values.k k ≤ 2 R @xi for the specific parametrization given in eq. (2), @xi−1 1 where diag converts a vector into a diagonal matrix, We first prove that it is sufficient for λ1 < γ , where and σ0 computes element-wise the derivative of σ. λ1 is the largest singular value of Wrec, for the vanish- ing gradient problem to occur. Note that we assume @Et Any gradient component @θ is also a sum (see eq. the parametrization given by eq. (2). The Jacobian (4)), whose terms we refer to as temporal contribu- @xk+1 T 0 matrix is given by Wrecdiag(σ (xk)). The 2- tions or temporal components. One can see that each @xk + norm of this Jacobian is bounded by the product of such temporal contribution @Et @xt @ xk measures how @xt @xk @θ the norms of the two matrices (see eq. (6)). Due to θ at step k affects the cost at step t > k. The fac- our assumption, this implies that it is smaller than 1. tors @xt (eq. (5)) transport the error \in time\ from @xk step t back to step k. We would further loosely distin- @xk+1 T 0 1 k; Wrec diag(σ (xk)) < γ < 1 guish between long term and short term contributions, 8 @xk ≤ k k γ where long term refers to components for which k t (6) and short term to everything else. Let η R be such that k; @xk+1 η < 1. The 2 8 @xk ≤ existence of η is given by eq. (6). By induction over i, 2. Exploding and vanishing gradients we can show that Introduced in Bengio et al. (1994), the exploding gra- t−1 ! @ t Y @xi+1 t−k @ t dients problem refers to the large increase in the norm E η E (7) @xt @xi ≤ @xt of the gradient during training. Such events are due to i=k the explosion of the long term components, which can As η < 1, it follows that, according to eq. (7), long 2We use \immediate" partial derivatives in order to term contributions (for which t k is large) go to 0 − avoid confusion, though one can use the concept of total exponentially fast with t k. derivative and the proper meaning of partial derivative to − express the same property By inverting this proof we get the necessary condition On the difficulty of training recurrent neural networks for exploding gradients, namely that the largest singu- ing attractor or from a disappearing one qualifies as 1 lar value λ1 is larger than γ (otherwise the long term crossing a boundary between attractors, this term en- components would vanish instead of exploding). For capsulates also the observations from Doya (1993). 1 tanh we have γ = 1 while for sigmoid we have γ = =4. We will re-use the one-hidden unit model (and plot) 2.2. Drawing similarities with dynamical from Doya (1993) (see Fig. 3) to depict our exten- systems sion, though , as the original hypothesis, our extension does not depend on the dimensionality of the model.
Recommended publications
  • Neural Networks (AI) (WBAI028-05) Lecture Notes
    Herbert Jaeger Neural Networks (AI) (WBAI028-05) Lecture Notes V 1.5, May 16, 2021 (revision of Section 8) BSc program in Artificial Intelligence Rijksuniversiteit Groningen, Bernoulli Institute Contents 1 A very fast rehearsal of machine learning basics 7 1.1 Training data. .............................. 8 1.2 Training objectives. ........................... 9 1.3 The overfitting problem. ........................ 11 1.4 How to tune model flexibility ..................... 17 1.5 How to estimate the risk of a model .................. 21 2 Feedforward networks in machine learning 24 2.1 The Perceptron ............................. 24 2.2 Multi-layer perceptrons ......................... 29 2.3 A glimpse at deep learning ....................... 52 3 A short visit in the wonderland of dynamical systems 56 3.1 What is a “dynamical system”? .................... 58 3.2 The zoo of standard finite-state discrete-time dynamical systems .. 64 3.3 Attractors, Bifurcation, Chaos ..................... 78 3.4 So far, so good ... ............................ 96 4 Recurrent neural networks in deep learning 98 4.1 Supervised training of RNNs in temporal tasks ............ 99 4.2 Backpropagation through time ..................... 107 4.3 LSTM networks ............................. 112 5 Hopfield networks 118 5.1 An energy-based associative memory ................. 121 5.2 HN: formal model ............................ 124 5.3 Geometry of the HN state space .................... 126 5.4 Training a HN .............................. 127 5.5 Limitations ..............................
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Learning to Learn by Gradient Descent by Gradient Descent
    Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information.
    [Show full text]
  • Q-Learning in Continuous State and Action Spaces
    -Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data.
    [Show full text]
  • NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann Lecun, and Others, at GTC21
    NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann LeCun, and Others, at GTC21 Online Conference to Feature Jensen Huang Keynote and 1,300 Talks from Leaders in Data Center, Networking, Graphics and Autonomous Vehicles NVIDIA today announced that its CEO and founder Jensen Huang will host renowned AI pioneers Yoshua Bengio, Geoffrey Hinton and Yann LeCun at the company’s upcoming technology conference, GTC21, running April 12-16. The event will kick off with a news-filled livestreamed keynote by Huang on April 12 at 8:30 am Pacific. Bengio, Hinton and LeCun won the 2018 ACM Turing Award, known as the Nobel Prize of computing, for breakthroughs that enabled the deep learning revolution. Their work underpins the proliferation of AI technologies now being adopted around the world, from natural language processing to autonomous machines. Bengio is a professor at the University of Montreal and head of Mila - Quebec Artificial Intelligence Institute; Hinton is a professor at the University of Toronto and a researcher at Google; and LeCun is a professor at New York University and chief AI scientist at Facebook. More than 100,000 developers, business leaders, creatives and others are expected to register for GTC, including CxOs and IT professionals focused on data center infrastructure. Registration is free and is not required to view the keynote. In addition to the three Turing winners, major speakers include: Girish Bablani, Corporate Vice President, Microsoft Azure John Bowman, Director of Data Science, Walmart
    [Show full text]
  • Mxnet-R-Reference-Manual.Pdf
    Package ‘mxnet’ June 24, 2020 Type Package Title MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Sys- tems Version 2.0.0 Date 2017-06-27 Author Tianqi Chen, Qiang Kou, Tong He, Anirudh Acharya <https://github.com/anirudhacharya> Maintainer Qiang Kou <[email protected]> Repository apache/incubator-mxnet Description MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix the flavours of deep learning programs together to maximize the efficiency and your productivity. License Apache License (== 2.0) URL https://github.com/apache/incubator-mxnet/tree/master/R-package BugReports https://github.com/apache/incubator-mxnet/issues Imports methods, Rcpp (>= 0.12.1), DiagrammeR (>= 0.9.0), visNetwork (>= 1.0.3), data.table, jsonlite, magrittr, stringr Suggests testthat, mlbench, knitr, rmarkdown, imager, covr Depends R (>= 3.4.4) LinkingTo Rcpp VignetteBuilder knitr 1 2 R topics documented: RoxygenNote 7.1.0 Encoding UTF-8 R topics documented: arguments . 15 as.array.MXNDArray . 15 as.matrix.MXNDArray . 16 children . 16 ctx.............................................. 16 dim.MXNDArray . 17 graph.viz . 17 im2rec . 18 internals . 19 is.mx.context . 19 is.mx.dataiter . 20 is.mx.ndarray . 20 is.mx.symbol . 21 is.serialized . 21 length.MXNDArray . 21 mx.apply . 22 mx.callback.early.stop . 22 mx.callback.log.speedometer . 23 mx.callback.log.train.metric . 23 mx.callback.save.checkpoint . 24 mx.cpu . 24 mx.ctx.default . 24 mx.exec.backward . 25 mx.exec.forward . 25 mx.exec.update.arg.arrays . 26 mx.exec.update.aux.arrays . 26 mx.exec.update.grad.arrays .
    [Show full text]
  • Jssjournal of Statistical Software MMMMMM YYYY, Volume VV, Issue II
    JSSJournal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ OSTSC: Over Sampling for Time Series Classification in R Matthew Dixon Diego Klabjan Lan Wei Stuart School of Business, Department of Industrial Department of Computer Science, Illinois Institute of Technology Engineering and Illinois Institute of Technology Management Sciences, Northwestern University Abstract The OSTSC package is a powerful oversampling approach for classifying univariant, but multinomial time series data in R. This article provides a brief overview of the over- sampling methodology implemented by the package. A tutorial of the OSTSC package is provided. We begin by providing three test cases for the user to quickly validate the functionality in the package. To demonstrate the performance impact of OSTSC,we then provide two medium size imbalanced time series datasets. Each example applies a TensorFlow implementation of a Long Short-Term Memory (LSTM) classifier - a type of a Recurrent Neural Network (RNN) classifier - to imbalanced time series. The classifier performance is compared with and without oversampling. Finally, larger versions of these two datasets are evaluated to demonstrate the scalability of the package. The examples demonstrate that the OSTSC package improves the performance of RNN classifiers applied to highly imbalanced time series data. In particular, OSTSC is observed to increase the AUC of LSTM from 0.543 to 0.784 on a high frequency trading dataset consisting of 30,000 time series observations. arXiv:1711.09545v1 [stat.CO] 27 Nov 2017 Keywords: Time Series Oversampling, Classification, Outlier Detection, R. 1. Introduction A significant number of learning problems involve the accurate classification of rare events or outliers from time series data.
    [Show full text]
  • On Recurrent and Deep Neural Networks
    On Recurrent and Deep Neural Networks Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit´ede Montr´eal,LISA lab September 2014 Pascanu On Recurrent and Deep Neural Networks 1/ 38 Studying the mechanism behind learning provides a meta-solution for solving tasks. Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Pascanu On Recurrent and Deep Neural Networks 2/ 38 Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Studying the mechanism behind learning provides a meta-solution for solving tasks. Pascanu On Recurrent and Deep Neural Networks 2/ 38 I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! Pascanu On Recurrent and Deep Neural Networks 3/ 38 ? I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) F Pascanu On Recurrent and Deep Neural Networks 3/ 38 Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Pascanu On Recurrent and Deep Neural Networks 3/ 38 Optimization for learning θ[k+1] θ[k] Pascanu On Recurrent and Deep Neural Networks 4/ 38 Neural networks Output neurons Last hidden layer bias = 1 Second hidden layer First hidden layer Input layer Pascanu On Recurrent and Deep Neural Networks 5/ 38 Recurrent neural networks Output neurons Output neurons Last hidden layer bias = 1 bias = 1 Recurrent Layer Second hidden layer First hidden layer Input layer Input layer (b) Recurrent
    [Show full text]
  • Training Neural Networks Without Gradients: a Scalable ADMM Approach
    Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps.
    [Show full text]
  • Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
    Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015).
    [Show full text]
  • Verification of RNN-Based Neural Agent-Environment Systems
    The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Verification of RNN-Based Neural Agent-Environment Systems Michael E. Akintunde, Andreea Kevorchian, Alessio Lomuscio, Edoardo Pirovano Department of Computing, Imperial College London, UK Abstract and realised via a previously trained recurrent neural net- work (Hausknecht and Stone 2015). As we discuss below, We introduce agent-environment systems where the agent we are not aware of other methods in the literature address- is stateful and executing a ReLU recurrent neural network. ing the formal verification of systems based on recurrent We define and study their verification problem by providing networks. The present contribution focuses on reachability equivalences of recurrent and feed-forward neural networks on bounded execution traces. We give a sound and complete analysis, that is we intend to give formal guarantees as to procedure for their verification against properties specified whether a state, or a set of states are reached in the system. in a simplified version of LTL on bounded executions. We Typically this is an unwanted state, or a bug, that we intend present an implementation and discuss the experimental re- to ensure is not reached in any possible execution. Similarly, sults obtained. to provide safety guarantees, this analysis can be used to show the system does not enter an unsafe region of the state space. 1 Introduction The rest of the paper is organised as follows. After the A key obstacle in the deployment of autonomous systems is discussion of related work below, in Section 2, we introduce the inherent difficulty of their verification. This is because the basic concepts related to neural networks that are used the behaviour of autonomous systems is hard to predict; in throughout the paper.
    [Show full text]