<<

Incorporating Nesterov Momentum into Adam

Timothy Dozat

1 Introduction Algorithm 1 gt ← ∇θ f (θ t−1) When attempting to improve the performance of a t−1 θ t ← θ t−1 − ηgt system, there are more or less three approaches one can take: the first is to improve the structure of the model, perhaps adding another layer, sum (with decay constant µ) of the previous gradi- switching from simple recurrent units to LSTM cells ents into a momentum vector m, and using that in- [4], or–in the realm of NLP–taking advantage of stead of the true gradient. This has the advantage of syntactic parses (e.g. as in [13, et seq.]); another ap- accelerating gradient descent learning along dimen- proach is to improve the initialization of the model, sions where the gradient remains relatively consis- guaranteeing that the early-stage gradients have cer- tent across training steps and slowing it along turbu- tain beneficial properties [3], or building in large lent dimensions where the gradient is significantly amounts of sparsity [6], or taking advantage of prin- oscillating. ciples of linear algebra [15]; the final approach is to try a more powerful learning algorithm, such as in- Algorithm 2 Classical Momentum cluding a decaying sum over the previous gradients gt ← ∇θt−1 f (θ t−1) in the update [12], by dividing each parameter up- mt ← µmt−1 + gt date by the L2 norm of the previous updates for that θ t ← θ t−1 − ηmt parameter [2], or even by foregoing first-order algo- rithms for more powerful but more computationally [14] show that Nesterov’s accelerated gradient costly second order algorithms [9]. This paper has (NAG) [11]–which has a provably better bound than as its goal the third option—improving the quality gradient descent–can be rewritten as a kind of im- of the final solution by using a faster, more powerful proved momentum. If we can substitute the defini- learning algorithm. tion for mt in place of the symbol mt in the parame- ter update as in (2) 2 Related Work θ ←θ − ηm (1) 2.1 Momentum-based algorithms t t−1 t θ t ←θ t−1 − ηµmt−1 − ηgt (2) Gradient descent is a simple, well-known, and gen- erally very robust optimization algorithm where the we can see that the term mt−1 doesn’t depend on gradient of the function to be minimized with re- the current gradient gt —so in principle, we can get spect to the parameters (∇ f (θ t−1)) is computed, and a superior step direction by applying the momentum a portion η of that gradient is subtracted off of the vector to the parameters before computing the gradi- parameters: ent. The authors provide empirical evidence that this Classical momentum [12] accumulates a decaying algorithm is superior to the gradient descent, classi- Algorithm 3 Nesterov’s accelerated gradient with adaptive moment estimation (Adam), combin-

gt ← ∇θt−1 f (θ t−1 − ηµmt−1) ing classical momentum (using a decaying mean in- mt ← µmt−1 + gt stead of a decaying sum) with RMSProp to improve θ t ← θ t−1 − ηmt performance on a number of benchmarks. In their algorithm, they include initialization bias correction terms, which offset some of the instability that ini- cal momentum, and Hessian-Free [9] algorithms for tializing m and n to 0 can create. conventionally difficult optimization objectives. Algorithm 6 Adam 2.2 L2 norm-based algorithms gt ← ∇θt−1 f (θ t−1) [2] present adaptive subgradient descent (Ada- mt ← µmt−1 + (1 − µ)gt L mt Grad), which divides η of every step by the 2 norm mˆ t ← 1−µt of all previous gradients; this slows down learning 2 nt ← νnt−1 + (1 − ν)gt along dimensions that have already changed signifi- nt nˆt ← 1−νt cantly and speeds up learning along dimensions that mˆ t θ t ← θ t−1 − η √ have only changed slightly, stabilizing the model’s nˆt +ε representation of common features and allowing it to rapidly “catch up” its representation of rare fea- [5] also include an algorithm AdaMax that re- 1 tures. places the L2 norm with the L∞ norm, removing the need for nˆt and replacing nt and θ t with the follow- Algorithm 4 AdaGrad ing updates: gt ← ∇ f (θ t− ) θt−1 1 nt ← max(νnt− ,|gt |) 2 1 nt ← nt−1 + g mˆ t t θ t ← θ t−1 − η gt nt +ε θ t ← θ t−1 − η √ nt +ε We can generalize this to RMSProp as well, using the L∞ norm in the denominator instead of the L2 One notable problem with AdaGrad is that the norm, giving what might be called the MaxaProp norm vector n eventually becomes so large that algorithm. training slows to a halt, preventing the model from 3 Methods reaching the local minimum; [16] go on to motivate RMSProp, an alternative to AdaGrad that replaces 3.1 NAG revisited the sum in n with a decaying mean parameterized t Adam combines RMSProp with classical momen- here by ν. This allows the model to continue to learn tum. But as [14] show, NAG is in general supe- indefinitely. rior to classical momentum—so how would we go Algorithm 5 RMSProp about modifying Adam to use NAG instead? First, we rewrite the NAG algorithm to be more straight- gt ← ∇θt−1 f (θ t−1) 2 forward and efficient to implement at the cost of nt ← νnt−1 + (1 − ν)gt gt some intuitive readability. Momentum is most ef- θ t ← θ t−1 − η √ nt +ε fective with a warming schedule, so for complete- ness we parameterize µ by t as well. Here, the

2.3 Combination Algorithm 7 NAG rewritten

One might ask if combining the momentum-based gt ← ∇θt−1 f (θ t−1) and norm-based methods might provide the ad- mt ← µt mt−1 + gt vantages of both. In fact, [5] successfully do so m¯ t ← gt + µt+1mt θ t ← θ t−1 − ηm¯ t 1Most implementations of this kind of algorithm include an ε parameter to keep the denominator from being too small and resulting in an irrecoverably large step vector m¯ contains the gradient update for the cur- rent timestep gt in addition to the momentum vector Algorithm 8 Nesterov-accelerated adaptive moment update for the next timestep µt+1mt , which needs estimation to be applied before taking the gradient at the next gt ← ∇θt−1 f (θ t−1) gt timestep. We don’t need to apply the momentum gˆ ← t 1−∏i=1 µi vector for the current timestep anymore because we mt ← µmt−1 + (1 − µ)gt mt already applied it in the last update of the parame- mˆ t ← t+1 1−∏i=1 µi ters, at timestep t − 1. 2 nt ← νnt−1 + (1 − ν)gt nˆ ← nt 3.2 Applying NAG to Adam t 1−νt m¯ t ← (1 − µt )gˆt + µt+1mˆ t Ignoring the initialization bias correction terms for m¯ t θ t ← θ t−1 − η √ the moment, Adam’s update rule can be written in nˆt +ε terms of the previous momentum/norm vectors and current gradient update as in (3). 4 Experiments µmt−1 θ t ←θ t−1 − η p (3) To test this algorithm, we compared the perfor- νn + (1 − ν)g2 + ε t−1 t mance of nine algorithms–GD, Momentum, NAG, (1 − µ)g − η t RMSProp, Adam, Nadam, MaxaProp, AdaMax, p 2 νnt−1 + (1 − ν)gt + ε and Nadamax–on three benchmarks– [10], MNIST image classification [7], and LSTM lan- In rewritten NAG, we would take the first part of the guage models [17]. All algorithms used ν = .999 step and apply it before taking the gradient of the and ε = 1e−8 as suggested in [5], with a momen- cost function f –however, the denominator depends t tum schedule given by µt = µ(1 − .5 × .96 250 ) with on gt , so we can’t take advantage of the trick used µ = .99, similar to the recommendation in [14]. in NAG for this equation. However, ν is generally Only the learning rate η differed across algorithms chosen to be very large (normally > .9), so the dif- and experiments. The algorithms were all coded us- ference between nt−1 and nt will in general be very ing Google’s TensorFlow [1] API and the experi- small. We can then replace nt with nt−1 without los- ments were done using the built-in TensorFlow mod- ing too much accuracy: els, making only small edits to the default settings.

µmt−1 All algorithms used initialization bias correction. θ t ←θ t−1 − η √ (4) nt− + ε 1 4.1 Word2Vec (1 − µ)g − η t p 2 Word2vec [10] word embeddings were trained us- νnt− + (1 − ν)g + ε 1 t ing each of the nine algorithms. Approximately The the first term in the expression in (4) no longer 100MB of cleaned text2 from Wikipedia were used depends on gt , meaning here we can use the Nes- as the source text, and any word not in the top 50000 terov trick; this give us the following expressions for words was replaced with UNK. 128-dimensional vec- m¯ t and θ t : tors with a left and right context size of 1 were trained using noise-contrastive estimation with 64 m¯ t ← (1 − µt )gt + µt+1mt m¯ t negative samples. Validation was done using the θ t ← θ t−1 − η √ vt +ε word analogy task; we report the average cosine dif- All that’s left is to determine how to include the ferene ( 1−cossim(x,y) ) between the analogy vector and initialization bias correction terms, taking into con- 2 the embedding of the correct answer to the analogy. sideration that g comes from the current timestep t The best results were achieved when, each column but m comes from the subsequent timestep. This t of the embedding vector had a 50% chance to be gives us the following, final form of the Nesterov- dropped during training. The results are in Figure 1. accelerated adaptive moment estimation (Nadam) algorithm. AdaMax can make use of the Nesterov acceleration trick identically (NadaMax). 2http://mattmahoney.net/dc/text8.zip GD Mom NAG GD Mom NAG Test loss .368 .361 .358 Test loss .0202 .0263 .0283 RMS Adam Nadam RMS Adam Nadam Test loss .316 .325 .284 Test loss .0172 .0175 .0183 Maxa A-max N-max Maxa A-max N-max Test loss .346 .356 .355 Test loss .0195 .0231 .0204 Figure 1: Training of word2vec word embeddings Figure 2: Training of handwritten digit recognition

The methods with RMSProp produced word vec- the global learning rate), and in other training runs tors that represented relationships between words Nadam performed noticeably better with a higher significantly better than the other methods, but RM- value of ε. SProp with Nesterov momentum (Nadam) clearly outperformed RMSProp with no momentum and 4.3 LSTM Language Model with classical momentum (Adam). Notice also that The final benchmark we present is the task of pre- the algorithms with NAG consistently outperform dicting the next word of a sentence given the pre- the algorithms witn classical momentum. vious words in a sentence. The dataset used here comes from the Penn TreeBank [8], and the model 4.2 Image Recognition uses a two-layer stacked LSTM with 200 cells at MNIST [7] is a classic benchmark for testing algo- each layer, reproducing [17] on a smaller scale. All rithms. We train a CNN with two convolutional lay- default settings in the small TensorFlow implemen- ers of 5 × 5 filters of depth 32 and depth 64 alternat- tation were retained, with the exception that the ing with 2 × 2 max pooling layers with a stride of model included L2 regularization in order to offset 2, followed by a fully connected layer of 512 nodes the lack of dropout. The results of training are pre- and a softmax layer on top of that–this is the default sented in Figure 3. model in TensorFlow’s basic MNIST CNN model. In this benchmark, algorithms that incorporated The annealing schedule was modified to decay faster RMSProp or MaxaProp did noticeably worse than and the model was trained for more epochs in order gradient descent and those that only use momen- to smooth out the error graphs. The results are given tum, and tuning hyperparameters makes little differ- in Figure 2. ence. It’s unclear why this is–it’s conceivable that In this benchmark, Nadam receives the lowest de- the assumptions RMS/MaxaProp make about how velopment error–as predicted–but on the test set, all features are represented don’t hold in the case of momentum-based methods underperform their sim- language models, LSTMs, or recurrent NNs more ple counterparts without momentum. However, [5] broadly. It’s also curious that classical momentum use a larger CNN and find that Adam achieves the outperformed Nesterov momentum, although that lowest performance. Furthermore, the hyperparam- may be due to the relatively small size of the mod- eters here have not been tuned for this task (except els. Crucially, even though Nadam wasn’t the op- default settings, random fluctuations that arise from dropout, or possibly even slight overfitting (which can be clearly seen in Nadam–but not in RMSProp– when hyperparameters are tuned). Finally, in the LSTM-LM, we find that using RMSProp hinders performance, even when hyperparameters are exten- sively tuned, but when it is included anyway, adding Nesterov momentum to it as in Nadam improves it over not using any momentum or only using classi- cal momentum. In two of the three benchmarks, we found a rather significant difference between the performance of GD Mom NAG models trained with Adam and that of those trained Test perp 100.8 99.3 99.8 with Nadam—why might this be? When the de- RMS Adam Nadam nominator of RMSProp is small, the effective step Test perp 106.7 111.0 105.5 size of a parameter update becomes very large, and Maxa A-max N-max when the denominator is large, the model requires Test perp 106.3 108.5 107.0 a lot of work to change what it’s learned–so ensur- Figure 3: Training of a Penn TreeBank language model ing the quality of the step direction and reducing as much noise as possible becomes critical. We can timal algorithm in this case, it significantly outper- understand classical momentum as being a version formed Adam, supporting our hypothesis that using of Nesterov momentum that applies an old, outdated Nesterov momentum will improve Adam. momentum vector that only uses past gradients to the parameter updates, rather than the most recent, 5 Discussion up-to-date momentum vector computed using the current gradient as well. It logically follows then While the results are a little messy, we can in- that Nesterov momentum would produce a gradient fer a number of things: the first is that the hith- update superior to classical momentum, which RM- erto untested MaxaProp-based methods, including SProp can then take full advantage of. AdaMax and NadaMax, generally do worse than at least some of the other algorithms. In the word2vec 6 Conclusion task, the algorithms with Nesterov momentum con- Adam essentially combines two algorithms known sistently achieve better results than the ones with to work well for different reasons–momentum, only classical momentum, supporting the intuition which points the model in a better direction, and behind the hypothesis that Nadam should be better RMSProp, which adapts how far the model goes in than Adam. We also see that Nadam produces by that direction on a per-parameter basis. However, far the best word vectors for this task when dropout Nesterov momentum is theoretically and often em- is included (which is not normally done but in our pirically superior to momentum—so it makes sense experiments categorically improves results). With to see how this affects the result of combining the the MNIST task, we find that including RMSProp two approaches. We compared Adam and Nadam makes a significant difference in the quality of the (Adam with Nesterov momentum), in addition to a final classifier, likely owing to RMSProp’s ability to slew of related algorithms, and find the in most cases adapt to different depths. Nadam performed best on the improvement of Nadam over Adam is fairly dra- the development set, but contra [5], RMSProp sur- matic. While Nadam wasn’t always the best algo- passed it on the test set. This may be due to the rithm to chose, it seems like if one is going to use smaller size of our model, the relative easiness of the momentum with RMSProp to train their model, they task (CIFAR-10 might make a better benchmark in may as well use Nesterov momentum. a less time-sensitive setting), our choice to only use References [9] James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th In- [1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, ternational Conference on Eugene Brevdo, Zhifeng Chen, Craig Citro, (ICML-10), pages 735–742, 2010. Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian [10] Tomas Mikolov, Ilya Sutskever, Kai Chen, Goodfellow, Andrew Harp, Geoffrey Irving, Greg S Corrado, and Jeff Dean. Distributed Michael Isard, Yangqing Jia, Rafal Jozefow- representations of words and phrases and their icz, Lukasz Kaiser, Manjunath Kudlur, Josh compositionality. In Advances in neural infor- Levenberg, Dan Mane,´ Rajat Monga, Sherry mation processing systems, pages 3111–3119, Moore, Derek Murray, Chris Olah, Mike 2013. Schuster, Jonathon Shlens, Benoit Steiner, Ilya [11] Yurii Nesterov. A method of solving a convex Sutskever, Kunal Talwar, Paul Tucker, Vin- programming problem with convergence rate cent Vanhoucke, Vijay Vasudevan, Fernanda O(1/k2). In Soviet Mathematics Doklady, vol- Viegas,´ Oriol Vinyals, Pete Warden, Martin ume 27, pages 372–376, 1983. Wattenberg, Martin Wicke, Yuan Yu, and Xi- [12] Boris Teodorovich Polyak. Some methods of aoqiang Zheng. TensorFlow: Large-scale speeding up the convergence of iteration meth- machine learning on heterogeneous systems, ods. USSR Computational Mathematics and 2015. Software available from tensorflow.org. Mathematical Physics, 4(5):1–17, 1964. [2] John Duchi, Elad Hazan, and Yoram Singer. [13] Richard Socher, Cliff C Lin, Andrew Y Ng, Adaptive subgradient methods for online learn- and Chris Manning. Parsing natural scenes ing and stochastic optimization. The Journal and natural language with recursive neural net- of Machine Learning Research, 12:2121–2159, works. In Proceedings of the 28th interna- 2011. tional conference on machine learning (ICML- [3] Xavier Glorot, Antoine Bordes, and Yoshua 11), pages 129–136, 2011. Bengio. Deep sparse rectifier neural networks. [14] Ilya Sutskever, James Martens, George Dahl, In International Conference on Artificial Intel- and . On the importance of ini- ligence and Statistics, pages 315–323, 2011. tialization and momentum in deep learning. In [4] Sepp Hochreiter and Jurgen¨ Schmidhuber. Proceedings of the 30th International Confer- Long short-term memory. Neural computation, ence on Machine Learning (ICML-13), pages 9(8):1735–1780, 1997. 1139–1147, 2013. [5] Diederik Kingma and Jimmy Ba. Adam: A [15] Sachin S Talathi and Aniket Vartak. Im- method for stochastic optimization. arXiv proving performance of recurrent neural net- preprint arXiv:1412.6980, 2014. work with relu nonlinearity. arXiv preprint arXiv:1511.03771, 2015. [6] Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. A simple way to initialize recur- [16] Tijmen Tieleman and Geoffrey Hinton. Lec- rent networks of rectified linear units. CoRR, ture 6.5-rmsprop: Divide the gradient by a run- abs/1504.00941, 2015. ning average of its recent magnitude. COURS- ERA: Neural Networks for Machine Learning, [7] Yann LeCun, Corinna Cortes, and Christo- 4, 2012. pher JC Burges. The mnist database of hand- written digits, 1998. [17] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regulariza- [8] Mitchell P Marcus, Mary Ann Marcinkiewicz, tion. CoRR, abs/1409.2329, 2014. and Beatrice Santorini. Building a large an- notated corpus of english: The penn tree- bank. Computational linguistics, 19(2):313– 330, 1993.