Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
Total Page:16
File Type:pdf, Size:1020Kb
Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015). SGD with momentum is the pre- jjgtjj ferred algorithm for computer vision, while Adam is more The step size does not depend on the magnitude of that commonly used for natural language processing (NLP) and gradient. Hazan et al.(2015) proved that the direction of the speech problems. Compared to SGD, Adam is perceived gradient is sufficient for convergence. Ignoring the gradient as safer and more robust to weight initialization and learn- magnitude makes SNGD robust to vanishing and exploding ing rate. However, Adam has certain drawbacks. First, as gradients, but SNGD is still sensitive to ”noisy” gradients, noted in the original paper (Kingma & Ba, 2015), the sec- especially during an initial training phase. ond moment can vanish or explode, especially during the initial phase of training. Also, Adam often leads to solutions One can improve SNGD stability by gradient averaging. arXiv:1905.11286v3 [cs.LG] 6 Feb 2020 that generalize worse than SGD (Wilson et al., 2017), e.g. Adagrad (Duchi et al., 2011), RmsProp (Tieleman & Hin- models for image classification trained with Adam have ton, 2012), Adam (Kingma & Ba, 2015) – all these SNGD significantly lower accuracy than when they are trained with methods use the some kind of gradient averaging. Adam, SGD with momentum (Loshchilov & Hutter, 2019). the most popular one, uses moving averages mt; vt;: Our motivation for this work was to build an algorithm mt = β1 · mt−1 + (1 − β1) · gt (1) which: (1) performs equally well for image classification, 2 vt = β2 · vt−1 + (1 − β2) · g (2) speech recognition, machine translation, and language mod- t eling, (2) is robust to learning rate (LR) and weight initial- The weights update is computed with the first moment mt ization, (3) has strong regularization properties. normalized by the second moment vt: 1NVIDIA, Santa Clara, USA. Correspondence to: mt wt+1 = wt − λt · p (3) Boris Ginsburg <[email protected]>. vt + Preprint. Under Review. where 1 is added for numerical stability. To strengthen NovoGrad - Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Adam’s robustness to ”noisy” gradients, coefficients β1 and 3. Algorithm β2 are usually close to 1 (e.g. β2 > 0:99). NovoGrad combines three ideas: (1) use layer-wise second moments, (2) compute first moment with gradients normal- 2.1. Layer-wise Gradient Normalization ized with layer-wise second moments, (3) decouple weight The other way to improve SNGD robustness was proposed decay. by Yu et al.(2018), who suggested to combine Adam with layer-wise gradient normalization. Both moments are Algorithm 1 NovoGrad gl Parameters: computed with normalized gradients g^l = t , where t l Init learning rate λ , moments β ; β , weight decay d, jjgtjj 0 1 2 l number of steps T gt is the gradient for the layer l at step t: t = 0: weight initialization l l l w0 Init(). m = β1 · m + (1 − β1) · g^ t t−1 t t = 1: moment initialization l l l 2 vt = β2 · vt−1 + (1 − β2) · jjg^tjj for each layer l do l l 2 v1 jjg1jj ; A similar idea was used in (Singh et al., 2015) to scale gl ml 1 + d · wl . up small layers gradients, while keeping large gradients 1 p l 1 v1 unchanged: end for while t ≤ T do 1 l l compute global learning rate λt LR(λ0; t; T ) g^t = gt · (1 + log(1 + l )) jjgtjj for each layer l do l gt rlL(wt) l l l 2 2.2. Improving Adam Generalization vt β2 · vt−1 + (1 − β2) · jjgtjj l l l gt l Adaptive methods like Adam generalize worse than SGD mt β1 · mt−1 + ( + d · wt) pvl + with momentum (Wilson et al., 2017). For example, Keskar t wl wl − λ · ml & Socher(2017) proposed to use Adam during the initial t+1 t t t end for stage only and then switch to SGD. Luo et al.(2019) sug- end while gested to improve Adam generalization by limiting the fac- tor p1 to a certain range: limiting from above helps to vt Let gl be the stochastic gradient for layer l at step t. First, decrease the training loss while limiting from below helps t we compute the layer-wise second moment vl using jjgljj: to generalize better. t t vl = β · vl + (1 − β ) · jjgljj2 (4) To improve Adam regularization, Loshchilov & Hutter t 2 t−1 2 t 1 (2019) proposed AdamW, which decouples the weight where 0 ≤ β2 ≤ 1. We use much smaller β2 than in Adam, decay d · w from the gradient and uses it directly in the l l t The moment vt is used to normalize the gradient gt before weight update: l calculating the first moment mt. l mt l l gt l wt+1 = wt − λt · (p + d · wt) m = β1 · m + w (5) v + t t−1 p l t t vt + Last, we decouple weight decay d · wt from the stochastic 2.3. Reduction of Adam Memory Footprint gradient similarly to AdamW, but we add it to normalized gradient before computing moment ml :2 Adam needs to store the second moment, and this doubles t the optimizer memory compared to SGD with momentum. l l l gt l This affects large models like GPT-2 – 1:5 billion parame- mt = β1 · mt−1 + ( + d · wt) (6) pvl + ters (Radford et al., 2019), Meena – 2.6 billion parameters t (Adiwardana et al., 2020), or Megatron-LM – 8.3 billion where 0 < β1 < 1 is the momentum, typically in the same parameters (Shoeybi et al., 2019). Shazeer & Stern(2018) range as in SGD or Adam [0:9 − 0:95]. The first moment proposed the AdaFactor algorithm, which replaced the full 1 second moment with moving averages of the row and col- The default β2 = 0:25 which we used in all experiments except Imagenet classification where β2 = 0:98 was used. umn sums of the squared gradients. For a layer defined by 2One can also use decoupled weight decay in the weights up- an n×m matrix, this would reduce memory from O(n×m) date, as in AdamW. We didn’t find a significant difference between to O(n + m). these two options. NovoGrad - Stochastic Gradient Normalized by Layerwise Adaptive Second Moments can be also computed with an exponential moving average 2017) — for WMT 2014 translation, Jasper (Li et al., 2019) in Adam-like style: — for LibriSpeech speech recognition, and Transformer- XL (Dai et al., 2019) — for WikiText-103 word-level lan- l l l gt l guage modeling, with NovoGrad, SGD with momentum, m = β1 · m + (1 − β1) · ( + d · w ) t t−1 p l t vt + and Adam/AdamW. Each model was trained on a single DGX-1 with 8 NVIDIA V100 GPUs with gradient accumu- We use the following moments initialization to remove bias: lation used for large batch training. In all the experiments, gl NovoGrad performed on par or better than other algorithms. vl = jjgl jj2 ; ml = 1 + d · wl 1 1 1 jjgl jj 1 1 4.1. Image Classification Finally, weights are updated the same way as in SGD with We used ResNet-50v2 model (He et al., 2016) for ImageNet momentum: classification task (Russakovsky et al., 2015). We trained w = w − λ · m t+1 t t t the model with three optimizers: SGD with momentum Similar to Adam, one can construct a counter-example for (SGD), AdamW, and NovoGrad using the batch of 1024 NovoGrad in the stochastic convex optimization settings for 100 epochs. We used quadratic LR decay for SGD (Wilson et al., 2017). However, the “AMS-Grad” fix (Reddi with momentum and cosine decay (Loshchilov & Hutter, et al., 2018) for Adam can also be applied in this case to 2016) for AdamW and NovoGrad. We could not find any λt training recipe for ResNet-50 with AdamW, so we report the make sure that is monotonically decreasing: p l best accuracy we achieved after extensive hyper-parameter vt search.