KaFiStO: A Kalman Filtering Framework for Stochastic Optimization

Aram Davtyan, Sepehr Sameni, Llukman Cerkezi, Givi Meishvilli, Adam Bielski, Paolo Favaro Computer Vision Group University of Bern {aram.davtyan, sepehr.sameni, llukman.cerkezi givi.meishvilli, adam.bielski, paolo.favaro}@inf.unibe.ch

Abstract tic gradient descent (SGD) [61]. SGD owes its performance gains to the adoption of an approximate version of the ob- Optimization is often cast as a deterministic problem, jective function at each iteration step, which, in turn, yields where the solution is found through some iterative proce- an approximate or noisy gradient. dure such as gradient descent. However, when training neu- While SGD seems to benefit greatly (e.g., in terms of rate ral networks the changes over (iteration) time of convergence) from such approximation, it has also been due to the randomized selection of a subset of the samples. shown that too much noise hurts the performance [83,7]. This randomization turns the optimization problem into a This suggests that, to further improve over SGD, one could stochastic one. We propose to consider the loss as a noisy attempt to model the noise of the objective function. We observation with respect to some reference optimum. This consider the iteration-time varying loss function used in interpretation of the loss allows us to adopt Kalman filter- SGD as a stochastic process obtained by adding the ex- ing as an optimizer, as its recursive formulation is designed pected risk to zero mean Gaussian noise. A powerful ap- to estimate unknown parameters from noisy measurements. proach designed to handle estimation with such processes Moreover, we show that the Kalman Filter dynamical model is Kalman filtering [36]. The idea of using Kalman filtering for the evolution of the unknown parameters can be used to to train neural networks is not new [25]. However, the way capture the gradient dynamics of advanced methods such as to apply it to address this task can vary vastly. Indeed, in our Momentum and Adam. We call this stochastic optimization approach, which we call KaFiStO, we introduce a number method KaFiStO. KaFiStO is an easy to implement, scal- of novel ideas that result in a practical and effective train- able, and efficient method to train neural networks. We ing algorithm. Firstly, we introduce drastic approximations show that it also yields parameter estimates that are on par of the estimated covariance of Kalman’s dynamical state so with or better than existing optimization algorithms across that the corresponding matrix depends on only up to a 2 × 2 several neural network architectures and matrix of parameters. Secondly, we approximate interme- tasks, such as computer vision and language modeling. diate Kalman filtering calculations so that more accuracy can be achieved. Thirdly, because of the way we model the objective function, we can also define a schedule for the op- 1. Introduction timization that behaves similarly to learning rate schedules arXiv:2107.03331v1 [cs.LG] 7 Jul 2021 used in SGD and other iterative methods [37]. Optimization of functions involving large datasets and high dimensional models finds today large applicability in We highlight the following contributions: 1) KaFiStO several data-driven fields in science and the industry. Given is designed to handle high-dimensional data and models, the growing role of deep learning, in this paper we look and large datasets; 2) The tuning of the algorithm is auto- at optimization problems arising in the training of neural mated, but it is also possible to introduce a learning rate networks. The training of these models can be cast as the schedule similar to those in existing methods, albeit with minimization or maximization of a certain objective func- a very different interpretation; 3) KaFiStO adapts automat- tion with respect to the model parameters. Because of the ically to the noise in the loss, which might vary depend- complexity and computational requirements of the objec- ing on the settings of the training (e.g., the minibatch size), tive function, the data and the models, the common practice and to the variation in the estimated weights over iteration is to resort to iterative training procedures, such as gradi- time; 4) It can incorporate iteration-time dynamics of the ent descent. Among the iterative methods that emerged as model parameters, which are analogous to momentum [74]; the most effective and computationally efficient is stochas- 5) It is a framework that can be easily extended (we show

Under review as a conference paper. a few variations of KaFiStO); 6) As shown in our experi- the adaptive and momental bounds [14], decorrelating the ments, KaFiStO is on par with state of the art optimizers second moment and gradient terms [99], the importance and can yield better minima in a number of problems rang- weights [40], the layer-wise adaptation strategy [90], the ing from image classification to generative adversarial net- gradient scale invariance [55], multiple learning rates [66], works (GAN) and natural language processing (NLP). controlling the increase in effective learning [93], learn- ing the update-step size [87], looking ahead at the se- 2. Prior Work quence of fast weights generated by another optimizer [98]. Among the widely adopted methods is the work of Duchi In this section, we review optimization methods that et al.[17], who presented a new family of sub-gradient found application in machine learning, and, in particular, methods called AdaGrad. AdaGrad dynamically incorpo- for large scale problems. Most of the progress in the last rates knowledge of the geometry of the data observed in decades aimed at improving the efficiency and accuracy of earlier iterations. Tieleman et al.[77] introduced RmsProp, the optimization algorithms. further extended by Mukkama et al.[52] with logarith- First-Order Methods. First-order methods exploit only mic regret bounds for strongly convex functions. Zeiler the gradient of the objective function. The main advantage et al.[94] proposed a per-dimension learning rate method of these methods lies in their speed and simplicity. Rob- for gradient descent called AdaDelta. Kingma and Ba [37] bins and Monro [61] introduced the very first stochastic op- introduced Adam, based on adaptive estimates of lower- timization method (SGD) in early 1951. Since then, the order moments. A wide range of variations and exten- SGD method has been thoroughly analyzed and extended sions of the original Adam optimizer has also been proposed [42, 67, 33, 73]. Some considered restarting techniques for [44, 60, 29, 47, 78, 86, 15, 72,9, 34, 43, 48, 84, 85, 41]. Re- optimization purposes [46, 82]. However, a limitation of cent work proposed to decouple the weight decay [23, 18]. SGD is that the learning rate must be manually defined and Chen et al.[8] introduced a partially adaptive momentum the approximations in the computation of the gradient hurt estimation method. Some recent work also focused on the the performance. role of gradient clipping et al.[95, 96]. Another line of re- Second-Order Methods. To address the manual tuning of search focused on reducing the memory overhead for adap- the learning rates in first-order methods and to improve the tive algorithms [2, 69, 56]. In most prior work, adaptivity convergence rate, second-order methods rely on the Hes- comes from the introduction of extra hyper-parameters that sian matrix. However, this matrix becomes very quickly also require task-specific tuning. In our case, this property unmanageable as it grows quadratically with the number of is a direct byproduct of the Kalman filtering framework. model parameters. Thus, most work reduces the compu- Kalman filtering. The use of Kalman filtering theory and tational complexity by approximating the Hessian with a methods for the training of neural networks is not new. block-diagonal matrix [21,6, 39]. A number of methods Haykin [1] edited a book collecting a wide range of tech- looked at combining the second-order information in differ- niques on this topic. More recently, Shashua et al.[68] ent ways. For example, Roux and Fitzgibbon [63] combined incorporates Kalman filtering for Value Approximation in Newton’s method and natural gradient. Sohl-Dickstein et . Ollivier et al.[54] recovers the al.[70] combined SGD with the second-order curvature exact extended Kalman filter equations from first principles information leveraged by quasi-Newton methods. Yao et in statistical learning: the Extended Kalman filter is equal to al.[89] dynamically incorporated the curvature of the loss Amari’s online natural gradient, applied in the space of tra- via adaptive estimates of the Hessian. Henriques et al.[28] jectories of the system. Vilmarest et al.[12] applies the Ex- proposed a method that does not even require to store the tended Kalman filter to linear and logistic regressions. Tak- Hessian at all. In contrast with these methods, KaFiStO enga et al.[75] compared GD to methods based on either does not compute second-order derivatives, but focuses in- Kalman filtering or the decoupled Kalman filter. To sum- stead on modeling noise in the objective function. marize, all of these prior Kalman filtering approaches either Adaptive. An alternative to using second-order derivatives focus on a specific non-general formulation or face difficul- is to design methods that automatically adjust the step-size ties when scaling to high-dimensional parameter spaces of during the optimization process. The adaptive selection large-scale neural models. of the update step-size has been based on several princi- ples, including: the local sharpness of the loss function [91], incorporating a line search approach [80, 53, 49], the 3. Modeling Noise in Stochastic Optimization gradient change speed [16], the Barzilai-Borwein method In machine learning, we are interested in minimizing the [76], a “belief” in the current gradient direction [100], expected risk the linearization of the loss [62], the per-component un- weighted mean of all historical gradients [11], handling . min L(x), where L(x) = Eξ∼p(ξ)[`(ξ; x)], (1) noise by preconditioning based on a covariance matrix [35], x∈Rn

Under review as a conference paper. . ∗ with respect to some loss ` that is a function of both the model parameters and let us denote with Lmin = L(x ) the data ξ ∈ Rd, with d the data dimensionality, and the model expected risk at x∗. Then, we model parameters x ∈ Rn (e.g., the weights of a neural network), L = L (x) + v , (5) where n is the number of parameters in the model. We con- min Sk k sider the case, which is of common interest today, where where x ∼ N (x∗, Σ) is a Gaussian with 6 both d and n are very large (∼ 10 ). For notational simplic- the optimal parameters x∗ as the mean and covariance Σ, ity, we do not distinguish the supervised and unsupervised and we associate both the stochasticity of the sampling and learning cases by concatenating all data into a single vec- of x to the scalar noise variable vk ∼ (0,Rk), which we tor ξ (e.g., in the case of image classification we stack in ξ assume to be zero-mean Gaussian with variance Rk. With both the input image and the output label). In practice, we eq. (5) we implicitly assume that vk and LS are statisti- have access to only a finite set of samples and thus resolve k cally dependent since Lmin is constant. Often, we know the to optimize the empirical risk value of Lmin up to some approximation. In the next sec- m tions we additionally show that it is possible to obtain an 1 X min `(ξi; x), (2) online estimate of Rk. x∈Rn m i=1 The task now can be posed as that of identifying the pa- rameters x∗ such that the observations (5) are satisfied. A where ξi ∼ p(ξ), for i = 1, . . . , m, are our training dataset natural way to tackle the identification of parameters given samples. Because of the nonconvex nature of the loss func- their noisy observations is to use Kalman filtering. As dis- tion with respect to the model parameters, this risk is then cussed in the prior work, there is an extensive literature on optimized via an such as gradient descent. the application of Kalman filtering as a stochastic gradient Since in current datasets m can be very large, the com- descent algorithm. However, these methods differ from our putation of the gradient of the empirical risk at each itera- approach in several ways. For instance, Vuckovic [81] uses tion is too demanding. To address this issue, the stochastic the gradients as measurements. Thus, this method requires gradient descent (SGD) method [61] minimizes instead the large matrix inversions, which are not scalable to the set- following risk approximation tings we consider in this paper and that are commonly used in deep learning. As we describe in the next section, we . 1 X E [`(ξ; x)] ≈ L (x) = `(ξ ; x), ξ∼p(ξ) Sk i (3) work instead directly with the scalar risks LSk and intro- |Sk| i∈Sk duce a number of computational approximations that make the training with large datasets and high dimensional data where Sk ⊂ [1, . . . , m] is a sample set of the dataset in- feasible with our method. dices that changes over iteration time k. SGD then iter- atively builds a sequence of parameters {xk}k=0,...,K by 3.1. Kalman Filtering for Stochastic Optimization recursively updating the parameters x with a step in the k We assumed that x is a random variable capturing the opposite direction of the gradient of L , i.e., with some Sk optimum x∗ up to some zero-mean Gaussian error (which random initialization for x and for k = 1,...,K 0 represents our about the parameters). Then, the

values of the time-varying loss LSk at samples of x will be xk = xk−1 − η∇LSk (xk−1), (4) scattered close to Lmin (see eq. (5)). Thus, a possible system of equations for a sequence xk of samples of x is where ∇LSk (xk−1) denotes the gradient of LSk with re- spect to x and computed at x , and η > 0 is commonly k−1 x = x + w (6) referred to as the learning rate and regulates the speed of k k−1 k−1 convergence. Lmin = LSk (xk) + vk. (7) While this approach is highly efficient, it is also affected Here, wk ∼ N (0,Qk) is modeled as a zero-mean Gaussian by the training set sampling at each iteration. The gradi- variable with covariance Q . The dynamical model implies ents are computed on the time-varying objectives L and k Sk that (at x∗) the state x does not change on average. noisy k can be seen as versions of the gradient of the expected The equations (6) and (7) fit very well the equations used L risk . Due to the aleatoric nature of this optimization, it is in Kalman filtering [36]. For completeness, we briefly recall necessary to apply a learning rate decay [5] to achieve con- here the general equations for an Extended Kalman filter vergence. There are also several methods to reduce noise in the gradients, which work on a dynamic sample size or a xk = fk(xk−1) + wk−1 (8) gradient aggregation strategy [7]. zk = hk(xk) + vk, (9) Rather than directly modeling the noise of the gradient, s we adopt a different perspective on the minimization of the where xk are also called the hidden state, zk ∈ R are the ∗ expected risk. Let us denote with x the optimal set of observations, fk and hk are functions that describe the state

Under review as a conference paper. xˆk = fk(xk−1) Algorithm 1 KaFiStO: Vanilla Kalman algorithm Predict: ˆ > Pk = FkPk−1Fk + Qk Initialize x0, P0, Q and R for k in range(1, K) do S = H Pˆ H> + R k k k k Predict: K = P H>S−1 xˆk = xk−1 (11) Update: k k xk =x ˆk + K(zk − hk(ˆxk)) Pˆk = Pk−1 + Q (12) P = (I − KH )Pˆ k k k Sample a minibatch of data S . . k H = ∇h (ˆx ) Update: with: k . k k F = ∇f (x ) ˆ k k k−1 Pk(LSk (ˆxk) − Lmin) xk =x ˆk − ∇LS (ˆxk) (13) Pˆ |∇L (ˆx )|2 + R k Table 1: Extended Kalman filter recursive equations for a k Sk k R posteriori state and covariance estimation. Pk = Pˆk (14) ˆ 2 Pk|∇LSk (ˆxk)| + R transition and the measurement dynamics respectively. The end for Extended Kalman filter infers optimal estimates of the state return xK variables from the previous estimates of xk and the last ob- servation zk. Moreover, it also estimates the a posteriori covariance matrix Pk of the state. This is done in two steps: where pk are so called momentums or velocities, that ac- Predict and Update, which we recall in Table1. cumulate the gradients from the past. The parameter α ∈ If we directly apply the equations in Table1 to our equa- (0, 1), commonly referred to as momentum rate, controls tions (6) and (7), we would immediately find that the pos- the trade-off between current and past gradients. Such up- terior covariance P is an n × n matrix, which would be dates claim to stabilize the training and prevent the param- too large to store and update for n values used in practice. eters from getting stuck at local minima. Hence, we approximate P as a scaled identity matrix. Since To incorporate the idea of Momentum within the the update equation for the posterior covariance requires the KaFiStO framework, one can simply introduce the state ve- > −1 computation of KHk = PkHk HkS , we need to approx- locities and define the following dynamics > imate Hk Hk also with a scaled identity matrix. We do this by using its largest eigenvalue, i.e., xk = xk−1 + pk−1 + wk−1 (17) > 2 2 p = αp + u (18) Hk Hk ≈ |Hk| In×n = |∇xLSk (ˆxk)| In×n, (10) k k−1 k−1

Lmin = LSk (xk) + vk, (19) where In×n denotes the n × n identity matrix. Because we work with a scalar loss L , the innovation covariance S is Sk where p ∈ n and u is a zero-centered Gaussian ran- a scalar and thus it can be easily inverted. We call this first k R k−1 dom variable. parameter estimation method the Vanilla Kalman algorithm, and summarize it in Algorithm1. One can rewrite these equations again as Kalman filter The update in eq. (13) is quite similar to the SGD equations by combining the parameters xk and the veloci- ties pk into one state vector x¯k = [xk, pk] and similarly for update (4), where the learning rate η depends on Pˆk, the state noise ζk−1 = [wk−1, uk−1]. This results in the LSk (ˆxk) − Lmin, ∇LSk (ˆxk) and R. Thus, the learning rate in Vanilla Kalman automatically adapts over time to the cur- following dynamical system rent loss value, its gradient and estimation of the parame- ters, while in SGD it must be manually tuned. x¯k = F x¯k−1 + ζk−1 (20) L = L (T x¯ ) + v , (21) 3.2. Incorporating Momentum Dynamics min Sk k k   The framework introduced so far, which we call In×n In×n   where F = and T = In×n 0 . KaFiStO (as a shorthand notation for Kalman Filtering for 0 αIn×n Stochastic Optimization), is very flexible and allows several Similarly to the Vanilla Kalman algorithm, we also aim to extensions. A first important change we introduce is the in- drastically reduce the dimensionality of the posterior co- corporation of Momentum [74]. Within our notation, this variance, which now is a 2n × 2n matrix. We approximate  2 2  method could be written as σxIn×n σc In×n P with the following form 2 2 , where σc In×n σpIn×n pk = αpk−1 − η∇L(xk−1) (15) 2 2 2 σx, σc , σp are scalars. In this formulation we have that xk = xk−1 + pk, (16) H =  ∇L> 0>  and thus our approximation for the k Sk

Under review as a conference paper. Kalman update of the posterior covariance will use be B-dimensional vectors. The j-th entry in this observa- tion vector is obtained by considering that only the param-  2  > |∇LSk | In×n 0 eters of the B-th layer are varying. Under these assump- Hk Hk ≈ . (22) 0 0 tions, the update equation (13) for both the Vanilla Kalman and the Kalman Dynamics algorithm will split into B layer- The remaining equations follow directly from the applica- wise equations, where each separate equation incorporates tion of Table1. We call this method the Kalman Dynamics only the gradients with respect to the parameters of a spe- algorithm. > −1 cific layer. Additionally to this, now the matrix Hk HkS 3.3. Estimation of the Measurement and State Noise also yields B separate blocks (one per observation), each of which gets approximated by the corresponding largest block In the KaFiStO framework we model the noise in the ob- eigenvalue. Finally, the maximum of these approximations servations and the state transitions with zero-mean Gaussian > −1 gives us the approximation of the whole matrix Hk HkS . variables with covariances Rk and Qk respectively. So far, That is we assumed that these covariances were constant. However, 2 they can also be estimated online, and lead to more accurate > −1 |∇bi LSk | Hk HkS ≈ max , (26) state and posterior covariance estimates. For Rk we use the 1≤i≤B S(i) following running average where bi is the subset of parameters corresponding to the i- 1 X 2 th layer and S(i) is the innovation covariance corresponding Rk = βRRk−1 + (1 − βR) `(ξi; xk) , (23) |Sk| to only the i-th measurement. We observe that this proce- i∈Sk dure induces additional stability in training. where we set βR = 0.9. Similarly, for the covariance  2  . qxIn×n 0 4. Ablations Qk = 2 , the online update for the 0 qpIn×n 2 In this section we ablate the following features and pa- component qx is rameters of both Vanilla Kalman and Kalman Dynamics xavg = βxxavg + (1 − βx)xk−1 (24) algorithms: the dynamics of the weights and velocities, 1 the initialization of the posterior covariance matrix and the q2 = |x − x |2, (25) x n k−1 avg adaptivity of the measurement and state noise estimators. In some ablations we also separately test the Kalman Dynam- where we set βx = 0.9. This adaptivity of the noise helps ics algorithm with adaptive Qk, since it usually gives a large both to reduce the number of hyper-parameters and to sta- boost to performance. Furthermore, we show that our algo- bilize the training and convergence. rithm is relatively insensitive to different batch sizes and weight initialization techniques. 3.4. Learning Rate Scheduling We evaluate our optimization methods by computing the In both the Vanilla Kalman and the Kalman Dynamics test performance achieved by the model obtained with the algorithms, the update equation for the state estimate needs estimated parameters. Although such performance may not Lmin (see e.g., eq. (13)). This term can be in many cases set uniquely correlate to the performance of our method, as it to 0, when we believe that this value can be achieved (e.g., might be affected also by the data, model and regularization, in some image classification problems). Also, we have the it is a useful indicator. In all the ablations, we choose the option to change Lmin progressively with the iteration time classification task on CIFAR-100 [38] with ResNet18 [27]. k. For instance, we could set Lmin = (1 − )LSk (xk). By We train all the models for 100 epochs and decrease the substituting this term in eq. (13) we obtain a learning rate learning rate by a factor of 0.1 every 30 epochs. that is  times the learning rate with Lmin = 0. By varying For the last two ablations and in the Experiments sec-  over the iteration time, we thus can define a learning rate tion, we use the Kalman Dynamics algorithm with α = 0.9, schedule as in current SGD implementations [20, 46]. No- adaptive Rk and Qk, initial posterior covariance parameters 2 2 2 tice, however, the very different interpretation of the sched- σx,0 = σp,0 = 0.1 and σc,0 = 0. We refer to this configura- ule in the case of KaFiStO, where we are gradually decreas- tion as KaFiStO and have no need in tuning it further. ing the target expected risk. Impact of the state dynamics. We compare the Vanilla Kalman algorithm (i.e., constant dynamics) to the Kalman 3.5. Layer-wise Approximations Dynamics (i.e., with velocities). Additionally, we ablate Let us consider the optimization problem specifically for the α, i.e., the decay rate of the velocities. The results are large neural networks. Let us denote with B the number of shown in Table2. We observe that the use of velocities with layers in a network. Next, we consider the observations to a calibrated moment has a positive impact on the estimated

Under review as a conference paper. 6 2.0 2 2 = 0.01 KaFiStO Variant α Top-1 Error Top-5 Error 1.8 x = 0.01 x 2 2 Vanilla - 34.16 12.02 1.6 x = 0.1 x = 0.1 5 1.4 2 2 x = 1.0 x = 1.0 Dynamics 0.50 33.52 11.34 1.2 4 Dynamics 0.90 33.63 12.13 1.0 Dynamics 0.99 diverge diverge 0.8 3 0.6 Dynamics (adapt. Qk) 0.50 28.25 8.91 0.4 Dynamics (adapt. Q ) 0.90 23.39 6.50 0.2 k 2 0.5 1.0 Dynamics (adapt. Qk) 0.99 33.63 9.82 1e4

1

Table 2: Ablation-α decay factor for the velocities in the 0 underlying state dynamics 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Number of iterations 1e4

Parameter Value KaFiStO Variant Top-1 Error Top-5 Error 0.01 Vanilla 33.97 12.83 Figure 1: Behavior of the σ2 over time depending on the 2 x σ0 0.10 Vanilla 34.16 12.02 values we initialize it with. It can be seen that the parameter 1.00 Vanilla 33.52 12.03 0.01 Dynamics 33.42 12.28 adapts quite quickly and after some time the curves with 0.10 Dynamics 33.63 12.13 different initializations do not differ much. 2 1.00 Dynamics 33.16 12.16 σx 0.01 Dynamics (adapt. Qk) 23.67 6.81 0.10 Dynamics (adapt. Qk) 23.39 6.50 KaFiStO Variant Rk Qk Top-1 Error Top-5 Error 1.00 Dynamics (adapt. Qk) 23.82 6.53 Vanilla adaptive - 34.16 12.02 0.01 Dynamics 33.28 11.88 Vanilla constant - diverge diverge 0.10 Dynamics 33.63 12.13 Dynamics adaptive constant 33.63 12.13 2 1.00 Dynamics 34.73 13.36 Dynamics constant constant diverge diverge σp 0.01 Dynamics (adapt. Qk) 23.37 7.13 Dynamics adaptive adaptive 23.39 6.50 0.10 Dynamics (adapt. Qk) 23.39 6.50 Dynamics constant adaptive diverge diverge 1.00 Dynamics (adapt. Qk) 24.24 7.40

Table 4: Ablation of the measurement R and process Q Table 3: Ablation of the initialization of the posterior co- k k noises. Classification error on CIFAR-100 with ResNet18. variance matrix.

Batch Size Top-1 Error Top-5 Error parameters, and that the adaptive state noise estimation pro- 32 24.59 7.13 vides a further substantial gain. 64 23.11 6.93 128 23.39 6.50 Posterior covariance initialization. The KaFiStO frame- 256 24.34 7.59 work requires to initialize the matrix P0. In the case of the Vanilla Kalman algorithm, we approximate the posterior co- Table 5: Ablation of the batch size used for training. Clas- 2 variance with a scaled identity matrix, i.e., Pk = σkIn×n, sification error on CIFAR-100 with ResNet18. where σk ∈ R. In the case of Kalman Dynamics, we ap- proximate Pk with a 2 × 2 block diagonal matrix, and we initialize it with are shown in Table4. We observe that the adaptivity of Rk  2  is essential for the model to converge and the adaptivity of σx,0In×n 0 P0 = 2 (27) Qk helps to further improve the performance of the trained 0 σ In×n p,0 model. Moreover, with adaptive noises there is no need to where σx,0, σp,0 ∈ R. In this section we ablate σx,0, σp,0 set some initial values for them, which reduces the number and σ0 to show that the method quickly adapts to the ob- of hyper-parameters to tune. servations and the initialization of Pk does not have a sig- Batch size. Usually one needs to adapt the learning rate to nificant impact on the final accuracy achieved with the es- the chosen minibatch size. In this experiment, we change timated parameters. The results are given in Table3 and in the batch size in the range [32, 64, 128, 256] and show that Figure1. KaFiStO adapts to it naturally. Table5 shows that the accu- Noise adaptivity. We compare the performance obtained racy of the model does not vary significantly with a varying with a fixed measurement variance Rk to the one with an batch size, which is a sign of stability. online estimate based on the k-th minibatch. Similarly, we Weight initialization. Similarly to the batch size, here we ablate the adaptivity of the process noise Qk. The results use different initialization techniques to show that the algo-

Under review as a conference paper. Initialization Optimizer Top-1 Error Top-5 Error 0.010 1.4 5 Adam 4.5 Adam 0.008 KaFiStO 1.2 KaFiStO SGD 26.71 7.59 SGD 4.0 SGD 1.0 Xavier-Normal 4 0.006 3.5 0.8

KaFiStO 23.34 6.78 0.004 0.6 3 Adam 3.0 0.4 0.002 KaFiStO SGD 26.90 7.97 0.2 SGD 2.5 2 Xavier-Uniform 0.000 0.0 6.8 6.9 7.0 7.1 7.2 7.3 1.70 1.75 1.80 1.85 1.90 1.95 2.00 KaFiStO 23.40 6.85 1e4 2.0 1e2 Training loss

1 Validation loss SGD 27.82 7.95 1.5 1.0 Kaiming-Uniform 0 KaFiStO 23.35 6.76 0 1 2 3 4 5 6 7 8 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of iterations 1e4 Number of iterations 1e2 SGD 26.83 7.59 Orthogonal 100 28 KaFiStO 23.27 6.63 Adam 1.75 26 90 KaFiStO SGD 1.50 24 80 1.25 22 70

1.00 Table 6: Ablation of different weight initializations. Classi- 20 60

18 0.75 50 1.5 1.6 1.7 1.8 1.9 2.0 fication error on CIFAR-100 with ResNet18. 1e2 0.50 40

30 0.25

Top-1 error - validation 20 0.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 1 2 3 4 5 6 7 8 100-epochs 200-epochs Epochs 1e2 Number of iterations 1e4 Error Figure 2: Training curves for Wide ResNet-50-2 trained on Dataset Architecture Method Top-1 Top-5 Top-1 Top-5 CIFAR-100 for 200 epochs. Top left: Training loss. Top SGD 5.60 0.16 7.53 0.29 right: Validation loss during the training. Bottom left: Top- ResNet-18 Adam 6.58 0.28 6.46 0.28 KaFiStO 5.69 0.21 5.46 0.25 1 error on validation during training. Bottom right: The SGD 6.37 0.19 8.10 0.27 evolution of KaFiStO’s “learning rate”, i.e., the value scal- CIFAR-10 ResNet-50 Adam 6.28 0.24 5.97 0.28 ing the gradient in the update step (see eq. (13)). KaFiStO 7.29 0.24 6.31 0.13 SGD 6.08 0.15 7.60 0.24 W-ResNet-50-2 Adam 6.02 0.19 5.90 0.26 KaFiStO 6.83 0.19 5.36 0.12 5.1. Classification SGD 23.50 6.48 22.44 5.99 ResNet-18 Adam 26.30 7.85 25.61 7.74 CIFAR-10/100. We first evaluate KaFiStO on CIFAR- KaFiStO 23.38 6.70 22.22 6.13 10 and CIFAR-100 using the popular ResNets [27] and SGD 25.05 6.74 22.06 5.71 CIFAR-100 ResNet-50 Adam 24.95 6.96 24.44 6.81 WideResNets [92] for training. We compare our results KaFiStO 22.34 5.96 21.03 5.33 with the ones obtained with commonly used existing op- SGD 23.83 6.35 22.47 5.96 timization algorithms, such as SGD with Momentum and W-ResNet-50-2 Adam 23.73 6.64 24.04 7.06 Adam. For SGD we set the momentum rate to 0.9, which KaFiStO 21.25 5.35 20.73 5.08 is the default for many popular networks, and for Adam we SGD 34.07 13.38 -- −8 ImageNet-32 ResNet-50 use the default parameters β = 0.9, β = 0.999,  = 10 . KaFiStO 34.99 14.06 - - 1 2 In all experiments on CIFARs, we use a batch size of 128 and basic data augmentation (random horizontal flipping Table 7: Results on CIFAR-10, CIFAR-100 and ImageNet- and random cropping with padding by 4 pixels). For each 32 datasets for 100 and 200 epochs runs. configuration we have two runs for 100 and 200 epochs re- spectively. For SGD we start with a learning rate equal to rithm is robust to them. We apply the same initializations to 0.1, for Adam to 0.0003 and 1.0 for KaFiStO. For the 100- SGD for comparison. We test Kaiming Uniform [26], Or- epochs run we decrease the learning rate by a factor of 0.1 thogonal [65], Xavier Normal [19], Xavier Uniform [19]. every 30 epochs. For 200-epochs on CIFAR-10 we decrease The results are shown in Table6. the learning rate only once at epoch 150 by the same factor. For the 200-epoch training on CIFAR-100 the learning rate 5. Experiments is decreased by a factor of 0.2 at epochs 60, 120 and 160. For all the algorithms, we additionally use a weight decay In order to assess the efficiency of KaFiStO, we eval- of 0.0005. uate it on different tasks, including image classification To show the benefit of using KaFiStO for training on (on CIFAR-10, CIFAR-100 and ImageNet [64]), generative classification tasks, we report the Top-1 and Top-5 errors learning and language modeling. For all these tasks, we re- on the validation set. For both the 100-epoch and 200-epoch port the quality metrics on the validation sets to compare configurations, we report the mean error among 3 runs with KaFiStO to the optimizers commonly used in the training 3 different random seeds. The results are reported in Ta- of existing models. We find that KaFiStO outperforms or ble7. Figure2 shows the behavior of the training loss, the is on par with the existing methods, while requiring fewer validation loss and the Top-1 error on the validation set, hyper-parameters to tune. as well as the adaptive evolution of KaFiStO’s “learning

Under review as a conference paper. rate”, i.e., the step size that scales the gradient in the up- Optimizer FID ↓ ncritic = 1 FID ↓ ncritic = 5 date eq. (13). Adam 78.21±15.91 79.05±7.89 ImageNet. Following [47], we train a ResNet50 [27] on KaFiStO 67.17±14.93 75.65±5.38 32 × 32 downscaled images with the most common set- tings: 100 epochs of training with learning rate decrease of Table 8: FID (mean±std, lower is better) for different opti- 0.1 after every 30 epochs and a weight decay of 0.0001. We mizers in GAN training on CIFAR-10. use random cropping and random horizontal flipping during training and we report the validation accuracy on single cen- Model Optimizer PTB ppl WikiText-2 ppl ter crop images. As shown in Table7, our model achieves a SGD 207.7 216.62 comparable accuracy to SGD, but without any task-specific tiny-LSTM Adam 123.6 166.49 Dynamics (adapt. Qk) 110.1 124.69 hyper-parameter tuning. SGD 112.19 145.7 larger-LSTM Adam 81.27 94.39 5.2. Generative Adversarial Networks Training Dynamics (adapt. Qk) 81.57 89.64 SGD 134.41 169.04 Generative Adversarial Networks (GAN) [22] are gener- tiny-Transformer Adam 140.13 189.66 ative models trained to generate new samples from a given Dynamics (adapt. Qk) 129.45 179.81 data distribution. GAN consists of two networks: generator and discriminator, which are trained in adversarial manner. Table 9: Language modeling perplexity (lower is better) on The training alternates between these networks in a mini- different datasets, architectures and optimizers. max game, which tends to be difficult to train. Algorithms like SGD struggle to find a good solution, and a common practice for training GANs is to use adaptive methods like plementary material. Adam or RMSProp. Thus, a good performance on the train- ing of GANs is a good indicator of stability and the ability 5.3. Language modeling to handle complex loss functions. Given the recent success of transfer learning in NLP with Following [100], we test our method with one of the pretrained language models [57, 88, 13, 45, 10, 32], we most popular models, Wasserstein GAN [3] with gradient trained both an LSTM [31] and also a Transformer [79] for penalty (WGAN-GP) [24]. The objectives for both the gen- language modeling on the Penn TreeBank dataset [50] and erator and the discriminator in WGAN-GP are unbounded Wikitext-2 [51]. We use the default data splits for training from below, which makes it difficult to apply our model di- and validation and report the perplexity (lower is better) on rectly. Indeed, our algorithm works under the assumption the test set in Table9. We used a two layer LSTM with that the expected risk at the optimum is some given finite 200 hidden neurons and input embedding size of 200 for value. However, we can control the measurements equa- our tiny-LSTM (the same settings as Bernstein et al.[4]) tions in KaFiStO by adjusting the L , as was done to ob- min and increased the sizes to 1500 for the larger-LSTM exper- tain learning rate schedules. The simplest way to deal with iment (the same as Zhang et al.[97]). The learning rate for unbounded losses is to set L below the current estimation min Adam and SGD were picked based on a grid search and we of the loss L (x ). That is, for a given minibatch S the Sk k k used the default learning rate of 1.0 for our optimizer. In target should be equal to L = L (x ) − u , for some min Sk k k order to prevent overfitting we used an aggressive dropout constant u > 0, which we set before the training. In our k [71] rate of 0.65 and tied the input and output embeddings experiments, we fix u = 4. We also set α = 0.5 similarly k [58], which is a common practice in NLP. Since we are us- to a common choice of β = 0.5 for Adam in GAN training. 1 ing small datasets, we use only a two-layer masked multi- We use a DCGAN [59] architecture with WGAN-GP head self-attention transformer with two heads, which per- loss. For each optimizer, we train the model for 100 epochs forms worse than LSTM. We find that even in these settings, on CIFAR-10 and evaluate the FID score [30], which cap- KaFiStO is on-par with other optimizers. tures the quality and diversity of the generated samples. We report the mean and standard deviation among 4 runs with 6. Conclusions different random seeds. Usually GANs are trained in an al- ternating way, that is the generator is updated every ncritic We have introduced KaFiStO, a novel Kalman filtering- iterations. This should make the generator compete with based approach to stochastic optimization. KaFiStO is suit- a stronger discriminator and achieve convergence. We test able to train modern neural network models on current large KaFiStO on two settings: ncritic = 1, 5. On both settings scale datasets with high-dimensional data. The method can KaFiStO outperforms Adam in terms of FID score and is self-tune and is quite robust to wide range of training set- more stable. The results are shown in Table8 and images tings. Moreover, we design KaFiStO so that it can incor- sampled from the trained generator are reported in the Sup- porate optimization dynamics such as those in Momentum

Under review as a conference paper. and Adam, and learning rate schedules. The efficacy of this [15] Timothy Dozat. Incorporating nesterov momentum into method is demonstrated on several experiments in image adam. 2016.2 classification, image generation and language processing. [16] S. R. Dubey, S. Chakraborty, S. K. Roy, S. Mukherjee, S. K. Singh, and B. B. Chaudhuri. diffgrad: An opti- References mization method for convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems, [1] Kalman filtering and neural networks. Adaptive and learn- (11), 2020.2 ing systems for signal processing, communications, and [17] John Duchi, Elad Hazan, and Yoram Singer. Adaptive sub- control. Wiley, New York, 2001.2 gradient methods for online learning and stochastic opti- [2] Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram mization. Journal of Machine Learning Research, (61), Singer. Memory efficient adaptive optimization. In Ad- 2011.2 vances in Neural Information Processing Systems, 2019.2 [18] Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, [3] Martin Arjovsky, Soumith Chintala, and Leon´ Bottou. Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Wasserstein generative adversarial networks. In Interna- Li, Huyen Nguyen, Yang Zhang, and Jonathan M. Cohen. tional conference on machine learning. PMLR, 2017.8 Stochastic gradient methods with layer-wise adaptive mo- [4] Jeremy Bernstein, Arash Vahdat, Yisong Yue, and Ming-Yu ments for training of deep networks, 2020.2 Liu. On the distance between two neural networks and the [19] Xavier Glorot and Yoshua Bengio. Understanding the dif- stability of learning. In H. Larochelle, M. Ranzato, R. Had- ficulty of training deep feedforward neural networks. In In sell, M. F. Balcan, and H. Lin, editors, Advances in Neural Proceedings of the International Conference on Artificial Information Processing Systems, 2020.8 Intelligence and (AISTATS’10). Society for Artifi- [5] Dimitri P Bertsekas and John N Tsitsiklis. Gradient con- cial Intelligence and Statistics, 2010.7 vergence in gradient methods with errors. SIAM Journal on [20] J.L. Goffin. On convergence rates of subgradient optimiza- Optimization, (3), 2000.3 tion methods. Mathematical Programming., 13:329–347, [6] Aleksandar Botev, Hippolyt Ritter, and David Barber. Prac- 1977.5 tical gauss-newton optimisation for deep learning. In ICML, [21] Donald Goldfarb, Yi Ren, and Achraf Bahamou. Practical 2017.2 quasi-newton methods for training deep neural networks. In [7]L eon´ Bottou, Frank E Curtis, and Jorge Nocedal. Opti- Advances in Neural Information Processing Systems, 2020. mization methods for large-scale machine learning. Siam 2 Review, (2), 2018.1,3 [22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [8] Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Cao, and Quanquan Gu. Closing the generalization gap and Yoshua Bengio. Generative adversarial nets. Advances of adaptive gradient methods in training deep neural net- in Neural Information Processing Systems, 2014.8 works. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, 7 [23] Diego Granziol, Xingchen Wan, Samuel Albanie, and 2020. Main track.2 Stephen Roberts. Beyond sgd: Iterate averaged adaptive [9] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. gradient method, 2021.2 On the convergence of a class of adam-type algorithms for [24] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent non-convex optimization. In International Conference on Dumoulin, and Aaron C Courville. Improved training of Learning Representations, 2019.2 wasserstein gans. In Advances in Neural Information Pro- [10] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christo- cessing Systems, 2017.8 pher D. Manning. Electra: Pre-training text encoders as dis- [25] Simon Haykin. Kalman filtering and neural networks. John criminators rather than generators. ArXiv, abs/2003.10555, Wiley & Sons, 2004.1 2020.8 [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [11] Brett Daley and Christopher Amato. Expectigrad: Fast Delving deep into rectifiers: Surpassing human-level per- stochastic optimization with robust convergence properties, formance on imagenet classification. In Proceedings of the 2021.2 2015 IEEE International Conference on Computer Vision [12] Joseph de Vilmarest and Olivier Wintenberger. Stochastic (ICCV), ICCV ’15, 2015.7 online optimization using kalman recursion, 2020.2 [27] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina for image recognition. In 2016 IEEE Conference on Com- Toutanova. BERT: Pre-training of deep bidirectional trans- puter Vision and Pattern Recognition (CVPR), 2016.5,7, formers for language understanding. In Proceedings of the 8 2019 Conference of the North American Chapter of the [28] Joao F. Henriques, Sebastien Ehrhardt, Samuel Albanie, Association for Computational Linguistics: Human Lan- and Andrea Vedaldi. Small steps and giant leaps: Mini- guage Technologies, Volume 1 (Long and Short Papers), mal newton solvers for deep learning. In Proceedings of Minneapolis, Minnesota, June 2019.8 the IEEE/CVF International Conference on Computer Vi- [14] Jianbang Ding, Xuancheng Ren, Ruixuan Luo, and Xu Sun. sion (ICCV), October 2019.2 An adaptive and momental bound method for stochastic [29] Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongy- learning. arXiv preprint arXiv:1910.12249, 2019.2 oon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and

Under review as a conference paper. Jung-Woo Ha. Adamp: Slowing down the slowdown for [46] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient de- momentum optimizers on scale-invariant weights. In Inter- scent with warm restarts. In International Conference on national Conference on Learning Representations (ICLR), Learning Representations (ICLR) 2017 Conference Track, 2021.2 Apr. 2017.2,5 [30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, [47] Ilya Loshchilov and Frank Hutter. Decoupled weight de- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a cay regularization. In International Conference on Learn- two time-scale update rule converge to a local nash equilib- ing Representations, 2019.2,8 rium. In Advances in Neural Information Processing Sys- [48] Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive tems, 2017.8 gradient methods with dynamic bound of learning rate. [31] S. Hochreiter and J. Schmidhuber. Long short-term mem- In International Conference on Learning Representations, ory. Neural Computation, 1997.8 2019.2 [32] J. Howard and Sebastian Ruder. Universal language model [49] Maren Mahsereci and Philipp Hennig. Probabilistic line fine-tuning for text classification. In ACL, 2018.8 searches for stochastic optimization. In Advances in Neural [33] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased Information Processing Systems, 2015.2 stochastic first-order methods for conditional stochastic op- [50] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann timization and applications in meta learning. In Advances Marcinkiewicz. Building a large annotated corpus of En- in Neural Information Processing Systems, 2020.2 glish: The Penn Treebank. Computational Linguistics, [34] Haiwen Huang, Chang Wang, and Bin Dong. Nostalgic 1993.8 adam: Weighting more of the past gradients when design- [51] Stephen Merity, Caiming Xiong, James Bradbury, and ing the adaptive learning rate. In Proceedings of the Twenty- R. Socher. Pointer sentinel mixture models. ArXiv, Eighth International Joint Conference on Artificial Intelli- abs/1609.07843, 2017.8 gence, IJCAI-19, 7 2019.2 [52] Mahesh Chandra Mukkamala and Matthias Hein. Variants [35] Yasutoshi Ida, Yasuhiro Fujiwara, and Sotetsu Iwamura. of RMSProp and Adagrad with logarithmic regret bounds. Adaptive learning rate via covariance matrix based precon- In Proceedings of the 34th International Conference on ditioning for deep neural networks. In Proceedings of the Machine Learning Twenty-Sixth International Joint Conference on Artificial , Proceedings of Machine Learning Re- search, International Convention Centre, Sydney, Australia, Intelligence, IJCAI-17, 2017.2 06–11 Aug 2017.2 [36] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME – Jour- [53] Maximus Mutschler and Andreas Zell. Parabolic approxi- nal of Basic Engineering (Series D), 1960.1,3 mation line search for dnns. In Advances in Neural Infor- mation Processing Systems [37] Diederik P. Kingma and Jimmy Ba. Adam: A method for , 2020.2 stochastic optimization. In ICLR (Poster), 2015.1,2 [54] Yann Ollivier. The extended kalman filter is a natural gra- [38] Alex Krizhevsky and Geoffrey Hinton. Learning multiple dient descent in trajectory space, 2019.2 layers of features from tiny images. 2009.5 [55] Francesco Orabona and David´ Pal.´ Scale-free algorithms [39] Yann A. LeCun, Leon´ Bottou, Genevieve B. Orr, and Klaus- for online linear optimization. In Algorithmic Learning Robert Muller.¨ Efficient BackProp. Springer Berlin Heidel- Theory, Cham, 2015.2 berg, Berlin, Heidelberg, 2012.2 [56] Antonio Orvieto, Jonas Kohler, and Aurelien Lucchi. The [40] Kfir Y. Levy, Alp Yurtsever, and Volkan Cevher. Online role of memory in stochastic optimization. In Proceedings adaptive methods, universality and acceleration. In Ad- of The 35th Uncertainty in Artificial Intelligence Confer- vances in Neural Information Processing Systems, 2018.2 ence, Proceedings of Machine Learning Research, Tel Aviv, [41] Wenjie Li, Zhaoyang Zhang, Xinjiang Wang, and Ping Luo. Israel, 22–25 Jul 2020.2 Adax: Adaptive gradient descent with exponential long [57] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gard- term memory, 2020.2 ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. [42] Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Deep contextualized word representations. In Proceedings Richtarik.´ Page: A simple and optimal probabilistic gra- of the 2018 Conference of the North American Chapter dient estimator for nonconvex optimization, 2020.2 of the Association for Computational Linguistics: Human [43] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Language Technologies, Volume 1 (Long Papers), New Or- Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the vari- leans, Louisiana, June 2018.8 ance of the adaptive learning rate and beyond. In Interna- [58] Ofir Press and Lior Wolf. Using the output embedding tional Conference on Learning Representations, 2020.2 to improve language models. In Proceedings of the 15th [44] Rui Liu, Tianyi Wu, and Barzan Mozafari. Adam with ban- Conference of the European Chapter of the Association for dit sampling for deep learning. In Advances in Neural In- Computational Linguistics: Volume 2, Short Papers, pages formation Processing Systems, 2020.2 157–163, Valencia, Spain, Apr. 2017.8 [45] Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, [59] Alec Radford, Luke Metz, and Soumith Chintala. Un- Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and supervised representation learning with deep convolu- Veselin Stoyanov. Roberta: A robustly optimized bert pre- tional generative adversarial networks. arXiv preprint training approach. ArXiv, abs/1907.11692, 2019.8 arXiv:1511.06434, 2015.8

Under review as a conference paper. [60] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the for fingerprint-based positioning. In IEEE 60th Vehicular convergence of adam and beyond. In International Confer- Technology Conference, 2004. VTC2004-Fall. 2004, 2004. ence on Learning Representations, 2018.2 2 [61] Herbert Robbins and Sutton Monro. A Stochastic Approxi- [76] Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. mation Method. The Annals of Mathematical Statistics, (3), Barzilai-borwein step size for stochastic gradient descent. 1951.1,2,3 In Advances in Neural Information Processing Systems, [62] Michal Rolinek and Georg Martius. L4: Practical loss- 2016.2 based stepsize adaptation for deep learning. In Advances [77] T Tieleman and G Hinton. Lecture 6.5-rmsprop: Divide in Neural Information Processing Systems, 2018.2 the gradient by a running average of its recent magnitude. [63] Nicolas Le Roux and Andrew W. Fitzgibbon. A fast natural 2012.2 newton method. In ICML, 2010.2 [78] P. T. Tran and L. T. Phong. On the convergence proof of [64] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, amsgrad and a new version. IEEE Access, 2019.2 Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej [79] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Berg, and Li Fei-Fei. Imagenet large scale visual recogni- Illia Polosukhin. Attention is all you need. ArXiv, tion challenge. Int. J. Comput. Vision, 115(3), 2015.7 abs/1706.03762, 2017.8 [65] Andrew M. Saxe, James L. McClelland, and Surya Gan- [80] Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark guli. Exact solutions to the nonlinear dynamics of learning Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Pain- in deep linear neural networks. In 2nd International Con- less stochastic gradient: Interpolation, line-search, and con- ference on Learning Representations, ICLR 2014, 2014.7 vergence rates. In Advances in Neural Information Process- [66] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky ing Systems, 2019.2 ICML (3) learning rates. In , 2013.2 [81] James Vuckovic. Kalman gradient descent: Adaptive vari- [67] Fanhua Shang, Kaiwen Zhou, Hongying Liu, James Cheng, ance reduction in stochastic optimization, 2018.3 Ivor W. Tsang, Lijun Zhang, Dacheng Tao, and Licheng [82] Bao Wang, Tan M Nguyen, Andrea L Bertozzi, Richard G Jiao. Vr-sgd: A simple stochastic variance reduction Baraniuk, and Stanley J Osher. Scheduled restart mo- method for machine learning, 2018.2 mentum for accelerated stochastic gradient descent. arXiv [68] Shirli Di-Castro Shashua and Shie Mannor. Trust region preprint arXiv:2002.10583, 2020.2 value optimization using kalman filtering, 2019.2 [83] Chong Wang, Xi Chen, Alexander Smola, and Eric P Xing. [69] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive Variance reduction for stochastic gradient optimization. Ad- learning rates with sublinear memory cost. In ICML, 2018. vances in Neural Information Processing Systems, 2013.1 2 [84] D. Wang, Y. Liu, W. Tang, F. Shang, H. Liu, Q. Sun, and [70] Jascha Sohl-Dickstein, Ben Poole, and Surya Ganguli. Fast L. Jiao. signadam++: Learning confidences for deep neu- large-scale optimization by unifying stochastic gradient and ral networks. In 2019 International Conference on Data quasi-newton methods. In Proceedings of the 31st Inter- Mining Workshops (ICDMW), 2019.2 national Conference on Machine Learning, Proceedings of Machine Learning Research, Bejing, China, 22–24 Jun [85] Guanghui Wang, Shiyin Lu, Quan Cheng, Wei wei Tu, and 2014.2 Lijun Zhang. Sadam: A variant of adam for strongly convex [71] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya functions. In International Conference on Learning Repre- Sutskever, and Ruslan Salakhutdinov. Dropout: A simple sentations, 2020.2 way to prevent neural networks from overfitting. J. Mach. [86] Shipeng Wang, Jian Sun, and Zongben Xu. Hyperadam: Learn. Res., page 1929–1958, Jan. 2014.8 A learnable task-adaptive adam for network training. Jul. [72] H. Sun, L. Gu, and B. Sun. Adathm: Adaptive gradient 2019.2 method based on estimates of third-order moments. In 2019 [87] Xiaoxia Wu, Rachel Ward, and Leon´ Bottou. Wngrad: IEEE Fourth International Conference on Data Science in Learn the learning rate in gradient descent, 2020.2 Cyberspace (DSC), Los Alamitos, CA, USA, jun 2019.2 [88] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, [73] Wonyong Sung, Iksoo Choi, Jinhwan Park, Seokhyun Choi, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized and Sungho Shin. S-sgd: Symmetrical stochastic gradient autoregressive pretraining for language understanding. In descent with weight noise injection for reaching flat min- Advances in Neural Information Processing Systems, 2019. ima, 2020.2 8 [74] Ilya Sutskever, James Martens, George Dahl, and Geoffrey [89] Zhewei Yao, Amir Gholami, Sheng Shen, Kurt Keutzer, and Hinton. On the importance of initialization and momen- Michael W Mahoney. Adahessian: An adaptive second or- tum in deep learning. In Proceedings of the 30th Inter- der optimizer for machine learning. AAAI, 2021.2 national Conference on Machine Learning, Proceedings of [90] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Machine Learning Research, Atlanta, Georgia, USA, 17–19 Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Dem- Jun 2013.1,4 mel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch opti- [75] C. M. Takenga, K. R. Anne, K. Kyamakya, and J. C. Ched- mization for deep learning: Training bert in 76 minutes. jou. Comparison of gradient descent method, kalman filter- In International Conference on Learning Representations, ing and decoupled kalman in training neural networks used 2020.2

Under review as a conference paper. [91] Xubo Yue, Maher Nouiehed, and Raed Al Kontar. Salr: Sharpness-aware learning rates for improved generaliza- tion, 2020.2 [92] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. 05 2016.7 [93] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In Advances in Neural Information Process- ing Systems, 2018.2 [94] Matthew D. Zeiler. Adadelta: An adaptive learning rate method. CoRR, 2012.2 [95] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention mod- els? In Advances in Neural Information Processing Sys- tems, 2020.2 [96] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention mod- els? In Advances in Neural Information Processing Sys- tems, 2020.2 [97] Jian Zhang, Ioannis Mitliagkas, and Christopher Re.´ Yel- lowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.8 [98] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Sys- tems, 2019.2 [99] Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, and Yong Yu. Adashift: Decorre- lation and convergence of adaptive learning rate methods. In International Conference on Learning Representations, 2019.2 [100] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems, 2020.2,8

Under review as a conference paper.