Kafisto: a Kalman Filtering Framework for Stochastic Optimization
Total Page:16
File Type:pdf, Size:1020Kb
KaFiStO: A Kalman Filtering Framework for Stochastic Optimization Aram Davtyan, Sepehr Sameni, Llukman Cerkezi, Givi Meishvilli, Adam Bielski, Paolo Favaro Computer Vision Group University of Bern faram.davtyan, sepehr.sameni, llukman.cerkezi givi.meishvilli, adam.bielski, [email protected] Abstract tic gradient descent (SGD) [61]. SGD owes its performance gains to the adoption of an approximate version of the ob- Optimization is often cast as a deterministic problem, jective function at each iteration step, which, in turn, yields where the solution is found through some iterative proce- an approximate or noisy gradient. dure such as gradient descent. However, when training neu- While SGD seems to benefit greatly (e.g., in terms of rate ral networks the loss function changes over (iteration) time of convergence) from such approximation, it has also been due to the randomized selection of a subset of the samples. shown that too much noise hurts the performance [83,7]. This randomization turns the optimization problem into a This suggests that, to further improve over SGD, one could stochastic one. We propose to consider the loss as a noisy attempt to model the noise of the objective function. We observation with respect to some reference optimum. This consider the iteration-time varying loss function used in interpretation of the loss allows us to adopt Kalman filter- SGD as a stochastic process obtained by adding the ex- ing as an optimizer, as its recursive formulation is designed pected risk to zero mean Gaussian noise. A powerful ap- to estimate unknown parameters from noisy measurements. proach designed to handle estimation with such processes Moreover, we show that the Kalman Filter dynamical model is Kalman filtering [36]. The idea of using Kalman filtering for the evolution of the unknown parameters can be used to to train neural networks is not new [25]. However, the way capture the gradient dynamics of advanced methods such as to apply it to address this task can vary vastly. Indeed, in our Momentum and Adam. We call this stochastic optimization approach, which we call KaFiStO, we introduce a number method KaFiStO. KaFiStO is an easy to implement, scal- of novel ideas that result in a practical and effective train- able, and efficient method to train neural networks. We ing algorithm. Firstly, we introduce drastic approximations show that it also yields parameter estimates that are on par of the estimated covariance of Kalman’s dynamical state so with or better than existing optimization algorithms across that the corresponding matrix depends on only up to a 2 × 2 several neural network architectures and machine learning matrix of parameters. Secondly, we approximate interme- tasks, such as computer vision and language modeling. diate Kalman filtering calculations so that more accuracy can be achieved. Thirdly, because of the way we model the objective function, we can also define a schedule for the op- 1. Introduction timization that behaves similarly to learning rate schedules arXiv:2107.03331v1 [cs.LG] 7 Jul 2021 used in SGD and other iterative methods [37]. Optimization of functions involving large datasets and high dimensional models finds today large applicability in We highlight the following contributions: 1) KaFiStO several data-driven fields in science and the industry. Given is designed to handle high-dimensional data and models, the growing role of deep learning, in this paper we look and large datasets; 2) The tuning of the algorithm is auto- at optimization problems arising in the training of neural mated, but it is also possible to introduce a learning rate networks. The training of these models can be cast as the schedule similar to those in existing methods, albeit with minimization or maximization of a certain objective func- a very different interpretation; 3) KaFiStO adapts automat- tion with respect to the model parameters. Because of the ically to the noise in the loss, which might vary depend- complexity and computational requirements of the objec- ing on the settings of the training (e.g., the minibatch size), tive function, the data and the models, the common practice and to the variation in the estimated weights over iteration is to resort to iterative training procedures, such as gradi- time; 4) It can incorporate iteration-time dynamics of the ent descent. Among the iterative methods that emerged as model parameters, which are analogous to momentum [74]; the most effective and computationally efficient is stochas- 5) It is a framework that can be easily extended (we show Under review as a conference paper. a few variations of KaFiStO); 6) As shown in our experi- the adaptive and momental bounds [14], decorrelating the ments, KaFiStO is on par with state of the art optimizers second moment and gradient terms [99], the importance and can yield better minima in a number of problems rang- weights [40], the layer-wise adaptation strategy [90], the ing from image classification to generative adversarial net- gradient scale invariance [55], multiple learning rates [66], works (GAN) and natural language processing (NLP). controlling the increase in effective learning [93], learn- ing the update-step size [87], looking ahead at the se- 2. Prior Work quence of fast weights generated by another optimizer [98]. Among the widely adopted methods is the work of Duchi In this section, we review optimization methods that et al.[17], who presented a new family of sub-gradient found application in machine learning, and, in particular, methods called AdaGrad. AdaGrad dynamically incorpo- for large scale problems. Most of the progress in the last rates knowledge of the geometry of the data observed in decades aimed at improving the efficiency and accuracy of earlier iterations. Tieleman et al.[77] introduced RmsProp, the optimization algorithms. further extended by Mukkama et al.[52] with logarith- First-Order Methods. First-order methods exploit only mic regret bounds for strongly convex functions. Zeiler the gradient of the objective function. The main advantage et al.[94] proposed a per-dimension learning rate method of these methods lies in their speed and simplicity. Rob- for gradient descent called AdaDelta. Kingma and Ba [37] bins and Monro [61] introduced the very first stochastic op- introduced Adam, based on adaptive estimates of lower- timization method (SGD) in early 1951. Since then, the order moments. A wide range of variations and exten- SGD method has been thoroughly analyzed and extended sions of the original Adam optimizer has also been proposed [42, 67, 33, 73]. Some considered restarting techniques for [44, 60, 29, 47, 78, 86, 15, 72,9, 34, 43, 48, 84, 85, 41]. Re- optimization purposes [46, 82]. However, a limitation of cent work proposed to decouple the weight decay [23, 18]. SGD is that the learning rate must be manually defined and Chen et al.[8] introduced a partially adaptive momentum the approximations in the computation of the gradient hurt estimation method. Some recent work also focused on the the performance. role of gradient clipping et al.[95, 96]. Another line of re- Second-Order Methods. To address the manual tuning of search focused on reducing the memory overhead for adap- the learning rates in first-order methods and to improve the tive algorithms [2, 69, 56]. In most prior work, adaptivity convergence rate, second-order methods rely on the Hes- comes from the introduction of extra hyper-parameters that sian matrix. However, this matrix becomes very quickly also require task-specific tuning. In our case, this property unmanageable as it grows quadratically with the number of is a direct byproduct of the Kalman filtering framework. model parameters. Thus, most work reduces the compu- Kalman filtering. The use of Kalman filtering theory and tational complexity by approximating the Hessian with a methods for the training of neural networks is not new. block-diagonal matrix [21,6, 39]. A number of methods Haykin [1] edited a book collecting a wide range of tech- looked at combining the second-order information in differ- niques on this topic. More recently, Shashua et al.[68] ent ways. For example, Roux and Fitzgibbon [63] combined incorporates Kalman filtering for Value Approximation in Newton’s method and natural gradient. Sohl-Dickstein et Reinforcement Learning. Ollivier et al.[54] recovers the al.[70] combined SGD with the second-order curvature exact extended Kalman filter equations from first principles information leveraged by quasi-Newton methods. Yao et in statistical learning: the Extended Kalman filter is equal to al.[89] dynamically incorporated the curvature of the loss Amari’s online natural gradient, applied in the space of tra- via adaptive estimates of the Hessian. Henriques et al.[28] jectories of the system. Vilmarest et al.[12] applies the Ex- proposed a method that does not even require to store the tended Kalman filter to linear and logistic regressions. Tak- Hessian at all. In contrast with these methods, KaFiStO enga et al.[75] compared GD to methods based on either does not compute second-order derivatives, but focuses in- Kalman filtering or the decoupled Kalman filter. To sum- stead on modeling noise in the objective function. marize, all of these prior Kalman filtering approaches either Adaptive. An alternative to using second-order derivatives focus on a specific non-general formulation or face difficul- is to design methods that automatically adjust the step-size ties when scaling to high-dimensional parameter spaces of during the optimization process. The adaptive selection large-scale neural models. of the update step-size has been based on several princi- ples, including: the local sharpness of the loss function [91], incorporating a line search approach [80, 53, 49], the 3. Modeling Noise in Stochastic Optimization gradient change speed [16], the Barzilai-Borwein method In machine learning, we are interested in minimizing the [76], a “belief” in the current gradient direction [100], expected risk the linearization of the loss [62], the per-component un- weighted mean of all historical gradients [11], handling : min L(x); where L(x) = Eξ∼p(ξ)[`(ξ; x)]; (1) noise by preconditioning based on a covariance matrix [35], x2Rn Under review as a conference paper.