Improved Gradient Descent Learning Strategy with Simulated Annealing

SA-GD: Improved Gradient Descent Learning Strategy with Simulated Annealing Zhicheng Cai School of Electronic Science and Engineering Nanjing University, Nanjing, China 210023 Email: [email protected] Abstract—Gradient descent algorithm is the most utilized stances like local minimums, saddle points and plateaus in method when optimizing machine learning issues. However, there the loss function [6]. It is easy for optimization with gradient exists many local minimums and saddle points in the loss descent to make the model get into these disappointing areas function, especially for high dimensional non-convex optimization problems like deep learning. Gradient descent may make loss and fail to jump out of inevitably. These situation tackles function trapped in these local intervals which impedes further further optimization and learning. As a consequence, the optimization, resulting in poor generalization ability. This paper model trained fails to obtain sufficient robustness and a good proposes the SA-GD algorithm which introduces the thought generalization ability. Therefore, many improved gradient de- of simulated annealing algorithm to gradient descent. SA-GD scent algorithm have been raised such as NAG [18], Adam method offers model the ability of mounting hills in probability, tending to enable the model to jump out of these local areas [14] and so on [25]. However, these improved optimization and converge to a optimal state finally. We took CNN models algorithms still keep loss function performing gradient descent as an example and tested the basic CNN models on various without possessing the ability of mounting hills [6] and benchmark datasets. Compared to the baseline models with fail to solve certain issues mentioned above. As a matter of traditional gradient descent algorithm, models with SA-GD fact, researchers prefer utilizing stochastic gradient descent algorithm possess better generalization ability without sacrificing the efficiency and stability of model convergence. In addition, SA- simply in many tasks [8], [15]. In addition, these improved GD can be utilized as an effective ensemble learning approach optimization algorithms fail to achieve better performance which improves the final performance significantly. compared to simple SGD under some circumstances. It is believed that quipping machine learning models with I. INTRODUCTION the ability of mounting hills is one approach to make the It has been decades since the concept of machine learning loss function jump out off these disappointing local areas. In was raised. From simple linear models to more complex addition, we believe that the ability of mounting hills should be issues processing high dimensional data such as reinforcement according to certain probability function. Moreover, we argue learning [32] and deep learning models [6], [16], the this probability function should take all present model states machine learning community prospers continuously because of into consideration, including loss, epoch, even learning rate. To unremitting research. What’s more, machine learning has been take effect while ensuring model convergence, this probability successfully applied in many significant realms and tasks such should be low in the beginning and then ascending, but under as natural language processing [33], computer vision [31], a certain ceiling all the time. This paper set the upper bound and speech recognition [2]. as 33 percent. Most of the machine learning models can be regarded Inspired by the simulated annealing algorithm [29] whose as a kind of unconstrained optimization problem, requiring probability function takes energy and temperature into con- the optimal parameters to obtain better representation ability. sideration, we proposed SA-GD optimization algorithm which arXiv:2107.07558v1 [cs.LG] 15 Jul 2021 Gradient descent algorithm [22], which is a kind of iterative stands for gradient descent improved by simulated annealing. method utilized to solve nonlinear equations, is generally SA-GD algorithm possesses the gradient ascent probability applied to optimize machine learning issues such as linear function in the similar format as simulated annealing algorithm regression [7], multi-layer perceptron [28], convolutional does. This function enables the model to perform gradient neural networks [8], and so on [30]. Specifically, when ascent in probability. With the ability of mounting hills, the facing a machine learning issue, we utilize gradient descent model is supposed to jump out of these intractable local areas iteratively to minimize the well-designed target function and and converge to an optimal state. Moreover, the probability of update these learned parameters by back-propagation [17] to performing gradient ascent can be regarded as a sort of noisy make model convergence. For tasks with tremendous dataset, disturbance, which enhances the model generalization ability stochastic gradient descent (SGD) algorithm [12] which splits as well. In section 3 we will exhibit the derivation process of dataset into mini-batches is utilized to balance between the the final gradient ascent probability function step by step. speed and stability of model convergence. In addition, due to the fluctuation of loss curve caused However, in the case of high dimensional non-convex by gradient ascent, the models of various epochs during optimization problems, there exists many intractable circum- training procedure are supposed to be different significantly. Make the precondition that all models selected perform well with standard momentum term [21] or Nesterov momentum individually, the group of selected models possesses diversity. [18], and adaptive learning rate optimization algorithms start- As a result, the thought of ensemble learning with SA-GD is ing from the perspective of learning rate such as AdaGrad [19] proposed naturally. and AdaDelta [34], trying to make the model escape from the saddle point and get a better convergence result. However, our II. RELATED WORK experimental results validate that these improved optimization A. Gradient descent algorithm algorithms fail to make a significant improvement to the model For most deep learning and neural network algorithms, generalization ability. we utilize the gradient descent algorithm to optimize the minimization of the cost function. Suppose the current cost B. Simulated annealing algorithm 0 function is J(θi) and J (θi) is the derivative of the cost Researches in statistical mechanics have demonstrated that function at θi, which is also the direction of the gradient that if a system is slowly cooled down (i.e. annealing process) reduces the cost function the fastest. We can gradually reduce starting from a high temperature, the energy of the system the value of the cost function by moving the parameter θi one is steadily reduced and the state of the system particles 0 small step in the derivative direction so that θi+1 = J (θi) is continuously changed. After sufficient transitions, thermal to obtain J(θi+1) < J(θi), where is the hyper-parameter equilibrium can be reached at each temperature. When the learning rate. Iterating over the above process in the hope of system is completely cooled, it will enter a state of minimal eventually obtaining the global optimal solution is the gradient energy with a high probability [4]. The Metropolis algorithm descent algorithm. [1] constructs a mathematical model to simply describe the However, for a high-dimensional non-convex optimization process of the system annealing. Assuming that the energy of problem like deep learning, there are often many saddle points, the system in state n is E(n), the probability that the system local minimums, or plateaus [6]. Gradient descent may make transitions from state n to state n + 1 at temperature T can the cost function fall into such local intervals and fail to jump be formulated as Eq. 1. out and converge to the global optimal solution. Fig. 1 shows the case where the cost function falls into a local minimum. If ( 1 ;E(n + 1) ≤ E(n) the learned weights of the network are not initialized properly, P (n ! n + 1) = E(n)−E(n+1) (1) it is very easy to make the network eventually converge to a e KT ;E(n + 1) > E(n) poor state [6]. where K is the Boltzmann constant and T is the system temperature. Under a certain temperature, the system will reach thermal equilibrium after sufficient conversions have taken place. The probability of the system being in state i at this point satisfies the Boltzmann distribution [4] as illustrated in Eq. 2. − E(i) e KT PT (X = i) = (2) P − E(j) j2S e KT where S is the set of state spaces. When the temperature drops very lowly to 0, there is: − E(i)−Emin Fig. 1. Schematic diagram of the case where cost function falls into a local e KT minimum. lim T !0 P − E(j)−Emin j2S e KT Stochastic gradient descent [12] takes one random sample E(i)−E − min e KT at a time to calculate the gradient, and since the optimal = lim T !0 − E(j)−Emin − E(j)−Emin direction of a single sample point may not be the global P e KT + P e KT j2Smin j2 =Smin optimal direction, the possibility that the gradient is not zero E(i)−E − min e KT when it falls into a local minimum is preserved, giving the loss = lim T !0 − E(j)−Emin function a chance to jump out of the local minimum. However, P e KT j2Smin the stochastic approach can be seen as adding a random factor 8 1 to the direction of the gradient and may provide the wrong < ; i 2 Smin = jSminj direction of optimization, resulting in the network eventually : 0 ; otherwise failing to converge to a better state. For deep learning models, (3) the probability of saddle point appearance is much higher than the probability of local minimum appearance [6], some where Emin represents the lowest energy in the state space improved gradient descent algorithms such as gradient descent and Smin = ijE(i) = Emin stands for the set of states with the lowest energy.

Load more