Chapter 8 Stochastic Gradient / Subgradient Methods

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 8 Stochastic Gradient / Subgradient Methods Chapter 8 Stochastic gradient / subgradient methods Contents (class version) 8.0 Introduction........................................ 8.2 8.1 The subgradient method................................. 8.5 Subgradients and subdifferentials................................. 8.5 Properties of subdifferentials.................................... 8.7 Convergence of the subgradient method.............................. 8.10 8.2 Example: Hinge loss with 1-norm regularizer for binary classifier design...... 8.17 8.3 Incremental (sub)gradient method............................ 8.19 Incremental (sub)gradient method................................. 8.21 8.4 Stochastic gradient (SG) method............................. 8.23 SG update.............................................. 8.23 Stochastic gradient algorithm: convergence analysis....................... 8.26 Variance reduction: overview................................... 8.33 Momentum............................................. 8.35 Adaptive step-sizes......................................... 8.37 8.5 Example: X-ray CT reconstruction........................... 8.41 8.1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.2 8.6 Summary.......................................... 8.50 8.0 Introduction This chapter describes two families of algorithms: • subgradient methods • stochastic gradient methods aka stochastic gradient descent methods Often we turn to these methods as a “last resort,” for applications where none of the methods discussed previously are suitable. Many machine learning applications, such as training artificial neural networks, use such methods. As stated in [1] “large-scale machine learning represents a distinctive setting in which traditional nonlinear optimization techniques typically falter.” For recent surveys, especially about stochastic gradient methods, see [1,2]. Acknowledgment This chapter was based in part on slides made by Naveen Murthy in April 2019. © J. Fessler, April 12, 2020, 17:55 (class version) 8.3 Rate of convergence review Suppose the sequence fxkg converges to x^. Consider the limiting ratio kx − x^k µ lim k+1 2 : , k!1 kxk − x^k2 We define the rate of convergence of the sequence fxkg as follows: • Converges linearly with rate µ if µ 2 (0; 1) • Converges sublinearly if µ = 1 • Converges super-linearly if µ = 0 k+1 k jxk+1 − 0j jρj Example: For xk = ρ , with jρj < 1, = k = jρj = µ, so the sequence converges linearly. jxk − 0j jρj c c jxk+1 − 0j 1=(k + 1) Example: For xk = 1=k , with c > 0, = c ! 0 = µ, so the convergence is sublinear. jxk − 0j 1=k 1 5 For the sequence x = 1 − − ; for k = 1; 2;:::, the value of µ is k 3k k2 A: 0 B: 1=5 C: 1=3 D: 1=2 E: 1 ?? © J. Fessler, April 12, 2020, 17:55 (class version) 8.4 Gradient descent (GD) review • GD update: xk+1 = xk − αrf(xk): • Converges to a minimizer if f is convex and differentiable, and rf is L-Lipschitz continuous, and 0 < α < 2=L; simple to analyze • Worst-case sublinear rate of convergence of O(1=k) if α ≤ 1=L • Can be improved to a linear rate O(ρk), where ρ < 1, under strong convexity assumptions on f PM • In the usual case where f(x) = m=1 fm(x); gradient computation is linear in M, i.e., takes O(M) time. =) Doubling the number of examples in the training set doubles the gradient computation cost. GD is a “batch” method: rf uses all available data at once The methods in this chapter relax the differentiability requirement, and scale better for large-scale problems. Example. ImageNet [3] contains ∼14 million images with more than 20,000 categories. © J. Fessler, April 12, 2020, 17:55 (class version) 8.5 8.1 The subgradient method The O(1=k) convergence rate of ISTA in Ch.4 (aka PGM) may seem quite slow, and it is by modern stan- dards, but convergence rates can be even worse! A classic way to seek a minimizer of a non-differentiable cost function Ψ is the subgradient method, defined as [4,5]: xk+1 = xk − αk gk; gk 2 @ Ψ(xk); (8.1) where gk 2 @ Ψ(xk) denotes a subgradient of the (nondifferentiable) function Ψ at the current iterate xk. This method was published (in Russian) by Naum Shor in 1962 for solving transportation problems [6, p. 4]. Subgradients and subdifferentials Define. If f : D 7! R is a real-valued convex function defined on a convex open set D ⊂ RN , N a vector g 2 R is called a subgradient at a point x0 2 D iff for all z 2 D we have f(z) − f(x0) ≥ hg; z − x0i : Define. The set of all subgradients at x0 is called the subdifferential at x0 and is denoted @f(x0) [6, p. 8]. © J. Fessler, April 12, 2020, 17:55 (class version) 8.6 Example.A rectified linear unit (ReLU) in an ANN uses the rectifier function that has the following definition and subdifferential: (Draw tangent lines on ReLU.) ReLU(x) @ ReLU(x) 8 0; x < 0 < r(x) = max(x; 0); @r(x) = [0; 1];x = 0 1 : 1; 0 < x: 0 x 0 x For this example, the derivative is defined almost everywhere, i.e., everywhere but a set of Lebesgue mea- sure equal to zero, also known as a null set. Specifically, here the derivative defined for the entire real line except for the point f0g. In most SIPML applications, the derivatives are defined except for a finite set of points. Unfortunately, even for a convex function the direction negative to that of an arbitrary subgradient is not always a direction of descent [6, p. 4]. 8 −1; x < 0 < Example. For f(x) = jxj, the subdifferential is @f(x) = [−1; 1];x = 0 : 1; 0 < x: At x = 0 all elements of the subdifferential have negatives that are ascent directions except for 0. © J. Fessler, April 12, 2020, 17:55 (class version) 8.7 Properties of subdifferentials Define. The subdifferential of a convex function f : RN 7! R is this set of subgradient vectors: N N @f(x) , g 2 R : f(z) − f(x) ≥ hg; z − xi; 8z 2 R : Properties [7]. • A convex function f : D 7! R is differentiable at x 2 D iff @f(x) = frf(x)g : So a subgradient is a generalization of a gradient (for convex functions). • For any x 2 D, the subdifferential is a nonempty convex and compact set (closed and bounded) [6, p. 9]. • If convex functions f; g : RN 7! R have subdifferentials @f and @g, respectively, and h(x) = f(x)+g(x), then, for all x 2 RN , HW @h(x) = @(f + g)(x) = @f(x) + @g(x) = fu + v : u 2 @f(x); v 2 @g(x)g ; where the “+” here denotes the Minkowski sum of two sets. The subdifferential of a sum of convex functions is the (set) sum of their subdifferentials [6, p. 13]. N If convex function f : R 7! R has subdifferential @f and h(x) , αf(x) for α 2 R, • then @h(x) = α@f(x) for all x 2 RN . (?) A: True B: False ?? True when α ≥ 0 HW © J. Fessler, April 12, 2020, 17:55 (class version) 8.8 • x is a global minimizer of a convex function f iff 0 2 @f(x) [6, p. 12]. Example. f(x) = max(jxj ; 1) f(x) @f(x) 1 -1 1 x 1 x • One can define convex functions and subdifferentials using the extended reals R [ f1g [7]. • There are also generalization for non-convex functions [6, p. 17] [7]. • A chain rule for affine arguments [8]. If g(x) = f(Ax + b) for convex f : RM 7! R, for x 2 RN and A 2 RM×N , then we saw in earlier HW that g : RN 7! R is convex, and furthermore [7,8]: @g(x) = A0@f(Ax + b): (8.2) Proof that (Read) v 2 @f(Ax + b) =) A0v 2 @g(x): Given that v 2 @f(Ax + b); we know that 8y 2 RM : f(y) − f(Ax + b) ≥ v0 (y − (Ax + b)) : In particular, choosing y = Az + b for any z 2 RN we have f(Az + b) − f(Ax + b) ≥ v0 ((Az + b) − (Ax + b)) = v0A (z − x) = (A0v)0 (z − x) : © J. Fessler, April 12, 2020, 17:55 (class version) 8.9 So by construction, for any z 2 RN we have g(z) − g(x) ≥ (A0v)0 (z − x) implying that A0v 2 @g(x): Thus we have shown A0@f(Ax + b) ⊆ @g(x): 2 Showing that @g(x) ⊆ A0@f(Ax + b); to complete the proof of (8.2), seems to be more complicated. For details: https://maunamn.wordpress.com/9-the-subdifferential-chain-rule Example. Consider compressed sensing with analysis sparsity regularizer and β ≥ 0: 1 Ψ(x) = kAx − yk2 + β kT xk =) A0 (Ax − y) + βT 0 sign :(T x) 2 @ Ψ(x); 2 2 1 because sign(t) is a subgradient of jtj, where we have applied several of the above properties: derivative, sum, scaling, affine. © J. Fessler, April 12, 2020, 17:55 (class version) 8.10 Convergence of the subgradient method For suitable choice of fαkg, and suitable assumptionsp on Ψ such as convexity, the convergence rate of the subgradient method (8.1) is bounded by O(1= k) [4–6]. (See result on subsequent pages.) Diminishing step sizes Convergence theorems for SGM often assume that the step sizes diminish, but not too quickly. Specifically, often we assume they satisfy: X1 αk > 0; αk ! 0; αk = 1: (8.3) k=1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.11 SGM convergence for diminishing step sizes A classic convergence theorem for SGM is the following [6, p. 26]. If Ψ is convex and has a bounded set of minimizers X∗, and fαkg satisfies (8.3), then the sequence fxkg generated by (8.4) for any x0 has the property that either • the sequence fg(xk)g is bounded and the sequence fxkg converges in the sense that fd(xk; X∗)g ! 0 and fΨ(xk)g ! Ψ∗; or • the sequence fg(xk)g is unbounded and there is no convergence.
Recommended publications
  • A Lagrangian Decomposition Approach Combined with Metaheuristics for the Knapsack Constrained Maximum Spanning Tree Problem
    MASTERARBEIT A Lagrangian Decomposition Approach Combined with Metaheuristics for the Knapsack Constrained Maximum Spanning Tree Problem ausgeführt am Institut für Computergraphik und Algorithmen der Technischen Universität Wien unter der Anleitung von Univ.-Prof. Dipl.-Ing. Dr.techn. Günther Raidl und Univ.-Ass. Dipl.-Ing. Dr.techn. Jakob Puchinger durch Sandro Pirkwieser, Bakk.techn. Matrikelnummer 0116200 Simmeringer Hauptstraße 50/30, A-1110 Wien Datum Unterschrift Abstract This master’s thesis deals with solving the Knapsack Constrained Maximum Spanning Tree (KCMST) problem, which is a so far less addressed NP-hard combinatorial optimization problem belonging to the area of network design. Thereby sought is a spanning tree whose profit is maximal, but at the same time its total weight must not exceed a specified value. For this purpose a Lagrangian decomposition approach, which is a special variant of La- grangian relaxation, is applied to derive upper bounds. In the course of the application the problem is split up in two subproblems, which are likewise to be maximized but easier to solve on its own. The subgradient optimization method as well as the Volume Algorithm are deployed to solve the Lagrangian dual problem. To derive according lower bounds, i.e. feasible solutions, a simple Lagrangian heuristic is applied which is strengthened by a problem specific local search. Furthermore an Evolutionary Algorithm is presented which uses a suitable encoding for the solutions and appropriate operators, whereas the latter are able to utilize heuristics based on defined edge-profits. It is shown that simple edge-profits, derived straightforward from the data given by an instance, are of no benefit.
    [Show full text]
  • Subgradient Method
    Subgradient Method Ryan Tibshirani Convex Optimization 10-725/36-725 Last last time: gradient descent Consider the problem min f(x) x n for f convex and differentiable, dom(f) = R . Gradient descent: (0) n choose initial x 2 R , repeat (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: Step sizes tk chosen to be fixed and small, or by backtracking line search If rf Lipschitz, gradient descent has convergence rate O(1/) Downsides: • Requires f differentiable this lecture • Can be slow to converge next lecture 2 Subgradient method n Now consider f convex, with dom(f) = R , but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients. I.e., initialize x(0), repeat (k) (k−1) (k−1) x = x − tk · g ; k = 1; 2; 3;::: where g(k−1) 2 @f(x(k−1)), any subgradient of f at x(k−1) Subgradient method is not necessarily a descent method, so we (k) (0) (k) keep track of best iterate xbest among x ; : : : x so far, i.e., f(x(k) ) = min f(x(i)) best i=0;:::k 3 Outline Today: • How to choose step sizes • Convergence analysis • Intersection of sets • Stochastic subgradient method 4 Step size choices • Fixed step sizes: tk = t all k = 1; 2; 3;::: • Diminishing step sizes: choose to meet conditions 1 1 X 2 X tk < 1; tk = 1; k=1 k=1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 5 Convergence analysis n Assume that f convex, dom(f) = R , and also that f is Lipschitz continuous with constant G > 0, i.e., jf(x) − f(y)j ≤ Gkx − yk2 for all x; y Theorem: For a fixed step size t, subgradient method satisfies lim f(x(k) ) ≤ f ? + G2t=2 k!1 best Theorem: For diminishing step sizes, subgradient method sat- isfies lim f(x(k) ) = f ? k!1 best 6 Basic bound, and convergence rate (0) ? Letting R = kx − x k2, after k steps, we have the basic bound R2 + G2 Pk t2 f(x(k) ) − f(x?) ≤ i=1 i best Pk 2 i=1 ti Previous theorems follow from this.
    [Show full text]
  • An Efficient Solution Methodology for Mixed-Integer Programming Problems Arising in Power Systems" (2016)
    University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 12-15-2016 An Efficient Solution Methodology for Mixed- Integer Programming Problems Arising in Power Systems Mikhail Bragin University of Connecticut - Storrs, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Bragin, Mikhail, "An Efficient Solution Methodology for Mixed-Integer Programming Problems Arising in Power Systems" (2016). Doctoral Dissertations. 1318. https://opencommons.uconn.edu/dissertations/1318 An Efficient Solution Methodology for Mixed-Integer Programming Problems Arising in Power Systems Mikhail Bragin, PhD University of Connecticut, 2016 For many important mixed-integer programming (MIP) problems, the goal is to obtain near- optimal solutions with quantifiable quality in a computationally efficient manner (within, e.g., 5, 10 or 20 minutes). A traditional method to solve such problems has been Lagrangian relaxation, but the method suffers from zigzagging of multipliers and slow convergence. When solving mixed-integer linear programming (MILP) problems, the recently adopted branch-and-cut may also suffer from slow convergence because when the convex hull of the problems has complicated facial structures, facet- defining cuts are typically difficult to obtain, and the method relies mostly on time-consuming branching operations. In this thesis, the novel Surrogate Lagrangian Relaxation method is developed and its convergence is proved to the optimal multipliers, without the knowledge of the optimal dual value and without fully optimizing the relaxed problem. Moreover, for practical implementations a stepsizing formula, that guarantees convergence without requiring the optimal dual value, has been constructively developed. The key idea is to select stepsizes in a way that distances between Lagrange multipliers at consecutive iterations decrease, and as a result, Lagrange multipliers converge to a unique limit.
    [Show full text]
  • Subgradient Methods
    Subgradient Methods Stephen Boyd and Almir Mutapcic Notes for EE364b, Stanford University, Winter 2006-07 April 13, 2008 Contents 1 Introduction 2 2 Basic subgradient method 2 2.1 Negativesubgradientupdate. .... 2 2.2 Stepsizerules................................... 3 2.3 Convergenceresults.............................. .. 4 3 Convergence proof 4 3.1 Assumptions.................................... 4 3.2 Somebasicinequalities . ... 5 3.3 Aboundonthesuboptimalitybound . ... 7 3.4 Astoppingcriterion.............................. .. 8 3.5 Numericalexample ................................ 8 4 Alternating projections 9 4.1 Optimal step size choice when f ⋆ isknown................... 9 4.2 Finding a point in the intersection of convex sets . ......... 11 4.3 Solving convex inequalities . ..... 14 4.4 Positive semidefinite matrix completion . ........ 15 5 Projected subgradient method 16 5.1 Numericalexample ................................ 18 6 Projected subgradient for dual problem 18 6.1 Numericalexample ................................ 20 7 Subgradient method for constrained optimization 21 7.1 Numericalexample ................................ 24 8 Speeding up subgradient methods 24 1 1 Introduction The subgradient method is a very simple algorithm for minimizing a nondifferentiable convex function. The method looks very much like the ordinary gradient method for differentiable functions, but with several notable exceptions: The subgradient method applies directly to nondifferentiable f. • The step lengths are not chosen via a line search, as in the ordinary gradient method. • In the most common cases, the step lengths are fixed ahead of time. Unlike the ordinary gradient method, the subgradient method is not a descent method; • the function value can (and often does) increase. The subgradient method is readily extended to handle problems with constraints. Subgradient methods can be much slower than interior-point methods (or Newton’s method in the unconstrained case).
    [Show full text]
  • 1 Base Polytopes
    6.883 Learning with Combinatorial Structure Note for Lecture 8 Authors: Hongyi Zhang A bit of history in combinatorial optimization. Last time we talked about the Lovász extension, which plays an important role in the optimization of submodular functions. Actually, despite the name Lovász, Jack Edmonds is another important figure who made huge contribution to this concept, and submodular optimization in general. In DISCML 20111, there was a brief introduction to Jack Edmonds’ contributions to mathematics and computer science by Jeff Bilmes. 1 Base polytopes 1.1 Examples of base polytopes Recall that in the last lecture we defined the base polytope BF of a submodular function, and showed that there is a greedy algorithm to solve the optimization problem max y>x y2BF and the solution is closely related to Lovász extension. Now we shall look at several other examples where interesting concepts turn out to be the base polytopes of some submodu- lar (or supermodular) functions. 1.1.1 Probability simplex Let the ground set be V such that jVj = n. Define F (S) = minfjSj; 1g;S ⊆ V, in last lecture we already saw that F (S) is submodular. It is easy to check that BF is a probability sim- plex. In fact, by definition ( ) X X BF = y : yi = F (V) and yi ≤ F (S) 8S ⊆ V i2V i2S ( n ) X = y : yi = 1 and yi ≥ 0; 8i i=1 n which is a standard (n−1)-simplex in R . 1http://las.ethz.ch/discml/discml11.html NIPS Workshop on Discrete Optimization in Ma- chine Learning 2011 1 1.1.2 Permutahedron PjSj Let the ground set be V such that jVj = n.
    [Show full text]
  • On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri
    On the Links between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri To cite this version: Senanayak Sesh Kumar Karri. On the Links between Probabilistic Graphical Models and Submodular Optimisation. Machine Learning [cs.LG]. Université Paris sciences et lettres, 2016. English. NNT : 2016PSLEE047. tel-01753810 HAL Id: tel-01753810 https://tel.archives-ouvertes.fr/tel-01753810 Submitted on 29 Mar 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT de l’Universite´ de recherche Paris Sciences Lettres PSL Research University Prepar´ ee´ a` l’Ecole´ normale superieure´ On the Links between Probabilistic Graphical Models and Submodular Optimisation Liens entre modeles` graphiques probabilistes et optimisation sous-modulaire Ecole´ doctorale n◦386 ECOLE´ DOCTORALE DE SCIENCES MATHEMATIQUES´ DE PARIS CENTRE Specialit´ e´ INFORMATIQUE COMPOSITION DU JURY : M Andreas Krause ETH Zurich, Rapporteur M Nikos Komodakis ENPC Paris, Rapporteur M Francis Bach Inria Paris, Directeur de these` Soutenue par Senanayak Sesh Kumar KARRI le 27.09.2016 M Josef Sivic ENS Paris, Membre du Jury Dirigee´ par Francis BACH M Antonin Chambolle CMAP EP Paris, Membre du Jury M Guillaume Obozinski ENPC Paris, Membre du Jury RESEARCH UNIVERSITY PARIS ÉCOLENORMALE SUPÉRIEURE What is the purpose of life? Proof and Conjecture, ....
    [Show full text]
  • Convergence Rates of Subgradient Methods for Quasi-Convex
    Convergence Rates of Subgradient Methods for Quasi-convex Optimization Problems Yaohua Hu,∗ Jiawen Li,† Carisa Kwok Wai Yu‡ Abstract Quasi-convex optimization acts a pivotal part in many fields including economics and finance; the subgradient method is an effective iterative algorithm for solving large-scale quasi-convex optimization problems. In this paper, we investigate the iteration complexity and convergence rates of various subgradient methods for solving quasi-convex optimization in a unified framework. In particular, we consider a sequence satisfying a general (inexact) basic inequality, and investigate the global convergence theorem and the iteration complexity when using the constant, diminishing or dynamic stepsize rules. More important, we establish the linear (or sublinear) convergence rates of the sequence under an additional assumption of weak sharp minima of H¨olderian order and upper bounded noise. These convergence theorems are applied to establish the iteration complexity and convergence rates of several subgradient methods, including the standard/inexact/conditional subgradient methods, for solving quasi-convex optimization problems under the assumptions of the H¨older condition and/or the weak sharp minima of H¨olderian order. Keywords Quasi-convex programming, subgradient method, iteration complexity, conver- gence rates. AMS subject classifications Primary, 65K05, 90C26; Secondary, 49M37. 1 Introduction Mathematical optimization is a fundamental tool for solving decision-making problems in many disciplines. Convex optimization plays a key role in mathematical optimization, but may not be applicable to some problems encountered in economics, finance and management ∗ arXiv:1910.10879v1 [math.OC] 24 Oct 2019 Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, P.
    [Show full text]
  • Cp135638.Pdf
    INSTITUTO NACIONAL DE MATEMATICA´ PURA E APLICADA ON GENERAL AUGMENTED LAGRANGIANS AND A MODIFIED SUBGRADIENT ALGORITHM Doctoral thesis by Jefferson Divino Gon¸calves de Melo Supervisor: Dr. Alfredo Noel Iusem Co-supervisor: Dr. Regina Sandra Burachik November 2009 i ii To my wife Leila, my daughter Yasmim and all my family iii iv Abstract In this thesis we study a modified subgradient algorithm applied to the dual problem generated by augmented Lagrangians. We consider an opti- mization problem with equality constraints and study an exact version of the algorithm with a sharp Lagrangian in finite dimensional spaces. An ine- xact version of the algorithm is extended to infinite dimensional spaces and we apply it to a dual problem of an extended real-valued optimization pro- blem. The dual problem is constructed via augmented Lagrangians which include sharp Lagrangian as a particular case. The sequences generated by these algorithms converge to a dual solution when the dual optimal solution set is nonempty. They have the property that all accumulation points of a primal sequence, obtained without extra cost, are primal solutions. We relate the convergence properties of these modified subgradient algorithms to differentiability of the dual function at a dual solution, and exact penalty property of these augmented Lagrangians. In the second part of this thesis, we propose and analyze a general augmented Lagrangian function, which includes several augmented Lagrangians considered in the literature. In this more general setting, we study a zero duality gap property, exact penalization and convergence of a sub-optimal path related to the dual problem. Keywords: nonsmooth optimization, reflexive Banach spaces, sharp La- grangian, general augmented Lagrangians, dual problem, modified subgradi- ent algorithm, primal convergence, exact penalization, Hausdorff topological spaces.
    [Show full text]
  • A Novel Lagrangian Multiplier Update Algorithm for Short-Term
    energies Article A Novel Lagrangian Multiplier Update Algorithm for y Short-Term Hydro-Thermal Coordination P. M. R. Bento 1 , S. J. P. S. Mariano 1,* , M. R. A. Calado 1 and L. A. F. M. Ferreira 2 1 IT—Instituto de Telecomunicações, University of Beira Interior, 6201-001 Covilhã, Portugal; [email protected] (P.M.R.B.); [email protected] (M.R.A.C.) 2 Instituto Superior Técnico and INESC-ID, University of Lisbon, 1049-001 Lisbon, Portugal; [email protected] * Correspondence: [email protected]; Tel.: +351-(12)-7532-9760 This paper is an extended version of our paper published in ICEUBI2019—International Congress on y Engineering—Engineering for Evolution, Covilhã, Portugal, 27–29 November 2019; pp. 728–742. Received: 14 September 2020; Accepted: 11 December 2020; Published: 15 December 2020 Abstract: The backbone of a conventional electrical power generation system relies on hydro-thermal coordination. Due to its intrinsic complex, large-scale and constrained nature, the feasibility of a direct approach is reduced. With this limitation in mind, decomposition methods, particularly Lagrangian relaxation, constitutes a consolidated choice to “simplify” the problem. Thus, translating a relaxed problem approach indirectly leads to solutions of the primal problem. In turn, the dual problem is solved iteratively, and Lagrange multipliers are updated between each iteration using subgradient methods. However, this class of methods presents a set of sensitive aspects that often require time-consuming tuning tasks or to rely on the dispatchers’ own expertise and experience. Hence, to tackle these shortcomings, a novel Lagrangian multiplier update adaptative algorithm is proposed, with the aim of automatically adjust the step-size used to update Lagrange multipliers, therefore avoiding the need to pre-select a set of parameters.
    [Show full text]
  • Lecture 7: February 4 7.1 Subgradient Method
    10-725/36-725: Convex Optimization Spring 2015 Lecture 7: February 4 Lecturer: Ryan Tibshirani Scribes: Rulin Chen, Dan Schwartz, Sanxi Yao 7.1 Subgradient method Lecture 5 covered gradient descent, an algorithm for minimizing a smooth convex function f. In gradient descent, we assume f has domain Rn, and choose some initial x(0) 2 Rn. We then iteratively take small steps in the direction of the negative gradient at the value of x(k) where k is the current iteration. Step sizes are chosen to be fixed and small or by backtracking line search. If the gradient, rf, is Lipschitz with constant L ≥ 0, a fixed step size t ≤ 1=L ensures a convergence rate of O(1/) iterations to acheive error less than or equal . Gradient descent requires f to be differentiable at all x 2 Rn and it can be slow to converge. The subgradient method removes the requirement that f be differentiable. In a later lecture, we will discuss speeding up the convergence rate. Like in gradient descent, in the subgradient method we start at an initial point x(0) 2 Rn and we iteratively update the current solution x(k) by taking small steps. However, in the subgradient method, rather than using the gradient of the function f we are minimizing, we can use any subgradient of f at the value of x(k). The rule for picking the next iterate in the subgradient method is given by: (k) (k−1) (k−1) (k−1) (k−1) x = x − tk · g , where g 2 @f(x ) Because moving in the direction of some subgradients can make our solution worse for a given iteration, the subgradient method is not necessarily a descent method.
    [Show full text]
  • An Augmented Lagrangian Approach to Constrained MAP Inference
    An Augmented Lagrangian Approach to Constrained MAP Inference Andr´eF. T. Martinsyz [email protected] M´arioA. T. Figueiredoz [email protected] Pedro M. Q. Aguiar] [email protected] Noah A. Smithy [email protected] Eric P. Xingy [email protected] ySchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA zInstituto de Telecomunica¸c~oes/ ]Instituto de Sistemas e Rob´otica,Instituto Superior T´ecnico,Lisboa, Portugal Abstract the graph structure in these relaxations (Wainwright We propose a new algorithm for approximate et al., 2005; Kolmogorov, 2006; Werner, 2007; Glober- MAP inference on factor graphs, by combin- son & Jaakkola, 2008; Ravikumar et al., 2010). In the ing augmented Lagrangian optimization with same line, Komodakis et al.(2007) proposed a method the dual decomposition method. Each slave based on the classical dual decomposition technique subproblem is given a quadratic penalty, (DD; Dantzig & Wolfe 1960; Everett III 1963; Shor which pushes toward faster consensus than in 1985), which breaks the original problem into a set previous subgradient approaches. Our algo- of smaller (slave) subproblems, splits the shared vari- rithm is provably convergent, parallelizable, ables, and tackles the Lagrange dual with the subgra- and suitable for fine decompositions of the dient algorithm. Initially applied in computer vision, graph. We show how it can efficiently han- DD has also been shown effective in NLP (Koo et al., dle problems with (possibly global) structural 2010). The drawback is that the subgradient algorithm constraints via simple sort operations. Ex- is very slow to converge when the number of slaves is periments on synthetic and real-world data large.
    [Show full text]
  • On the Computational Efficiency of Subgradient
    Math. Prog. Comp. DOI 10.1007/s12532-017-0120-7 FULL LENGTH PAPER On the computational efficiency of subgradient methods: a case study with Lagrangian bounds Antonio Frangioni1 · Bernard Gendron2,3 · Enrico Gorgone4,5 Received: 1 November 2015 / Accepted: 20 April 2017 © Springer-Verlag Berlin Heidelberg and The Mathematical Programming Society 2017 Abstract Subgradient methods (SM) have long been the preferred way to solve the large-scale Nondifferentiable Optimization problems arising from the solution of Lagrangian Duals (LD) of Integer Programs (IP). Although other methods can have better convergence rate in practice, SM have certain advantages that may make them competitive under the right conditions. Furthermore, SM have significantly progressed in recent years, and new versions have been proposed with better theoretical and prac- tical performances in some applications. We computationally evaluate a large class of SM in order to assess if these improvements carry over to the IP setting. For this we build a unified scheme that covers many of the SM proposed in the literature, comprised some often overlooked features like projection and dynamic generation of variables. The software that was reviewed as part of this submission has been issued the Digital Object Identifier doi:10.5281/zenodo.556738. B Antonio Frangioni [email protected] Bernard Gendron [email protected] Enrico Gorgone [email protected]; [email protected] 1 Dipartimento di Informatica, Università di Pisa, Pisa, Italy 2 Centre Interuniversitaire de Recherche sur les Réseaux d’Entreprise, la Logistique et le Transport (CIRRELT), Montreal, Canada 3 Department of Computer Science and Operations Research, Université de Montréal, Montreal, Canada 4 Dipartimento di Matematica ed Informatica, Università di Cagliari, Cagliari, Italy 5 Indian Institute of Management Bangalore (IIMB), Bengaluru, India 123 Antonio Frangioni et al.
    [Show full text]