Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization

Total Page:16

File Type:pdf, Size:1020Kb

Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization Math 126 Winter 18 Date of current version: January 29, 2018 Abstract This note studies (sub)gradient methods for unconstrained convex optimization. Many parts of this note are based on the chapters [1, Chapter 4] [2, Chapter 3,5,8,10] [5, Chapter 9] [14, Chapter 2,3] and their corresponding lecture notes available online by the authors. Please email me if you find any typos or errors. 1 Unconstrained Convex Optimization (see [5, Chapter 9]) We consider the following unconstrained problem: min f(x) (1.1) d x Ê ∈ where we assume that d Ê – f : Ê is convex and continuously differentiable, – the problem→ is solvable, i.e., there exists an optimal point x . ∗ Note that it is equivalent to solve the optimality condition: d Find x Ê s.t. f(x)= 0. (1.2) ∈ ∇ Iterative algorithms generate a sequence of points x0, x1,... dom f with f(xk) p , where p is the ∈ → ∗ ∗ optimal value. The iterative algorithm is terminated, for example, when f(xk) ǫ where ǫ > 0 is some specified tolerance. ||∇ || ≤ 2 Descent Methods (see [1, Chapter 4] [5, Chapter 9]) 2.1 General Descent Methods d d d Ê Ê Ê Definition 2.1 Let f : Ê be a continuously differentiable function over . A vector 0 = d is called a descent direction of→f at x if it satisfies 6 ∈ f(x )⊤d < 0, (2.1) ∇ k k i.e., it takes an acute angle with the negative gradient f(x ). −∇ k d d Ê Lemma 2.1 Let f be a continuously differentiable function over Ê , and let x . Suppose that d is descent direction of f at x. Then there exists ǫ> 0 such that ∈ f(x + sd) <f(x) (2.2) for any s (0, ǫ]. ∈ Donghwan Kim Dartmouth College E-mail: [email protected] Proof Since f(x)⊤d < 0, it follows from the definition of the directional derivative that ∇ f(x + sd) f(x) lim − = f(x)⊤d < 0 s 0+ s ∇ → Therefore, there exists ǫ> 0 such that f(x + sd) f(x) − < 0 s for any s (0,ǫ]. ∈ The outline of general descent methods is as follows. Algorithm 1 General descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: Determine a descent direction dk. 4: Choose a step size sk > 0 satisfying f(xk + skdk) <f(xk). 5: xk+1 = xk + skdk. 6: If a stopping criteria is satisfied, then stop. One can choose a step size sk at each iteration by one of the following approaches: – Exact line search: sk = argmins 0 f(xk + sdk) – Backtracking line search: starting≥ from an initial s > 0, repeat s βs until the following sufficient decrease condition is satisfied: ← f(x + sd) <f(x)+ αs f(x)⊤d ∇ with parameters α (0, 1) and β (0, 1). ∈ ∈ – Constant step size: sk = s (to be further studied later in this note). 1 d Example 2.1 Exact line search for quadratic functions. Let f(x)= 2 x⊤Qx + p⊤x, where Q Ë++ and d d d ∈ Ê Ê p Ê . Let d be a descent direction of f at x . We will derive an explicit formula for the step∈ size generated∈ by exact line search: ∈ sˆ(x, d) = argmin f(x + sd), (2.3) s 0 ≥ where we have 1 1 2 1 f(x + sd)= (x + sd)⊤Q(x + sd)+ p⊤(x + sd)= (d⊤Qd)s +(d⊤Qx + d⊤p)s + x⊤Qx + p⊤x. 2 2 2 The optimality condition of (2.3) is d f(x + sd)=(d⊤Qd)s +(d⊤Qx + d⊤p) = 0 ds and using f(x)= Qx + p we have ∇ d⊤ f(x) sˆ(x, d)= ∇ . − d⊤Qd Note that the exact line search is not easy to compute in general. 2 2.2 Gradient Descent Methods Gradient descent methods take the negative gradient as a descent direction: d = f(x ), (2.4) k −∇ k 2 since f(xk)⊤dk = f(xk) < 0 as long as f(xk) = 0. The∇ outline of gradient−||∇ descent|| methods is as∇ follows.6 Algorithm 2 Gradient descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: dk = −∇f(xk). 4: Choose a step size sk > 0 satisfying f(xk + skdk) <f(xk). 5: xk+1 = xk + skdk. 6: If a stopping criteria is satisfied, then stop. 2.3 Steepest Descent Methods d Definition 2.2 Let be any norm on Ê . We define a normalized steepest descent direction (with respect to the norm || · ||) as ||·|| dnsd = argmin f(x)⊤v : v = 1 . (2.5) v ∇ || || This is a unit-norm step with most negative directional derivative f(x)⊤v. (Recall that v is a descent ∇ direction if f(x)⊤v < 0.) In other words, a normalized steepest descent direction is the direction in the unit ball∇ of that extends farthest in the direction f(x). ||·|| −∇ Definition 2.3 A (unnormalized) steepest descent step is defined as dsd = f(x) dnsd, (2.6) ||∇ ||∗ where z = max z, y : y 1 denotes the dual norm (e.g., 2 is a dual norm of 2, and || ||∗ {h i || || ≤ } ||·||2 ||·|| 1 is a dual norm of ). This satisfies f(x)⊤dsd = f(x) . ||·|| ||·||∞ ∇ −||∇ ||∗ Example 2.2 Examples of steepest descent methods. – Euclidean norm (ℓ2-norm): dsd = f(x). – The resulting algorithm is a gradient−∇ descent method. 1/2 d 1 – Quadratic P -norm ( z P = P z 2 where P Ë++): dsd = P − f(x). – The resulting algorithm|| || is|| a preconditioned|| gradient∈ descent method− ∇ with a preconditioner P (or 1 1/2 P − ). This is equivalent to a gradient descent method with the change of coordinates x¯ = P x. – A good choice of P (e.g., P 2f(x )) makes the condition number of the problem after the change of coordinates x¯ = P 1≈/2x ∇small,∗ which likely makes the problem easier to solve. – ℓ -norm: d = ∂f(x) e for the index i satisfying f(x) = ( f(x)) , where e is the ith 1 sd ∂xi i i i standard basis vector.− ||∇ ||∞ | ∇ | – The resulting algorithm is an instance of coordinate-descent methods that update only one com- ponent of x at each iteration. Remark 2.1 At the end of this term (in Part 6), we will study the Newton’s descent direction: 2 1 d = [ f(x)]− f(x), (2.7) nt − ∇ ∇ 2 d which is a steepest descent direction at x in local Hessian norm 2f(x) when f(x) Ë++. ||·||∇ ∇ ∈ 3 3 Convergence Analysis of the Gradient Method (See [1, Chapter 4] [2, Chapter 5] [14, Chapter 2]) 3.1 Lipschitz Continuity of the Gradient Definition 3.1 A function f is said to be L-smooth if it is continuously differentiable and its gradient d f is Lipschitz continuous over Ê , meaning that there exists L> 0 for which ∇ d f(x) f(y) L x y for any x, y Ê . (3.1) ||∇ − ∇ ||2 ≤ || − ||2 ∈ 1,1 d 1,1 1 We denote the class of functions with Lipschitz gradient with constant L by L (Ê ) or L . One can generalize the choice of the norm as C C f(x) f(y) L x y , (3.2) ||∇ − ∇ ||∗ ≤ || − || where is a dual norm of , but we will not consider this generalization in this class. ||·||∗ ||·|| Lemma 3.1 If f is twice continuously differentiable, f is L-smooth if and only if LI 2f(x) LI (or equivalently 2f(x) L) (3.3) − ∇ ||∇ ||2 ≤ Example 3.1 Smooth functions. d d 1 1,1 Ê – Quadratic functions: Let Q Ë and p . Then, the function f(x) = 2 x⊤Qx + p⊤x is a L function with L = Q , since∈ ∈ C || ||2 f(x) f(y) = Qx + p (Qy + p) = Q(x y) Q x y . ||∇ − ∇ ||2 || − ||2 || − ||2 ≤ || ||2 · || − ||2 – A convex function f(x)= 1+ x 2: We have f(x)= 1 x and 2 √ x 2+1 || || ∇ || ||2 p 2 1 1 1 f(x)= I xx⊤ I I. 2 2 3/2 2 ∇ x + 1 − ( x 2 + 1) x + 1 || ||2 || || || ||2 p p Therefore, f is a 1,1 function with L = 1. CL 1,1 An important result for L functions is that they can be bounded above by a certain quadratic function, which is fundamentalC in convergence proofs of gradient-based methods. 1,1 d Lemma 3.2 (Descent lemma) Let f L ( ) for some L > 0 and a given convex set Ê . Then for any x, y , ∈ C D D ⊆ ∈D L f(y) f(x)+ f(x), y x + x y 2. (3.4) ≤ h∇ − i 2 || − ||2 Proof By the fundamental theorem of calculus, we have 1 f(y) f(x)= f(x + t(y x)), y x dt − h∇ − − i Z0 Therefore, 1 f(y) f(x)= f(x), y x + f(x + t(y x)) f(x), y x dt. − h∇ − i h∇ − − ∇ − i Z0 1 i,j d We denote by CL (D) for D ⊆ Ê the class of functions that are i-times continuously differentiable on D and its jth derivative is Lipschitz continuous on D. 4 Thus, 1 f(y) f(x) f(x), y x = f(x + t(y x)) f(x), y x dt | − − h∇ − i h∇ − − ∇ − i Z0 1 f(x + t(y x)) f(x), y x dt ≤ 0 | h∇ − − ∇ − i| Z 1 f(x + t(y x)) f(x) 2 y x 2dt ≤ 0 ||∇ − − ∇ || · || − || Z 1 tL y x 2dt ≤ || − ||2 Z0 L = y x 2, 2 || − ||2 where the second inequality uses the generalized Cauchy-Schwarz inequality and the third inequality uses the L-smoothness.
Recommended publications
  • A Lagrangian Decomposition Approach Combined with Metaheuristics for the Knapsack Constrained Maximum Spanning Tree Problem
    MASTERARBEIT A Lagrangian Decomposition Approach Combined with Metaheuristics for the Knapsack Constrained Maximum Spanning Tree Problem ausgeführt am Institut für Computergraphik und Algorithmen der Technischen Universität Wien unter der Anleitung von Univ.-Prof. Dipl.-Ing. Dr.techn. Günther Raidl und Univ.-Ass. Dipl.-Ing. Dr.techn. Jakob Puchinger durch Sandro Pirkwieser, Bakk.techn. Matrikelnummer 0116200 Simmeringer Hauptstraße 50/30, A-1110 Wien Datum Unterschrift Abstract This master’s thesis deals with solving the Knapsack Constrained Maximum Spanning Tree (KCMST) problem, which is a so far less addressed NP-hard combinatorial optimization problem belonging to the area of network design. Thereby sought is a spanning tree whose profit is maximal, but at the same time its total weight must not exceed a specified value. For this purpose a Lagrangian decomposition approach, which is a special variant of La- grangian relaxation, is applied to derive upper bounds. In the course of the application the problem is split up in two subproblems, which are likewise to be maximized but easier to solve on its own. The subgradient optimization method as well as the Volume Algorithm are deployed to solve the Lagrangian dual problem. To derive according lower bounds, i.e. feasible solutions, a simple Lagrangian heuristic is applied which is strengthened by a problem specific local search. Furthermore an Evolutionary Algorithm is presented which uses a suitable encoding for the solutions and appropriate operators, whereas the latter are able to utilize heuristics based on defined edge-profits. It is shown that simple edge-profits, derived straightforward from the data given by an instance, are of no benefit.
    [Show full text]
  • Chapter 8 Stochastic Gradient / Subgradient Methods
    Chapter 8 Stochastic gradient / subgradient methods Contents (class version) 8.0 Introduction........................................ 8.2 8.1 The subgradient method................................. 8.5 Subgradients and subdifferentials................................. 8.5 Properties of subdifferentials.................................... 8.7 Convergence of the subgradient method.............................. 8.10 8.2 Example: Hinge loss with 1-norm regularizer for binary classifier design...... 8.17 8.3 Incremental (sub)gradient method............................ 8.19 Incremental (sub)gradient method................................. 8.21 8.4 Stochastic gradient (SG) method............................. 8.23 SG update.............................................. 8.23 Stochastic gradient algorithm: convergence analysis....................... 8.26 Variance reduction: overview................................... 8.33 Momentum............................................. 8.35 Adaptive step-sizes......................................... 8.37 8.5 Example: X-ray CT reconstruction........................... 8.41 8.1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.2 8.6 Summary.......................................... 8.50 8.0 Introduction This chapter describes two families of algorithms: • subgradient methods • stochastic gradient methods aka stochastic gradient descent methods Often we turn to these methods as a “last resort,” for applications where none of the methods discussed previously are suitable. Many machine learning applications,
    [Show full text]
  • Subgradient Method
    Subgradient Method Ryan Tibshirani Convex Optimization 10-725/36-725 Last last time: gradient descent Consider the problem min f(x) x n for f convex and differentiable, dom(f) = R . Gradient descent: (0) n choose initial x 2 R , repeat (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: Step sizes tk chosen to be fixed and small, or by backtracking line search If rf Lipschitz, gradient descent has convergence rate O(1/) Downsides: • Requires f differentiable this lecture • Can be slow to converge next lecture 2 Subgradient method n Now consider f convex, with dom(f) = R , but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients. I.e., initialize x(0), repeat (k) (k−1) (k−1) x = x − tk · g ; k = 1; 2; 3;::: where g(k−1) 2 @f(x(k−1)), any subgradient of f at x(k−1) Subgradient method is not necessarily a descent method, so we (k) (0) (k) keep track of best iterate xbest among x ; : : : x so far, i.e., f(x(k) ) = min f(x(i)) best i=0;:::k 3 Outline Today: • How to choose step sizes • Convergence analysis • Intersection of sets • Stochastic subgradient method 4 Step size choices • Fixed step sizes: tk = t all k = 1; 2; 3;::: • Diminishing step sizes: choose to meet conditions 1 1 X 2 X tk < 1; tk = 1; k=1 k=1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 5 Convergence analysis n Assume that f convex, dom(f) = R , and also that f is Lipschitz continuous with constant G > 0, i.e., jf(x) − f(y)j ≤ Gkx − yk2 for all x; y Theorem: For a fixed step size t, subgradient method satisfies lim f(x(k) ) ≤ f ? + G2t=2 k!1 best Theorem: For diminishing step sizes, subgradient method sat- isfies lim f(x(k) ) = f ? k!1 best 6 Basic bound, and convergence rate (0) ? Letting R = kx − x k2, after k steps, we have the basic bound R2 + G2 Pk t2 f(x(k) ) − f(x?) ≤ i=1 i best Pk 2 i=1 ti Previous theorems follow from this.
    [Show full text]
  • An Efficient Solution Methodology for Mixed-Integer Programming Problems Arising in Power Systems" (2016)
    University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 12-15-2016 An Efficient Solution Methodology for Mixed- Integer Programming Problems Arising in Power Systems Mikhail Bragin University of Connecticut - Storrs, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Bragin, Mikhail, "An Efficient Solution Methodology for Mixed-Integer Programming Problems Arising in Power Systems" (2016). Doctoral Dissertations. 1318. https://opencommons.uconn.edu/dissertations/1318 An Efficient Solution Methodology for Mixed-Integer Programming Problems Arising in Power Systems Mikhail Bragin, PhD University of Connecticut, 2016 For many important mixed-integer programming (MIP) problems, the goal is to obtain near- optimal solutions with quantifiable quality in a computationally efficient manner (within, e.g., 5, 10 or 20 minutes). A traditional method to solve such problems has been Lagrangian relaxation, but the method suffers from zigzagging of multipliers and slow convergence. When solving mixed-integer linear programming (MILP) problems, the recently adopted branch-and-cut may also suffer from slow convergence because when the convex hull of the problems has complicated facial structures, facet- defining cuts are typically difficult to obtain, and the method relies mostly on time-consuming branching operations. In this thesis, the novel Surrogate Lagrangian Relaxation method is developed and its convergence is proved to the optimal multipliers, without the knowledge of the optimal dual value and without fully optimizing the relaxed problem. Moreover, for practical implementations a stepsizing formula, that guarantees convergence without requiring the optimal dual value, has been constructively developed. The key idea is to select stepsizes in a way that distances between Lagrange multipliers at consecutive iterations decrease, and as a result, Lagrange multipliers converge to a unique limit.
    [Show full text]
  • Subgradient Methods
    Subgradient Methods Stephen Boyd and Almir Mutapcic Notes for EE364b, Stanford University, Winter 2006-07 April 13, 2008 Contents 1 Introduction 2 2 Basic subgradient method 2 2.1 Negativesubgradientupdate. .... 2 2.2 Stepsizerules................................... 3 2.3 Convergenceresults.............................. .. 4 3 Convergence proof 4 3.1 Assumptions.................................... 4 3.2 Somebasicinequalities . ... 5 3.3 Aboundonthesuboptimalitybound . ... 7 3.4 Astoppingcriterion.............................. .. 8 3.5 Numericalexample ................................ 8 4 Alternating projections 9 4.1 Optimal step size choice when f ⋆ isknown................... 9 4.2 Finding a point in the intersection of convex sets . ......... 11 4.3 Solving convex inequalities . ..... 14 4.4 Positive semidefinite matrix completion . ........ 15 5 Projected subgradient method 16 5.1 Numericalexample ................................ 18 6 Projected subgradient for dual problem 18 6.1 Numericalexample ................................ 20 7 Subgradient method for constrained optimization 21 7.1 Numericalexample ................................ 24 8 Speeding up subgradient methods 24 1 1 Introduction The subgradient method is a very simple algorithm for minimizing a nondifferentiable convex function. The method looks very much like the ordinary gradient method for differentiable functions, but with several notable exceptions: The subgradient method applies directly to nondifferentiable f. • The step lengths are not chosen via a line search, as in the ordinary gradient method. • In the most common cases, the step lengths are fixed ahead of time. Unlike the ordinary gradient method, the subgradient method is not a descent method; • the function value can (and often does) increase. The subgradient method is readily extended to handle problems with constraints. Subgradient methods can be much slower than interior-point methods (or Newton’s method in the unconstrained case).
    [Show full text]
  • 1 Base Polytopes
    6.883 Learning with Combinatorial Structure Note for Lecture 8 Authors: Hongyi Zhang A bit of history in combinatorial optimization. Last time we talked about the Lovász extension, which plays an important role in the optimization of submodular functions. Actually, despite the name Lovász, Jack Edmonds is another important figure who made huge contribution to this concept, and submodular optimization in general. In DISCML 20111, there was a brief introduction to Jack Edmonds’ contributions to mathematics and computer science by Jeff Bilmes. 1 Base polytopes 1.1 Examples of base polytopes Recall that in the last lecture we defined the base polytope BF of a submodular function, and showed that there is a greedy algorithm to solve the optimization problem max y>x y2BF and the solution is closely related to Lovász extension. Now we shall look at several other examples where interesting concepts turn out to be the base polytopes of some submodu- lar (or supermodular) functions. 1.1.1 Probability simplex Let the ground set be V such that jVj = n. Define F (S) = minfjSj; 1g;S ⊆ V, in last lecture we already saw that F (S) is submodular. It is easy to check that BF is a probability sim- plex. In fact, by definition ( ) X X BF = y : yi = F (V) and yi ≤ F (S) 8S ⊆ V i2V i2S ( n ) X = y : yi = 1 and yi ≥ 0; 8i i=1 n which is a standard (n−1)-simplex in R . 1http://las.ethz.ch/discml/discml11.html NIPS Workshop on Discrete Optimization in Ma- chine Learning 2011 1 1.1.2 Permutahedron PjSj Let the ground set be V such that jVj = n.
    [Show full text]
  • On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri
    On the Links between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri To cite this version: Senanayak Sesh Kumar Karri. On the Links between Probabilistic Graphical Models and Submodular Optimisation. Machine Learning [cs.LG]. Université Paris sciences et lettres, 2016. English. NNT : 2016PSLEE047. tel-01753810 HAL Id: tel-01753810 https://tel.archives-ouvertes.fr/tel-01753810 Submitted on 29 Mar 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT de l’Universite´ de recherche Paris Sciences Lettres PSL Research University Prepar´ ee´ a` l’Ecole´ normale superieure´ On the Links between Probabilistic Graphical Models and Submodular Optimisation Liens entre modeles` graphiques probabilistes et optimisation sous-modulaire Ecole´ doctorale n◦386 ECOLE´ DOCTORALE DE SCIENCES MATHEMATIQUES´ DE PARIS CENTRE Specialit´ e´ INFORMATIQUE COMPOSITION DU JURY : M Andreas Krause ETH Zurich, Rapporteur M Nikos Komodakis ENPC Paris, Rapporteur M Francis Bach Inria Paris, Directeur de these` Soutenue par Senanayak Sesh Kumar KARRI le 27.09.2016 M Josef Sivic ENS Paris, Membre du Jury Dirigee´ par Francis BACH M Antonin Chambolle CMAP EP Paris, Membre du Jury M Guillaume Obozinski ENPC Paris, Membre du Jury RESEARCH UNIVERSITY PARIS ÉCOLENORMALE SUPÉRIEURE What is the purpose of life? Proof and Conjecture, ....
    [Show full text]
  • Final Exam Guide Guide
    MATH 408 FINAL EXAM GUIDE GUIDE This exam will consist of three parts: (I) Linear Least Squares, (II) Quadratic Optimization, and (III) Optimality Conditions and Line Search Methods. The topics covered on the first two parts ((I) Linear Least Squares and (II) Quadratic Optimization) are identical in content to the two parts of the midterm exam. Please use the midterm exam study guide to prepare for these questions. A more detailed description of the third part of the final exam is given below. III Optimality Conditions and Line search methods. 1 Theory Question: For this question you will need to review all of the vocabulary words as well as the theorems on the weekly guides for Elements of Multivariable Calculus, Optimality Conditions for Unconstrained Problems, and Line search methods. You may be asked to provide statements of first- and second-order optimality conditions for unconstrained problems. In addition, you may be asked about the role of convexity in optimization, how it is detected, as well as first- and second- order conditions under which it is satisfied. You may be asked to describe what a descent direction is, as well as, what the Newton and the Cauchy (gradient descent) directions are. You need to know what the backtracking line-search is, as well as the convergence guarantees of the line search methods. 2 Computation: All the computation needed for Linear Least Squares and Quadratic Optimization is fair game for this question. Mostly, however, you will be asked to compute gradients and Hessians, locate and classify stationary points for specific optimizations problems, as well as test for the convexity of a problem.
    [Show full text]
  • A Line-Search Descent Algorithm for Strict Saddle Functions with Complexity Guarantees∗
    A LINE-SEARCH DESCENT ALGORITHM FOR STRICT SADDLE FUNCTIONS WITH COMPLEXITY GUARANTEES∗ MICHAEL O'NEILL AND STEPHEN J. WRIGHT Abstract. We describe a line-search algorithm which achieves the best-known worst-case com- plexity results for problems with a certain \strict saddle" property that has been observed to hold in low-rank matrix optimization problems. Our algorithm is adaptive, in the sense that it makes use of backtracking line searches and does not require prior knowledge of the parameters that define the strict saddle property. 1 Introduction. Formulation of machine learning (ML) problems as noncon- vex optimization problems has produced significant advances in several key areas. While general nonconvex optimization is difficult, both in theory and in practice, the problems arising from ML applications often have structure that makes them solv- able by local descent methods. For example, for functions with the \strict saddle" property, nonconvex optimization methods can efficiently find local (and often global) minimizers [16]. In this work, we design an optimization algorithm for a class of low-rank matrix problems that includes matrix completion, matrix sensing, and Poisson prinicipal component analysis. Our method seeks a rank-r minimizer of the function f(X), where f : Rn×m ! R. The matrix X is parametrized explicitly as the outer product of two matrices U 2 Rn×r and V 2 Rm×r, where r ≤ min(m; n). We make use throughout of the notation U (1) W = 2 (m+n)×r: V R The problem is reformulated in terms of W and an objective function F as follows: (2) min F (W ) := f(UV T ); where W , U, V are related as in (1).
    [Show full text]
  • Newton Method for Stochastic Control Problems Emmanuel Gobet, Maxime Grangereau
    Newton method for stochastic control problems Emmanuel Gobet, Maxime Grangereau To cite this version: Emmanuel Gobet, Maxime Grangereau. Newton method for stochastic control problems. 2021. hal- 03108627 HAL Id: hal-03108627 https://hal.archives-ouvertes.fr/hal-03108627 Preprint submitted on 13 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Newton method for stochastic control problems ∗ Emmanuel GOBET y and Maxime GRANGEREAU z Abstract We develop a new iterative method based on Pontryagin principle to solve stochastic control problems. This method is nothing else than the Newton method extended to the framework of stochastic controls, where the state dynamics is given by an ODE with stochastic coefficients. Each iteration of the method is made of two ingredients: computing the Newton direction, and finding an adapted step length. The Newton direction is obtained by solving an affine-linear Forward-Backward Stochastic Differential Equation (FBSDE) with random coefficients. This is done in the setting of a general filtration. We prove that solving such an FBSDE reduces to solving a Riccati Backward Stochastic Differential Equation (BSDE) and an affine-linear BSDE, as expected in the framework of linear FBSDEs or Linear-Quadratic stochastic control problems.
    [Show full text]
  • Convergence Rates of Subgradient Methods for Quasi-Convex
    Convergence Rates of Subgradient Methods for Quasi-convex Optimization Problems Yaohua Hu,∗ Jiawen Li,† Carisa Kwok Wai Yu‡ Abstract Quasi-convex optimization acts a pivotal part in many fields including economics and finance; the subgradient method is an effective iterative algorithm for solving large-scale quasi-convex optimization problems. In this paper, we investigate the iteration complexity and convergence rates of various subgradient methods for solving quasi-convex optimization in a unified framework. In particular, we consider a sequence satisfying a general (inexact) basic inequality, and investigate the global convergence theorem and the iteration complexity when using the constant, diminishing or dynamic stepsize rules. More important, we establish the linear (or sublinear) convergence rates of the sequence under an additional assumption of weak sharp minima of H¨olderian order and upper bounded noise. These convergence theorems are applied to establish the iteration complexity and convergence rates of several subgradient methods, including the standard/inexact/conditional subgradient methods, for solving quasi-convex optimization problems under the assumptions of the H¨older condition and/or the weak sharp minima of H¨olderian order. Keywords Quasi-convex programming, subgradient method, iteration complexity, conver- gence rates. AMS subject classifications Primary, 65K05, 90C26; Secondary, 49M37. 1 Introduction Mathematical optimization is a fundamental tool for solving decision-making problems in many disciplines. Convex optimization plays a key role in mathematical optimization, but may not be applicable to some problems encountered in economics, finance and management ∗ arXiv:1910.10879v1 [math.OC] 24 Oct 2019 Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, P.
    [Show full text]
  • Cp135638.Pdf
    INSTITUTO NACIONAL DE MATEMATICA´ PURA E APLICADA ON GENERAL AUGMENTED LAGRANGIANS AND A MODIFIED SUBGRADIENT ALGORITHM Doctoral thesis by Jefferson Divino Gon¸calves de Melo Supervisor: Dr. Alfredo Noel Iusem Co-supervisor: Dr. Regina Sandra Burachik November 2009 i ii To my wife Leila, my daughter Yasmim and all my family iii iv Abstract In this thesis we study a modified subgradient algorithm applied to the dual problem generated by augmented Lagrangians. We consider an opti- mization problem with equality constraints and study an exact version of the algorithm with a sharp Lagrangian in finite dimensional spaces. An ine- xact version of the algorithm is extended to infinite dimensional spaces and we apply it to a dual problem of an extended real-valued optimization pro- blem. The dual problem is constructed via augmented Lagrangians which include sharp Lagrangian as a particular case. The sequences generated by these algorithms converge to a dual solution when the dual optimal solution set is nonempty. They have the property that all accumulation points of a primal sequence, obtained without extra cost, are primal solutions. We relate the convergence properties of these modified subgradient algorithms to differentiability of the dual function at a dual solution, and exact penalty property of these augmented Lagrangians. In the second part of this thesis, we propose and analyze a general augmented Lagrangian function, which includes several augmented Lagrangians considered in the literature. In this more general setting, we study a zero duality gap property, exact penalization and convergence of a sub-optimal path related to the dual problem. Keywords: nonsmooth optimization, reflexive Banach spaces, sharp La- grangian, general augmented Lagrangians, dual problem, modified subgradi- ent algorithm, primal convergence, exact penalization, Hausdorff topological spaces.
    [Show full text]