12. Coordinate Descent Methods

Total Page:16

File Type:pdf, Size:1020Kb

12. Coordinate Descent Methods EE 546, Univ of Washington, Spring 2014 12. Coordinate descent methods theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem: minimize f(x) x RN ∈ n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of gradient: • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i Coordinate descent methods 12–2 (Block) coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full gradient method)? • Coordinate descent methods 12–3 Steepest coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2 f has bounded sub-level set, in particular, define • ⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤ ∈ Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise Lipschitz assumption: L f(x + U v) f(x)+ f(x),v + i v 2, i =1,...,n i ≤ h∇i i 2 k k2 , assume constant step size 0 <t 1/M, with M maxi 1,...,n Li ≤ ∈{ } t t f(x+) f(x) f(x) 2 f(x) f(x) 2 ≤ − 2k∇i k2 ≤ − 2nk∇ k2 by convexity and Cauchy-Schwarz inequality, f(x) f ⋆ f(x),x x⋆ − ≤ h∇ − i f(x) x x⋆ f(x) R(x(0)) ≤ k∇ k2k − k2 ≤ k∇ k2 therefore t 2 f(x) f(x+) f(x) f ⋆ − ≥ 2nR2 − Coordinate descent methods 12–5 let ∆ = f(x(k)) f ⋆, then k − t ∆ ∆ ∆2 k − k+1 ≥ 2nR2 k consider their multiplicative inverses 1 1 ∆k ∆k+1 ∆k ∆k+1 t = − − 2 2 ∆k+1 − ∆k ∆k+1∆k ≥ ∆k ≥ 2nR therefore 1 1 k 2t kt + 2 2 + 2 ∆k ≥ ∆0 2nLmaxR ≥ nR 2nR finally 2nR2 f(x(k)) f ⋆ = ∆ − k ≤ (k + 4)t Coordinate descent methods 12–6 Bounds on full gradient Lipschitz constant N N lemma: suppose A R × is positive semidefinite and has the partition ∈ N N A =[Aij]n n, where Aij R i× j for i,j =1,...,n, and × ∈ A L I , i =1,...,n ii i Ni then n A L I i N i=1 ! X n n n 2 proof: xT Ax = xT A x xT A x i ij j ≤ i ii i i=1 i=1 i=1 ! X X X q n 2 n n L1/2 x L x 2 ≤ i k ik2 ≤ i k ik2 i=1 ! i=1 ! i=1 X X X conclusion: the full gradient Lipschitz constant L n L f ≤ i=1 i P Coordinate descent methods 12–7 Computational complexity and justifications nMR2 (steepest) coordinate descent O k L R2 full gradient method O f k in general coordinate descent has worse complexity bound it can happen that M O(L ) • ≥ f choosing i(k) may rely on computing full gradient • too expensive to do line search based on function values • nevertheless, there are justifications for huge-scale problems even computation of a function value can require substantial effort • limits by computer memory, distributed storage, and human patience • Coordinate descent methods 12–8 Example n def 1 2 minimize f(x) = fi(xi)+ Ax b 2 x Rn 2k − k ∈ ( i=1 ) X f are convex differentiable univariate functions • i m n A =[a a ] R × , and assume a has p nonzero elements • 1 ··· n ∈ i i n computing either function value or full gradient costs O( i=1 pi) operations P computing coordinate directional derivatives: O(pi) operations f(x) = f (x )+ aT r(x), i =1,...,n ∇i ∇ i i i r(x) = Ax b − given r(x), computing f(x) requires O(p ) operations • ∇i i + coordinate update x = x + αei results in efficient update of residue: • + r(x )= r(x)+ αai, which also cost O(pi) operations Coordinate descent methods 12–9 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Randomized coordinate descent choose x(0) Rn and α R, and iterate for k =0, 1, 2,... ∈ ∈ α (α) Li 1. choose i(k) with probability pi = n α, i =1,...,n j=1 Lj (k+1) (k) 1 P (k) 2. update x = x Ui(k) i(k)f(x ) − Li ∇ (0) special case: α =0 gives uniform distribution pi =1/n for i =1,...,n assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n (1) k∇i i i −∇i k2 ≤ ik ik2 f has bounded sub-level set, and f ⋆ is attained at some x⋆ • Coordinate descent methods 12–10 Solution guarantees convergence in expectation E[f(x(k))] f ⋆ ǫ − ≤ high probability iteration complexity: number of iterations to reach prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − confidence level 0 <ρ< 1 • error tolerance ǫ> 0 • Coordinate descent methods 12–11 Convergence analysis block coordinate-wise Lipschitz continuity of f(x) implies for i =1,...,n ∇ L f(x + U v ) f(x)+ f(x),v + i v 2, x RN , v RNi i i ≤ h∇i ii 2 k ik2 ∀ ∈ i ∈ coordinate update obtained by minimizing quadratic upper bound + x = x + Uivˆi L vˆ = argmin f(x)+ f(x),v + i v 2 i h∇i ii 2 k ik2 vi objective function is non-increasing: + 1 2 f(x) f(x ) if(x) 2 − ≥ 2Lik∇ k Coordinate descent methods 12–12 A pair of conjugate norms for any α R, define ∈ n 1/2 n 1/2 α 2 α 2 x = L x , y ∗ = L− y k kα i k ik2 k kα i k ik2 i=1 i=1 X X n α let Sα = i=1 Li (note that S0 = n) P lemma (Nesterov): let f satisfy (1), then for any α R, ∈ N f(x) f(y) 1∗ α Sα x y 1 α, x,y R k∇ −∇ k − ≤ k − k − ∀ ∈ therefore Sα 2 N f(x) f(y)+ f(y),x y + x y 1 α, x,y R ≤ h∇ − i 2 k − k − ∀ ∈ Coordinate descent methods 12–13 Convergence in expectation theorem (Nesterov): for any k 0, ≥ (k) ⋆ 2 2 (0) Ef(x ) f SαR1 α(x ) − ≤ k +4 − (0) ⋆ (0) where R1 α(x ) = max max y x 1 α : f(y) f(x ) − y x⋆ X⋆ k − k − ≤ ∈ proof: define random variables ξ = i(0),...,i(k) , k { } n f(x(k)) E f(x(k+1)) = p(α) f(x(k)) f(x(k) + U vˆ ) − i(k) i − i i i=1 X n p(α) i f(x(k)) 2 ≥ 2L k∇i k2 i=1 i X 1 2 = f(x) 1∗ α 2Sα k∇ k − Coordinate descent methods 12–14 f(x(k)) f ⋆ min f(x(k)),x(k) x⋆ x⋆ X⋆ − ≤ ∈ h∇ − i (k) (0) f(x ) 1∗ αR1 α(x ) ≤ k∇ k − − 2 (0) therefore, with C =2SαR1 α(x ), − 1 2 f(x(k)) E f(x(k+1)) f(x(k)) f ⋆ − i(k) ≥ C − taking expectation of both sides with respect to ξk 1 = i(0),...,i(k 1) , − { − } (k) (k+1) 1 (k) ⋆ 2 Ef(x ) Ef(x ) Eξk 1 f(x ) f − ≥ C − − 1 h 2 i Ef(x(k)) f ⋆ ≥ C − finally, following steps on page 12–6 to obtain desired result Coordinate descent methods 12–15 Discussions α =0: S = n and • 0 2n 2n n Ef(x(k)) f ⋆ R2(x(0)) L x(0) x⋆ 2 − ≤ k +4 1 ≤ k +4 ik i − k2 i=1 X (k) ⋆ γ 2 (0) corresponding rate of full gradient method: f(x ) f kR1(x ), where γ is big enough to ensure 2f(x) γ diag L−I ≤n ∇ { i Ni}i=1 conclusion: proportional to worst case rate of full gradient method α =1: S = n L and • 1 i=1 i n P 2 Ef(x(k)) f ⋆ L R2(x(0)) − ≤ k +4 i 0 i=1 ! X L corresponding rate of full gradient method: f(x(k)) f ⋆ f R2(x(0)) − ≤ k 0 conclusion: same as worst case rate of full gradient method but each iteration of randomized coordinate descent can be much cheaper Coordinate descent methods 12–16 An interesting case consider α =1/2, let Ni =1 for i =1,...,n, and let (0) (0) D (x ) = max max max xi yi : f(x) f(x ) ∞ x y X⋆ 1 i n | − | ≤ ∈ ≤ ≤ 2 (0) 2 (0) then R1/2(x ) S1/2D (x ) and hence ≤ ∞ n 2 (k) ⋆ 2 1/2 2 (0) Ef(x ) f Li D (x ) − ≤ k +4 ∞ i=1 ! X worst-case dimension-independent complexity of minimizing convex • functions over n-dimensional box is infinite (Nemirovski & Yudin 1983) S can be bounded for very big or even infinite dimension problems • 1/2 conclusion: RCD can work in situations where full gradient methods have no theoretical justification Coordinate descent methods 12–17 Convergence for strongly convex functions theorem (Nesterov): if f is strongly convex with respect to the norm 1 α with convexity parameter σ1 α > 0, then k·k − − k (k) ⋆ σ1 α (0) ⋆ Ef(x ) f 1 − f(x ) f − ≤ − S − α proof: combine consequence of strong convexity (k) ⋆ 1 2 f(x ) f f(x) 1∗ α − ≤ σ1 α k∇ k − − with inequality on page 12–14 to obtain (k) (k+1) 1 2 σ1 α (k) ⋆ f(x ) Ei(k)f(x ) f(x) 1∗ α − f(x ) f − ≥ 2Sα k∇ k − ≥ Sα − it remains to take expectations over ξk 1 = i(0),...,i(k 1) − { − } Coordinate descent methods 12–18 High probability bounds number of iterations to guarantee prob f(x(k)) f ⋆ ǫ 1 ρ − ≤ ≥ − where 0 <ρ< 1 is confidence level and ǫ> 0 is error tolerance for smooth convex functions • n 1 O 1+log ǫ ρ for smooth strongly convex functions • n 1 O log µ ǫρ Coordinate descent methods 12–19 Outline theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Minimizing composite objectives minimize F (x) , f(x)+Ψ(x) x RN ∈ assumptions f differentiable and f(x) block coordinate-wise Lipschitz continuous • ∇ f(x + U v ) f(x) L v , i =1,...,n k∇i i i −∇i k2 ≤ ik ik2 Ψ is block separable: • n Ψ(x)= Ψi(xi) i=1 X and each Ψi is convex and closed, and also simple Coordinate descent methods 12–20 Coordinate update use quadratic upper bound on smooth part: F (x + Uiv) = f(x + Uivi)+Ψ(x + Uivi) L f(x)+ f(x),v + i v +Ψ (x + v )+ Ψ (x ) ≤ h∇i ii 2 k ik2 i i i j j j=i X6 define L V (x,v )= f(x)+ f(x),v + i v +Ψ (x + v ) i i h∇i ii 2 k ik2 i i i coordinate descent takes the form (k+1) (k) x = x + Ui∆xi where ∆xi = argmin V (x,vi) vi Coordinate descent methods 12–21 Randomized coordinate descent for composite functions choose x(0) Rn and α R, and iterate for k =0, 1, 2,..
Recommended publications
  • A Generic Coordinate Descent Solver for Nonsmooth Convex Optimization Olivier Fercoq
    A generic coordinate descent solver for nonsmooth convex optimization Olivier Fercoq To cite this version: Olivier Fercoq. A generic coordinate descent solver for nonsmooth convex optimization. Optimization Methods and Software, Taylor & Francis, 2019, pp.1-21. 10.1080/10556788.2019.1658758. hal- 01941152v2 HAL Id: hal-01941152 https://hal.archives-ouvertes.fr/hal-01941152v2 Submitted on 26 Sep 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A generic coordinate descent solver for nonsmooth convex optimization Olivier Fercoq LTCI, T´el´ecom ParisTech, Universit´eParis-Saclay, 46 rue Barrault, 75634 Paris Cedex 13, France ARTICLE HISTORY Compiled September 26, 2019 ABSTRACT We present a generic coordinate descent solver for the minimization of a nonsmooth convex ob- jective with structure. The method can deal in particular with problems with linear constraints. The implementation makes use of efficient residual updates and automatically determines which dual variables should be duplicated. A list of basic functional atoms is pre-compiled for effi- ciency and a modelling language in Python allows the user to combine them at run time. So, the algorithm can be used to solve a large variety of problems including Lasso, sparse multinomial logistic regression, linear and quadratic programs.
    [Show full text]
  • Process Optimization
    Process Optimization Mathematical Programming and Optimization of Multi-Plant Operations and Process Design Ralph W. Pike Director, Minerals Processing Research Institute Horton Professor of Chemical Engineering Louisiana State University Department of Chemical Engineering, Lamar University, April, 10, 2007 Process Optimization • Typical Industrial Problems • Mathematical Programming Software • Mathematical Basis for Optimization • Lagrange Multipliers and the Simplex Algorithm • Generalized Reduced Gradient Algorithm • On-Line Optimization • Mixed Integer Programming and the Branch and Bound Algorithm • Chemical Production Complex Optimization New Results • Using one computer language to write and run a program in another language • Cumulative probability distribution instead of an optimal point using Monte Carlo simulation for a multi-criteria, mixed integer nonlinear programming problem • Global optimization Design vs. Operations • Optimal Design −Uses flowsheet simulators and SQP – Heuristics for a design, a superstructure, an optimal design • Optimal Operations – On-line optimization – Plant optimal scheduling – Corporate supply chain optimization Plant Problem Size Contact Alkylation Ethylene 3,200 TPD 15,000 BPD 200 million lb/yr Units 14 76 ~200 Streams 35 110 ~4,000 Constraints Equality 761 1,579 ~400,000 Inequality 28 50 ~10,000 Variables Measured 43 125 ~300 Unmeasured 732 1,509 ~10,000 Parameters 11 64 ~100 Optimization Programming Languages • GAMS - General Algebraic Modeling System • LINDO - Widely used in business applications
    [Show full text]
  • The Gradient Sampling Methodology
    The Gradient Sampling Methodology James V. Burke∗ Frank E. Curtisy Adrian S. Lewisz Michael L. Overtonx January 14, 2019 1 Introduction in particular, this leads to calling the negative gra- dient, namely, −∇f(x), the direction of steepest de- The principal methodology for minimizing a smooth scent for f at x. However, when f is not differentiable function is the steepest descent (gradient) method. near x, one finds that following the negative gradient One way to extend this methodology to the minimiza- direction might offer only a small amount of decrease tion of a nonsmooth function involves approximating in f; indeed, obtaining decrease from x along −∇f(x) subdifferentials through the random sampling of gra- may be possible only with a very small stepsize. The dients. This approach, known as gradient sampling GS methodology is based on the idea of stabilizing (GS), gained a solid theoretical foundation about a this definition of steepest descent by instead finding decade ago [BLO05, Kiw07], and has developed into a a direction to approximately solve comprehensive methodology for handling nonsmooth, potentially nonconvex functions in the context of op- min max gT d; (2) kdk2≤1 g2@¯ f(x) timization algorithms. In this article, we summarize the foundations of the GS methodology, provide an ¯ where @f(x) is the -subdifferential of f at x [Gol77]. overview of the enhancements and extensions to it To understand the context of this idea, recall that that have developed over the past decade, and high- the (Clarke) subdifferential of a locally Lipschitz f light some interesting open questions related to GS.
    [Show full text]
  • Decomposable Submodular Function Minimization: Discrete And
    Decomposable Submodular Function Minimization: Discrete and Continuous Alina Ene∗ Huy L. Nguy˜ên† László A. Végh‡ March 7, 2017 Abstract This paper investigates connections between discrete and continuous approaches for decomposable submodular function minimization. We provide improved running time estimates for the state-of-the-art continuous algorithms for the problem using combinatorial arguments. We also provide a systematic experimental comparison of the two types of methods, based on a clear distinction between level-0 and level-1 algorithms. 1 Introduction Submodular functions arise in a wide range of applications: graph theory, optimization, economics, game theory, to name a few. A function f : 2V R on a ground set V is submodular if f(X)+ f(Y ) f(X Y )+ f(X Y ) for all sets X, Y V . Submodularity→ can also be interpreted as a decreasing marginals≥ property.∩ ∪ ⊆ There has been significant interest in submodular optimization in the machine learning and computer vision communities. The submodular function minimization (SFM) problem arises in problems in image segmenta- tion or MAP inference tasks in Markov Random Fields. Landmark results in combinatorial optimization give polynomial-time exact algorithms for SFM. However, the high-degree polynomial dependence in the running time is prohibitive for large-scale problem instances. The main objective in this context is to develop fast and scalable SFM algorithms. Instead of minimizing arbitrary submodular functions, several recent papers aim to exploit special structural properties of submodular functions arising in practical applications. A popular model is decomposable sub- modular functions: these can be written as sums of several “simple” submodular functions defined on small arXiv:1703.01830v1 [cs.LG] 6 Mar 2017 supports.
    [Show full text]
  • Arxiv:1706.05795V4 [Math.OC] 4 Nov 2018
    BCOL RESEARCH REPORT 17.02 Industrial Engineering & Operations Research University of California, Berkeley, CA 94720{1777 SIMPLEX QP-BASED METHODS FOR MINIMIZING A CONIC QUADRATIC OBJECTIVE OVER POLYHEDRA ALPER ATAMTURK¨ AND ANDRES´ GOMEZ´ Abstract. We consider minimizing a conic quadratic objective over a polyhe- dron. Such problems arise in parametric value-at-risk minimization, portfolio optimization, and robust optimization with ellipsoidal objective uncertainty; and they can be solved by polynomial interior point algorithms for conic qua- dratic optimization. However, interior point algorithms are not well-suited for branch-and-bound algorithms for the discrete counterparts of these problems due to the lack of effective warm starts necessary for the efficient solution of convex relaxations repeatedly at the nodes of the search tree. In order to overcome this shortcoming, we reformulate the problem us- ing the perspective of the quadratic function. The perspective reformulation lends itself to simple coordinate descent and bisection algorithms utilizing the simplex method for quadratic programming, which makes the solution meth- ods amenable to warm starts and suitable for branch-and-bound algorithms. We test the simplex-based quadratic programming algorithms to solve con- vex as well as discrete instances and compare them with the state-of-the-art approaches. The computational experiments indicate that the proposed al- gorithms scale much better than interior point algorithms and return higher precision solutions. In our experiments, for large convex instances, they pro- vide up to 22x speed-up. For smaller discrete instances, the speed-up is about 13x over a barrier-based branch-and-bound algorithm and 6x over the LP- based branch-and-bound algorithm with extended formulations.
    [Show full text]
  • A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*
    Volume 27, N. 1, pp. 93–106, 2008 Copyright © 2008 SBMAC ISSN 0101-8205 www.scielo.br/cam A new algorithm of nonlinear conjugate gradient method with strong convergence* ZHEN-JUN SHI1,2 and JINHUA GUO2 1College of Operations Research and Management, Qufu Normal University Rizhao, Sahndong 276826, P.R. China 2Department of Computer and Information Science, University of Michigan Dearborn, Michigan 48128-1491, USA E-mails: [email protected]; [email protected] / [email protected] Abstract. The nonlinear conjugate gradient method is a very useful technique for solving large scale minimization problems and has wide applications in many fields. In this paper, we present a new algorithm of nonlinear conjugate gradient method with strong convergence for unconstrained minimization problems. The new algorithm can generate an adequate trust region radius automatically at each iteration and has global convergence and linear convergence rate under some mild conditions. Numerical results show that the new algorithm is efficient in practical computation and superior to other similar methods in many situations. Mathematical subject classification: 90C30, 65K05, 49M37. Key words: unconstrained optimization, nonlinear conjugate gradient method, global conver- gence, linear convergence rate. 1 Introduction Consider an unconstrained minimization problem min f (x), x ∈ Rn, (1) where Rn is an n-dimensional Euclidean space and f : Rn −→ R is a continu- ously differentiable function. #724/07. Received: 10/V/07. Accepted: 24/IX/07. *The work was supported in part by NSF CNS-0521142, USA. 94 A NEW ALGORITHM OF NONLINEAR CONJUGATE GRADIENT METHOD When n is very large (for example, n > 106) the related problem is called large scale minimization problem.
    [Show full text]
  • Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization
    Hindawi Complexity Volume 2018, Article ID 2609471, 11 pages https://doi.org/10.1155/2018/2609471 Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization Zhigang Li,1 Mingchuan Zhang ,2 Junlong Zhu ,2 Ruijuan Zheng ,2 Qikun Zhang ,1 and Qingtao Wu 2 1 School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, 450002, China 2Information Engineering College, Henan University of Science and Technology, Luoyang, 471023, China Correspondence should be addressed to Mingchuan Zhang; zhang [email protected] Received 25 May 2018; Accepted 26 November 2018; Published 5 December 2018 Academic Editor: Mahardhika Pratama Copyright © 2018 Zhigang Li et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We consider a stochastic continuous submodular huge-scale optimization problem, which arises naturally in many applications such as machine learning. Due to high-dimensional data, the computation of the whole gradient vector can become prohibitively expensive. To reduce the complexity and memory requirements, we propose a stochastic block-coordinate gradient projection algorithm for maximizing continuous submodular functions, which chooses a random subset of gradient vector and updates the estimates along the positive gradient direction. We prove that the estimates of all nodes generated by the algorithm converge to ∗ some stationary points with probability 1. Moreover, we show that the proposed algorithm achieves the tight ((� /2)� − �) 2 min approximation guarantee afer �(1/� ) iterations for DR-submodular functions by choosing appropriate step sizes. Furthermore, ((�2/(1 + �2))� �∗ − �) �(1/�2) we also show that the algorithm achieves the tight min approximation guarantee afer iterations for weakly DR-submodular functions with parameter � by choosing diminishing step sizes.
    [Show full text]
  • A Hybrid Method of Combinatorial Search and Coordinate Descent For
    A Hybrid Method of Combinatorial Search and Coordinate Descent for Discrete Optimization Ganzhao Yuan [email protected] School of Data and Computer Science, Sun Yat-sen University (SYSU), P.R. China Li Shen [email protected] Tencent AI Lab, Shenzhen, P.R. China Wei-Shi Zheng [email protected] School of Data and Computer Science, Sun Yat-sen University (SYSU), P.R. China Editor: Abstract Discrete optimization is a central problem in mathematical optimization with a broad range of applications, among which binary optimization and sparse optimization are two common ones. However, these problems are NP-hard and thus difficult to solve in general. Combinatorial search methods such as branch-and-bound and exhaustive search find the global optimal solution but are confined to small-sized problems, while coordinate descent methods such as coordinate gradient descent are efficient but often suffer from poor local minima. In this paper, we consider a hybrid method that combines the effectiveness of combinatorial search and the efficiency of coordinate descent. Specifically, we consider random strategy or/and greedy strategy to select a subset of coordinates as the working set, and then perform global combinatorial search over the working set based on the original objective function. In addition, we provide some optimality analysis and convergence analysis for the proposed method. Our method finds stronger stationary points than existing methods. Finally, we demonstrate the efficacy of our method on some sparse optimization and binary optimization applications. As a result, our method achieves state-of-the-art performance in terms of accuracy. For example, our method generally outperforms the well-known orthogonal matching pursuit method in sparse optimization.
    [Show full text]
  • On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri
    On the Links between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri To cite this version: Senanayak Sesh Kumar Karri. On the Links between Probabilistic Graphical Models and Submodular Optimisation. Machine Learning [cs.LG]. Université Paris sciences et lettres, 2016. English. NNT : 2016PSLEE047. tel-01753810 HAL Id: tel-01753810 https://tel.archives-ouvertes.fr/tel-01753810 Submitted on 29 Mar 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT de l’Universite´ de recherche Paris Sciences Lettres PSL Research University Prepar´ ee´ a` l’Ecole´ normale superieure´ On the Links between Probabilistic Graphical Models and Submodular Optimisation Liens entre modeles` graphiques probabilistes et optimisation sous-modulaire Ecole´ doctorale n◦386 ECOLE´ DOCTORALE DE SCIENCES MATHEMATIQUES´ DE PARIS CENTRE Specialit´ e´ INFORMATIQUE COMPOSITION DU JURY : M Andreas Krause ETH Zurich, Rapporteur M Nikos Komodakis ENPC Paris, Rapporteur M Francis Bach Inria Paris, Directeur de these` Soutenue par Senanayak Sesh Kumar KARRI le 27.09.2016 M Josef Sivic ENS Paris, Membre du Jury Dirigee´ par Francis BACH M Antonin Chambolle CMAP EP Paris, Membre du Jury M Guillaume Obozinski ENPC Paris, Membre du Jury RESEARCH UNIVERSITY PARIS ÉCOLENORMALE SUPÉRIEURE What is the purpose of life? Proof and Conjecture, ....
    [Show full text]
  • Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text
    Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text Jaafar Hammouda*, Ali Eisab, Natalia Dobrenkoa, Natalia Gusarovaa aITMO University, St. Petersburg, Russia *[email protected] bAleppo University, Aleppo, Syria Abstract Gradient methods have applications in multiple fields, including signal processing, image processing, and dynamic systems. In this paper, we present a nonlinear gradient method for solving convex supra-quadratic functions by developing the search direction, that done by hybridizing between the two conjugate coefficients HRM [2] and NHS [1]. The numerical results proved the effectiveness of the presented method by applying it to solve standard problems and reaching the exact solution if the objective function is quadratic convex. Also presented in this article, an application to the problem of named entities in the Arabic medical language, as it proved the stability of the proposed method and its efficiency in terms of execution time. Keywords: Convex optimization; Gradient methods; Named entity recognition; Arabic; e-Health. 1. Introduction Several nonlinear conjugate gradient methods have been presented for solving high-dimensional unconstrained optimization problems which are given as [3]: min f(x) ; x∈Rn (1) To solve problem (1), we start with the following iterative relationship: 푥(푘+1) = 푥푘 + 훼푘 푑푘, 푘 = 0,1,2, (2) where α푘 > 0 is a step size, that is calculated by strong Wolfe-Powell’s conditions [4] 푇 푓(푥푘 + 훼푘푑푘) ≤ 푓(푥푘) + 훿훼푘푔푘 푑푘, (3) 푇 푇 |푔(푥푘 + 훼푘푑푘) 푑푘| ≤ 휎|푔푘 푑푘| where 0 < 훿 < 휎 < 1 푑푘 is the search direction that is computed as follow [3] −푔푘 , if 푘 = 0 (4) 푑푘+1 = { −푔푘+1 + 훽푘푑푘 , if 푘 ≥ 1 Where 푔푘 = 푔(푥푘) = ∇푓(푥푘) represent the gradient vector for 푓(푥) at the point 푥푘 , 훽푘 ∈ ℝ is known as CG coefficient that characterizes different CG methods.
    [Show full text]
  • An Iterated Coordinate Descent Algorithm for Regularized 0/1 Loss Minimization
    An iterated coordinate descent algorithm for regularized 0/1 loss minimization Usman Roshan Department of Computer Science, New Jersey Institute of Technology, [email protected] Abstract. We propose regularization and iterated coordinate descent for better 0/1 loss optimization. The 0/1 loss is less sensitive to outliers than the popular hinge and logistic loss particularly when outliers are misclassified. However, it also suffers from local minima rendering coor- dinate descent ineffective. A popular method to deal with local minima is to perform multiple coordinate descent searches starting from differ- ent random initial solutions. We propose using a less explored approach called iterated local search, where the idea is to continue the search after a slight perturbation to the local minima. This preserves structure in the local minimum and also allows for an overall deeper search compared to random restart alone. Inspired by this we implement an iterated coordi- nate descent algorithm for a regularized 0/1 loss objective. Our algorithm performs a perturbation to the local minima reached at the end of co- ordinate descent, resumes the search from the perturbation until a new minima is reached, and continues in this manner for a specified number of iterations. We show empirically on real data that this leads to a sig- nificant improvement in the 0/1 loss compared to random restarts both on training and test data. We also see that optimizing the regularized 0/1 loss gives lower objective values than optimizing the 0/1 loss alone on training and test data. Our results show that the instability of 0/1 loss is improved by regularization and that iterated coordinate descent gives lower loss values than random restart alone.
    [Show full text]
  • Conditional Gradient Method for Multiobjective Optimization
    Conditional gradient method for multiobjective optimization P. B. Assun¸c~ao ∗ O. P. Ferreira ∗ L. F. Prudente ∗ April 7, 2020 Abstract: We analyze the conditional gradient method, also known as Frank-Wolfe method, for constrained multiobjective optimization. The constraint set is assumed to be convex and com- pact, and the objectives functions are assumed to be continuously differentiable. The method is considered with different strategies for obtaining the step sizes. Asymptotic convergence proper- ties and iteration-complexity bounds with and without convexity assumptions on the objective functions are stablished. Numerical experiments are provided to illustrate the effectiveness of the method and certify the obtained theoretical results. Keywords: Conditional gradient method; multiobjective optimization; Pareto optimality; con- strained optimization problem 1 Introduction In the constrained multiobjective optimization, we seek to simultaneously minimize several objec- tive functions on a set C. One strategy for solving multiobjective problems that has become very popular consists in the extension of methods for scalar-valued to multiobjective-valued optimiza- tion, instead of using scalarization approaches [23]. As far as we know, this strategy originated from the work [14] in which the authors proposed the steepest descent methods for multiobjec- tive optimization. Since of then, new properties related to this method have been discovered and several variants of it have been considered, see for example [4,5,16,19,20,26,27]. In recent years, there has been a significant increase in the number of papers addressing concepts, techniques, and methods for multiobjective optimization, see for example [6, 9, 13, 15, 25, 41, 42, 45, 46, 48, 53, 54].
    [Show full text]