Accelerating Greedy Coordinate Descent Methods

Total Page:16

File Type:pdf, Size:1020Kb

Accelerating Greedy Coordinate Descent Methods Accelerating Greedy Coordinate Descent Methods Haihao Lu 1 Robert M. Freund 2 Vahab Mirrokni 3 Abstract sentially recover the same results (in expectation) as full We introduce and study two algorithms to accel- gradient descent, including obtaining “accelerated” (i.e., erate greedy coordinate descent in theory and in O(1=k2)) convergence guarantees. On the other hand, in practice: Accelerated Semi-Greedy Coordinate some important machine learning applications, greedy co- Descent (ASCD) and Accelerated Greedy Co- ordinate methods demonstrate superior numerical perfor- ordinate Descent (AGCD). On the theory side, mance while also delivering much sparser solutions. For our main results are for ASCD: we show that example, greedy coordinate descent is one of the fastest ASCD achieves O(1=k2) convergence, and it algorithms for the graphical LASSO implemented in DP- also achieves accelerated linear convergence for GLASSO (Mazumder & Hastie, 2012). And sequence mini- strongly convex functions. On the empirical side, mization optimization (SMO) (a variant of greedy coordi- while both AGCD and ASCD outperform Acceler- nate descent) is widely known as the best solver for kernel ated Randomized Coordinate Descent on most in- SVM (Joachims, 1999)(Platt, 1999) and is implemented in stances in our numerical experiments, we note that LIBSVM and SVMLight. AGCD significantly outperforms the other two In general, for smooth convex optimization the standard methods in our experiments, in spite of a lack of first-order methods converge at a rate of O(1=k) (including theoretical guarantees for this method. To comple- greedy coordinate descent). In 1983, Nesterov (Nesterov, ment this empirical finding for AGCD, we present 1983) proposed an algorithm that achieved a rate of O(1=k2) an explanation why standard proof techniques for – which can be shown to be the optimal rate achievable by acceleration cannot work for AGCD, and we in- any first-order method (Nemirovsky & Yudin, 1983). This troduce a technical condition under which AGCD method (and other similar methods) is now referred to as is guaranteed to have accelerated convergence. Accelerated Gradient Descent (AGD). Finally, we confirm that this technical condition holds in our numerical experiments. However, there has not been much work on accelerating the standard Greedy Coordinate Descent (GCD) due to the inherent difficulty in demonstrating O(1=k2) computational 1. Introduction guarantees (we discuss this difficulty further in Section 4.1). The only work that might be close as far as the authors are Coordinate descent methods have received much-deserved aware is (Song et al., 2017), which updates the z-sequence attention recently due to their capability for solving large- using the full gradient and thus should not be considered as scale optimization problems (with sparsity) that arise in a coordinate descent method in the standard sense. machine learning applications and elsewhere. With inexpen- sive updates at each iteration, coordinate descent algorithms In this paper, we study ways to accelerate greedy coordi- obtain faster running times than similar full gradient de- nate descent in theory and in practice. We introduce and scent algorithms in order to reach the same near-optimality study two algorithms: Accelerated Semi-Greedy Coordi- tolerance; indeed some of these algorithms now comprise nate Descent (ASCD) and Accelerated Greedy Coordinate the state-of-the-art in machine learning algorithms for loss Descent (AGCD). While ASCD takes greedy steps in the minimization. x-updates and randomized steps in the z-updates, AGCD is a straightforward extension of GCD that only takes greedy Most recent research on coordinate descent has focused on steps. On the theory side, our main results are for ASCD: versions of randomized coordinate descent, which can es- we show that ASCD achieves O(1=k2) convergence, and 1Department of Mathematics and Operations Research Cen- it also achieves accelerated linear convergence when the 2 3 ter, MIT Sloan School of Management, MIT Google Research. objective function is furthermore strongly convex. However, Correspondence to: Haihao Lu <[email protected]>. a direct extension of convergence proofs for ARCD does not Proceedings of the 35 th International Conference on Machine work for ASCD due to the different coordinates we use to Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 update x-sequence and z-sequence. Thus, we present a new by the author(s). Accelerating Greedy Coordinate Descent Methods proof technique – which shows that a greedy coordinate step 2012) developed the first accelerated randomized coordinate yields a better objective function value than a full gradient gradient method. (Lu & Xiao, 2015) present a sharper con- step with a modified smoothness condition. vergence analysis of Nesterov’s method using a randomized estimate sequence framework. (Fercoq & Richtarik, 2015) On the empirical side, we first note that in most of our proposed the APPROX (Accelerated, Parallel and PROXi- experiments ASCD outperforms Accelerated Randomized mal) coordinate descent method and obtained an accelerated Coordinate Descent (ARCD) in terms of running time. On sublinear convergence rate, and (Lee & Sidford, 2013) de- the other hand, we note that AGCD significantly outper- veloped an efficient implementation of ARCD. forms the other accelerated coordinate descent methods in all instances, in spite of a lack of theoretical guarantees for this method. To complement the empirical study of AGCD, 1.2. Accelerated Coordinate Descent we present a Lyapunov energy function argument that points Framework to an explanation for why a direct extension of the proof for Our optimization problem of interest is: AGCD does not work. This argument inspires us to intro- ∗ duce a technical condition under which AGCD is guaranteed P : f := minimumx f(x) ; (1) to converge at an accelerated rate. Interestingly, we confirm that the technical condition holds in a variety of instances where f(·): Rn ! R is a differentiable convex func- in our empirical study, which in turn justifies our empirical tion. observation that AGCD works so well in practice. Definition 1.1. f(·) is coordinate-wise L-smooth for the 1.1. Related Work vector of parameters L := (L1;L2;:::;Ln) if rf(·) is coordinate-wise Lipschitz continuous for the corresponding Coordinate Descent. Coordinate descent methods have n coefficients of L, i.e., for all x 2 R and h 2 R it holds a long history in optimization, and convergence of these that: methods has been extensively studied in the optimization community in the 1980s-90s, see (Bertsekas & Tsitsiklis, jrif(x + hei) − rif(x)j ≤ Lijhj ; i = 1; : : : ; n ; (2) 1989), (Luo & Tseng, 1992), and (Luo & Tseng, 1993). There are roughly three types of coordinate descent meth- th ods depending on how the coordinate is chosen: randomized where rif(·) denotes the i coordinate of rf(·) and ei is th coordinate descent (RCD), cyclic coordinate descent (CCD), i unit coordinate vector, for i = 1; : : : ; n. and greedy coordinate descent (GCD). RCD has received much attention since the seminal paper of Nesterov (Nes- We presume throughout that Li > 0 for i = 1; : : : ; n. terov, 2012). In RCD, the coordinate is chosen randomly Let L denote the diagonal matrix whose diagonal coeffi- cients correspond to the respective coefficients of L. Let from a certain fixed distribution. (Richtarik & Takac, 2014) n provides an excellent review of theoretical results for RCD. h·; ·i denote the standard coordinate inner product in R , hx; yi = Pn x y k · k ` CCD chooses the coordinate in a cyclic order, see (Beck namely i=1 i i, and let p denote the p 1 ≤ p ≤ 1 hx; yi := Pn L x y = & Tetruashvili, 2013) for basic convergence results. More norm for . Let L i=1 i i i recent results show that CCD is inferior to RCD in the worst hx; Lyi = hLx; yi denote the L-inner product. Define p −1 case (Sun & Ye, 2016), while it is better than RCD in certain the norm kxkL := hx; Lxi. Letting L denote the in- situations (Gurbuzbalaban et al., 2017). In GCD, we select verse of L, we will also use the norm k · kL−1 defined by q the coordinate yielding the largest reduction in the objective p −1 Pn −1 2 kvk −1 := hv; L vi = L v . function value. GCD usually delivers better function val- L i=1 i i ues at each iteration in practice, though this comes at the Algorithm1 presents a generic framework for accelerated expense of having to compute the full gradient in order to coordinate descent methods that is flexible enough to en- select the gradient coordinate with largest magnitude. The compass deterministic as well as randomized methods. One recent work (Nutini et al., 2015) shows that GCD has faster specific case is the standard Accelerated Randomized Coor- convergence than RCD in theory, and also provides several dinate Descent (ARCD). In this paper we propose and study applications in machine learning where the full gradient two other cases. The first is Accelerated Greedy Coordinate can be computed cheaply. A parallel GCD method is pro- Descent (AGCD), which is a straightforward extension of posed in (You et al., 2016) and numerical results show its greedy coordinate descent to the acceleration framework and advantage in practice. which, surprisingly, has not been previously studied (that Accelerated Randomized Coordinate Descent. Since we are aware of). The second is a new algorithm which we Nesterov’s paper on RCD (Nesterov, 2012) there has been call Accelerated Semi-Greedy Coordinate Descent (ASCD) significant focus on accelerated versions of RCD. (Nesterov, that takes greedy steps in the x-updates and randomized steps in the z-update. Accelerating Greedy Coordinate Descent Methods Algorithm 1 Accelerated Coordinate Descent Framework coordinate is used to update both the x-sequence and the without Strong Convexity z-sequence. As far as we know AGCD has not appeared in Input: Objective function f with coordinate-wise the first-order method literature.
Recommended publications
  • A Generic Coordinate Descent Solver for Nonsmooth Convex Optimization Olivier Fercoq
    A generic coordinate descent solver for nonsmooth convex optimization Olivier Fercoq To cite this version: Olivier Fercoq. A generic coordinate descent solver for nonsmooth convex optimization. Optimization Methods and Software, Taylor & Francis, 2019, pp.1-21. 10.1080/10556788.2019.1658758. hal- 01941152v2 HAL Id: hal-01941152 https://hal.archives-ouvertes.fr/hal-01941152v2 Submitted on 26 Sep 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A generic coordinate descent solver for nonsmooth convex optimization Olivier Fercoq LTCI, T´el´ecom ParisTech, Universit´eParis-Saclay, 46 rue Barrault, 75634 Paris Cedex 13, France ARTICLE HISTORY Compiled September 26, 2019 ABSTRACT We present a generic coordinate descent solver for the minimization of a nonsmooth convex ob- jective with structure. The method can deal in particular with problems with linear constraints. The implementation makes use of efficient residual updates and automatically determines which dual variables should be duplicated. A list of basic functional atoms is pre-compiled for effi- ciency and a modelling language in Python allows the user to combine them at run time. So, the algorithm can be used to solve a large variety of problems including Lasso, sparse multinomial logistic regression, linear and quadratic programs.
    [Show full text]
  • 12. Coordinate Descent Methods
    EE 546, Univ of Washington, Spring 2014 12. Coordinate descent methods theoretical justifications • randomized coordinate descent method • minimizing composite objectives • accelerated coordinate descent method • Coordinate descent methods 12–1 Notations consider smooth unconstrained minimization problem: minimize f(x) x RN ∈ n coordinate blocks: x =(x ,...,x ) with x RNi and N = N • 1 n i ∈ i=1 i more generally, partition with a permutation matrix: U =[PU1 Un] • ··· n T xi = Ui x, x = Uixi i=1 X blocks of gradient: • f(x) = U T f(x) ∇i i ∇ coordinate update: • x+ = x tU f(x) − i∇i Coordinate descent methods 12–2 (Block) coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ 1. choose coordinate i(k) 2. update x(k+1) = x(k) t U f(x(k)) − k ik∇ik among the first schemes for solving smooth unconstrained problems • cyclic or round-Robin: difficult to analyze convergence • mostly local convergence results for particular classes of problems • does it really work (better than full gradient method)? • Coordinate descent methods 12–3 Steepest coordinate descent choose x(0) Rn, and iterate for k =0, 1, 2,... ∈ (k) 1. choose i(k) = argmax if(x ) 2 i 1,...,n k∇ k ∈{ } 2. update x(k+1) = x(k) t U f(x(k)) − k i(k)∇i(k) assumptions f(x) is block-wise Lipschitz continuous • ∇ f(x + U v) f(x) L v , i =1,...,n k∇i i −∇i k2 ≤ ik k2 f has bounded sub-level set, in particular, define • ⋆ R(x) = max max y x 2 : f(y) f(x) y x⋆ X⋆ k − k ≤ ∈ Coordinate descent methods 12–4 Analysis for constant step size quadratic upper bound due to block coordinate-wise
    [Show full text]
  • Process Optimization
    Process Optimization Mathematical Programming and Optimization of Multi-Plant Operations and Process Design Ralph W. Pike Director, Minerals Processing Research Institute Horton Professor of Chemical Engineering Louisiana State University Department of Chemical Engineering, Lamar University, April, 10, 2007 Process Optimization • Typical Industrial Problems • Mathematical Programming Software • Mathematical Basis for Optimization • Lagrange Multipliers and the Simplex Algorithm • Generalized Reduced Gradient Algorithm • On-Line Optimization • Mixed Integer Programming and the Branch and Bound Algorithm • Chemical Production Complex Optimization New Results • Using one computer language to write and run a program in another language • Cumulative probability distribution instead of an optimal point using Monte Carlo simulation for a multi-criteria, mixed integer nonlinear programming problem • Global optimization Design vs. Operations • Optimal Design −Uses flowsheet simulators and SQP – Heuristics for a design, a superstructure, an optimal design • Optimal Operations – On-line optimization – Plant optimal scheduling – Corporate supply chain optimization Plant Problem Size Contact Alkylation Ethylene 3,200 TPD 15,000 BPD 200 million lb/yr Units 14 76 ~200 Streams 35 110 ~4,000 Constraints Equality 761 1,579 ~400,000 Inequality 28 50 ~10,000 Variables Measured 43 125 ~300 Unmeasured 732 1,509 ~10,000 Parameters 11 64 ~100 Optimization Programming Languages • GAMS - General Algebraic Modeling System • LINDO - Widely used in business applications
    [Show full text]
  • The Gradient Sampling Methodology
    The Gradient Sampling Methodology James V. Burke∗ Frank E. Curtisy Adrian S. Lewisz Michael L. Overtonx January 14, 2019 1 Introduction in particular, this leads to calling the negative gra- dient, namely, −∇f(x), the direction of steepest de- The principal methodology for minimizing a smooth scent for f at x. However, when f is not differentiable function is the steepest descent (gradient) method. near x, one finds that following the negative gradient One way to extend this methodology to the minimiza- direction might offer only a small amount of decrease tion of a nonsmooth function involves approximating in f; indeed, obtaining decrease from x along −∇f(x) subdifferentials through the random sampling of gra- may be possible only with a very small stepsize. The dients. This approach, known as gradient sampling GS methodology is based on the idea of stabilizing (GS), gained a solid theoretical foundation about a this definition of steepest descent by instead finding decade ago [BLO05, Kiw07], and has developed into a a direction to approximately solve comprehensive methodology for handling nonsmooth, potentially nonconvex functions in the context of op- min max gT d; (2) kdk2≤1 g2@¯ f(x) timization algorithms. In this article, we summarize the foundations of the GS methodology, provide an ¯ where @f(x) is the -subdifferential of f at x [Gol77]. overview of the enhancements and extensions to it To understand the context of this idea, recall that that have developed over the past decade, and high- the (Clarke) subdifferential of a locally Lipschitz f light some interesting open questions related to GS.
    [Show full text]
  • Decomposable Submodular Function Minimization: Discrete And
    Decomposable Submodular Function Minimization: Discrete and Continuous Alina Ene∗ Huy L. Nguy˜ên† László A. Végh‡ March 7, 2017 Abstract This paper investigates connections between discrete and continuous approaches for decomposable submodular function minimization. We provide improved running time estimates for the state-of-the-art continuous algorithms for the problem using combinatorial arguments. We also provide a systematic experimental comparison of the two types of methods, based on a clear distinction between level-0 and level-1 algorithms. 1 Introduction Submodular functions arise in a wide range of applications: graph theory, optimization, economics, game theory, to name a few. A function f : 2V R on a ground set V is submodular if f(X)+ f(Y ) f(X Y )+ f(X Y ) for all sets X, Y V . Submodularity→ can also be interpreted as a decreasing marginals≥ property.∩ ∪ ⊆ There has been significant interest in submodular optimization in the machine learning and computer vision communities. The submodular function minimization (SFM) problem arises in problems in image segmenta- tion or MAP inference tasks in Markov Random Fields. Landmark results in combinatorial optimization give polynomial-time exact algorithms for SFM. However, the high-degree polynomial dependence in the running time is prohibitive for large-scale problem instances. The main objective in this context is to develop fast and scalable SFM algorithms. Instead of minimizing arbitrary submodular functions, several recent papers aim to exploit special structural properties of submodular functions arising in practical applications. A popular model is decomposable sub- modular functions: these can be written as sums of several “simple” submodular functions defined on small arXiv:1703.01830v1 [cs.LG] 6 Mar 2017 supports.
    [Show full text]
  • Arxiv:1706.05795V4 [Math.OC] 4 Nov 2018
    BCOL RESEARCH REPORT 17.02 Industrial Engineering & Operations Research University of California, Berkeley, CA 94720{1777 SIMPLEX QP-BASED METHODS FOR MINIMIZING A CONIC QUADRATIC OBJECTIVE OVER POLYHEDRA ALPER ATAMTURK¨ AND ANDRES´ GOMEZ´ Abstract. We consider minimizing a conic quadratic objective over a polyhe- dron. Such problems arise in parametric value-at-risk minimization, portfolio optimization, and robust optimization with ellipsoidal objective uncertainty; and they can be solved by polynomial interior point algorithms for conic qua- dratic optimization. However, interior point algorithms are not well-suited for branch-and-bound algorithms for the discrete counterparts of these problems due to the lack of effective warm starts necessary for the efficient solution of convex relaxations repeatedly at the nodes of the search tree. In order to overcome this shortcoming, we reformulate the problem us- ing the perspective of the quadratic function. The perspective reformulation lends itself to simple coordinate descent and bisection algorithms utilizing the simplex method for quadratic programming, which makes the solution meth- ods amenable to warm starts and suitable for branch-and-bound algorithms. We test the simplex-based quadratic programming algorithms to solve con- vex as well as discrete instances and compare them with the state-of-the-art approaches. The computational experiments indicate that the proposed al- gorithms scale much better than interior point algorithms and return higher precision solutions. In our experiments, for large convex instances, they pro- vide up to 22x speed-up. For smaller discrete instances, the speed-up is about 13x over a barrier-based branch-and-bound algorithm and 6x over the LP- based branch-and-bound algorithm with extended formulations.
    [Show full text]
  • A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*
    Volume 27, N. 1, pp. 93–106, 2008 Copyright © 2008 SBMAC ISSN 0101-8205 www.scielo.br/cam A new algorithm of nonlinear conjugate gradient method with strong convergence* ZHEN-JUN SHI1,2 and JINHUA GUO2 1College of Operations Research and Management, Qufu Normal University Rizhao, Sahndong 276826, P.R. China 2Department of Computer and Information Science, University of Michigan Dearborn, Michigan 48128-1491, USA E-mails: [email protected]; [email protected] / [email protected] Abstract. The nonlinear conjugate gradient method is a very useful technique for solving large scale minimization problems and has wide applications in many fields. In this paper, we present a new algorithm of nonlinear conjugate gradient method with strong convergence for unconstrained minimization problems. The new algorithm can generate an adequate trust region radius automatically at each iteration and has global convergence and linear convergence rate under some mild conditions. Numerical results show that the new algorithm is efficient in practical computation and superior to other similar methods in many situations. Mathematical subject classification: 90C30, 65K05, 49M37. Key words: unconstrained optimization, nonlinear conjugate gradient method, global conver- gence, linear convergence rate. 1 Introduction Consider an unconstrained minimization problem min f (x), x ∈ Rn, (1) where Rn is an n-dimensional Euclidean space and f : Rn −→ R is a continu- ously differentiable function. #724/07. Received: 10/V/07. Accepted: 24/IX/07. *The work was supported in part by NSF CNS-0521142, USA. 94 A NEW ALGORITHM OF NONLINEAR CONJUGATE GRADIENT METHOD When n is very large (for example, n > 106) the related problem is called large scale minimization problem.
    [Show full text]
  • Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization
    Hindawi Complexity Volume 2018, Article ID 2609471, 11 pages https://doi.org/10.1155/2018/2609471 Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization Zhigang Li,1 Mingchuan Zhang ,2 Junlong Zhu ,2 Ruijuan Zheng ,2 Qikun Zhang ,1 and Qingtao Wu 2 1 School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, 450002, China 2Information Engineering College, Henan University of Science and Technology, Luoyang, 471023, China Correspondence should be addressed to Mingchuan Zhang; zhang [email protected] Received 25 May 2018; Accepted 26 November 2018; Published 5 December 2018 Academic Editor: Mahardhika Pratama Copyright © 2018 Zhigang Li et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We consider a stochastic continuous submodular huge-scale optimization problem, which arises naturally in many applications such as machine learning. Due to high-dimensional data, the computation of the whole gradient vector can become prohibitively expensive. To reduce the complexity and memory requirements, we propose a stochastic block-coordinate gradient projection algorithm for maximizing continuous submodular functions, which chooses a random subset of gradient vector and updates the estimates along the positive gradient direction. We prove that the estimates of all nodes generated by the algorithm converge to ∗ some stationary points with probability 1. Moreover, we show that the proposed algorithm achieves the tight ((� /2)� − �) 2 min approximation guarantee afer �(1/� ) iterations for DR-submodular functions by choosing appropriate step sizes. Furthermore, ((�2/(1 + �2))� �∗ − �) �(1/�2) we also show that the algorithm achieves the tight min approximation guarantee afer iterations for weakly DR-submodular functions with parameter � by choosing diminishing step sizes.
    [Show full text]
  • A Hybrid Method of Combinatorial Search and Coordinate Descent For
    A Hybrid Method of Combinatorial Search and Coordinate Descent for Discrete Optimization Ganzhao Yuan [email protected] School of Data and Computer Science, Sun Yat-sen University (SYSU), P.R. China Li Shen [email protected] Tencent AI Lab, Shenzhen, P.R. China Wei-Shi Zheng [email protected] School of Data and Computer Science, Sun Yat-sen University (SYSU), P.R. China Editor: Abstract Discrete optimization is a central problem in mathematical optimization with a broad range of applications, among which binary optimization and sparse optimization are two common ones. However, these problems are NP-hard and thus difficult to solve in general. Combinatorial search methods such as branch-and-bound and exhaustive search find the global optimal solution but are confined to small-sized problems, while coordinate descent methods such as coordinate gradient descent are efficient but often suffer from poor local minima. In this paper, we consider a hybrid method that combines the effectiveness of combinatorial search and the efficiency of coordinate descent. Specifically, we consider random strategy or/and greedy strategy to select a subset of coordinates as the working set, and then perform global combinatorial search over the working set based on the original objective function. In addition, we provide some optimality analysis and convergence analysis for the proposed method. Our method finds stronger stationary points than existing methods. Finally, we demonstrate the efficacy of our method on some sparse optimization and binary optimization applications. As a result, our method achieves state-of-the-art performance in terms of accuracy. For example, our method generally outperforms the well-known orthogonal matching pursuit method in sparse optimization.
    [Show full text]
  • On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri
    On the Links between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri To cite this version: Senanayak Sesh Kumar Karri. On the Links between Probabilistic Graphical Models and Submodular Optimisation. Machine Learning [cs.LG]. Université Paris sciences et lettres, 2016. English. NNT : 2016PSLEE047. tel-01753810 HAL Id: tel-01753810 https://tel.archives-ouvertes.fr/tel-01753810 Submitted on 29 Mar 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT de l’Universite´ de recherche Paris Sciences Lettres PSL Research University Prepar´ ee´ a` l’Ecole´ normale superieure´ On the Links between Probabilistic Graphical Models and Submodular Optimisation Liens entre modeles` graphiques probabilistes et optimisation sous-modulaire Ecole´ doctorale n◦386 ECOLE´ DOCTORALE DE SCIENCES MATHEMATIQUES´ DE PARIS CENTRE Specialit´ e´ INFORMATIQUE COMPOSITION DU JURY : M Andreas Krause ETH Zurich, Rapporteur M Nikos Komodakis ENPC Paris, Rapporteur M Francis Bach Inria Paris, Directeur de these` Soutenue par Senanayak Sesh Kumar KARRI le 27.09.2016 M Josef Sivic ENS Paris, Membre du Jury Dirigee´ par Francis BACH M Antonin Chambolle CMAP EP Paris, Membre du Jury M Guillaume Obozinski ENPC Paris, Membre du Jury RESEARCH UNIVERSITY PARIS ÉCOLENORMALE SUPÉRIEURE What is the purpose of life? Proof and Conjecture, ....
    [Show full text]
  • Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text
    Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text Jaafar Hammouda*, Ali Eisab, Natalia Dobrenkoa, Natalia Gusarovaa aITMO University, St. Petersburg, Russia *[email protected] bAleppo University, Aleppo, Syria Abstract Gradient methods have applications in multiple fields, including signal processing, image processing, and dynamic systems. In this paper, we present a nonlinear gradient method for solving convex supra-quadratic functions by developing the search direction, that done by hybridizing between the two conjugate coefficients HRM [2] and NHS [1]. The numerical results proved the effectiveness of the presented method by applying it to solve standard problems and reaching the exact solution if the objective function is quadratic convex. Also presented in this article, an application to the problem of named entities in the Arabic medical language, as it proved the stability of the proposed method and its efficiency in terms of execution time. Keywords: Convex optimization; Gradient methods; Named entity recognition; Arabic; e-Health. 1. Introduction Several nonlinear conjugate gradient methods have been presented for solving high-dimensional unconstrained optimization problems which are given as [3]: min f(x) ; x∈Rn (1) To solve problem (1), we start with the following iterative relationship: 푥(푘+1) = 푥푘 + 훼푘 푑푘, 푘 = 0,1,2, (2) where α푘 > 0 is a step size, that is calculated by strong Wolfe-Powell’s conditions [4] 푇 푓(푥푘 + 훼푘푑푘) ≤ 푓(푥푘) + 훿훼푘푔푘 푑푘, (3) 푇 푇 |푔(푥푘 + 훼푘푑푘) 푑푘| ≤ 휎|푔푘 푑푘| where 0 < 훿 < 휎 < 1 푑푘 is the search direction that is computed as follow [3] −푔푘 , if 푘 = 0 (4) 푑푘+1 = { −푔푘+1 + 훽푘푑푘 , if 푘 ≥ 1 Where 푔푘 = 푔(푥푘) = ∇푓(푥푘) represent the gradient vector for 푓(푥) at the point 푥푘 , 훽푘 ∈ ℝ is known as CG coefficient that characterizes different CG methods.
    [Show full text]
  • An Iterated Coordinate Descent Algorithm for Regularized 0/1 Loss Minimization
    An iterated coordinate descent algorithm for regularized 0/1 loss minimization Usman Roshan Department of Computer Science, New Jersey Institute of Technology, [email protected] Abstract. We propose regularization and iterated coordinate descent for better 0/1 loss optimization. The 0/1 loss is less sensitive to outliers than the popular hinge and logistic loss particularly when outliers are misclassified. However, it also suffers from local minima rendering coor- dinate descent ineffective. A popular method to deal with local minima is to perform multiple coordinate descent searches starting from differ- ent random initial solutions. We propose using a less explored approach called iterated local search, where the idea is to continue the search after a slight perturbation to the local minima. This preserves structure in the local minimum and also allows for an overall deeper search compared to random restart alone. Inspired by this we implement an iterated coordi- nate descent algorithm for a regularized 0/1 loss objective. Our algorithm performs a perturbation to the local minima reached at the end of co- ordinate descent, resumes the search from the perturbation until a new minima is reached, and continues in this manner for a specified number of iterations. We show empirically on real data that this leads to a sig- nificant improvement in the 0/1 loss compared to random restarts both on training and test data. We also see that optimizing the regularized 0/1 loss gives lower objective values than optimizing the 0/1 loss alone on training and test data. Our results show that the instability of 0/1 loss is improved by regularization and that iterated coordinate descent gives lower loss values than random restart alone.
    [Show full text]