Asynchronous Algorithms for Large-Scale Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Asynchronous Algorithms for Large-Scale Optimization Analysis and Implementation ARDA AYTEKIN Licentiate Thesis Stockholm, Sweden 2017 KTH Royal Institute of Technology School of Electrical Engineering TRITA-EE 2017:021 Department of Automatic Control ISSN 1653-5146 SE-100 44 Stockholm ISBN 978-91-7729-328-6 Sweden Akademisk avhandling som med tillstånd av Kungliga Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie licentiatexamen i elektro- och systemteknik fredagen den 07 apr 2017 klockan 10.00 i Q2, Q-huset, Kungliga Tekniska högskolan, Osquldas väg 10, Stockholm. © Arda Aytekin, Apr 2017 Tryck: Universitetsservice US AB iii Abstract This thesis proposes and analyzes several first-order methods for convex optimization, designed for parallel implementation in shared and distributed memory architectures. The theoretical focus is on designing algorithms that can run asynchronously, allowing computing nodes to execute their tasks with stale information without jeopardizing convergence to the optimal solution. The first part of the thesis focuses on shared memory architectures. We propose and analyze a family of algorithms to solve an unconstrained, smooth optimization problem consisting of a large number of component functions. Specifically, we investigate the effect of information delay, inherent in asynchronous implementations, on the conver- gence properties of the incremental prox-gradient descent method. Contrary to related proposals in the literature, we establish delay-insensitive convergence results: the pro- posed algorithms converge under any bounded information delay, and their constant step-size can be selected independently of the delay bound. Then, we shift focus to solving constrained, possibly non-smooth, optimization problems in a distributed memory architecture. This time, we propose and analyze two important families of gradient descent algorithms: asynchronous mini-batching and incremental aggregated gradient descent. In particular, for asynchronous mini-batching, we show that, by suitably choosing the algorithm parameters, one can recover the best-known convergence rates established for delay-free implementations, and expect a near-linear speedup with the number of computing nodes. Similarly, for incremental aggregated gra- dient descent, we establish global linear convergence rates for any bounded information delay. Extensive simulations and actual implementations of the algorithms in different platforms on representative real-world problems validate our theoretical results. ACKNOWLEDGMENTS First of all, I would like to express my gratitude to my main advisor, Mikael Johansson, and my co-advisors, Alexandre Proutiere and Dimos Dimarogonas, for accepting me as their Ph.D. student and giving me the opportunity of being part of such a great family at KTH Royal Institute of Technology. I would like to especially thank Mikael for his never- ending patience, professional-yet-friendly attitude, constant efforts in not only promoting my strengths but also improving my weaknesses, and his excellent guidance in research. Thanks to you, Mikael, I have learned a lot in research while working with you — from formulating problems and systematically analyzing them using the correct tools, to presenting the results of my research both in written and oral forms. I am also indebted to my colleagues whom I have collaborated with. I would like to thank Hamid for all the fruitful discussions and his help in convex optimization and algorithm analysis. I am grateful to Burak for the interesting control problems we have worked on together: even though I have not covered them in this thesis, the time spent there has added to my knowledge and skills. Last, but not least, I feel lucky to have such great support from Cristian Rojas and my “partner in crime” Niklas in developing software tools to be (hopefully) presented at a workshop in the near future. In addition, I would also like to acknowledge Burak, Demia, Hamid, Martin Biel, Sadegh, Sarit and Vien for proofreading my thesis and providing me with constructive comments. Automatic Control at KTH is a great family in terms of both quantity and quality. I am very fortunate to have spent my time among you all! Apologies in advance, should I forget to explicitly mention your names...I would like to start with thanking both the current — Demia “piccolo” Della Penda, Martin Biel, Max, Sarit, and Vien — and the former — António “W.M.” Gonga, Burak, Euhanna “the old chap” Ghadimi, Hamid, Jeff, Sadegh, and Themis “yet even older chap” Charalambous — members of our group for all the inspiring discussions we have had at the meetings and all the fun extracurricular activities we have done together. I thank you, my office mates, Jezdimir, Martin Andreasson, Martin Biel, Miguel, Mohamed, Niklas, and Valerio “Valerione” Turri for creating a warm and relaxing working environment. Among the great people I have met at the department, I would like to thank, in particular, Burak, Demia, Hamid, Jeff, Kaveh, Martin Andreasson, Mohamed, Niclas, Niklas, Riccardo, Sadegh, Themis, and Valerio for not only being my colleagues but also being a part of my life as true friends! v vi ACKNOWLEDGMENTS Our administrators...Thank you, Anneli, Gerd, Hanna, Karin, Kristina and Silvia for being so helpful, positive and kind at all times. I am grateful to you all for fixing all the administrative issues, helping me with the paperwork, and spoiling us all with all the waffles and “semlor!” Finally, the closest ones in Turkey...I thank you, my parents, Mine and Süreyya, for always believing in me and for your unconditional support in my efforts to achieve my goals! Similarly, special thanks go to our extended family members, Berrin and Rıdvan Tuğsuz, and Göksan and İhsan Hakyemez, for always being “there” together with my parents. Equally important are my friends Burak and Serdar Demirel, Mehmet Ayyıldız, Utku Boz and Begüm Yıldırım. I thank you all for all your support and for putting up with me whenever I was stressed out. Arda Aytekin Stockholm, March 2017. vii CONTENTS Acknowledgments v Contents ix 1 Introduction 1 1.1 Motivation . 1 1.2 Contributions and Outline . 12 2 Preliminaries 15 2.1 Notation . 15 2.2 Preliminaries . 16 3 Shared Memory Algorithms 21 3.1 Problem Formulation . 22 3.2 Main Result . 23 3.3 Numerical Example . 33 3.4 Proofs . 40 4 Distributed Memory Algorithms 49 4.1 Problem Formulation . 51 4.2 Main Result . 61 4.3 Numerical Example . 67 4.4 Proofs . 75 5 Conclusion 89 Bibliography 93 ix CHAPTER 1 INTRODUCTION In this thesis, we will investigate the effect of information delay when designing and running asynchronous algorithms to solve optimization problems on a relatively large scale. Specifi- cally, we will propose a family of parallel algorithms, analyze their convergence properties under stale information and verify the theoretical results by implementing the algorithms to solve some representative examples of optimization problems. 1.1 Motivation An optimization problem is a problem of choosing the best element (with respect to some criterion) from a given set of elements. The standard way of writing optimization problems is minimize f.x/ xËX ̃ subject to hi.x/ f 0 i = 1; § ;I; ̄ hj.x/ = 0 j = 1; § ;J; X where x denotes the decision variable defined in some given set , f.x/ is the objective, or, cost ̃ ̄ constraints the to be minimized, and hi.x/ and hj.x/ denote the inequality and equality of the problem, respectively. The problem is said to be feasible if there exists a decision variable in the given set which satisfies all the constraints. If there are no constraints in the problem, the problem is said to be unconstrained. Optimization problems are important in engineering applications. Engineers often find themselves in the loop of collecting data about processes, building representative mathemat- ical models based on the collected data, formulating optimization problems to minimize a cost while meeting some design criteria, and solving the problems. In these problems, the cost usually relates to some penalty on the resources used or the deviation from a desired behavior. Then, the task is to come up with the best decision that minimizes this cost while fulfilling the design criteria dictated by the constraints of the problem. 1 2 CHAPTER 1. INTRODUCTION 120 160 80 200 40 240 uk0 xk0 MPC Ak,Bk , Qk,Rk, f x , ̄x , u , ̄u ,K ̄ k k ̄k k Figure 1.1: A simplified block diagram representation of an MPC, employed in velocity control of vehicles. Given the linearized model Ak;Bk and the cost Qk;Rk; f , MPC samples the current state x , and solves an optimization problem to find the best input k0 values that minimize the total cost while satisfying the constraints x ; ̄x ; u ; ̄u (up to a ̄ k k ̄ k k horizon of K sampling instances). Then, it sends the best input uk0 to the vehicle and repeats the procedure in the next sampling interval. Below are two illustrative, real-world examples of optimization problems encountered in engineering. Example 1.1 (Model Predictive Control). Model predictive control (MPC) is an advanced, multivariable control algorithm that uses an internal dynamical model to predict the future behavior of a given process, and solves, at each sampling instance, an optimization problem to minimize a given cost while satisfying a set of constraints. For instance, an MPC algorithm employed in velocity control of vehicles (cf. Figure 1.1) can be written in the form k0+K*1 minimize ñ + ñ + f xk Qkxk uk Rkuk xk0+K uk k=k É0 subject to xk+1 = Akxk + Bkuk u f u f ̄u ̄ k k k x f x f ̄x ̄ k k k k = k0; § ; k0 + K * 1 ; e.g. e.g. where uk is the input, , the fuel injection, to the vehicle, and xk is the state, , the deviation from a set-point velocity. MPC samples the state of the vehicle periodically, as dictated by the sampling interval.