Greed Is Good Greedy Optimization Methods for Large-Scale Structured Problems
Total Page:16
File Type:pdf, Size:1020Kb
Greed is Good Greedy Optimization Methods for Large-Scale Structured Problems by Julie Nutini B.Sc., The University of British Columbia (Okanagan), 2010 M.Sc., The University of British Columbia (Okanagan), 2012 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate and Postdoctoral Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) May 2018 c Julie Nutini 2018 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Greed is Good: Greedy Optimization Methods for Large-Scale Structured Problems submitted by Julie Nutini in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Examining Committee: Mark Schmidt, Computer Science Supervisor Chen Greif, Computer Science Supervisory Committee Member Will Evans, Computer Science Supervisory Committee Member Bruce Shepherd, Computer Science University Examiner Ozgur Yilmaz, Mathematics University Examiner ii Abstract This work looks at large-scale machine learning, with a particular focus on greedy methods. A recent trend caused by big datasets is to use optimization methods that have a cheap iteration cost. In this category are (block) coordinate descent and Kaczmarz methods, as the updates of these methods only rely on a reduced subspace of the problem at each iteration. Prior to our work, the literature cast greedy variations of these methods as computationally expensive with comparable convergence rates to randomized versions. In this dissertation, we show that greed is good. Specifically, we show that greedy coordinate descent and Kaczmarz methods have efficient implementations and can be faster than their randomized counterparts for certain common problem structures in machine learning. We show linear convergence for greedy (block) coordinate descent methods under a revived relaxation of strong convexity from 1963, which we call the Polyak-Lojasiewicz (PL) inequality. Of the proposed relaxations of strong convexity in the recent literature, we show that the PL inequality is the weakest condition that still ensures a global minimum. Further, we highlight the exploitable flexibility in block coordinate descent methods, not only in the different types of selection rules possible, but also in the types of updates we can use. We show that using second-order or exact updates with greedy block coordinate descent methods can lead to superlinear or finite convergence (respectively) for popular machine learning problems. Finally, we introduce the notion of \active-set complexity", which we define as the number of iterations required before an algorithm is guaranteed to reach the optimal active manifold, and show explicit bounds for two common problem instances when using the proximal gradient or the proximal coordinate descent method. iii Lay Summary A recent trend caused by big datasets is to use methods that are computationally inexpensive to solve large-scale problems in machine learning. This work looks at several of these methods, with a particular focus on greedy variations, that is, methods that try to make the most possible progress at each step. Prior to our work, these greedy methods were regarded as computation- ally expensive with similar performance to cheaper versions that make a random amount of progress at each step. In this dissertation, we show that greed is good. Specifically, we show that these greedy methods can be very efficient and can be faster relative to their randomized counterparts for solving machine learning problems with certain structure. We exploit the flex- ibility of these methods and show various ways (both theoretically and empirically) to speed them up. iv Preface The body of research in this dissertation (Chapters 2 - 6) is based off of several collaborative papers that have either been previously published or are currently under review. The work in Chapter 2 was published in the Proceedings of the 32nd International Con- • ference on Machine Learning (ICML) [Nutini et al., 2015]: J. Nutini, Mark Schmidt, Issam H. Laradji, Michael Friedlander and Hoyt Koepke. Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection, ICML 2015 [arXiv]. The majority of this chapter's manuscript was written by J. Nutini and Mark Schmidt. The new convergence rate analysis for the greedy coordinate descent method (Section 2.2.3) was done by Michael Friedlander and Mark Schmidt. The analysis of strong- convexity constants (Section 2.3) was contributed to by Michael Friedlander. The Gauss- Southwell-Lipschitz rule and nearest neighbour analysis in Sections 2.5.2 and 2.5.3 were a joint effort by J. Nutini, Issam Laradji and Mark Schmidt. The majority of the work in Section 2.8 (numerical experiments) was done by Issam Laradji, with help from J. Nutini. Appendix A.1 showing how to calculate the greedy Gauss-Southwell rule efficiently for sparse problems was primary researched by Issam Laradji and Mark Schmidt. All other sections were primarily researched by J. Nutini and Mark Schmidt. The final co-author on this paper, Hoyt Koepke, was the primary researcher on extending the greedy coordinate descent analysis to the case of using exact coordinate optimization (excluded from this dissertation). A version of Chapter 3 was published in the Proceedings of the 32nd Conference on • Uncertainty in Artificial Intelligence (UAI) [Nutini et al., 2016]: J. Nutini, Behrooz Sepehry, Issam H. Laradji, Mark Schmidt, Hoyt Koepke and Alim Virani. Convergence Rates for Greedy Kaczmarz Algorithms, and Faster Random- ized Kaczmarz Rules Using the Orthogonality Graph, UAI 2016 [arXiv]. The majority of this chapter's manuscript was researched and written by J. Nutini and Mark Schmidt. The majority of the work in Section 3.9 and Appendix B.12 (numerical experiments) was done by Issam Laradji, with help from J. Nutini and Alim Virani. A section of the corresponding paper on extending the convergence rate analysis of Kaczmarz v methods to consider more than one step (i.e., a sequence of steps) was written by my co- authors Behrooz Sepehry, Hoyt Koepke and Mark Schmidt, and therefore excluded from this dissertation. The material in Chapter 4 was published in the Proceedings of the 27th European Con- • ference on Machine Learning (ECML) [Karimi et al., 2016]: Hamed Karimi, J. Nutini and Mark Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition, ECML 2016 [arXiv]. Hamed Karimi, Mark Schmidt and I all made contributions to Theorem 2 and Section C.5, while the work in Section C.4 was done by Hamed Karimi. Several results presented in the corresponding paper are excluded from this dissertation as they were written by my co-authors. These include using the Polyak-Lojasiewicz (PL) inequality to give new convergence rate analyses for stochastic gradient descent methods and stochastic variance reduced gradient methods, as well as a result proving the equivalence of our proposed proximal-PL condition to two other previously proposed conditions. A version of Chapter 5 has been submitted for publication [Nutini et al., 2017a]: • J. Nutini, Issam H. Laradji and Mark Schmidt. Let's Make Block Coordinate De- scent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence (2017) [arXiv]. This chapter's manuscript was written by J. Nutini and Mark Schmidt. The majority of the research was joint work between J. Nutini, Issam Laradji and Mark Schmidt. Section 5.5 and Appendix D.5 (numerical experiments) were primarily done by Issam Laradji, with help from J. Nutini and Mark Schmidt. The material in Chapter 6 is a compilation of material from the reference for Chap- • ter 5 [Nutini et al., 2017a] and a manuscript that has been submitted for publication [Nu- tini et al., 2017b]: J. Nutini, Mark Schmidt and Warren Hare. \Active-set complexity" of proximal gradient: How long does it take to find the sparsity pattern? (2017) [arXiv]. The majority of this chapter's manuscript was written by J. Nutini and Mark Schmidt. The proof of Lemmas 2 and 4 was joint work by Warren Hare, Mark Schmidt and J. Nutini. The majority of the work in Section 6.5 (numerical experiments) was done by Issam Laradji, with help from J. Nutini and Mark Schmidt. vi Table of Contents Abstract . iii Lay Summary . iv Preface . v Table of Contents . vii List of Tables . xiii List of Figures . xiv Acknowledgements . xvii Dedication . .xviii 1 Introduction . 1 1.1 Big-Data: A Barrier to Learning? . 1 1.2 The Learning Problem/Algorithm . 2 1.2.1 Loss Functions . 4 1.3 First-Order Methods . 6 1.4 Gradient Descent . 8 1.5 Stochastic Gradient Descent . 8 1.6 Coordinate Descent Methods . 10 1.7 Linear Systems and Kaczmarz Methods . 12 1.8 Relaxing Strong Convexity . 13 1.9 Proximal First-Order Methods . 13 1.10 Summary of Contributions . 14 2 Greedy Coordinate Descent . 17 2.1 Problems of Interest . 18 2.2 Analysis of Convergence Rates . 19 2.2.1 Randomized Coordinate Descent . 20 2.2.2 Gauss-Southwell . 20 2.2.3 Refined Gauss-Southwell Analysis . 21 vii 2.3 Comparison for Separable Quadratic . 22 2.3.1 `Working Together' Interpretation . 22 2.3.2 Fast Convergence with Bias Term . 23 2.4 Rates with Different Lipschitz Constants . 23 2.5 Rules Depending on Lipschitz Constants . 24 2.5.1 Lipschitz Sampling . 24 2.5.2 Gauss-Southwell-Lipschitz Rule . 24 2.5.3 Connection between GSL Rule and Normalized Nearest Neighbour Search 26 2.6 Approximate Gauss-Southwell . 28 2.6.1 Multiplicative Errors . 28 2.6.2 Additive Errors . 29 2.7 Proximal Gradient Gauss-Southwell . 30 2.8 Experiments . 33 2.9 Discussion . 35 3 Greedy Kaczmarz . 36 3.1 Problems of Interest . 37 3.2 Kaczmarz Algorithm and Greedy Selection Rules . 38 3.2.1 Efficient Calculations for Sparse A .....................