New Quasi-Newton Optimization Methods for Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
New Quasi-Newton Optimization Methods for Machine Learning Jin Yu A thesis submitted for the degree of Doctor of Philosophy of The Australian National University September 2009 To my parents, Zuohuang Yu and Zenyu Chen, and my husband, Dan Liu, for their unconditional love and support. Declaration Except where otherwise indicated, this thesis is my own original work. This thesis is in part based on the following publications completed during my candidature: Chapters3 and4 are based on (Yu et al., 2008a,b); Chapters5 and6 are based on (Schraudolph et al., 2007). I was the first or second author of these published works. 30 September 2009 Acknowledgements First and foremost, I would like to thank my primary supervisor A. Prof. S. V. N. Vishwanathan for his continuous guidance, support, and encouragement throughout my graduate study. His perpetual energy and enthusiasm for research was always a source of inspiration to me, and I am especially grateful to him for keeping up his supervision after leaving for Purdue University in the United States. I am grateful to my former supervisor, Dr. Nicol N. Schraudolph, who introduced me to the field of optimization, and offered a great deal of help in sharpening my research skills. My sincere appreciation also goes to my panel chair Dr. Jochen Trumpf and my advisor Dr. Knut H¨uper for their invaluable advice and support. I would like to thank Dr. Simon G¨unter for many constructive discussions and assis- tance with implementation issues and to Dr. Peter Sunehag for mathematical assistance and proofreading many chapters of this thesis. My appreciation is extended to A. Prof. Jian Zhang for fruitful discussions and collaborative work during my stay at Purdue University. I was fortunate to work with a group of great researchers at the SML lab, who were so incredibly helpful and supportive. Thanks to each and every one of them. In particular, I am grateful to Dr. Wray Buntine for his help with travel funds, Dr. Marconi Barbosa for keeping our cluster up and running, and Dr. Christfried Webers for his help with Elefant code and proofreading some chapters of this thesis. It is without doubt that life would have been considerably less pleasant if my fellow PhD students were not so fun and great to be with. Their friendship and moral support are deeply appreciated. Especially, I would like to thank Xinhua Zhang and Choonhui Teo for proofreading my papers and assistance with implementation issues. Special thanks go to my friend Bernard Larkin who called regularly to make sure I was doing well. I am grateful to NICTA and ANU for the scholarship and travel funds that were generously awarded to me. NICTA is funded by the Australian Government's Backing Australia's Ability and the ICT Center of Excellence program. Thanks also go to CECS administrative staff Michelle Moravec, Debbie Pioch, Marie Katselas, and Sue Van Haeften for their indispensable support over these years. Last but not least, I am in great debt to all of my family for always supporting me in whatever I choose to do. I am particularly grateful to my husband, Dan, for his unconditional love and support, and for just being part of my life. v Abstract This thesis develops new quasi-Newton optimization methods that exploit the well- structured functional form of objective functions often encountered in machine learn- ing, while still maintaining the solid foundation of the standard BFGS quasi-Newton method. In particular, our algorithms are tailored for two categories of machine learn- ing problems: (1) regularized risk minimization problems with convex but nonsmooth objective functions and (2) stochastic convex optimization problems that involve learn- ing from small subsamples (mini-batches) of a potentially very large set of data. We first extend the classical BFGS quasi-Newton method and its limited-memory variant LBFGS to the optimization of nonsmooth convex problems. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the lo- cal quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply the limited- memory variant of subBFGS (subLBFGS) to L2-regularized risk minimization with the binary hinge loss. To extend our algorithms to the multiclass and multilabel set- tings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that it can also extend a recently developed bundle method to the multiclass and multilabel settings. Moreover, we apply the direction- finding component of our algorithms to L1-regularized risk minimization with the lo- gistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available datasets. This thesis also provides stochastic variants of the BFGS method, in both full and memory-limited forms, for large-scale optimization of convex problems where objective and gradient must be estimated from subsamples of training data. The limited-memory variant of the resulting online BFGS algorithm performs comparable to a well-tuned natural gradient descent but is scalable to very high-dimensional problems. On stan- dard benchmarks in natural language processing it asymptotically outperforms previous stochastic gradient methods for parameter estimation in Conditional Random Fields. vii Contents Acknowledgementsv Abstract vii 1 Introduction1 1.1 Motivation..................................2 1.2 BFGS Quasi-Newton Methods.......................4 1.3 Quasi-Newton Methods for Nonsmooth Optimization...........6 1.3.1 Existing Approaches.........................6 1.3.2 Our Approach............................7 1.4 Stochastic Quasi-Newton Methods.....................8 1.5 Thesis Contributions............................. 10 1.6 Outline.................................... 11 1.7 Notation.................................... 11 2 Classical Quasi-Newton Methods 13 2.1 The BFGS Quasi-Newton Method..................... 13 2.1.1 Local Quadratic Model....................... 13 2.1.2 Line Search.............................. 15 2.1.3 Inverse Hessian Approximation................... 16 2.2 The Limited-Memory BFGS Method.................... 19 2.3 Summary................................... 21 3 A Quasi-Newton Approach to Nonsmooth Convex Optimization 23 3.1 Motivation.................................. 23 3.1.1 Problems of (L)BFGS on Nonsmooth Objectives......... 24 3.1.2 Advantage of Incorporating BFGS' Curvature Estimate..... 27 3.2 Subgradient BFGS Method......................... 28 3.2.1 Generalizing the Local Quadratic Model.............. 29 3.2.2 Finding a Descent Direction..................... 31 3.2.3 Subgradient Line Search....................... 34 3.2.4 Bounded Spectrum of BFGS' Inverse Hessian Estimate..... 35 3.2.5 Limited-Memory Subgradient BFGS................ 37 3.2.6 Convergence of Subgradient (L)BFGS............... 37 ix x Contents 3.3 SubBFGS for L2-Regularized Binary Hinge Loss............. 38 3.3.1 Efficient Oracle for the Direction-Finding Method........ 38 3.3.2 Implementing the Line Search.................... 39 3.4 Segmenting the Pointwise Maximum of 1-D Linear Functions...... 42 3.5 SubBFGS for Multiclass and Multilabel Hinge Losses........... 44 3.5.1 Multiclass Hinge Loss........................ 44 3.5.2 Efficient Multiclass Direction-Finding Oracle........... 45 3.5.3 Implementing the Multiclass Line Search............. 46 3.5.4 Multilabel Hinge Loss........................ 49 3.6 Related Work................................. 50 3.6.1 Nonsmooth Convex Optimization.................. 50 3.6.2 Segmenting the Pointwise Maximum of 1-D Linear Functions.. 52 3.7 Discussion................................... 52 4 SubLBFGS for Nonsmooth Convex Optimization 55 4.1 Experiments.................................. 55 4.1.1 Convergence Tolerance of the Direction-Finding Procedure... 56 4.1.2 Size of SubLBFGS Buffer...................... 57 4.1.3 L2-Regularized Binary Hinge Loss................. 58 4.1.4 L1-Regularized Logistic Loss.................... 60 4.1.5 L2-Regularized Multiclass and Multilabel Hinge Loss....... 62 4.2 Discussion................................... 65 5 A Stochastic Quasi-Newton Method for Online Convex Optimization 67 5.1 Stochastic Gradient-Based Learning.................... 67 5.2 Existing Stochastic Gradient Methods................... 69 5.2.1 Stochastic Gradient Descent.................... 69 5.2.2 Stochastic Meta-Descent....................... 69 5.2.3 Natural Gradient Descent...................... 70 5.3 Online BFGS Method............................ 72 5.3.1 Convergence without Line Search.................. 72 5.3.2 Modified BFGS Curvature Update................. 73 5.3.3 Consistent Gradient Measurements................. 74 5.4 Limited-Memory Online BFGS....................... 74 5.5 Stochastic Quadratic Problems....................... 75 5.5.1 Deterministic Quadratic....................... 75 5.5.2 A Simple Stochastic Quadratic................... 76 5.5.3 A Non-Realizable Quadratic..................... 77 5.5.4 Choice of Jacobian.......................... 78 5.6 Experiments.................................. 78 Contents xi 5.6.1 Results on the Realizable Quadratic................ 78 5.6.2 Results on the Non-Realizable Quadratic............. 80 5.7 Discussion................................... 80 6 Online LBFGS for the Training of Conditional Random