SIAG/OPT Views and News 24(1)
Total Page:16
File Type:pdf, Size:1020Kb
SIAG/OPT Views and News A Forum for the SIAM Activity Group on Optimization Volume 24 Number 1 October 2016 Contents Article Article Evolution of randomness in optimization methods for super- vised machine learning Katya Scheinberg . .1 Evolution of randomness in In Memoriam optimization methods for supervised Jonathan Borwein Adrian Lewis . .8 machine learning Roger Fletcher Sven Leyffer . 10 Katya Scheinberg Christodoulos Achilleus Floudas Industrial and Systems Engineering Depart- Ruth Misener . 12 ment Bulletin Lehigh University Event Announcements ..................................16 Bethlehem, PA Book Announcements .................................. 17 USA Other Announcements [email protected] Howard Rosenbrock Prize 2015 . 18 http://coral.ise.lehigh.edu/katyas Combinatorial Scientific Computing Best Paper Prize 19 2016 SIAM Fellows Announced . .19 Chair's Column 1 Introduction Juan Meza . 20 Machine learning (ML) as a field is relatively new. Its be- Comments from the Editors ginning is usually attributed to the seminal work of Vapnik Jennifer Erway & Stefan M. Wild . 20 [1] from 1995, but it has grown at exponential speed in the past 20 years. One can say that ML is a combination of learning theory, statistics, and optimization and as such has Are you receiving this influenced all three fields tremendously. ML models, while by postal mail? virtually unknown in the optimization community in 1995, are one of the key topics of continuous optimization today. Do you prefer elec- This situation is ironic, in fact, because after Vapnik's work the core of ML models shifted to support vector ma- tronic delivery? chines (SVMs){a well-defined and nice problem that is a con- vex QP with a simple constraint set. On the other hand, today, the optimization problem at the core of many ML models is no longer as clearly defined and is not usually as nice. The main focus of our article is the evolution of the central optimization problem in ML and how different funda- mental assumptions result in different problem settings and algorithmic choices, from deterministic to stochastic. We will consider the early SVM setting, which lasted for about a decade, where the problem was considered to be determin- istic. In the following decade the perspective on ML prob- lems shifted to stochastic optimization. And, over the past couple of years, the perspective has begun changing once Email [email protected]. more: The problem is again considered to be deterministic, gov today to opt out of paper V&N! but stochastic (or perhaps we should say randomized) algo- rithms are often deemed as the methods of choice. We will 2 SIAG/OPT Views and News outline this evolution and summarize some of the key algo- distribution. This quantity is known as the expected risk and rithmic developments as well as outstanding questions. has a clear interpretation as a measure of the quality of a This article is based on one optimizer's perspective of the classifier. All that remains is to find a classifier that mini- development of the field in the past 15 years and is inevitably mizes (1)! somewhat biased. Moreover, the field of optimization in ma- Minimizing (1) is a stochastic optimization problem with chine learning is vast and quickly growing. Hence, this article unknown distribution. Such problems have been the focus of is by no means comprehensive. Most references are omitted, simulation optimization and sample average approximation results are oversimplified, and the technical content is meant techniques for a few decades [3{6]. Since the expectation to be simple and introductory. For a much more comprehen- in (1) cannot be computed, it is replaced by a sample av- sive and technically sound recent review, we refer readers to, erage. Here, there is an important assumption to be made for example, [2]. that will dictate which optimization approaches are viable: Can we sample from fX ; Yg indefinitely during the optimiza- 2 Optimization models in machine learning tion process, or are we limited to a fixed set of sample data Optimization problems arise in machine learning in the de- that is available a priori? The second option is a more prac- termination of prediction functions for some (unknown) set tical assumption and indeed is usually the preferred choice of data-label pairs fX ; Yg, where X ⊂ Rd and Y may con- in machine learning and dictates the difference between the tain binary numbers, real numbers, or vectors. Specifically, optimization in ML and the traditional field of stochastic op- given an input vector x 2 Rd, one aims to determine a func- timization. As we will see later, many optimization methods tion such that when the function is evaluated at x, one ob- for ML are theoretically justified under the assumption of tains the best prediction of its corresponding label y with sampling indefinitely while being applied under the scenario (x; y) 2 fX ; Yg. In supervised learning, one determines of a fixed set of sample data. such a function by using the information contained in a Assume that we are given a data set fX; Y g: X = set fX; Y g of n known and labeled samples, that is, pairs fx1; x2; : : : ; xng ⊂ Rd and Y = fy1; y2; : : : ; yng ⊂ f+1; −1g, i i (xi; yi) 2 fX ; Yg for i 2 f1; : : : ; ng. Restricting attention to with each data point (x ; y ) selected independently at ran- a family of functions fp(w; x): w 2 W ⊆ Rpg parameterized dom from fX ; Yg. We consider a sample average approxi- by the vector w, one aims to find w∗ such that the value mation of (1): p(w∗; x) that results from any given input x 2 X is best at ^ 1 Pn predicting the appropriate label y corresponding to x. For f01(w) = i=1`01(p(w; xi); yi): (2) simplicity we will focus our attention on binary classifica- n tion, where labels y can take values 1 and −1. Note that the This is often referred to as the empirical risk of w. predictors are often called classifiers. We now have a pair of optimization problems: the expected Our aim is to define an optimization problem to find the risk minimization (the problem we actually want to solve), best classifier/predictor; however, we first need to define the measure of quality by which classifiers are compared. The w∗ = arg minw2W f01(w) = E(x;y)∼{X ;Yg[`01(p(w; x); y)]; quality (or lack thereof) of a classifier is typically measured (3) by its loss or prediction error. Given a classifier w, a func- and the empirical risk minimization (the problem we can tion p(w; x) 2 R, and a particular data instance (x; y), the solve), classical 0 − 1 loss is defined as ^ 1 Pn w^ = arg minw2W f01(w) = `01(p(w; xi); yi); (4) ( n i=1 0 if yp(w; x) > 0 p `01(p(w; x); y) = where W ⊂ R . An important question of ML, considered by 1 if yp(w; x) ≤ 0: learning theory, is how the empirical risk minimizerw ^ com- pares with the expected risk minimizer w . A vast amount of Thus, this loss outputs a 0 if the classifiers made a correct ∗ work has been done on bounding these quantities in learning prediction or 1 if it made a mistake. Note that ` (p(w; x); y) 01 theory; see, for example, [7{9]. Here we present simplified is not a convex function of w regardless of the form of p(w; x). bounds, which help us convey the key dependencies. Given a Since the data for which the prediction is needed is not sample set fX; Y g drawn randomly and independently from known ahead of time, we assume that (x; y) is randomly the distribution fX ; Yg, with probability at least 1 − δ, the selected from fX ; Yg according to some unknown distribu- following generalization bound holds for any w: tion.1 The quality of a classifier w can be measured as 0s 1 W + log( 1 ) f01(w) = E(x;y)∼{X ;Yg[`01(p(w; x); y)] (1) jf^ (w) − f (w)j ≤ O δ : (5) 01 01 @ n A = P(x;y)∼{X ;Ygfyp(w; x) > 0g; which is the probability that p(w; x) gives the correct pre- Moreover, forw ^ and w∗, the following esimation bound holds: diction on data if the data is randomly selected from the 0s 1 1 W + log( δ ) 1For notational simplicity, fX ; Yg will hence denote both the distri- jf01(w ^) − f01(w∗)j ≤ O @ A ; (6) bution of (x; y) and the set of all pairs (x; y), according to the context. n Volume 24 Number 1 { October 2016 3 where W 2 < measures the complexity of the set of all possi- to solve), and ble classifiers produced by p(w; ·) for w 2 W . The first bound (5) ensures that at the optimal solutionw ^, the expected risk ^ 1 Pn w^ = arg minw2W f(w) = i=1`(p(w; xi); yi); (8) represents the empirical risk whenever the right-hand side of n (5) is small. The second bound (6) ensures that the qual- called the empirical loss minimization (the problem that can ity of solution of (4) is almost as good as that of the best be solved). Fortunately, one can establish bounds similar classifier from W . Thus, to learn “effectively," we need to to (5) and (6) for f, w∗, andw ^ defined above using convex balance the complexity W and the size of the data set to keep loss functions. The difference between the case of the 0 − 1 the right-hand sides of (5) and (6) small. For instance, in a loss we considered previously and the Lipschitz-continuous \big data" setting, when n is very large, very complex mod- loss is the complexity measure, which is used in place of W.