Warm Start for Parameter Selection of Linear Classifiers

Warm Start for Parameter Selection of Linear Classifiers Bo-Yu Chu Chia-Hua Ho Cheng-Hao Tsai Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan [email protected] [email protected] [email protected] Chieh-Yen Lin Chih-Jen Lin Dept. of Computer Science Dept. of Computer Science National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan [email protected] [email protected] ABSTRACT we may need to solve many optimization problems. Sec- In linear classification, a regularization term effectively reme- ondly, if we do not know the reasonable range of the pa- dies the overfitting problem, but selecting a good regulariza- rameters, we may need a long time to solve optimization tion parameter is usually time consuming. We consider cross problems under extreme parameter values. validation for the selection process, so several optimization In this paper, we consider using warm start to efficiently problems under different parameters must be solved. Our solve a sequence of optimization problems with different reg- aim is to devise effective warm-start strategies to efficiently ularization parameters. Warm start is a technique to reduce solve this sequence of optimization problems. We detailedly the running time of iterative methods by using the solution investigate the relationship between optimal solutions of lo- of a slightly different optimization problem as an initial point gistic regression/linear SVM and regularization parameters. for the current problem. If the initial point is close to the op- Based on the analysis, we develop an efficient tool to auto- timum, warm start is very useful. Recently, for incremental matically find a suitable parameter for users with no related and decremental learning, where a few data instances are background knowledge. added or removed, we have successfully applied the warm start technique for fast training [20]. Now for parameter selection, in contrast to the change of data, the optimiza- Keywords tion problem is slightly modified because of the parameter warm start; regularization parameter; linear classification change. Many considerations become different from those in [20]. As we will show in this paper, the relationship between 1. INTRODUCTION the optimization problem and the regularization parameter must be fully understood. Linear classifiers such as logistic regression and linear SVM Many past works have applied the warm-start technique are commonly used in machine learning and data mining. for solving optimization problems in machine learning meth- Because directly minimizing the training loss may overfit ods. For kernel SVM, [5, 15] have considered various initial the training data, the concept of regularization is usually ap- points from the solution information of the previous prob- plied. A linear classifier thus solves an optimization problem lem. While they showed effective time reduction for some that involves a parameter (often referred to as C) to balance data sets, warm start for kernel SVM has not been widely the training loss and the regularization term. Selecting the deployed because of the following reasons. regularization parameter is an important but difficult prac- - An initial point obtained using information from a related tical issue. An inappropriate setting may not only cause optimization problem may not cause fewer iterations than overfitting or underfitting, but also a lengthy training time. a naive initial point (zero in the case of kernel SVM). Several reasons make parameter selection a time-consuming - When kernels are used, SVM involves both regularization procedure. First, usually the search involves sweeping the and kernel parameters. The implementation of a warm- following sequence of parameters start technique can be complicated. For example, we of- 2 Cmin; ∆Cmin; ∆ Cmin;:::;Cmax; (1) ten cache frequently used kernel elements to avoid their repeated computation, but with the change of kernel pa- where ∆ is a given factor. At each parameter the perfor- rameters, maintaining the kernel cache is very difficult. mance must be estimated by, for example, cross validation In fact, [15] was our previous attempt to develop warm- (CV). Thus a sequence of training tasks are conducted, and start techniques for kernel SVM, but results are not mature Permission to make digital or hard copies of all or part of this work for personal or enough to be included in our popular SVM software LIBSVM classroom use is granted without fee provided that copies are not made or distributed [2]. In contrast, the situation for linear classification is sim- for profit or commercial advantage and that copies bear this notice and the full citation pler because on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or - the regularization parameter is the only parameter to be republish, to post on servers or to redistribute to lists, requires prior specific permission chosen, and and/or a fee. Request permissions from [email protected]. - as pointed out in [20] and other works, it is more flexible to KDD’15, August 11–14, 2015, Sydney, NSW, Australia. choose optimization methods for linear rather than kernel. Copyright is held by the owner/author(s). Publication rights licensed to ACM. We will show that for different optimization methods, the ACM 978-1-4503-3664-2/15/08 . $15.00. effectiveness of warm-start techniques varies. http://dx.doi.org/10.1145/2783258.2783332. In this work we focus on linear classification and make warm effective warm-start setting for fast cross validation across start an efficient tool for the parameter selection. a sequence of regularization parameters. Second, we pro- Besides warm-start techniques, other methods have been vide an automatic parameter selection tool for users with proposed for the parameter selection of kernel methods. For no background knowledge. In particular, users do not need example, Hastie et al. [9] show that for L1-loss SVM, solu- to specify a sequence of parameter candidates. tions are a piece-wise linear function of regularization pa- This paper is organized as follows. In the rest of this sec- rameters. They obtain the regularization path by updat- tion, we introduce the primal and dual formulations of linear ing solutions according to the optimality condition. This classifiers. In Section 2, we discuss the relation between op- approach basically needs to maintain and manipulate the timization problems and regularization parameters. Based kernel matrix, a situation not applicable for large data sets. on the results, we propose an effective warm-start setting to Although this difficulty may be alleviated for linear classifi- reduce the training time in Section 3. To verify our anal- cation, we still see two potential problems. First, the proce- ysis, Section 4 provides detailed experiments. The conclu- dure to obtain the path depends on optimization problems sions are in Section 5. This research work has lead to a (e.g., primal and dual) and loss functions. Second, although useful parameter-selection tool1 extended from the popular an approximate solution path can be considered [8], the reg- package LIBLINEAR [6] for linear classification. Because of ularization path may still contain too many pieces of linear space limitation, we give proofs and more experiments in functions for large data sets. Therefore, in the current study, supplementary materials.1 we do not pursue this direction for the parameter selection of linear classification. Another approach for parameter se- 1.1 Primal and Dual Formulations lection is to minimize a function of parameters that approx- Although linear classification such as logistic regression imates the error (e.g., [18, 3, 4]). However, minimizing the (LR) and linear SVM have been well studied in literatures estimation may not lead to the the best parameters, and (e.g., a survey in [22]), for easy discussion we briefly list their the implementation of a two-level optimization procedure primal and dual formulations by mainly following the de- is complicated. Because linear classifiers involves a single scription in [20, Section 2]. Consider (label, feature-vector) n parameter C, simpler approaches might be appropriate. pairs of training data (yi; xi) 2 f−1; 1g × R ; i = 1; : : : ; l. While traditional linear classification considers L2 regu- A linear classifier obtains its model by solving the following larization (see formulations in Section 1.1), recently L1 reg- optimization problem. ularization has been popular because of its sparse model. 1 Warm start may be very useful for training L1-regularized min f(w) ≡ kwk2 + CL(w); (2) w 2 problems because some variables may remain to be zero after the change of parameters. In [13, 7], warm start is applied to where speed up two optimization methods: interior point method Xl L(w) ≡ ξ(w; xi; yi); and GLMNET, respectively. However, these works consider i=1 warm start as a trick without giving a full study on parame- is the sum of training losses, ξ(w; x; y) is the loss function, ter selection. They investigate neither the range of parame- and C is a user-specified regularization parameter to balance ters nor the relation between optimization problems and pa- the regularization term kwk2=2 and the loss term L(w). LR rameters. Another work [19] proposes a screening approach and linear SVM consider the following loss functions. that pre-identifies some zero elements in the final solution. 8 T Then a smaller optimization problem is solved. They ap- log(1 + e−yw x) logistic (LR) loss, <> ply warm-start techniques to keep track of nonzero elements ξ(w; x; y) ≡ max(0; 1 − ywT x) L1 loss, (3) of the solutions under different parameters. However, the :>max(0; 1 − ywT x)2 L2 loss. screening approach is not applicable to L2-regularized classifiers because of the lack of sparsity.

Warm Start for Parameter Selection of Linear Classifiers

Neural Networks and Backpropagation

Lecture 2: Linear Classifiers

Hyperplane Based Classification: Perceptron and (Intro

2 Linear Classifiers and Perceptrons

Machine Learning Empirical Risk Minimization

Using Linear Classifier Probes

Chapter 2 Classification

Linear Classification Overview Logistic Regression Model I Logistic

Differentially Private Empirical Risk Minimization

Tutorial on Support Vector Machine (SVM) Vikramaditya Jakkula, School of EECS, Washington State University, Pullman 99164

2DI70 - Statistical Learning Theory Lecture Notes

Linear Classifiers 1