Early Stopping for Kernel Boosting Algorithms: a General Analysis with Localized Complexities
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019 6685 Early Stopping for Kernel Boosting Algorithms: A General Analysis With Localized Complexities Yuting Wei , Fanny Yang, and Martin J. Wainwright, Senior Member, IEEE Abstract— Early stopping of iterative algorithms is a widely While the basic idea of early stopping is fairly old (e.g., [1], used form of regularization in statistics, commonly used in [33], [37]), recent years have witnessed renewed interests in its conjunction with boosting and related gradient-type algorithms. properties, especially in the context of boosting algorithms and Although consistency results have been established in some settings, such estimators are less well-understood than their neural network training (e.g., [13], [27]). Over the past decade, analogues based on penalized regularization. In this paper, for a a line of work has yielded some theoretical insight into early relatively broad class of loss functions and boosting algorithms stopping, including works on classification error for boosting (including L2-boost, LogitBoost, and AdaBoost, among others), algorithms [4], [14], [19], [25], [41], [42], L2-boosting algo- we exhibit a direct connection between the performance of a rithms for regression [8], [9], and similar gradient algorithms stopped iterate and the localized Gaussian complexity of the associated function class. This connection allows us to show that in reproducing kernel Hilbert spaces (e.g. [11], [12], [28], the local fixed point analysis of Gaussian or Rademacher com- [36], [41]). A number of these papers establish consistency plexities, now standard in the analysis of penalized estimators, results for particular forms of early stopping, guaranteeing can be used to derive optimal stopping rules. We derive such that the procedure outputs a function with statistical error that stopping rules in detail for various kernel classes and illustrate converges to zero as the sample size increases. On the other the correspondence of our theory with practice for Sobolev kernel classes. hand, there are relatively few results that actually establish rate optimality of an early stopping procedure, meaning that Index Terms— Boosting, kernel, early stopping, regularization, the achieved error matches known statistical minimax lower localized complexities. bounds. To the best of our knowledge, Bühlmann and Yu [9] I. INTRODUCTION were the first to prove optimality for early stopping of L2- HILE non-parametric models offer great flexibility, boosting as applied to spline classes, albeit with a rule that Wthey can also lead to overfitting, and poor generaliza- was not computable from the data. Subsequent work by tion as a consequence. For this reason, it is well understood Raskutti et al. [28] refined this analysis of L2-boosting for that procedures for fitting non-parametric models must involve kernel classes and first established an important connection some form of regularization. When models are fit via a form to the localized Rademacher complexity; see also the related of empirical risk minimization, the most classical form of work [10], [29], [41] with rates for particular kernel classes. regularization is based on adding some type of penalty to More broadly, relative to our rich and detailed understanding the objective function. An alternative form of regularization is of regularization via penalization (e.g., see the books [18], based on the principle of early stopping, in which an iterative [34], [35], [39] and papers [3], [21] for details), our under- algorithm is run for a pre-specified number of steps, and standing of early stopping regularization is not as well devel- terminated prior to convergence. oped. Intuitively, early stopping should depend on the same bias-variance tradeoffs that control estimators based on penal- Manuscript received March 13, 2018; revised February 23, 2019; accepted May 30, 2019. Date of publication July 9, 2019; date of current version ization. In particular, for penalized estimators, it is now well- September 13, 2019. This work was supported in part by the DOD Advanced understood that complexity measures such as the localized Research Projects Agency under Grant W911NF-16-1-0552, in part by the Gaussian width, or its Rademacher analogue, can be used National Science Foundation under Grant NSF-DMS-1612948, and in part by the Office of Naval Research under Grant DOD-ONR-N00014. to characterize their achievable rates [3], [21], [34], [39]. Y. Wei was with the Department of Statistics, University of Califor- Is such a general and sharp characterization also possible in nia at Berkeley, Berkeley, CA 94720 USA. She is now with the Statis- the context of early stopping? tics Department, Stanford University, Stanford, CA 94305 USA. (e-mail: [email protected]). The main contribution of this paper is to answer this F. Yang was with the Department of Electrical Engineering and Computer question in the affirmative for the early stopping of boost- Sciences, University of California at Berkeley, Berkeley, CA 94720 USA. She ing algorithms as applied to various regression and classi- is now with the Institute for Theoretical Studies, ETH Zurich, 8092 Zurich, Switzerland, and also with the Department of Statistics, Stanford University, fication problems involving functions in reproducing kernel Stanford, CA 94305 USA (e-mail: [email protected]). Hilbert spaces (RKHS). A standard way to obtain a good M. J. Wainwright is with the Department of Statistics, University of Cali- estimator or classifier is through minimizing some penalized fornia at Berkeley, Berkeley, CA 94720 USA, and also with the Department of Electrical Engineering and Computer Sciences, University of California at form of loss functions of which the method of kernel ridge Berkeley, Berkeley, CA 94720 USA (e-mail: [email protected]). regression [38] is a popular choice. Instead, we consider an Communicated by C. Caramanis, Associate Editor for Machine Learning. iterative update involving the kernel that is derived from a Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. greedy update. Borrowing tools from empirical process theory, Digital Object Identifier 10.1109/TIT.2019.2927563 we are able to characterize the “size” of the effective function 0018-9448 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 6686 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019 space explored by taking T steps, and then to connect the Given some loss function φ,wedefinethepopulation cost resulting estimation error naturally to the notion of localized functional f → L( f ) via Gaussian width defined with respect to this effective function n space. This leads to a principled analysis for a broad class of 1 L( f ) :=EY n φ Yi , f (xi ) . (1) loss functions used in practice, including the loss functions that 1 n underlie the L2-boost, LogitBoost and AdaBoost algorithms, i=1 among other procedures. { }n L The remainder of this paper is organized as follows. In Note that with the covariates xi i=1 fixed, the functional is a non-random object. Given some function space F, the optimal Section II, we provide background on boosting methods and 1 reproducing kernel Hilbert spaces, and then introduce the function minimizes the population cost functional—that is updates studied in this paper. Section III is devoted to state- ∗ := L( ). ments of our main results, followed by a discussion of their f arg min f (2) f ∈F consequences for particular function classes in Section IV. We provide simulations that confirm the practical effectiveness As a standard example, when we adopt the least-squares loss of our stopping rules, and show close agreement with our φ( ,θ)= 1 ( −θ)2 ∗ y 2 y , the population minimizer f corresponds theoretical predictions. In Section V, we provide the proofs of to the conditional expectation x → E[Y | x]. our main results, with certain more technical aspects deferred Since we do not have access to the population distribution to the appendices. of the responses however, the computation of f ∗ is impos- { }n sible. Given our samples Yi i=1, we consider instead some II. BACKGROUND AND PROBLEM FORMULATION procedure applied to the empirical loss The goal of prediction is to estimate a function that maps covariates x ∈ X to responses y ∈ Y. In a regression 1 n L ( f ) := φ(Y , f (x )), (3) problem, the responses are typically real-valued, whereas in n n i i a classification problem, the responses take values in a finite i=1 set. In this paper, we study both regression (Y = R)and classification problems (e.g., Y ={−1, +1} in the binary where the population expectation has been replaced by an L case). Our primary focus is on the case of fixed design, empirical expectation. For example, when n corresponds φ( , ( )) = in which we observe a collection of n pairs of the form to the log likelihood of the samples with Yi f xi n log[P(Yi ; f (xi ))], direct unconstrained minimization of Ln {(xi , Yi )} , where each xi ∈ X is a fixed covariate, whereas i=1 would yield the maximum likelihood estimator. Yi ∈ Y is a random response drawn independently from a It is well-known that direct minimization of L over a distribution P | which depends on x . Later in the paper, n Y xi i F we also discuss the consequences of our results for the case sufficiently rich function class may lead to overfitting. There are various ways to mitigate this phenomenon, among of random design, where the (Xi , Yi ) pairs are drawn in an which the most classical method is to minimize the sum of the i.i.d. fashion from the joint distribution P = PX PY |X for some empirical loss with a penalty regularization term. Adjusting the distribution PX on the covariates.