IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019 6685 Early Stopping for Kernel Boosting Algorithms: A General Analysis With Localized Complexities

Yuting Wei , Fanny Yang, and Martin J. Wainwright, Senior Member, IEEE

Abstract— Early stopping of iterative algorithms is a widely While the basic idea of early stopping is fairly old (e.g., [1], used form of regularization in statistics, commonly used in [33], [37]), recent years have witnessed renewed interests in its conjunction with boosting and related gradient-type algorithms. properties, especially in the context of boosting algorithms and Although consistency results have been established in some settings, such estimators are less well-understood than their neural network training (e.g., [13], [27]). Over the past decade, analogues based on penalized regularization. In this paper, for a a line of work has yielded some theoretical insight into early relatively broad class of loss functions and boosting algorithms stopping, including works on classification error for boosting (including L2-boost, LogitBoost, and AdaBoost, among others), algorithms [4], [14], [19], [25], [41], [42], L2-boosting algo- we exhibit a direct connection between the performance of a rithms for regression [8], [9], and similar gradient algorithms stopped iterate and the localized Gaussian complexity of the associated function class. This connection allows us to show that in reproducing kernel Hilbert spaces (e.g. [11], [12], [28], the local fixed point analysis of Gaussian or Rademacher com- [36], [41]). A number of these papers establish consistency plexities, now standard in the analysis of penalized estimators, results for particular forms of early stopping, guaranteeing can be used to derive optimal stopping rules. We derive such that the procedure outputs a function with statistical error that stopping rules in detail for various kernel classes and illustrate converges to zero as the sample size increases. On the other the correspondence of our theory with practice for Sobolev kernel classes. hand, there are relatively few results that actually establish rate optimality of an early stopping procedure, meaning that Index Terms— Boosting, kernel, early stopping, regularization, the achieved error matches known statistical minimax lower localized complexities. bounds. To the best of our knowledge, Bühlmann and Yu [9] I. INTRODUCTION were the first to prove optimality for early stopping of L2- HILE non-parametric models offer great flexibility, boosting as applied to spline classes, albeit with a rule that Wthey can also lead to overfitting, and poor generaliza- was not computable from the data. Subsequent work by tion as a consequence. For this reason, it is well understood Raskutti et al. [28] refined this analysis of L2-boosting for that procedures for fitting non-parametric models must involve kernel classes and first established an important connection some form of regularization. When models are fit via a form to the localized Rademacher complexity; see also the related of empirical risk minimization, the most classical form of work [10], [29], [41] with rates for particular kernel classes. regularization is based on adding some type of penalty to More broadly, relative to our rich and detailed understanding the objective function. An alternative form of regularization is of regularization via penalization (e.g., see the books [18], based on the principle of early stopping, in which an iterative [34], [35], [39] and papers [3], [21] for details), our under- algorithm is run for a pre-specified number of steps, and standing of early stopping regularization is not as well devel- terminated prior to convergence. oped. Intuitively, early stopping should depend on the same bias-variance tradeoffs that control estimators based on penal- Manuscript received March 13, 2018; revised February 23, 2019; accepted May 30, 2019. Date of publication July 9, 2019; date of current version ization. In particular, for penalized estimators, it is now well- September 13, 2019. This work was supported in part by the DOD Advanced understood that complexity measures such as the localized Research Projects Agency under Grant W911NF-16-1-0552, in part by the Gaussian width, or its Rademacher analogue, can be used National Science Foundation under Grant NSF-DMS-1612948, and in part by the Office of Naval Research under Grant DOD-ONR-N00014. to characterize their achievable rates [3], [21], [34], [39]. Y. Wei was with the Department of Statistics, University of Califor- Is such a general and sharp characterization also possible in nia at Berkeley, Berkeley, CA 94720 USA. She is now with the Statis- the context of early stopping? tics Department, Stanford University, Stanford, CA 94305 USA. (e-mail: [email protected]). The main contribution of this paper is to answer this F. Yang was with the Department of Electrical Engineering and Computer question in the affirmative for the early stopping of boost- Sciences, University of California at Berkeley, Berkeley, CA 94720 USA. She ing algorithms as applied to various regression and classi- is now with the Institute for Theoretical Studies, ETH Zurich, 8092 Zurich, Switzerland, and also with the Department of Statistics, Stanford University, fication problems involving functions in reproducing kernel Stanford, CA 94305 USA (e-mail: [email protected]). Hilbert spaces (RKHS). A standard way to obtain a good M. J. Wainwright is with the Department of Statistics, University of Cali- estimator or classifier is through minimizing some penalized fornia at Berkeley, Berkeley, CA 94720 USA, and also with the Department of Electrical Engineering and Computer Sciences, University of California at form of loss functions of which the method of kernel ridge Berkeley, Berkeley, CA 94720 USA (e-mail: [email protected]). regression [38] is a popular choice. Instead, we consider an Communicated by C. Caramanis, Associate Editor for . iterative update involving the kernel that is derived from a Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. greedy update. Borrowing tools from empirical process theory, Digital Object Identifier 10.1109/TIT.2019.2927563 we are able to characterize the “size” of the effective function 0018-9448 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 6686 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019 space explored by taking T steps, and then to connect the Given some φ,wedefinethepopulation cost resulting estimation error naturally to the notion of localized functional f → L( f ) via Gaussian width defined with respect to this effective function  n  space. This leads to a principled analysis for a broad class of 1    L( f ) :=EY n φ Yi , f (xi ) . (1) loss functions used in practice, including the loss functions that 1 n underlie the L2-boost, LogitBoost and AdaBoost algorithms, i=1 among other procedures. { }n L The remainder of this paper is organized as follows. In Note that with the covariates xi i=1 fixed, the functional is a non-random object. Given some function space F, the optimal Section II, we provide background on boosting methods and 1 reproducing kernel Hilbert spaces, and then introduce the function minimizes the population cost functional—that is updates studied in this paper. Section III is devoted to state- ∗ := L( ). ments of our main results, followed by a discussion of their f arg min f (2) f ∈F consequences for particular function classes in Section IV. We provide simulations that confirm the practical effectiveness As a standard example, when we adopt the least-squares loss of our stopping rules, and show close agreement with our φ( ,θ)= 1 ( −θ)2 ∗ y 2 y , the population minimizer f corresponds theoretical predictions. In Section V, we provide the proofs of to the conditional expectation x → E[Y | x]. our main results, with certain more technical aspects deferred Since we do not have access to the population distribution to the appendices. of the responses however, the computation of f ∗ is impos- { }n sible. Given our samples Yi i=1, we consider instead some II. BACKGROUND AND PROBLEM FORMULATION procedure applied to the empirical loss The goal of prediction is to estimate a function that maps covariates x ∈ X to responses y ∈ Y. In a regression 1 n L ( f ) := φ(Y , f (x )), (3) problem, the responses are typically real-valued, whereas in n n i i a classification problem, the responses take values in a finite i=1 set. In this paper, we study both regression (Y = R)and classification problems (e.g., Y ={−1, +1} in the binary where the population expectation has been replaced by an L case). Our primary focus is on the case of fixed design, empirical expectation. For example, when n corresponds φ( , ( )) = in which we observe a collection of n pairs of the form to the log likelihood of the samples with Yi f xi n log[P(Yi ; f (xi ))], direct unconstrained minimization of Ln {(xi , Yi )} , where each xi ∈ X is a fixed covariate, whereas i=1 would yield the maximum likelihood estimator. Yi ∈ Y is a random response drawn independently from a It is well-known that direct minimization of L over a distribution P | which depends on x . Later in the paper, n Y xi i F we also discuss the consequences of our results for the case sufficiently rich function class may lead to overfitting. There are various ways to mitigate this phenomenon, among of random design, where the (Xi , Yi ) pairs are drawn in an which the most classical method is to minimize the sum of the i.i.d. fashion from the joint distribution P = PX PY |X for some empirical loss with a penalty regularization term. Adjusting the distribution PX on the covariates. In this section, we provide some necessary background on a weight on the regularization term allows for trade-off between gradient-type algorithm which is often referred to as boosting fit to the data, and some form of regularity or smoothness algorithm. We also discuss briefly about the reproducing kernel in the fit. The behavior of such penalized of regularized Hilbert spaces before turning to a precise formulation of the estimation methods is now quite well understood (for instance, problem that is studied in this paper. see the books [18], [34], [35], [39] and papers [3], [21] for more details). In this paper, we study a form of algorithmic regularization, A. Boosting and Early Stopping based on applying a gradient-type algorithm to Ln but then Consider a cost function φ : R × R →[0, ∞),where stopping it “early”—that is, after some fixed number of steps. the non-negative scalar φ(y,θ) denotes the cost associated Such methods are often referred to as boosting algorithms, with predicting θ when the true response is y. Some common since they are based on improving the fit of a function via examples of loss functions φ that we consider in later sections a sequence of additive updates (e.g., see the papers [6], [7], include: [14], [30], [31]). Many boosting algorithms, among them • φ( ,θ):= 1 ( −θ)2 2 the least-squares loss y 2 y that underlies AdaBoost [14], L -boosting [9] and LogitBoost [15], [16], L2-boosting [9], can be understood as forms of functional gradient meth- • the logistic regression loss φ(y,θ) = ln(1 + e−yθ ) that ods [16], [25]; see the survey paper [8] for further background underlies the LogitBoost algorithm [15], [16], and on boosting. The way in which the number of steps is • the exponential loss φ(y,θ) = exp(−yθ) that underlies chosen is referred to as a stopping rule, and the overall the AdaBoost algorithm [14]. procedure is referred to as early stopping of a boosting The least-squares loss is typically used for regression problems algorithm. (e.g., [9], [11], [12], [28], [36], [41]), whereas the latter two ∗ losses are frequently used in the setting of binary classification 1Our assumptions guarantee uniqueness of f with regard to the design { }n (e.g., [14], [16], [25]). points xi i=1. WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6687

 t ∗ 2 1 n t ∗ 2 Fig. 1. Plots of the squared error  f − f  = = ( f (xi ) − f (xi )) versus the iteration number t for (a) LogitBoost using a first-order Sobolev n n i 1   kernel (b) AdaBoost using the same first-order Sobolev kernel K(x, x ) = 1 + min(x, x ) which generates a class of Lipschitz functions (splines of order one). Both plots correspond to a sample size n = 100.

In more detail, a broad class of boosting algorithms [25] risk inf sup E[L( f ) − L( f )], where the infimum is taken  t ∞ f f ∈F generate a sequence { f } = via updates of the form t 0 over all possible estimators f . Note that the minimax risk + f t 1 = f t − αt gt (4) provides a fundamental lower bound on the performance of t ∝ ∇L ( t ), ( n) , any estimator uniformly over the function space F. Coupled with g arg max n f d x1 dF ≤1 with our stopping time guarantee (5), we are guaranteed that our estimate achieves the minimax risk up to constant factors. where the constraint dF ≤ 1 defines the As a result, our bounds are unimprovable in general (see F unit ball in a given function class ,andCorollary 2). ( n) :=( ( ), ( ),..., ( )) ∈ Rn ∇L ( ) ∈ Rn d x1 d x1 d x2 d xn , n f denotes the gradient taken with respect to the vector B. Reproducing Kernel Hilbert Spaces ( ),..., ( )) , f x1 f xn ,and h g is the usual inner product The analysis of this paper focuses on algorithms with the , ∈ Rn {αt }∞ between vectors h g . Here the scalar t=0 is a update (4) when the function class F is a reproducing kernel sequence of step sizes chosen by the user. For non-decaying Hilbert space, or RKHS for short. Here we provide a brief L step sizes and a convex objective n, running this procedure introduction, directing the reader to various books for more for an infinite number of iterations will lead to a minimizer background [5], [17], [32], [38], [39]. An RKHS H is a of the empirical loss, thus causing overfitting. In order to function space consisting of functions mapping a domain X to illustrate this phenomenon, Figure 1 provides plots of the the real line R. Any RKHS is defined by a bivariate symmetric  t − ∗2 := 1 n t ( ) − ∗( ) 2 squared error f f n n i=1 f xi f xi kernel function K : X × X → R which is required to versus the iteration number, for LogitBoost in panel (a) and be positive semidefinite, i.e. for any integer N ≥ 1anda AdaBoost in panel (b). See Section IV-B for more details on { }N X [K( , )] ∈ collection of points x j j=1 in , the matrix xi x j ij exactly how these experiments were conducted. RN×N is positive semidefinite. In the plots in Figure 1, the dotted line indicates the mini- The associated RKHS is the completion of all linear com- ρ2  mum mean-squared error n over all iterates of that particular n α K(·, ) ∈ N { }n binations i=1 i xi ,wheren , x j j=1 is some col- run of the algorithm. Both plots are qualitatively similar, X {ω }∞ lection of points in ,and j j=1 is a real-valued sequence, illustrating the existence of a “good” number of iterations to with respect to the scalar product take, after which the MSE greatly increases. Hence a natural n m  problem is to decide at what iteration T to stop such that the     αi K(·, xi ), α K(·, x ) H = αi α K(xi , x ). f T j j j j iterate satisfies bounds of the form i=1 j=1 i, j L( T ) − L( ∗)  ρ2  T − ∗2  ρ2 For each x ∈ X , the function K(·, x) belongs to H ,and f f n and f f n n (5) satisfies the reproducing relation ( )  ( ) ( ) ≤ with high probability. Here f n g n indicates that f n f, K(·, x) H = f (x) for all f ∈ H . (6) cg(n) for some universal constant c ∈ (0, ∞). The main results of this paper provide a stopping rule T for which bounds of Moreover, when the covariates Xi are drawn i.i.d. from a P X X the form (5) do in fact hold with high probability over the distribution X with domain , suppose that is compact, K randomness in the observed responses. the kernel function is continuous and positive semidefinite, Moreover, as shown by our later results, under suit- and satisfies the bound   able regularity conditions, the expectation of the minimum K(x, x ) dPX (x)dPX (x )<∞. ρ2 X ×X squared error n is proportional to the statistical minimax 6688 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019

t = ∇L ( t ), t ( n) t Then we can invoke Mercer’s theorem, which guarantees that In this paper, we study the choice g n f d x1 d the kernel function can be represented as in the boosting update (4), so that the function value iterates ∞ take the form   K(x, x ) = μkφk(x)φk(x ), (7) + f t 1(xn) = f t (xn) − αnK∇L ( f t ), (9) k=1 1 1 n where μ1 ≥ μ2 ≥ ··· ≥ 0 are the ordered eigenvalues of where α>0 is a constant stepsize choice. With the ∞ K {φ } K 0 n t n the kernel function and k k=1 are eigenfunctions of initialization f (x ) = 0, all iterates f (x ) remain in the 2 1 1 which form an orthonormal basis of L (X , PX ) with the inner range space of K . product f, g := X f (x)g(x)dPX (x). In this paper, we consider the following three error measures Throughout this paper, we assume that the kernel function for an estimator f : is uniformly bounded, meaning that there is a constant L such that sup K(x, x) ≤ L. Such a boundedness condition holds n   x∈X 2(P )  − ∗2 = 1 ( ) − ∗( ) 2, for many kernels used in practice, including the Gaussian, L n norm: f f n f xi f xi n = Laplacian, Sobolev, other types of spline kernels, as well as i 1  2(P )  − ∗2 :=E ( ) − ∗( ) 2, any trace class kernel with trigonometric eigenfunctions. By L X norm: f f 2 f X f X ∗ rescaling the kernel as necessary, we may assume without loss Excess risk: L( f ) − L( f ). of generality that L = 1. As a consequence, for any function 2 f such that  f H ≤ r, we have by the reproducing relation Here the expectation defining the L (PX )-norm is taken over that random covariates X that are independent of the samples  (Xi , Yi ) used to form the estimate f . Our goal is to propose  f ∞ = sup f, K(·, x) H ≤f H sup K(·, x)H ≤ r.  = x x a stopping time T such that the averaged function f 1 T f t satisfies bounds of the type (5). We begin our Given samples {(x , y )}n , by the representer theo- T t=1 i i i=1 analysis by focusing on the empirical L2(P ) error, but as rem [20], it is sufficient to restrict ourselves to the linear n we will see in Corollary 1, bounds on the empirical error are subspace H =span{K(·, x )}n , for which all f ∈ H can n i i=1 n easily transformed to bounds on the population L2(P ) error. be expressed as X Importantly, we exhibit such bounds with a statistical error n 1 term δn that is specified by the localized Gaussian complexity f = √ ωi K(·, xi ) (8) n of the kernel class. i=1 ω ∈ Rn for some coefficient vector . Among those func- III. MAIN RESULTS tions which achieve the infimum in expression (1), let us define f ∗ as the one with the minimum Hilbert norm. This We now turn to the statement of our main results, beginning definition is equivalent to restricting f ∗ to be in the linear with the introduction of some regularity assumptions. subspace Hn. A. Assumptions C. Boosting in Kernel Spaces Recall from our earlier set-up that we differentiate between { }n the empirical loss function Ln in expression (3), and the Given a collection of n covariates xi i=1, let us define n×n population loss L in expression (1). Apart from assuming the normalized kernel matrix K ∈ R with entries Kij = differentiability of both functions, all of our remaining con- K(xi , x j )/n. Recall that we can restrict the minimization of Ln ditions are imposed on the population loss. Such conditions and L from H to the subspace Hn without loss of generality. Consequently, using the expression (8), the function value at the population level are weaker than their analogues at the ( n) := ( ( ),..., ( )) empirical level. vectors f x1 √ f x1 f xn can be written in the ( n) = ω For a given radius r > 0, let us define the Hilbert ball form f x1 nK . Moreover, if the null space of K is ∗ empty, then there is a one-to-one correspondence between around the optimal function f as n n-dimensional vectors f (x ) ∈ Rn and the corresponding ∗ ∗ 1 BH ( f , r) :={f ∈ H |f − f H ≤ r}. (10) function f ∈ Hn. In that case, minimization of an empirical H loss in the subspace n essentially becomes the n-dimensional Our analysis makes particular use of this ball defined for the problem of fitting a response vector y over the set range(K ). 2 ∗ 2 2 radius CH := 2max{ f H , 32,σ } where the effective In the sequel, all updates will thus be performed on the noise level σ is defined in the sequel. ( n) function value vectors f x1 . √ √ We assume that the population loss is m-strongly con- ( n) = ∗ With a change of variable d x1 n Kz we then have vex and M-smooth over BH ( f , 2CH ), meaning that the t ( n) := ∇L ( t ), ( n) m-M-condition d x1 arg max n f d x1 dH ≤1 m d∈range(K )  − 2 ≤ L( ) − L( ) − ∇L( ), ( n) − ( n) √ f g n f g g f x1 g x1 t 2 nK∇Ln( f ) = . ≤ M  − 2 ∇L ( t ) ∇L ( t ) f g n n f K n f 2 WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6689

∗ holds for all f, g ∈ BH ( f , 2CH ) and all design points As a remark, for the quadratic loss function, the above { }n φ − ∗( ) xi i=1. In addition, we assume that the function is M- assumption is equivalent to assume that each Yi f xi is a Lipschitz in its second argument over the interval sub-Gaussian random variable. The critical radius δn is the smallest positive scalar such ∗ ∗ θ ∈ min f (xi ) − 2CH , max f (xi ) + 2CH . that i∈[n] i∈[n] G (E (δ, 1)) δ n n ≤ . (14) To be clear, here ∇L(g) denotes the vector in Rn obtained δ σ L ( n) by taking the gradient of with respect to the vector g x1 . We note that past work on localized Rademacher and Gaussian It can be verified by a straightforward computation that when complexity [3], [26] guarantees that there exists a unique L φ( ,θ)= 1 ( − θ)2 is induced by the least-squares cost y 2 y , δ > 0 that satisfies this condition, so that our definition is = = n the m-M-condition holds for m M 1. The logistic and sensible. In particular, one can show (more details see e.g. exponential loss satisfy this condition (see the supplementary Wainwright [39, Lemma 13.6,]) that the left hand side of this material), where it is key that we have imposed the condition inequality is a non-increasing function in terms of δ and the B ( ∗, ) only locally on the ball H f 2CH . right hand side of the inequality is an increasing function of δ. In addition to the least-squares cost, our theory also applies Therefore there exists a smallest positive solution to the above L φ to losses induced by scalar functions that satisfy the inequality. φ -boundedness condition: 1) Upper Bounds on Excess Risk and Empirical L2(P )-   n   Error: With this set-up, we are now equipped to state our main ∂φ(y,θ) max   ≤ B theorem. It provides high-probability bounds on the excess risk i=1,...,n ∂θ θ= ( )  f xi and L2(P )-error of the estimator f¯T := 1 T f t defined ∈ B ( ∗, ) ∈ Y. n T t=1 for all f H f 2CH and y by averaging the T iterates of the algorithm. It applies to both the least-squares cost function, and more generally, to any loss This condition holds with B = 1 for the logistic loss for all function satisfying the m-M-condition and the φ-boundedness Y,andB = exp(2.5CH ) for the exponential loss for binary condition. classification with Y ={−1, 1}, using our kernel boundedness condition. Note that whenever this condition holds with some Theorem 1. Given a sample size n sufficiently large to ensure φ / δ ≤ M { t }∞ finite B, we can always rescale the scalar loss by 1 B so that n m , suppose that we compute the sequence f t=0 that it holds with B = 1,andwedosoinordertosimplify using the update (9) with initialization f 0 = 0 and any α ∈ ( , { 1 , }] ∈ the statement of our results. step size 0min M M . Then for any iteration T , ,... m  ¯T 0 1 δ2 , the averaged function estimate f satisfies 8M n the bounds B. Upper Bound in Terms of Localized Gaussian Width  2  ∗ 1 δ Our upper bounds involve a complexity measure known as L( f¯T ) − L( f ) ≤ CM + n (15a) α 2 the localized Gaussian width. In general, Gaussian widths are mT m  δ2  widely used to obtain risk bounds for least-squares and other  ¯T − ∗2 ≤ 1 + n , and f f n C (15b) types of M-estimators. In our case, we consider Gaussian αmT m2 complexities for “localized” sets of the form where both inequalities hold with probability at least 2 δ2  − (− m n n ) 1 c1 exp C2 σ 2 . En(δ, 1) := f − g | f, g ∈ H such that  We prove Theorem 1 in Section V-A.  f − gH ≤ 1and f − gn ≤ δ . (11) A few comments about the constants in our statement: in all cases, constants of the form c j are universal, whereas the The Gaussian complexity localized at scale δ is given by capital C j may depend on parameters of the joint distribution  n  and population loss L. In Theorem 1, we have the explicit   1 2 G E (δ, ) :=E w ( ) , ={m , } 2 n n 1 sup i g xi (12) value C2 σ 2 1 and C is proportional to the quantity g∈En(δ,1) n = ∗ 2 2 i 1 2max{ f H , 32,σ}. While inequalities (15a) and (15b) are stated as high probability results, similar bounds for where (w1,...,wn) denotes an i.i.d. sequence of standard Gaussian variables. expected loss (over the response yi , with the design fixed) An essential quantity in our theory is specified by a certain can be obtained by a simple integration argument. As another remark, since the function class En(δ, 1) does not depend fixed point equation that is now standard in empirical process ∗ on the minimizer f , quantity δn in expression (14) is also theory [3], [21], [28], [34], [39]. Let us define the effective ∗ ∗ noise level independent of f , and therefore our bound holds for all f ⎧   in the unit ball of the given RKHS. ⎪ ((Y − f ∗(x ))2/t2) ⎨⎪min t | max E[e i i ] < ∞ In order to gain intuition for the claims in the theorem, i=1,...,n σ := note that apart from factors depending on (m, M),thefirst ⎪ for least-squares loss δ2 ⎩⎪ term 1 dominates the second term n whenever T  1/δ2. ( + )( + ) φ αmT m2 n 4 2M 1 1 2CH for -bounded losses. Consequently, up to this point, taking further iterations reduces (13) the upper bound on the error. This reduction continues until 6690 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019

/δ2 E(δ, ) we have taken of the order 1 n many steps, at which point It is analogous to the previously defined set 1 , except that δ2 · the upper bound is of the order n. the empirical norm n has been replaced by the population More precisely, suppose that we perform the updates with version. The population Gaussian complexity localized at scale α = m τ := 1 δ step size ; then, after a total number of δ2 { , } is given by M n max 8 M many iterations, the extension of Theorem 1 to expectations    n  guarantees that the mean squared error is bounded as   1 Gn E(δ, 1) :=Ew,X sup wi g(Xi ) , (19)  n δ2 g∈E(δ,1) i=1 E f¯τ − f ∗2 ≤ C n , n 2 (16) m {w }n where i i=1 are an i.i.d. sequence of standard normal  n where C is another constant depending on CH .Herewe variates, and {Xi } = is a second i.i.d. sequence, indepen- ≥ i 1 have used the fact that M m in simplifying the expression. dent of the normal variates, drawn according to PX . Finally,  It is worth noting that guarantee (16) matches the best known the population critical radius δn is defined by equation (19),  upper bounds for kernel ridge regression (KRR)—indeed, in which Gn is replaced by Gn. this must be the case, since a sharp analysis of KRR is based on the same notion of localized Gaussian complexity Corollary 1. Under the conditions of Theorem 1, suppose { }n (e.g. [2], [3]). Thus, our results establish a strong parallel in addition that the sequence Xi i=1 are drawn i.i.d. from P { }n between the algorithmic regularization of early stopping, distribution X and the responses Yi i=1 are generated from and the penalized regularization of kernel ridge regression. regression model (17). If we compute the boosting updates with α ∈ ( , { 1 , }] 0 = Moreover, as will be clarified in Section III-C, under suitable step size 0 min M M and initialization f 0,then ¯T := 1  the averaged function estimate f at time T δ¯2 { , } regularity conditions on the RKHS, the critical squared radius n max 8 M δ2 n also acts as a lower bound for the expected risk, meaning satisfies the bound that our upper bounds are not improvable in general.   ∗ ∗ δ2 E f¯T (X) − f (X) 2 =f¯T − f 2 ≤˜c δ2 Note that the critical radius n only depends on our obser- X 2 n n vations {(xi , yi )} = through the solution of inequality (14). In i 1 2 δ¯2 − (− m n n ) many cases, it is possible to compute and/or upper bound this with probability at least 1 c1 exp C2 σ 2 over the random critical radius, so that a concrete and valid stopping rule can samples. indeed by calculated in advance. In Section IV, we provide a The proof of Corollary 1 consists of two main steps: number of settings in which this can be done in terms of the n • eigenvalues {μ j } = of the normalized kernel matrix. From standard empirical process theory bounds [3], [28], j 1  ¯T − ∗2 2) Consequences for Random Design Regression: Thus far, the difference between empirical risk f f n and  ¯T − ∗2 our analysis has focused purely on the case of fixed design, in population risk f f 2 can be controlled for uni- { }n formly bounded f¯T − f ∗. Note that the latter holds which the sequence of covariates xi i=1 is viewed as fixed. when T ≤ 1  by uniform boundedness of the If we instead view the covariates as being sampled in an δ¯2 max{8,M} P X n i.i.d. manner from some distribution  X over , then the kernel and Lemma 4 used for the proof of Theorem 1. In  − ∗2 = 1 n ( ) − ∗( ) 2 empirical error f f n n i=1 f xi f xi of a particular, it can be shown that given estimate f is a random quantity, and it is interesting to ∗ ¯T ∗ 2 ¯T ∗ 2  2(P )  − 2 = | f − f  −f − f  |≤cδn, relate it to the squared population L X -norm f f 2 n 2 E ( f (X) − f ∗(X))2 . . In order not to unnecessarily complicate the discussion, for a fixed positive constant c in this section we assume that the data is drawn from the • Furthermore, one can show (e.g. Wainwright [39, Propo- δ generative model sition 14.25]) that the empirical critical quantity n is δ ∗ bounded by the population n up to multiplicative con- yi = f (xi ) + i , i = 1, 2,...,n, (17) stant. ∗ with f ∈ H ,andwhereweassumethatthenoise i := By combining both arguments the corollary follows. We refer ∗ yi − f (xi ) variables are uncorrelated with xi . Under these the reader to the papers [3], [28] for further details on such conditions, the function f ∗ is both the minimizer of the equivalences. population risk when the expectation is taken over the pair It is worth comparing this guarantee with the past work of (X, Y ) jointly, and the minimizer of the empirical risk when Raskutti et al. [28], who analyzed the kernel boosting iterates the expectation is only taken with respect to the randomness of the form (9), but with attention restricted to the special in Y . In this way, our results from the case of non-random case of the least-squares loss. Their analysis was based on first design can be generalized directly. decomposing the squared error into bias and variance terms, In order to state an upper bound on this error, we introduce a then carefully relating the combination of these terms to a population analogue of the critical radius δn, which we denote particular bound on the localized Gaussian complexity (see  by δn. Consider the set equation (23) below). In contrast, our theory more directly  analyzes the effective function class that is explored by taking  E(δ, 1) := f − g | f, g ∈ H ,  f − gH ≤ 1, (18) T steps, so that the localized Gaussian width (19) appears  more naturally. In addition, our analysis applies to a broader  −  ≤ δ . f g 2 class of loss functions. WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6691

C. Achieving Minimax Lower Bounds yields an estimate f¯T such that ∗ ∗ In this section, we show that the upper bound (16) for fixed E f¯T − f 2  inf sup E f − f 2. (22) n  ∗ n design matches known minimax lower bounds on the error, f  f H ≤1 so that our results are unimprovable in general. We establish Here f (n)  g(n) means f (n) = cg(n) up to a universal this result for the class of regular kernels, as previously defined constant c ∈ (0, ∞). As always, in the minimax claim (22), by Yang et al. [40], which includes the Gaussian and Sobolev the infimum is taken over all measurable functions of the kernels as special cases. input data and the expectation is taken over the random- The class of regular kernels is defined as follows. Let { }n ness of the response variables Yi i=1. Since we know that μ1 ≥ μ2 ≥ ··· ≥ μn ≥ 0 denote the ordered eigenvalues E ¯T − ∗2  δ2 f f n n, the way to prove bound (22) is by estab- K  ∗ 2 2 of the normalized kernel matrix , and define the quantity lishing inf sup ∗ ≤ E f − f   δ . See Section V-B := {μ ≤ δ2} f f H 1 n n dn argmin j=1,...,n j n . A kernel is called regular for the proof of this result. whenever there is a universal constant c such that the tail At a high level, the statement in Corollary 2 shows that sum satisfies n μ ≤ cd δ2. In words, the tail sum j=dn+1 j n n early stopping prevents us from overfitting to the data; in of the eigenvalues for regular kernels is roughly on the particular, using the stopping time T yields an estimate that same or smaller scale as the sum of the eigenvalues bigger attains the optimal balance between bias and variance. Note δ2 than n. that for the special case of the squared loss, related works For such kernels and under the Gaussian observation model such as [11], [24] have proved bounds which adapt to the ∼ ( ∗( ), σ 2) ∗ (Yi N f xi ), Yang et al. [40] prove a minimax lower smoothness and regularity of f for an early stopping time δ bound involving n. In particular, they show that the minimax (which also depends on the regularity parameter). The novelty risk over the unit ball of the Hilbert space is lower bounded in our work is that our general unified framework which as connects penalized and algorithmic regularization, can be used ∗ to compute bounds for a large variety of loss functions. inf sup E f − f 2 ≥ cδ2, (20)  ∗ n n f  f H ≤1 IV. CONSEQUENCES FOR VARIOUS KERNEL CLASSES for some fixed positive constant c. Comparing the lower In this section, we apply Theorem 1 to derive some concrete bound (20) with upper bound (16) for our estimator f¯T rates for different kernel spaces for the random design case stopped after O(1/δ2) many steps, it follows that the bounds n and then illustrate them with some numerical experiments. It proven in Theorem 1 are unimprovable apart from constant is known that the complexity of an RKHS in association with factors. a distribution over the covariates P can be characterized by We now state a generalization of this minimax lower X the decay rate (7) of the eigenvalues of the kernel operator bound, one which applies to a sub-class of generalized linear (also called eigen-decay). The representation power of a kernel models, or GLM for short. In these models, the conditional class is directly correlated with the eigen-decay: the faster the distribution of the observed vector Y = (Y1,...,Yn) given ∗ ∗ decay, the smaller the function class. f (x1),..., f (xn) takes the form n   ∗ ∗  yi f (xi ) − ( f (xi )) A. Theoretical Predictions as a Function of Decay Pθ (y) = h(y ) exp , (21) i s(σ) In this section, let us consider two forms of decay of kernel i=1 eigenvalues μi defined in equation (7). where s(σ) is a known scale factor and : R → R is the • γ -exponential decay:Forsomeγ>0, the cumulant function of the generalized linear model. As some eigenvalues satisfy a decay condition of the form γ concrete examples: μ j ≤ c1 exp(−c2 j ),wherec1, c2 are universal con- • The linear Gaussian model is recovered by setting s(σ) = stants. Examples of kernels in this class include the σ 2 and (t) = t2/2. Gaussian kernel, which for the Lebesgue measure satisfies • The logistic model for binary responses such a bound with γ = 2 (over the real line) or γ = 1 y ∈{−1, 1} is recovered by setting s(σ) = 1and (on a compact domain). (t) = log(1 + exp(t)). • β-polynomial decay:Forsomeβ>1/2, the eigenvalues μ ≤ −2β Our minimax lower bound applies to the class of GLMs satisfy a decay condition of the form j c1 j , where c1 is a universal constant. Examples of kernels for which the cumulant function is differentiable and th has uniformly bounded second derivative | |≤L.This in this class include the k -order Sobolev spaces for ≥ class includes the linear, logistic, multinomial families, among some fixed integer k 1 with Lebesgue measure on a bounded domain. We consider Sobolev spaces that consist others, but excludes (for instance) the Poisson family. Under th (k) this condition, we have the following: of functions that have k -order weak derivatives f being Lebesgue integrable and f (0) = f (1)(0) = ··· = n ( − ) Corollary 2. Suppose that we are given i.i.d. samples {yi } = f k 1 (0) = 0. For such classes, the β-polynomial decay ∗ i 1 from a GLM (21) for some function f in a regular kernel condition holds with β = k.  ∗ ≤ := 1  class with f H 1. Then running T δ2 { , } n max 8 M Given eigendecay conditions of these types, it is possible α ∈ ( , { 1 , }] 0 = δ iterations with step size 0 min M M and f 0 to compute an upper bound on the critical radius n by upper 6692 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019 bounding the localized Gaussian complexity by a function of in terms of the eigenvalues of the kernel matrix. The proof of the eigenvalues. In particular, we have the following lemma: this claim is similar; see Lemma 13.22 in the book [39] for details. Since the empirical and population critical quantities Lemma 1. Letting μ ≥ μ ≥ ··· ≥ 0 denote the ordered 1 2 are close up to constants ( [39, Lemma 13.6,]), this connection eigenvalues of the kernel operator, it holds that provides an explicit way to compute the optimal stopping    ∞ rule T up to a constant factor. It does entail computing (a 2  G (E(δ, 1)) ≤ min{δ2,μ }. (23) subset of) the eigenvalues of the n-dimensional kernel matrix. n n j j=1 In applications where there are multiple estimation problems involving the same set of design points, this cost could be The proof can be found in Section E. As a direct conse- amortized. In other settings, this cost could be significant, and quence, the smallest δ which satisfies the following inequality it would be interesting to devise algorithms that approximate    ∞ the kernel eigenvalues to sufficient accuracy. We leave this as 2  δ2 min{δ2,μ }≤ (24) a direction for future work. n j σ j= 1 B. Numerical Experiments  is an upper bound on the critical radius δn. We can now see that We now describe some numerical experiments that pro-  δn is directly related to the representation power of a kernel vide illustrative confirmations of our theoretical predictions. class by means of the eigenvalues of the corresponding kernel While we have applied our methods to various kernel classes, operator. in this section, we present numerical results for the first-order Consequently, we can show that for γ -exponentially decay- /γ Sobolev kernel as two typical examples for exponential and δ2  (logn)1 β ing kernels, we have n n , whereas for -polynomial polynomial eigen-decay kernel classes. − 2β Let us start with the first-order Sobolev space of Lip- kernels, we have δ2  n 2β+1 ,where denotes an inequality n [ , ] holding with a universal constant pre-factor. Combining with schitz functions on the unit interval 0 1 , defined by the K(x, x) = + (x, x) our Theorem 1, we obtain the following result: kernel 1 min , and with the design points { }n [ , ] xi i=1 set equidistantly over 0 1 . Note that the equidistant Corollary 3 (Bounds based on eigendecay). Under the con- design yields β-polynomial decay of the eigenvalues of K { }n P β = ditions of Corollary 1 and for xi i=1 drawn i.i.d. from X , with 1 as in the case when xi are drawn i.i.d. from the following bounds on the expected empirical errors over the uniform measure on [0, 1]. Consequently we have that − δ − n n δ2  −2/3 the outputs hold with probability at least 1 e n n . Accordingly, our theory predicts that the stopping / ¯T (a) For kernels with γ -exponential eigen-decay, we have time T = (cn)2 3 should lead to an estimate f such that  ¯T − ∗2  −2/3 1/γ f f n n . ∗ log n 2 E ¯T − 2 ≤ In our experiments for L -Boost, we sampled Yi according f f n c ∗ n to Yi = f (xi ) + wi with wi ∼ N (0, 0.5), which corresponds n ∗ at T  steps. to the probability distribution P(Y | xi ) = N ( f (xi ); 0.5), log1/γ n where f ∗(x) =|x − 1 |− 1 is defined on the unit interval (b) For kernels with β-polynomial eigen-decay, we have 2 4 [0, 1]. By construction, the function f ∗ belongs to the first- E ¯T − ∗2 ≤ −2β/(2β+1)  ∗ = f f n cn order Sobolev space with f H 1. For LogitBoost, we sampled Y according to Bin(p(x ), 5) where p(x) = 2β/(2β+1) ∗ i i at T  n steps. exp( f (x)) 0 = 1+exp( f ∗(x)). In all cases, we fixed the initialization f 0, In particular, these bounds hold for LogitBoost and AdaBoost. and ran the updates (9) for L2-Boost and LogitBoost with See Section V-C for the proof of Corollary 3. These bounds the constant step size α = 0.75. We compared various directly carry over to bounds on the excess risk L( f T )−L( f ∗) stopping rules to the oracle rule G, meaning the procedure that as outlined in the proof of Theorem 1 and also hold for the examines all iterates { f t }, and chooses the stopping time G =  t − ∗2 population errors by invoking Corollary 1. arg mint≥1 f f n that yields the minimum prediction To the best of our knowledge, our results are the first error. Of course, this procedure is unimplementable in practice, to show non-asymptotic and minimax optimal rates for the but it serves as a convenient lower bound with which to ·2 n-error when applying early stopping to either the Log- compare. itBoost or AdaBoost updates, along with an explicit depen- Each panel in Figure 2 shows plots of the mean-squared  ¯T − ∗2 dence of the stopping rule on n. Our results yield analogous error f f n versus the sample size n, with each point guarantees for L2-boosting, as have been established in past corresponding to an average over 40 trials, for four differ- work [28]. Note that we observe a similar trade-off between ent stopping rules. Error bars correspond to the standard computational efficiency and statistical accuracy as in the case errors computed from our simulations. The top two panels of kernel least-squares regression [28], [41]: although larger ((a) and (b)) show results for L2-Boost, whereas the bottom kernel classes (e.g. Sobolev classes) yield higher estimation two panels ((c) and (d)) show results for LogitBoost. The errors, boosting updates reach the optimum faster than for a left two panels ((a) and (c)) gives plots on a linear-linear smaller kernel class (e.g. Gaussian kernels). scale, whereas the corresponding right two panels ((b) and It is worth noting that whiel Lemma 1 is stated in terms of (d)) replots the same data on a log-log scale. The blue-solid the kernel operator eigenvalues, an analogous bound also holds curve corresponds to the performance of the oracle stopping WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6693

Fig. 2. Plots of the mean-squared error versus the sample size n for various types of stopping rules. Panel (a) shows plots on a linear-linear scale, whereas panel (b) shows plots on a log-log scale. The blue-solid curve corresponds the oracle rule of stopping the algorithm at the time T with the minimum error; it is an oracle procedure, since we cannot actually determine the point of minimum error in practice. The remaining three curves show the behavior of three κ different practical stopping rules, all based on stopping at the time T = (7n) for different choices of exponent κ. The performance of the rules given by κ ∈{1/3, 2/3, 1} are plotted in black-dotted, red-dashed, and green-dot-dashed curves, respectively. The choice κ = 0.67 is the theoretically optimal choice. On the log-log plot, we see linear fits with slopes −0.80 for the oracle rule versus −0.85 for the rule with κ = 2/3 in the case of L2-boost, and −0.71 for the oracle rule versus −0.77 for the rule with κ = 2/3 in the case of LogitBoost. time T = G described above; the other three curves show was to illustrate the correspondence between our theoretical the performance of rules of the form T = (7n)κ for different predictions and behavior in simulation. choices of κ. Note that the behavior of the stopping rules 2 for L -Boost and LogitBoost are qualitatively similar. In both V. P ROOF OF MAIN RESULTS cases, the theoretically derived stopping rule T = (7n)κ We now turn to the proofs of our main results, with the bulk with κ∗ = 2/3 ≈ 0.67, while slightly worse than the of the technical details deferred to the appendices. We begin oracle, tracks its performance closely. We also performed by recalling some notation from Section II-C. We denote the simulations for some “bad” stopping rules, in particular for vector of function values of a function f ∈ H evaluated at an exponent κ not equal to the theoretically optimal choice (x , x ,...,x ) as 2/3; the performance of these bad rules is indicated in the 1 2 n θ := ( n) = ( ( ), ( ),... ( )) ∈ Rn, green-dot-dashed and black-dotted curves. The log scale plots f f x1 f x1 f x2 f xn in Figure 2, namely panels (b) and (d), show that with rules defined by κ ∈{1/3, 1}, the resulting performance is where we omit the subscript f when it is clear from the indeed substantially worse, with the difference in slope even context. As mentioned in the main text, updates on the function θ t ∈ Rn suggesting a different scaling of the error with the number value vectors correspond uniquely to updates of the t ∈ H of observations n. Recalling our discussion for Figure 1, this functions f (see a complete discussion in Section A). phenomenon likely occurs due to underfitting and overfitting In the following we repeatedly abuse notation by defining the  ∈ ( ) effects. These qualitative shifts are consistent with our theory. Hilbert norm and empirical norm on vectors in range K As regards to the choice of pre-factor 7 in our stopping rule, as this was made on a heuristic basis, mainly to ensure that we 2 1 T † 2 1 2 H =  K  and  =  , had reasonable rules for all choices of κ. Our main goal here n n n 2 6694 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019 where K † is the pseudoinverse of K .WealsouseBH (θ, r) Taking these lemmas as given, we now complete the proof to denote the ball with respect to the ·H -norm in of the theorem. We first condition on the event E from range(K ). Lemma 3, so that we may apply the bound (27). We then < m − fix some iterate t such that t δ2 1, and condition on 8M n the event that the bound (28) in Lemma 4 holds, so that we are A. Proof of Theorem 1 + guaranteed that t 1H ≤ CH . We then split the analysis On a high-level, the key to the proof is the fact that the into two cases: Hilbert norm is bounded for all iterations t such that t ≤ t+1 ≤ δ m a) Case 1: First, suppose that n nCH .Inthis δ2 . Doing so involves combining arguments from convex 8M n case, inequality (15b) holds directly. optimization with arguments from empirical process theory, t+1 b) Case 2: Otherwise, we may assume that  n > with the latter being required to relate the derivatives of the t+1 ! δn H . Applying bound (27) with the choice (, ) = sample and population-based objective functions. (t ,t+1) yields The proof of our main theorem is based on a sequence ∗ t ∗ t t+1 t+1 of lemmas, all of which are stated with the assumptions of ∇L(θ +  ) −∇Ln(θ +  ),  ≤4δn n m + Theorem 1 in force. The first lemma establishes a bound on the + t 12. (29) t+1 t+1 ∗ n empirical norm ·n of the error  :=θ − θ , provided c3 that its Hilbert norm is suitably controlled. Substituting inequality (29) back into equation (26) yields   1 Lemma 2. For any stepsize α ∈ (0, ] and any iteration t m t+1 2 1 t 2 t+1 2 M   ≤  H − H we have 2 n 2α   t+1 m t+1 2 m + 1 + + 4δ   +   . t 12 ≤ t 2 −t 12 n n c n 2 n 2α H H 3 ∗ t ∗ t t+1 Re-arranging terms yields the bound + ∇L(θ +  ) −∇Ln(θ +  ),  . (26) + + γ mt 12 ≤ Dt + 4δ t 1 , (30) See Appendix B for the proof of this claim. n n n The second term on the right-hand side of the bound (26) where we have introduced the shorthand notation involves the difference between the population and empirical t := 1 t 2 −t+12 γ = 1 − 1 D 2α H H ,aswellas 2 c gradient operators. Since this difference is being evaluated 3 t t+1 Equation (30) defines a quadratic inequality with respect to at the random points  and  , the following lemma + t 1 ; solving it and making use of the inequality (a + establishes a form of uniform control on this term. n b)2 ≤ 2a2 + 2 b2 yields the bound Let us define the set δ2 t t+1 2 c n 2D ! n   ≤ + , (31) S := , ∈ R |H ≥ 1, n γ 2m2 γ m " for some universal constant c. By telescoping inequality (31), ! and ,  ∈ BH (0, 2CH ) , we find that T T 1  cδ2 1  2Dt and consider the uniform bound t 2 ≤ n + (32) T n γ 2m2 T γ m ∗ ∗ t= t= ∇L(θ + )! −∇L (θ + ),!  ≤2δ  1 1 n n n δ2 m c n 1 0 2 T 2 + δ2 + 2 ,! ∈ S ≤ + [ H − H ]. (33) 2 n H n for all . (27) γ 2m2 αγmT c3 Lemma 3. Let E be the event that bound (27) By Jensen’s inequality, we have holds. There are universal constants (c1, c2) such that T T 2 δ2 ¯T ∗ 2 1 t 2 1 t 2 P[E]≥ − (− m n n )  f − f  =   ≤   , 1 c1 exp c2 σ 2 . n T n T n t=1 t=1 See Appendix C for the proof of this claim. so that inequality (15b) follows from the bound (32). Note that Lemma 2 applies only to error iterates with a On the other hand, by the smoothness assumption, we have bounded Hilbert norm. Our last lemma provides this control M for some number of iterations: L( f¯T ) − L( f ∗) ≤  f¯T − f ∗2, 2 n Lemma 4. There are constants (C1, C2) independent of n α ∈ , { , 1 } from which inequality (15a) follows. such that for any step size 0 min M M , we have t  ≤ ≤ m H CH for all iterations t δ2 (28) 8M n B. Proof of Corollary 2 − (− δ2) = We first provide a proof outline before proceeding with with probability at least 1 C1 exp C2n n ,whereC2 2 the details. Our proof is based on a standard application of max{ m , 1}. σ 2 Fano’s inequality, together with a random packing argument. See Appendix D for the proof of this lemma; note that this Fano’s inequality helps reduce proving lower bounds to finding proof also uses Lemma 3. an upper bound on the Kullback-Leibler divergence between WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6695 distributions associated to the packing set. Then we control the Proof of inequality (34): Recall that we define the trans- T Kullback-Leibler divergence between the GLM (21) through formed parameter√ θ = DUα with K = U DU,andany  T the properties of their cumulant functions. See Chapter 15 in estimator f = nU θ. Let us write U =[u1, u2,...,un]. the book [39] for more background on arguments of this type. Direct calculation of the KL divergence yields

Our proof builds on and generalizes Theorem 1 in Yang et Pθ (y) al. [40]. By definition of the transformed vector θ = DUα Pθ (y), Pθ (y)KL = log( )Pθ (y)dy √ P  ( ) = T  = T θ θ y with K U DU, we have for any estimator f nU n  − ∗2 =θ − θ ∗2 1 √  √ that f f n 2. Therefore our goal is to lower = ( n u ,θ ) − ( n u ,θ ) θ − θ ∗ θ ∗. (σ) i i bound the Euclidean error 2 of any estimator of s = δ/ √i 1 Borrowing Lemma 4 in Yang et al. [40], there exists 2- n   ={θ ∈ Rn | −1/2θ ≤ } n  packing of the set B D 2 1 of + yi ui ,θ− θ Pθ dy. (35) = dn/64 := {μ ≤ δ2} s(σ) cardinality M e with dn arg min j n . i=1 j=1,...,n The packing actually belongs to the set To further control the right hand  side of expression (35), n we concentrate on expressing = yi ui Pθ dy differently.  n θ 2  i 1 E(δ) := θ ∈ Rn | j ≤ , Leibniz’s rule allow us to inter-change the order of integral 2 1 min{δ , μ j } and derivative, so that j=1 dPθ d dy = Pθ dy = 0. (36) which is a subset of B. Let us denote the packing set by dθ dθ {θ 1,...,θM }. θ ∈ E(δ) Since , by simple calculation, we have Observe that θ i  ≤ δ. √ 2 n  √  By considering the random ensemble of regression problems dPθ n   dy = Pθ · ui yi − ( n ui ,θ ) dy in which we first draw an index Z at random from the index set dθ s(σ) i=1 [M] and then condition on Z = z, we observe n i.i.d samples n := { ,..., } P so that equality (36) yields y1 y1 yn from θ z , Fano’s inequality implies that n n  √ 2 n P = ( ,θ ). ∗ δ I (y ; Z) + log 2 yi ui θ dy ui n ui P(θ − θ  ≥ ) ≥ 1 − 1 . 2 4 log M i=1 i=1 Combining the above inequality with expression (35), the KL ( n; ) where I y1 Z is the mutual information between the samples divergence between two generalized linear models Pθ , Pθ can Y and the random index Z. thus be written as So it is only left for us to control the mutual information n √ ( n; ) P¯ = 1 M P 1  I y1 Z . Using the mixture representation M i=1 θi Pθ (y), Pθ (y) = ( n u ,θ ) KL s(σ) i and the convexity of the Kullback–Leibler divergence, we have i=1 √ √   √ − ( n u ,θ ) − n u ,θ − θ ( n u ,θ ). (37) M  i i i n 1 ¯ 1 I (y ; Z) = Pθ j , P ≤ Pθi , Pθ j  . 1 M KL M2 KL Together with the fact that j=1 i, j √ √ | ( n u ,θ ) − ( n u ,θ ) i√ i √ In order to proceed, we claim that − ,θ − θ ( ,θ )|≤ θ − θ 2. n ui n ui nL 2 θ − θ 2 nL 2 which follows by assumption on having a uniformly Pθ (y), Pθ (y) ≤ . (34) KL s(σ) bounded second derivative. Combining the above inequality with inequality (37) establishes our claim (34). i Taking this claim as given for the moment, since each θ 2 ≤ δ θ i − θ j  ≤ δ = , triangle inequality yields 2 2 for all i j.It C. Proof of Corollary 3 is therefore guaranteed that Throughout the proof, we condition on the event 4nLδ2 δ  ( n; ) ≤ . n  I y1 Z E = ≤ δn ≤ 3δn s(σ) 4 −c nδ Therefore, similar to Yang et al. [40], following the fact that which holds with probability at least 1 − c1e 2 n for (σ) ≥ δ2 , the kernel is regular and hence s dn cn n, any estimator some universal constants c1 c2 (see Proposition 14.1 in the f has prediction error lower bounded as book [39]). The expectations are with respect to the outputs Y1,...,Yn. When the conditions in Theorem 1 are satis-  ∗ 2 2 sup E f − f  ≥ cl δ . fied, it then follows from the extension (16) of the theorem ∗ n n  f H ≤1 that δ2 δ2 This lower bound, in conjunction with the upper bound from E ¯T − ∗2 ≤  n ≤  n  1 f f n C C at T δ2 steps. (38) Theorem 1, yields the statement of Corollary 2. m2 m2 n 6696 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019

δ2  −2β/(1+2β) In order to invoke Theorem 1 for the particular cases of Log- we find that the critical radius scales as n n . itBoost and AdaBoost, we need to verify the conditions, i.e. Consequently, if we take T  n−2β/(1+2β) many steps, then that the m-M-condition and φ-boundedness conditions hold Theorem 1 guarantees that the averaged estimator θ¯T satisfies ∗ for the respective loss function over the ball BH (θ , 2CH ). the bound # $# $ β/( β+ ) The following lemma provides such a guarantee: σ 2 2 2 1 ¯T ∗ 2 1 1 ∗ θ − θ  ≤ + , Lemma 5. With D :=CH +θ H , the logistic regression n αm m2 n cost function satisfies the m-M-condition with parameters with probability at least 1 − c exp(−c m2( n )1/(2β+1)). 1 1 1 2 σ 2 m = , M = , and B = 1. −D + D + e e 2 4 VI. DISCUSSION The AdaBoost cost function satisfies the m-M-condition with In this paper, we have proven non-asymptotic bounds parameters for early stopping of kernel boosting for a relatively broad − m = e D, M = eD, and B = eD. class of loss functions. These bounds allowed us to propose { }n simple stopping rules which, for the class of regular kernel In particular, the conditions hold for any sequence xi i=1 in the domain X . functions [40], yield minimax optimal rates of estimation. Although the connection between early stopping and regular- See Appendix F for the proof of Lemma 5. ization has long been studied and explored in the theoretical It follows from Lemma 1 that is suffices to compute the literature and applications alike, to the best of our knowledge, function  this paper is the first one to establish a general relationship   ∞ 2  between the statistical optimality of stopped iterates and the R(δ) := {δ2,μ } min j (39) localized Gaussian complexity. This connection is important, n = j 1 because this localized Gaussian complexity measure, as well  to obtain an upper bound for the critical radius δn. as its Rademacher analogue, are now well-understood to play c) γ -exponential decay: Suppose that the kernel operator a central role in controlling the behavior of estimators based eigenvalues satisfy a decay condition of the form μ j ≤ on regularization [3], [21], [34], [39]. γ c1 exp(−c2 j ),wherec1, c2 are universal constants. Then There are various open questions suggested by our results. the function R from equation (39) can be upper bounded as The stopping rules in this paper depend on the eigenvalues   of the empirical kernel matrix; for this reason, they are data-  ∞ 1 2 dependent and computable given the data. However, in prac- R(δ) = min{δ ,μj } n = tice, it would be desirable to avoid the cost of computing all i 1   the empirical eigenvalues. Can fast approximation techniques  ∞ 1 − 2 for kernels be used to approximately compute our optimal ≤ kδ2 + c e c2 j , n 1 stopping rules? Second, our current theoretical results apply j=k+1 to the averaged estimator f¯T . We strongly suspect that the √ T and R(δ) ≥ kδ2 where k is the smallest integer such that same results apply to the stopped estimator f , but some new γ 2 ingredients are required to extend our proofs. c1 exp(−c2 k )<δ . Some algebra then shows that the critical 2 n radius scales as δ  /γ . n log(n)1 σ 2 ( )1/γ σ 2 APPENDIX Consequently, if we take T  log n steps, then Theo- n A. Notation and Three Equivalent Iteration Forms rem 1 guarantees that the averaged estimator θ¯T satisfies the Before directly diving into the proofs, let us first introduce bound # $ 1/γ some shorthand notation and provide more details on RKHS ∗ 1 1 log n θ¯T − θ 2  + σ 2, that are relevant for developing our main results. n αm m2 n Recalling the discussion in Section II-C, we denote the 2 1/γ with probability 1 − c1exp(−c2m log n). vector of function values of a function f ∈ H evaluated β d) -polynomial decay: Now suppose that the kernel at (x1, x2,...,xn) as μ ≤ eigenvalues satisfy a decay condition of the form j n n −2β θ f := f (x ) = ( f (x1), f (x2),... f (xn)) ∈ R , c1 j for some β>1/2 and constant c1. In this case, 1 a direct calculation yields the bounds  where we omit the subscript f when it is clear from the   ∞ context. As mentioned in the main text, updates on the function 2 −2 2 θ t ∈ Rn R(δ) ≤ kδ + c2 j and R(δ) ≥ kδ , value vectors correspond uniquely to updates of the t t j=k+1 functions f ∈ Hn. Therefore updates on f which is written −2 2 as where k is the smallest integer such that c2 k <δ. t+1 t t 0 Combined with upper bound f = f − α∇Ln ( f ), with f = 0, (40a) n −2 −2 2 can be written as updates on the function value c2 j ≤ c2 j ≤ kδ , k+ t+1 t ˜ t 0 j=k+1 1 θ = θ − nαK ∇Ln(θ ), with θ = 0, (40b) WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6697 where Now define the difference of the squared errors n    1 ∗ + ∗ ˜ t 1 t :=  t − 2 − t 1 − 2 . L ( f ) = L (θ ) = φ(Y ,θ ). V z z 2 z z 2 n n n i i 2 i=1 By some simple algebra, we have   Denote K † as the pseudoinverse of K , in our proof, we also 1 V t = zt − z∗2 −dt + zt − z∗2 use the linear transformation 2 2 2 √ := −1/2( †)1/2θ ⇐⇒ θ = 1/2 . =− t , t − ∗ −1 t 2 z n K nK z d z z d 2 √ √ 2 ˜ 1 as well as the new function Jn(z) := Ln( n Kz) and its =− t , − t + t+1 − ∗ −  t 2 d d z z d 2 population equivalent J (z) := EJn(z). Ordinary gradient 2 1 descent on Jn with stepsize α takes the form =− t , t+1 − ∗ +  t 2. d z z d 2 + 2 zt 1 = zt − α∇J (zt ) (40c) √ n√ √ √ Substituting back into equation (41) yields = zt − α n K ∇L˜ ( n Kzt ), t n + ∗ 1 d + ∗ J (zt 1) − J (z ) ≤ V t + ∇J (zt ) + , zt 1 − z 0 α α with z = 0. If we transform this update on z √back√ to an 1 equivalent one on θ by multiplying both sides by n K ,we = V t + ∇J (zt ) −∇J (zt ), zt+1 − z∗ , α n see that ordinary on Jn is equivalent to the t+1 t ˜ t 1 kernel boosting update θ = θ − αnK∇Ln(θ ). where we have used the fact that α ≥ M by our choice of Therefore in our context, these three forms (40a)-(40c) are stepsize α. equivalent to each other, and we interchange between them Finally,√ √ we transform back to the√ original√ variables from time to time for convenience of the proof. Also note that θ = n Kz, using the relation ∇J (z) = n K ∇L(θ),so we abuse notation and write L for both L and L˜ as it is as to obtain the bound n n n   clear from the argument which loss we refer to. L(θ t+1) − L(θ ∗) ≤ 1 t 2 −t+12 2α H H t t t+1 ∗ B. Proof of Lemma 2 + ∇L(θ ) −∇Ln(θ ), θ − θ . Recalling that K † denotes the pseudoinverse of K (see Note that the optimality of θ ∗ implies that ∇L(θ ∗) = 0. Section A), our proof is based on the linear transformation Combined with m-strong convexity, we are guaranteed that √ m t+12 ≤ L(θ t+1) − L(θ ∗) −1/2 † 1/2 1/2 n , and hence z :=n (K ) θ ⇐⇒ θ = nK z. 2   m t+12 ≤ 1 t 2 −t+12 and iterates (40c) where 2 n 2α H H + ∗ t ∗ t t+1 zt 1 = zt − α∇J (zt ) + ∇L(θ +  ) −∇Ln(θ +  ),  , √ √ n √ √ t t z − α n K∇Ln( n Kz ). as claimed.

Our goal is to analyze the behavior of the update (40c) in C. Proof of Lemma 3 terms of the population cost J (zt ). Thus, our problem is one of We split our proof into two cases, depending on analyzing a noisy form of gradient descent on the function J , whether we are dealing with the least-squares loss where the noise is induced by the difference between the φ(y,θ)= 1 (y − θ)2 ∇J 2 , or a classification loss with uniformly empirical gradient operator n and the population gradient φ ≤ ∇J bounded gradient ( ∞ 1). operator . 1) Least-Squares Case: The least-squares loss is m-strongly We show in the first part of the proof of Lemma 4 (without convex with m = M = 1. Moreover, the difference between having to use the statement of this Lemma 2), that for all m the population and empirical gradients can be written as t ≤ we have t H ≤ 2CH . Therefore we can readily ∗ ∗ σ 8Mδ2 ∇L(θ + )! −∇L (θ + )! = (w ,...,w ) n n n 1 n ,wherethe assume that the L is M-smooth. Since the kernel matrix K {w }n random variables i i=1 are i.i.d. and sub-Gaussian with has been normalized to have largest eigenvalue at most one, parameter 1. Consequently, we have the function J is also M-smooth, whence   σ n  | ∇L(θ ∗ + )! −∇L (θ ∗ + ),!  | = w ( ). J ( t+1) ≤ J ( t ) + ∇J ( t ), t + M  t 2, n  i xi  z z z d d 2 n 2 i=1 t t+1 t t where d :=z − z =−α∇Jn(z ). Under these conditions, one can show (see [39] Lemma 13.4.) that J J (z∗) ≥   Morever, since the function is convex, we have  n  J ( t ) + ∇J ( t ), ∗ − t σ  z z z z , whence  w (x ) ≤ 2δ   n i i  n n J ( t+1) − J ( ∗) ≤ ∇J ( t ), t + t − ∗ + M  t 2 i=1 z z z d z z d 2 2 + δ2 + 1 2, 2 n H n (42) = ∇J ( t ), t+1 − ∗ + M  t 2. 16 z z z d 2 (41) 2 which implies that Lemma 3 holds with c3 = 16. 6698 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019

2) Gradient-Bounded φ-Functions: We now turn to the b) Case t >δn: On the other hand, for any t >δn, proof of Lemma 3 for gradient bounded φ-functions. First, we have we claim that it suffices to prove the bound (27) for functions ( ) ( ) i ii Gn(E(t, 1)) g ∈ ∂H and gH = 1where∂H :={f − g | f, g ∈ H }. EZn(t) ≤ σGn(E(t, 1)) ≤ tσ ≤ tδn, Indeed, suppose that it holds for all such functions, and that t we are given a function  with H > 1. By assumption, where step (i) follows from Lemma 6, and step (ii) follows G (E( , )) we can apply the inequality (27) to the new function g := → n u 1 because the function u u is non-increasing on the /H , which belongs to ∂H by nature of the subspace positive real line. (This non-increasing property is a direct H ={K(·, )}n ∂H span xi i=1. consequence of the star-shaped nature of .) Finally, using Applying the bound (27) to g and then multiplying both this upper bound on expression EZn(δn) and setting α =  2 sides by H , we obtain t m/(4c3) in the tail bound (45) yields ∗ ! ∗ !  2    ∇L(θ + ) −∇Ln(θ + ),  t m 2 2 P Zn(t) ≥ tδn + ≤ c1 exp −c2nm t . (48) m 2 4c3 ≤ δ  + δ2 + n 2 n n 2 n H  c3 H Note that the precise values of the universal constants c2 may ≤ δ  + δ2 + m 2, change from line to line throughout this section. 2 n n 2 n H n c3 c) Peeling argument: Equipped with the tail bounds (47) and (48), we are now ready to complete the peeling argument. where the second inequality uses the fact that H > 1by A assumption. Let denote the event that the bound (27) is violated for some ∈ ∂H   = ≤ < In order to establish the bound (27) for functions with function g with g H 1. For real numbers 0 a b,letA(a, b) denote the event that it is violated for some gH = 1, we first prove it uniformly over the set {g | function such that g ∈[a, b],andgH = 1. For k = gH = 1, g ≤ t},wheret > 1 is a fixed radius (of n n , , ,... = kδ course, we restrict our attention to those radii t for which this 0 1 2 ,define% tk 2 n. We then have the decomposition E = ( , ) ∪ ( ∞ A( , )) set is non-empty.) We then extend the argument to one that is 0 t0 k=0 tk tk+1 and hence by union bound, also uniform over the choice of t by a “peeling” argument. ∞ Define the random variable P[E]≤P[A(0,δn)]+ P[A(tk, tk+1)]. (49) = ∗ ! ∗ ! k 1 Zn(t) := sup ∇L(θ + ) −∇Ln(θ + ),  . ! ,∈E(t,1) From the bound (47), we have P[A(0,δn)]≤ − δ2 A( , ) (43) c1 exp c2 n n . On the other hand, suppose that tk tk+1 holds, meaning that there exists some function g with The following two lemmas, respectively, bound the mean of gH = 1andgn ∈[tk, tk+1] such that this random variable, and its deviations above the mean: ∇L(θ ∗ + )! −∇L (θ ∗ + ),! Lemma 6. For any t > 0, the mean is upper bounded as n g > δ   + δ2 + m  2 2 n g n 2 n g n EZn(t) ≤ σGn(E(t, 1)), (44) c3 (i) m ≥ δ + δ2 + 2 σ := + 2 ntk 2 n tk where 2M 4CH . c3 (ii) m Lemma 7. There are universal constants (c , c ) such that ≥ δ + δ2 + 2 , 1 2 ntk+1 2 n tk+1 4c3    2  c2nα P Z (t) ≥ EZ (t) + α ≤ c exp − . (45) where step (i) uses the gn ≥ tk and step (ii) uses the fact n n 1 t2 that tk+1 = 2tk. This lower bound implies that Zn(tk+1)> 2 tk+1m See Appendices C.3 and C.4 for the proofs of these two claims. tk+1δn + and applying the tail bound (48) yields 4c3 Equipped with Lemmas 6 and 7, we now prove inequal- ity (27). We divide our argument into two cases: t2 m P(A( , )) ≤ P(Z ( )> δ + k+1 ) a) Case t = δ : We first prove inequality (27) for t = δ . tk tk+1 n tk+1 tk+1 n n n   4 c3 From Lemma 6, we have ≤ − 2 2k+2δ2 . exp c2nm 2 n (i) EZ (δ ) ≤ σG (E(δ , )) ≤ δ2, n n n n 1 n (46) Substituting this inequality and our earlier bound (47) into equation (49) yields where inequality (i) follows from the definition of δn in inequality (14). Setting α = δ2 in expression (45) yields P(E) ≤ (− 2δ2), n c1 exp c2nm n     P Z (δ ) ≥ δ2 ≤ − δ2 , n n 2 n c1 exp c2n n (47) where the reader should recall that the precise values of uni- versal constants may change from line-to-line. This concludes which establishes the claim for t = δn. the proof of Lemma 3. WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6699

3) Proof of Lemma 6: Recalling the definitions (1) and (3) where step (i) follows since each function ϕi is M-Lipschitz by of L and Ln, we can write assumption; and step (ii) follows since the Gaussian complex- n ity* upper bounds the Rademacher complexity up to a factor Z ( ) = 1 (φ( ,θ∗ + ! ) π n t sup yi i i of 2 . Similarly, we have ! n ,∈E(t,1) i=1   ∗ ! π − Eφ (yi ,θ + i ))i i T ≤ Gn(E(t, 1)), 2 2 Note that the vectors  and ! contain function values of ∗ ∗ and putting together the pieces yields the claim. the form f (xi ) − f (xi ) for functions f ∈ BH ( f , 2CH ). Recall that the kernel function is bounded uniformly by one. 4) Proof of Lemma 7: Recall the definition (50) of the ∗ Z˜ Consequently, for any function f ∈ BH ( f , 2CH ),wehave symmetrized variable n. By a standard symmetrization argu- ment [35], there are universal constants c1, c2 such that ∗ ∗ | f (x) − f (x)|=| f − f , K(·, x) H |     ˜ ˜ ∗ P Zn(t) ≥ EZn[t]+c1α ≤ c2P Zn(t) ≥ EZn[t]+α . ≤f − f H K(·, x)H ≤ 2CH . ! {ε }n { }n Z˜ ( ) Thus, we can restrict our attention to vectors , with Since i i=1 are yi i=1 are independent, we can study n t ! { }n {ε }n ∞, ∞ ≤ 2CH from hereonwards. conditionally on yi i=1. Viewed as a function of i i=1, {ε }n Z˜ ( ) Letting i i=1 denote an i.i.d. sequence of Rademacher the function n t is convex and Lipschitz with respect to variables, define the symmetrized variable the Euclidean norm with parameter n n   1  1 2 t2 Z˜ ( ) := ε φ( ,θ∗ + ! ). L2 := φ(y ,θ∗ + ! ) ≤ , n t sup i yi i i i (50) sup 2 i i i i ! n ,!∈E( , ) n n ,∈E(t,1) i=1 t 1 i=1  By a standard symmetrization argument [35], we have where we have used the facts that φ ∞ ≤ 1and ≤ t. ˜ n Ey[Zn(t)]≤2Ey, [Zn(t)]. Moreover, since By Ledoux’s concentration for convex and Lipschitz func-   2 tions [22], we have φ( ,θ∗ + ! ) ≤ 1 φ( ,θ∗ + ! ) + 12 yi i i i yi i i i    α2  2 2 ˜ ˜ n n P Zn(t) ≥ EZn[t]+α |{yi } = ≤ c3 exp − c4 . we have i 1 t2 n n   Since the right-hand side does not involve {yi } = ,thesame EZ ( ) ≤ E 1 ε φ( ,θ∗ + ! ) 2 i 1 n t sup i yi i i bound holds unconditionally over the randomness in both the ! n ∈E(t,1) i=1 { }n Rademacher variables and the sequence yi i=1. Consequently, n the claimed bound (45) follows, with suitable redefinitions of + E 1 ε 2 sup i i ∈E( , ) n the universal constants. t 1 i=1 n ≤ E 1 ε φ( ,θ∗ + ! ) 2 sup i yi i i D. Proof of Lemma 4 !∈E(t,1) n & i=1'( ) We first require an auxiliary lemma, which we state and T1 prove in the following section. We then prove Lemma 4 in n Section D.2. + E 1 ε  , 4CH sup i i 1) An Auxiliary Lemma: The following result relates the ∈E(t,1) n & '( i=1 ) Hilbert norm of the error to the difference between the T2 empirical and population gradients: where the second inequality follows by applying the Lemma 8. For any convex and differentiable loss function Rademacher contraction inequality [23], using the fact that L, the kernel boosting error t+1 :=θ t+1 − θ ∗ satisfies the  φ ∞ ≤ 1 for the first term, and ∞ ≤ 2CH for bound the second term.  ∗ t+1 2 t t+1 E[ε φ ( ,θ )]=  H ≤ H  H Focusing first on the term T1,since i yi i 0, ∗ t ∗ t t+1 we have + α ∇L(θ +  ) −∇Ln(θ +  ),  . n   (51) = E 1 ε φ( ,θ∗ + ! ) − φ( ; θ ∗) T1 sup i yi i i yi i ∗ ∗ ! n t 2 =θ t − θ 2 = t − 2 ∈E(t,1) i=1 & '( ) Proof: Recall that H H z z 2 ! ϕi (i ) by definition of the Hilbert norm. Let us define the population n update operator G on the population function J and the (i) 1 ! ≤ ME sup εi i empirical update operator Gn on Jn as !∈E(t, ) n √  1 i=1 √ G(zt ) :=zt − α∇J ( n Kzt ), (ii) π √ √ ≤ MGn (E(t, 1)), t+1 t t t 2 and z := Gn(z ) = z − α∇Jn ( n Kz ). (52) 6700 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019

Since J is convex and smooth, it follows from standard and applying Hoeffding’s inequality. Putting things together arguments in convex optimization that G is a non-expansive yields an upper bound on the probability of the complementary operator—viz. event, namely

G(x) − G(y)2 ≤x − y2 for all x, y ∈ C. (53) P(Ec ∪ Ec) ≤ (− δ2) 0 2c1 exp C2n n ∗ In addition, we note that the vector z is a fixed point of 2 ∗ ∗ = { m , } G—that is, G(z ) = z . From these ingredients, we have with C2 max σ 2 1 . + a) Showing that t 1H ≤ 2CH : In this step, t+12 H we assume that inequality (54) holds at step t,andshowthat / t+1 ∗ t t t ∗ t+1 (K †)1 2 = z − z , Gn(z ) − G(z ) + G(z ) − z  H ≤ 2CH . Recalling that z := √ θ, our update n ( ) i + ∗ ∗ can be written as ≤zt 1 − z  G(zt ) − G(z ) √ √ 2 2 √ √ + α [∇L(θ ∗ + t ) −∇L (θ ∗ + t )], t+1 − ∗ zt+1 − z∗ = zt − α n K ∇L(θ t ) − z∗ n K n z z √ √ (ii) t t t+1 t + α n K (∇Ln(θ ) −∇L(θ )). ≤ H  H ∗ t ∗ t t+1 + α ∇L(θ +  ) −∇Ln(θ +  ),  Applying the triangle inequality yields the bound √ √ where step (i) follows by applying the Cauchy-Schwarz to t+1 ∗ t t ∗ z − z 2 ≤z − α n K∇L(θ ) −z 2 control the inner product, and step (ii) follows since t+1 = & '( ) √ √ √ ( t ) n K (zt+1 − z∗), and the square root kernel matrix K is √G z√ t t symmetric. +α n K (∇Ln(θ ) −∇L(θ ))2 2) Proof of Lemma 4: We now prove Lemma 4. The argument makes use of Lemmas 2 and 3 combined with where the population update operator G was previously Lemma 8. defined (52), and observed to be non-expansive (53). From In order to prove inequality (28), we follow an inductive this non-expansiveness, we find that argument. Instead of proving (28) directly, we prove a slightly + ∗ ∗ zt 1 − z  ≤zt − z  stronger relation which implies it, namely 2 √2 √ +α (∇L (θ t ) −∇L(θ t )) , 4M n K n 2 max{1, t 2 }≤max{1, 02 }+tδ2 . (54) H H n !γ m Note that the 2 norm of z corresponds to the Hilbert norm Here !γ and c3 are constants linked by the relation of θ . This implies

1 1 2 t+1 t !γ := − = 1/CH . (55)  H ≤ H + 32 4c √ √ 3 t t α n K (∇Ln(θ ) −∇L(θ ))2 We claim that it suffices to prove that the error iterates t+1 & '( ) := satisfy the inequality (54). Indeed, if we take inequality (54) T as given, then we have Observe that because of uniform boundedness of the kernel 1 by one, the quantity T can be bounded as t 2 ≤ max{1, 02 }+ ≤ C2 , H H 2!γ H √ √ ≤ α ∇L (θ t ) −∇L(θ t )) = α 1 v − Ev , 2 ∗ 2 T n n 2 n 2 where we used the definition CH = 2max{θ H , 32}. n Thus, it suffices to focus our attention on proving inequal- where we have defined the vector v ∈ Rn with coordinates ity (54).  t vi := φ (yi ,θ ). For functions φ satisfying the gradient For t = 0, it is trivially true. Now let us assume inequal- i ∗ boundedness and m − M condition, since θ t ∈ BH (θ , CH ), ≤ m ity (54) holds for some t δ2 , and then prove that it also v Ev 8M n each coordinate of the vectors and is bounded by 1 in holds for step t + 1. + absolute value. We consequently have If t 1H < 1, then inequality (54) follows directly. Therefore, we can assume without loss of generality that T ≤ α ≤ CH , + t 1H ≥ 1. α ≤ / < ≤ CH . We break down the proof of this induction into two steps: where we have used the fact that m M 1 2 For + • First, we show that t 1H ≤ 2CH so that Lemma 3 least-squares φ we instead have is applicable. √ n α √ • Second, we show that the bound (54) holds and thus in T ≤ α y − E[y | x] =: √ Y ≤ 2σ ≤ CH + 2 fact t 1H ≤ CH . n n √ Throughout the proof, we√ condition on the event E and √1 conditioned on the event E0 := { y − E[y | x]2 ≤ 2σ}. E := { √1 y − E[y | x] ≤ 2σ}. Lemma 3 guarantees that n 0 n 2 2 σ 2 2 δ2 Since Y is sub-exponential with parameter n it follows by P(Ec) ≤ (− m n n ) P(E ) ≥ − −n −n c1 exp c2 σ 2 whereas 0 1 e follows Hoeffding’s inequality that P(E0) ≥ 1 − e . Putting together + from the fact that Y 2 is sub-exponential with parameter σ 2n the pieces yields that t 1H ≤ 2CH , as claimed. WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6701

b) Completing the induction step: We are now ready e) Combining the pieces: By combining the two previous to complete the induction step for proving inequality (54) cases, we arrive at the bound +   using Lemma 2 and Lemma 3 since t 1H ≥ 1. We , t+12 split the argument into two cases separately depending on max 1 H t+1 δ ≥t+1   whether or not H n n. In general we can 2 t 2 2 t 2 4M 2 t+1 t ≤ max 1,κ ( H + 4αδ ) ,   + δ , (59) assume that  H >  H , otherwise the induction n H !γ m n inequality (54) is satisfied trivially. t+1 t+1 where κ := 1 and we used the inequality c) Case 1: When  H δn ≥ n, inequal- (1−αδ2 m ) n c3 ity (27) implies that α ≤ { 1 , } min M M . ∗ ! ∗ ! t+1 ∇L(θ + ) −∇Ln(θ + ),  (56) Now it is only left for us to show that with the constant c3 1 1 2 m chosen such that !γ = − = 1/CH ,wehave ≤ δ2t+1 + t+12, 32 4c3 4 n H n c3 2 t 2 2 t 2 4M 2 κ ( H + 4αδ ) ≤  + δ . Combining Lemma 8 and inequality (56), we obtain n H !γ m n t+12 ≤t  t+1 + αδ2t+1 H H H 4 n H Define the function f : (0, CH ]→R via m +α t+12 4M n (ξ) :=κ2(ξ + αδ2)2 − ξ 2 − δ2. c3 f 4 n n   !γ m t+1 1 t 2 ⇒   H ≤  H + 4αδ , (57) 1 − αδ2 m n Since κ ≥ 1, in order to conclude that f (ξ) < 0forall n c3 ξ ∈ ( , ] ( )< 0 CH , it suffices to show that argminx∈R f x 0 t+1 where the last inequality uses the fact that  n ≤ and f (CH )<0. t+1 δn H . The former is obtained by basic algebra and follows directly t+1 t+1 d) Case 2: When  H δn <  n, we use our from κ ≥ 1. For the latter, since !γ = 1 − 1 = 1/C2 , + 32 4c3 H assumption t 1H ≥t H together with Lemma 8 and 2 α< 1 and δ2 ≤ M it thus suffices to show inequality (27). Collectively, they guarantee that M n m2 t+1 2 t 2 1 4M  H ≤ H ≤ + 1 ( − M )2 m ∗ t ∗ t t+1 1 8m + 2α ∇L(θ +  ) −∇Ln(θ +  ),  + m + ( + )( − x )2 ≥ ≤ m ≤ ≤t 2 + αδ t 1 + α t 12. Since 4x 1 1 8 1forallx 1and M 1, we H 8 n n 2 n c3 conclude that f (CH )<0. { , t+12 }≤ Using the elementary inequality 2ab ≤ a2 + b2,wefindthat Now that we have established max 1 H { , t 2 }+ 4M δ2 max 1 H !γ m n, the induction step (54) follows, + + 1 t 12 ≤t 2 + 8α m!γ t 12 + δ2 which completes the proof of Lemma 4. H H n 4!γ m n + α m t+12 2 n E. Proof of Lemma 1 c3 ∞ 2 {φ } m + 2αδ Recall that the eigenfunctions k k=1 associated with a ≤t 2 + α t 12 + n , (58) 2(X , P ) H 4 n !γ m kernel operator form an orthonormal basis of L X with the inner product f, g := f (x)g(x)dP (x).Every !γ, X X where in the final step, we plug in the constants c3 which function f ∈ H induced by the kernel can then be written as satisfy equation (55). ∞ Now Lemma 2 implies that  f (x) = β φ (x) m m j j t+12 ≤ t + t+1 δ + t+12 = n D 4 n n n j 1 2 c3 (i) ∞ ∞ β2 t t+1 2 1 2 β ∈ 2(N) β2 < ∞ j < ∞ ≤ D + 4 !γ m  + δ with (i.e. j=1 )and j=1 μ . n !γ n j j 4 m It can be shown (see e.g. [39]) that the corresponding inner m + t+12, product of the function space H reads c n 3 ∞ 2 2 f,φj g,φj where step (i) again uses 2ab ≤ a + b . Thus, we have f, g H = m t+12 ≤ t + 1 δ2 μ j 4 n D !γ m n. Together with expression (58), j=1 we find that  2 = 1 and thus for f as defined in Equation (E) we have f H t+12 ≤t 2 + (t 2 −t+12 )  β2  H H H H ∞ j 2 ∞ 2 2 = μ as well as  f  = = β . The localized popula- α j 1 j 2 j 1 j + 4 δ2 tion Gaussian complexity is defined as !γ n m  n  4α    1  ⇒  t+12 ≤t 2 + δ2. Gn(E(δ, 1)) = EX Ew sup wi f (xi ) . H H n   ≤δ,  ≤ n !γ m f 2 f H 1 i=1 6702 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 65, NO. 10, OCTOBER 2019

 w˜ = n w φ ( ) ∈{− , } Rewriting the objective and defining j i=1 i j xi , inequality with the fact that y 1 1 , it suffices to lower the inner constrained maximization can be upper bounded as bound the quantity   follows  2  ∂ φ(y,θ) ∞ ∞ min min   y∈{− ,+ } |θ|≤D (∂θ)2 sup w˜ β ≤ sup w˜ β 1 1  j j  j j 2 ∞ β2≤δ2 ∞ η β2≤ y 1 j=1 j j=1 j=1 j j 2 j=1 = ≥ , min min − θ/ θ/ −  β2 |y|≤1 |θ|≤D ( y 2 + y 2)2 D + D + ∞ j ≤ e e &e '(e 2) j=1 μ 1 j + m ∞ = 2 w˜ γ , which completes the proof for the logistic loss.  sup j j ∞ 2 η j 2) m-M-condition for AdaBoost: The AdaBoost algorithm = γ ≤1 j= j 1 j 1 is based on the cost function φ(y,θ)= e−yθ , which has first η = {δ−2,μ−1} and second derivatives (with respect to its second argument) where j max j . Now using Hölders inequality we obtain given by   2 ∞ ∞ 2 ∂φ(y,θ) − θ ∂ φ(y,θ) − θ  2w˜ =−ye y , and = e y . w˜ β ≤ j . ∂θ (∂θ)2 sup j j η   β2 = = j ∞ 2 2 ∞ j j 1 j 1 = β ≤δ , = μ ≤1 As in the preceding argument for logistic loss, we have the j 1 j j 1 j bound |y|≤1and|θ|≤D. By inspection, the absolute value Furthermore, using Jensen’s inequality we have of the first derivative is uniformly bounded B :=eD, whereas    [ , ]  ∞ 2  ∞ 2 the second derivative always lies in the interval m M with  2w˜  E ,ww˜ D −D 1 E E j ≤ 2 X j M :=e and m :=e , as claimed. X w η η n = j n = j j 1  j 1   ACKNOWLEDGEMENT ∞ ≤ 2 1 , Yuting Wei and Fanny Yang contributed equally to this n η j work. j=1 which concludes the proof. REFERENCES [1] R. S. Anderssen and P. M. Prenter, “A formal comparison of methods F. Proof of Lemma 5 proposed for the numerical solution of first kind integral equations,” ANZIAM J., vol. 22, no. 4, pp. 488–500, 1981. Recall that the LogitBoost algorithm is based on logistic [2] P. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: loss φ(y,θ)= ln(1 + e−yθ ), whereas the AdaBoost algorithm Risk bounds and structural results,” J. Mach. Learn. Res.,vol.3, φ( ,θ)= (− θ) pp. 463–482, Nov. 2002. is based on the exponential loss y exp y .Wenow [3] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher verify the m-M-condition for these two losses with the corre- complexities,” Ann. Statist., vol. 33, no. 4, pp. 1497–1537, 2005. sponding parameters specified in Lemma 5. [4] P. L. Bartlett and M. Traskin, “Adaboost is consistent,” J. Mach. Learn. Res., vol. 8, pp. 2347–2368, Oct. 2007. 1) m-M-condition for Logistic Loss: The first and second [5] A. Berlinet and C. Thomas-Agnan, Reproducing Kernel Hilbert Spaces derivatives are given by in Probability and Statistics. Norwell, MA, USA: Kluwer Academic, 2004. − θ ∂φ(y,θ) −ye y [6] L. Breiman, “Prediction games and arcing algorithms,” Neural Comput., = , and ∂θ + −yθ vol. 11, no. 7, pp. 1493–1517, Oct. 1999. 1 e [7] L. Breiman, “Arcing classifier (with discussion and a rejoinder by the ∂2φ( ,θ) 2 y = y . author),” Ann. Statist., vol. 26, no. 3, pp. 801–849, 1998. (∂θ)2 ( −yθ/2 + yθ/2)2 [8] P. Bühlmann and T. Hothorn, “Boosting algorithms: Regularization, e e prediction and model fitting,” Stat. Sci., vol. 22, no. 4, pp. 477–505, | ∂φ(y,θ) | 2007. It is easy to check that ∂θ is uniformly bounded [9] P. Bühlmann and B. Yu, “Boosting with L2 loss: Regression and by B = 1. classification,” J. Amer. Stat. Assoc., vol. 98, no. 462, pp. 324–339, 2003. Turning to the second derivative, recalling that y ∈ [10] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco, “NYTRO: When subsampling meets early stopping,” in Proc. 19th Int. Conf., Artif. Intell. {−1, +1}, it is straightforward to show that Statist., 2016, pp. 1403–1411. 2 [11] A. Caponnetto, “Optimal rates for regularization operators in learn- y ≤ 1, ing theory,” Massachusetts Inst. Technol., Cambridge, MA, USA, max sup − θ/ θ/ y∈{−1,+1} θ (e y 2 + ey 2)2 4 Tech. Rep. CBCL Paper 264/AI Tech. Rep. 062, Sep. 2006. [12] A. Caponnetto and Y. Yao, “Adaptation for regularization operators in ∂φ(y,θ) / θ learning theory,” Massachusetts Inst. Technol., Cambridge, MA, USA, which implies that ∂θ is a 1 4-Lipschitz function of , Tech. Rep. CBCL Paper 265/AI Tech. Rep. 063, Sep. 2006. i.e. with M = 1/4. [13] R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Our final step is to compute a value for m by deriv- Backpropagation, conjugate gradient, and early stopping,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 402–408. ing a uniform lower bound on the Hessian. For this step, [14] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of we need to exploit the fact that θ = f (x) must arise from on-line learning and an application to boosting,” J. Comput. Syst. Sci., ∗ a function f such that  f H ≤ D :=CH +θ H .Since vol. 55, no. 1, pp. 119–139, Aug. 1997. K( , ) ≤ [15] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: supx x x 1 by assumption, the reproducing relation A statistical view of boosting (with discussion and a rejoinder by the for RKHS then implies that | f (x)|≤D. Combining this authors),” Ann. Statist., vol. 28, no. 2, pp. 337–407, 2000. WEI et al.: EARLY STOPPING FOR KERNEL BOOSTING ALGORITHMS 6703

[16] J. H. Friedman, “Greedy function approximation: A gradient boosting [40] Y. Yang, M. Pilanci, and M. J. Wainwright, “Randomized sketches machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001. for kernels: Fast and optimal nonparametric regression,” Ann. Statist., [17] C. Gu, Smoothing Spline ANOVA Models (Springer Series in Statistics). vol. 45, no. 3, pp. 991–1023, 2017. New York, NY, USA: Springer, 2002. [41] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient [18] L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk, A Distribution-Free descent learning,” Construct. Approx., vol. 26, no. 2, pp. 289–315, 2007. Theory of Nonparametric Regression (Springer Series in Statistics). New [42] T. Zhang and B. Yu, “Boosting with early stopping: Convergence and York, NY, USA: Springer-Verlag, 2002. consistency,” Ann. Statist., vol. 33, no. 4, pp. 1538–1579, Aug. 2005. [19] W. Jiang, “Process consistency for ,” Ann. Statist., vol. 21, no. 1, pp. 13–29, 2004. [20] G. Kimeldorf and G. Wahba, “Some results on Tchebycheffian spline functions,” J. Math. Anal. Appl., vol. 33, pp. 82–95, Sep. 1971. [21] V. Koltchinskii, “Local rademacher complexities and oracle inequalities in risk minimization,” Ann. Statist., vol. 34, no. 6, pp. 2593–2656, 2006. [22] M. Ledoux, The Concentration of Measure Phenomenon (Mathematical Yuting Wei Yuting Wei is currently a Stein Fellow in Statistics Department at Surveys and Monographs). Providence, RI, USA: American Mathemat- Stanford University. She received a Ph.D. degree in Statistics from University ical Society, 2001. of California, Berkeley in 2018 and a B.S. in Statistics, a B.A. in Economics [23] M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperime- both from Peking University in China. Yuting was the recipient of the try and Processes. New York, NY, USA: Springer-Verlag, 1991. 2018 Erich L. Lehmann Citation from the Berkeley statistics department for [24] J. Lin, A. Rudi, L. Rosasco, and V. Cevher, “Optimal rates an outstanding Ph.D. dissertation in theoretical statistics. Her research inter- for spectral algorithms with least-squares regression over Hilbert ests include non-parametric statistics, high dimensional statistical inference, spaces,” Appl. Comput. Harmon. Anal., to be published. doi: optimization, and statistical information theory. 10.1016/j.acha.2018.09.009. [25] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean, “Boosting algo- rithms as gradient descent,” in Adv. Neural Inf. Process. Syst., vol. 12, 1999, pp. 512–518. [26] S. Mendelson, “Geometric parameters of kernel machines,” in Proc. Conf. Learn. Theory (COLT), 2002, pp. 29–43. Fanny Yang Fanny Yang is a Junior Fellow at the Institute of Theoretical [27] L. Prechelt, “Early stopping-but when?” in Neural Networks: Tricks of Studies at ETH Zurich and holds a joint appointment as postdoctoral scholar the Trade. Berlin, Germany: Springer, 1998, pp. 55–69. Department of Statistics and Department of Computer Sciences at Stanford [28] G. Raskutti, M. J. Wainwright, and B. Yu, “Early stopping and University. She received a Bachelor’s degree in Electrical Engineering from non-parametric regression: An optimal data-dependent stopping rule,” Karlsruhe Institute of Technology, Germany, a M. Sc. degree from the J. Mach. Learn. Res., vol. 15, pp. 335–366, Jan. 2014. Technical University, Munich and a Ph.D. degree in EECS from University [29] L. Rosasco and S. Villa, “Learning with incremental iterative regular- of California, Berkeley. She was the recipient of the Best Student Paper ization,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1630–1638. Award at the International Conference on Sampling Theory and Applications [30] R. E. Schapire, “The strength of weak learnability,” Mach. Learn.,vol.5, 2013, Berkeley Departmental Fellowship 2013/14 and Samuel Silver Memo- no. 2, pp. 197–227, 1990. rial Scholarship Award 2015. Her research interests include non-parametric [31] R. E. Schapire, “The boosting approach to machine learning: statistics, theory for neural networks, statistical and computational trade-offs An overview,” in Nonlinear Estimation and Classification.NewYork, in machine learning and high-dimensional statistics. NY, USA: Springer, 2003, pp. 149–171. [32] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA, USA: MIT Press, 2002. [33] O. N. Strand, “Theory and methods related to the singular-function expansion and Landweber’s iteration for integral equations of the first kind,” SIAM J. Numer. Anal., vol. 11, no. 4, pp. 798–825, 1974. [34] S. van de Geer, Empirical Processes in M-Estimation. Cambridge, U.K.: Martin J. Wainwright is Chancellor’s Professor at the University of Cambridge Univ. Press, 2000. California at Berkeley, with a joint appointment between the Department [35] A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical of Statistics and the Department of Electrical Engineering and Computer Processes. New York, NY, USA: Springer-Verlag, 1996. Sciences (EECS). He received a Bachelor’s degree in Mathematics from Uni- [36] E. De Vito, S. Pereverzyev, and L. Rosasco, “Adaptive kernel methods versity of Waterloo, Canada, and Ph.D. degree in EECS from Massachusetts using the balancing principle,” Found. Comput. Math., vol. 10, no. 4, Institute of Technology (MIT). His research interests include high-dimensional pp. 455–479, 2010. statistics, information theory, statistical machine learning, and optimization [37] G. Wahba, “Three topics in ill-posed problems,” in Inverse and Ill-Posed theory. He has been awarded an Alfred P. Sloan Foundation Fellowship (2005), Problems, H. W. Engl and G. W. Groetsch, Eds. New York, NY, USA: Best Paper Awards from the IEEE Signal Processing Society (2008), and IEEE Academic, 1987, pp. 37–50. Communications Society (2010); the Joint Paper Prize (2012) from IEEE [38] G. Wahba, “Spline models for observational data,” in Proc. CBMS-NSF Information Theory and Communication Societies; a Medallion Lectureship Regional Conf. Ser. Appl. Math. Philadelphia, PA, USA: SIAM, 1990, (2013) from the Institute of Mathematical Statistics; a Section Lecturer at pp. 1–169. the International Congress of Mathematicians (2014); the COPSS Presidents’ [39] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic View- Award (2014) from the Joint Statistical Societies; and the Blackwell Lecturer point. Cambridge, U.K.: Cambridge Univ. Press, 2019. (2016) from the Institute of Mathematical Statistics.