NYTRO: When Subsampling Meets Early Stopping
Total Page:16
File Type:pdf, Size:1020Kb
NYTRO: When Subsampling Meets Early Stopping Raffaello Camoriano1 Tom´asAngles1 Alessandro Rudi Lorenzo Rosasco iCub Facility - IIT2 Ecole´ Polytechnique DIBRIS - UniGe4 DIBRIS - UniGe4 LCSL - IIT & MIT3 LCSL - IIT & MIT3 LCSL - IIT & MIT3 LCSL - IIT & MIT3 raff[email protected] [email protected] ale [email protected] [email protected] Abstract not only by the the amount, but also by the quality of the available data. Early stopping is a well known approach Early stopping, known as iterative regularization in to reduce the time complexity for perform- inverse problem theory (Engl et al., 1996; Zhang and ing training and model selection of large Yu, 2005; Bauer et al., 2007; Yao et al., 2007; Capon- scale learning machines. On the other hand, netto and Yao, 2010), provides a simple and sound memory/space (rather than time) complex- implementation of this intuition. An empirical objec- ity is the main constraint in many appli- tive function is optimized in an iterative way with no cations, and randomized subsampling tech- explicit constraint or penalization and regularization is niques have been proposed to tackle this is- achieved by suitably stopping the iteration. Too many sue. In this paper we ask whether early stop- iterations might lead to overfitting, while stopping too ping and subsampling ideas can be combined early might result in oversmoothing (Zhang and Yu, in a fruitful way. We consider the question 2005; Bauer et al., 2007; Yao et al., 2007; Caponnetto in a least squares regression setting and pro- and Yao, 2010). Then, the best stopping rule arises pose a form of randomized iterative regular- from a form of bias-variance trade-off (Hastie et al., ization based on early stopping and subsam- 2001). Towards the discussion in the paper, the key pling. In this context, we analyze the statisti- observation is that the number of iterations controls at cal and computational properties of the pro- the same time the computational complexity as well as posed method. Theoretical results are com- the statistical properties of the obtained learning algo- plemented and validated by a thorough ex- rithm (Yao et al., 2007). Training and model selection perimental analysis. can hence be performed with often considerable gain in time complexity. 1 INTRODUCTION Despite these nice properties, early stopping proce- dures often share the same space complexity require- Availability of large scale datasets requires the devel- ments, hence bottle necks, of other methods, such as opment of ever more efficient machine learning proce- those based on variational regularization `ala Tikhonov dures. A key feature towards scalability is being able (see Tikhonov, 1963; Hoerl and Kennard, 1970). A to tailor computational requirements to the general- natural way to tackle these issues is to consider ran- ization properties/statistical accuracy allowed by the domized subsampling/sketching approaches. Roughly data. In other words, the precision with which com- speaking, these methods achieve memory and time putations need to be performed should be determined savings by reducing the size of the problem in a 1 stochastic way (Smola and Sch¨olkopf, 2000; Williams The authors contributed equally. and Seeger, 2000). Subsampling methods are typically 2Istituto Italiano di Tecnologia, Genoa, Italy. 3Laboratory for Computational and Statistical Learn- used successfully together with penalized regulariza- ing, Istituto Italiano di Tecnologia and Massachusetts In- tion. In particular, they are popular in the context stitute of Technology. of kernel methods, where they are often referred to 4Universit`adegli Studi di Genova, Italy. as Nystr¨omapproaches and provide one of the main methods towards large scale extensions (Zhang et al., th Appearing in Proceedings of the 19 International Con- 2008; Kumar et al., 2009; Li et al., 2010; Dai et al., ference on Artificial Intelligence and Statistics (AISTATS) 2014; Huang et al., 2014; Si et al., 2014; Rudi et al., 2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright 2016 by the authors. 2015). 1403 NYTRO: When Subsampling Meets Early Stopping In this paper, we ask whether early stopping and sub- (x1, y1),..., (xn, yn) called training set and an approx- sampling methods can be fruitfully combined. With imate solution needs to be found. The quality of a the context of kernel methods in mind, we propose solution f is measured by the excess risk, defined as and study NYTRO (NYstr¨omiTerative Regulariza- R(f) = (f) inf (v), f . tiOn), a simple algorithm combining these two ideas. E − v E ∀ ∈ H After recalling the properties and advantages of dif- ∈H ferent regularization approaches in Section2, in Sec- We next discuss estimation schemes to find a solution tion3 we present in detail NYTRO and our main re- and compare their computational and statistical prop- sult, the characterization of its generalization prop- erties. erties. In particular, we analyze the conditions un- der which it attains the same statistical properties of 2.2 From (Kernel) Ordinary Least Square to subsampling and early stopping. Indeed, our study Tikhonov Regularization shows that while both techniques share similar, opti- mal, statistical properties, they are computationally A classical approach to derive an empirical solution to advantageous in different regimes and NYTRO out- Problem (1) is the so called empirical risk minimiza- performs early stopping in the appropriate regime, as tion discussed in Section 3.3. The theoretical results are 1 n f = argmin (f(x ) y )2 . (2) validated empirically in Section4, where NYTRO is ols n i − i f i=1 shown to provide competitive results even at a fraction ∈H X of the computational time, on a variety of benchmark In this paper, we are interested in the case where is datasets. the reproducing kernel Hilbert space H = span k(x, ) x , 2 Learning and Regularization H { · | ∈ X } induced by a positive definite kernel k : R In this section we introduce the problem of learning (see Sch¨olkopf and Smola, 2002). In thisX case × X Prob- → in the fixed design setting and discuss different regu- lem (8) corresponds to the Kernel Ordinary Least larized learning approaches, comparing their statistical Squares (KOLS) and has the closed form solution and computational properties. This section is a survey that might be interesting in its own right, and reviews n several results providing the context for the study in fols(x) = αols,ik(x, xi), αols = K†y, (3) i=1 the paper. X for all x , where (K)† denotes the pseudo-inverse ∈n Xn 2.1 The Learning Problem of the R × empirical kernel matrix Kij = k(xi, xj) ∈ and y = (y1, . , yn). The cost for computing the co- We introduce the learning setting we consider in the 2 3 2 efficients αols is O(n ) in memory and O(n + q( )n ) paper. Let = Rd be the input space and R the 2 X X Y ⊆ in time, where q( )n is the cost for computing K output space. Consider a fixed design setting (Bach, and n3 the cost forX obtaining its pseudo-inverse. Here 2013) where the input points x1, . , xn are fixed, q( ) is the cost of evaluating the kernel function. In while the outputs y , . , y are given∈ X by X 1 n ∈ Y the following, we are concerned with the dependence on n and hence view q( ) as a constant. yi = f (xi) + i, i 1, . , n X ∗ ∀ ∈ { } The statistical properties of KOLS, and related meth- where f : is a fixed function and 1, . , n are ods, can be characterized by suitable notions of di- ∗ X → Y random variables. The latter can be seen seen as noise mension that we recall next. The simplest is the full and are assumed to be independently and identically dimension, that is distributed according to a probability distribution ρ with zero mean and variance σ2. In this context, the d∗ = rank K goal is to minimize the expected risk, that is which measures the degrees of freedom of the kernel n matrix. This latter quantity might not be stable when 1 2 min (f), (f) = E (f(xi) yi) , f , K is ill-conditioned. A more robust notion is provided f E E n − ∀ ∈ H ∈H i=1 by the effective dimension X (1) 1 deff(λ) = Tr(K(K + λnI)− ), λ > 0. where is a space of functions, called hypothesis space.H In real applications, ρ and f are unknown Indeed, the above quantity can be shown to be related and accessible only by means of a single∗ realization to the eigenvalue decay of K (Bach, 2013; Alaoui and 1404 Raffaello Camoriano1, Tom´asAngles1, Alessandro Rudi, Lorenzo Rosasco Mahoney, 2014; Rudi et al., 2015) and can consider- for all x . The intuition that regularization can ∈ X ably smaller than d∗, as discussed in the following. be beneficial is made precise by the following result Finally, consider comparing KOLS and KRLS. 1 1 Theorem 2. Let λ∗ = , the following inequalities d˜(λ) = n max(K(K + λnI)− )ii, λ > 0. (4) SNR i hold, It is easy to see that the following inequalities hold, 2 2 ¯ σ deff(λ∗) σ d∗ ER(fλ∗ ) = ER(fols). d (λ) d˜(λ) 1/λ, d (λ) d∗ n, λ > 0. ≤ n ≤ n eff ≤ ≤ eff ≤ ≤ ∀ Aside from the above notion of dimensionality, the sta- We add a few comments. First, as announced, the tistical accuracy of empirical least squares solutions above result quantifies the benefits of regularization. depends on a natural form of signal to noise ratio de- Indeed, it shows that there exists a λ∗ for which the fined next. Note that the function that minimizes the expected excess risk of KRLS is smaller than the one excess risk in is given by of KOLS. As discussed in Table 1 of Bach(2013), if H n d∗ = n and the kernel is sufficiently “rich”, namely f = α k(x, x ), x (5) universal (Micchelli et al., 2006), then deff can be less opt opt,i i ∀ ∈ X i=1 than a fractional power of d∗, so that d d∗ and X eff αopt = K†µ, with µ = Ey.