Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
Mikhail Belkin Daniel Hsu Partha P. Mitra The Ohio State University Columbia University Cold Spring Harbor Laboratory
Abstract
Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phe- nomenon of strong generalization performance for “overfitted” / interpolated clas- sifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. Moreover, the nearest neighbor schemes exhibit optimal rates under some standard statistical assumptions. Finally, this paper suggests a way to explain the phenomenon of adversarial ex- amples, which are seemingly ubiquitous in modern machine learning, and also discusses some connections to kernel machines and random forests in the interpo- lated regime.
1 Introduction
The central problem of supervised inference is to predict labels of unseen data points from a set of labeled training data. The literature on this subject is vast, ranging from classical parametric and non-parametric statistics [48, 49] to more recent machine learning methods, such as kernel machines [39], boosting [36], random forests [15], and deep neural networks [25]. There is a wealth of theoretical analyses for these methods based on a spectrum of techniques including non- parametric estimation [46], capacity control such as VC-dimension or Rademacher complexity [40], and regularization theory [42]. In nearly all of these results, theoretical analysis of generalization requires “what you see is what you get” setup, where prediction performance on unseen test data is close to the performance on the training data, achieved by carefully managing the bias-variance trade-off. Furthermore, it is widely accepted in the literature that interpolation has poor statistical properties and should be dismissed out-of-hand. For example, in their book on non-parametric statistics, Györfi et al. [26, page 21] say that a certain procedure “may lead to a function which interpolates the data and hence is not a reasonable estimate”. Yet, this is not how many modern machine learning methods are used in practice. For instance, the best practice for training deep neural networks is to first perfectly fit the training data [35]. The resulting (zero training loss) neural networks after this first step can already have good performance on test data [53]. Similar observations about models that perfectly fit training data have been
E-mail: [email protected], [email protected], [email protected]
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. made for other machine learning methods, including boosting [37], random forests [19], and kernel machines [12]. These methods return good classifiers even when the training data have high levels of label noise [12, 51, 53]. An important effort to show that fitting the training data exactly can under certain conditions be theoretically justified is the margins theory for boosting [37] and other margin-based methods [6, 24, 28, 29, 34]. However, this theory lacks explanatory power for the performance of classifiers that perfectly fit noisy labels, when it is known that no margin is present in the data [12, 51]. Moreover, margins theory does not apply to regression and to functions (for regression or classification) that interpolate the data in the classical sense [12]. In this paper, we identify the challenge of providing a rigorous understanding of generalization in machine learning models that interpolate training data. We take first steps towards such a theory by proposing and analyzing interpolating methods for classification and regression with non-trivial risk and consistency guarantees.
Related work. Many existing forms of generalization analyses face significant analytical and conceptual barriers to being able to explain the success of interpolating methods.
Capacity control. Existing capacity-based bounds (e.g., VC dimension, fat-shattering dimension, Rademacher complexity) for empirical risk minimization [3, 4, 7, 28, 37] do not give useful risk bounds for functions with zero empirical risk whenever there is non-negligible label noise. This is because function classes rich enough to perfectly fit noisy training labels generally have capacity measures that grow quickly with the number of training data, at least with the existing notions of capacity [12]. Note that since the training risk is zero for the functions of interest, the generalization bound must bound their true risk, as it equals the generalization gap (difference between the true and empirical risk). Whether such capacity-based generalization bounds exist is open for debate. Stability. Generalization analyses based on algorithmic stability [8, 14] control the difference between the true risk and the training risk, assuming bounded sensitivity of an algorithm’s output to small changes in training data. Like standard uses of capacity-based bounds, these approaches are not well-suited to settings when training risk is identically zero but true risk is non-zero. Regularization. Many analyses are available for regularization approaches to statistical inverse problems, ranging from Tikhonov regularization to early stopping [9, 16, 42, 52]. To obtain a risk bound, these analyses require the regularization parameter (or some analogous quantity) to approach zero as the number of data n tends to infinity. However, to get (the minimum norm) interpolation, we need 0 while n is fixed, causing the bounds to diverge. ! Smoothing. There is an extensive literature on local prediction rules in non-parametric statistics [46, 49]. Nearly all of these analyses require local smoothing (to explicitly balance bias and variance) and thus do not apply to interpolation. (Two exceptions are discussed below.)
Recently, Wyner et al. [51] proposed a thought-provoking explanation for the performance of Ad- aBoost and random forests in the interpolation regime, based on ideas related to “self-averaging” and localization. However, a theoretical basis for these ideas is not developed in their work. There are two important exceptions to the aforementioned discussion of non-parametric methods. First, the nearest neighbor rule (also called 1-nearest neighbor, in the context of the general family of k-nearest neighbor rules) is a well-known interpolating classification method, though it is not generally consistent for classification (and is not useful for regression when there is significant amount of label noise). Nevertheless, its asymptotic risk can be shown to be bounded above by twice the Bayes risk [18].1 A second important (though perhaps less well-known) exception is the non-parametric smoothing method of Devroye et al. [21] based on a singular kernel called the Hilbert kernel (which is related to Shepard’s method [41]). The resulting estimate of the regression function interpolates the training data, yet is proved to be consistent for classification and regression. The analyses of the nearest neighbor rule and Hilbert kernel regression estimate are not based on bounding generalization gap, the difference between the true risk and the empirical risk. Rather, the true risk is analyzed directly by exploiting locality properties of the prediction rules. In particular, the
1More precisely, the expected risk of the nearest neighbor rule converges to E[2⌘(X)(1 ⌘(X))], where ⌘ is the regression function; this quantity can be bounded above by 2R⇤(1 R⇤), where R⇤ is the Bayes risk.
2 prediction at a point depends primarily or entirely on the values of the function at nearby points. This inductive bias favors functions where local information in a neighborhood can be aggregated to give an accurate representation of the underlying regression function.
What we do. Our approach to understanding the generalization properties of interpolation methods is to understand and isolate the key properties of local classification, particularly the nearest neighbor rule. First, we construct and analyze an interpolating function based on multivariate triangulation and linear interpolation on each simplex (Section 3), which results in a geometrically intuitive and theoretically tractable prediction rule. Like nearest neighbor, this method is not statistically consistent, but, unlike nearest neighbor, its asymptotic risk approaches the Bayes risk as the dimension becomes large, even when the Bayes risk is far from zero—a kind of “blessing of dimensionality”2. Moreover, under an additional margin condition the difference between the Bayes risk and our classifier is exponentially small in the dimension. A similar finding holds for regression, as the method is nearly consistent when the dimension is high. Next, we propose a weighted & interpolated nearest neighbor (wiNN) scheme based on singular weight functions (Section 4). The resulting function is somewhat less natural than that obtained by simplicial interpolation, but like the Hilbert kernel regression estimate, the prediction rule is statistically consistent in any dimension. Interestingly, conditions on the weights to ensure consistency become less restrictive in higher dimension—another “blessing of dimensionality”. Our analysis provides the first known non-asymptotic rates of convergence to the Bayes risk for an interpolated predictor, as well as tighter bounds under margin conditions for classification. In fact, the rate achieved by wiNN regression is statistically optimal under a standard minimax setting3. Our results also suggest an explanation for the phenomenon of adversarial examples [44], which are seemingly ubiquitous in modern machine learning. In Section 5, we argue that interpolation inevitably results in adversarial examples in the presence of any amount of label noise. When these schemes are consistent or nearly consistent, the set of adversarial examples (where the interpolating classifier disagrees with the Bayes optimal) has small measure but is asymptotically dense. Our analysis is consistent with the empirical observations that such examples are difficult to find by random sampling [22], but are easily discovered using targeted optimization procedures, such as Projected Gradient Descent [30]. Finally, we discuss the difference between direct and inverse interpolation schemes; and make some connections to kernel machines, and random forests in (Section 6). Proofs of the main results, along with an informal discussion of some connections to graph-based semi-supervised learning, are given in the full version of the paper [11] on arXiv.4
2 Preliminaries
The goal of regression and classification is to construct a predictor fˆ given labeled training data d (x1,y1),...,(xn,yn) R R, that performs well on unseen test data, which are typically assumed to be sampled from the2 same⇥ distribution as the training data. In this work, we focus on interpolating methods that construct predictors fˆ satisfying fˆ(xi)=yi for all i =1,...,n. Algorithms that perfectly fit training data are not common in statistical and machine learning literature. The prominent exception is the nearest neighbor rule, which is among of the oldest and best- understood classification methods. Given a training set of labeled example, the nearest neighbor rule predicts the label of a new point x to be the same as that of the nearest point to x within the training d set. Mathematically, the predicted label of x R is yi, where i arg mini =1,...,n x xi . (Here, 2 2 0 k 0 k always denotes the Euclidean norm.) As discussed above, the classification risk of the nearest neighbork·k rule is asymptotically bounded by twice the Bayes (optimal) risk [18]. The nearest neighbor
2This does not remove the usual curse of dimensionality, which is similar to the standard analyses of k-NN and other non-parametric methods. 3An earlier version of this article paper contained a bound with a worse rate of convergence based on a loose analysis. The subsequent work [13] found that a different Nadaraya-Watson kernel regression estimate (with a singular kernel) could achieve the optimal convergence rate; this inspired us to seek a tighter analysis of our wiNN scheme. 4https://arxiv.org/abs/1806.05161
3 rule provides an important intuition that such classifiers can (and perhaps should) be constructed using local information in the feature space. In this paper, we analyze two interpolating schemes, one based on triangulating and constructing the simplicial interpolant for the data, and another, based on weighted nearest neighbors with singular weight function.
2.1 Statistical model and notations
d We assume (X1,Y1),...,(Xn,Yn), (X, Y ) are iid labeled examples from R [0, 1]. Here, n ⇥ ((Xi,Yi))i=1 are the iid training data, and (X, Y ) is an independent test example from the same distribution. Let µ denote the marginal distribution of X, with support denoted by supp(µ); and let ⌘ : Rd R denote the conditional mean of Y given X, i.e., the function given by ! ⌘(x) := E(Y X = x). For (binary) classification, we assume the range of Y is 0, 1 | d { } (so ⌘(x)=P(Y =1 X = x)), and we let f ⇤ : R 0, 1 denote the Bayes opti- | !{ } mal classifier, which is defined by f ⇤(x) := 1 ⌘(x)>1/2 . This classifier minimizes the risk 1 { } 0/1(f) := E[ f(X)=Y ]=P(f(X) = Y ) under zero-one loss, while the conditional mean R { 6 } 6 2 function ⌘ minimizes the risk sq(g) := E[(g(X) Y ) ] under squared loss. R The goal of our analyses will be to establish excess risk bounds for empirical predictors (fˆand ⌘ˆ, based on training data) in terms of their agreement with f ⇤ for classification and with ⌘ for regression. For ˆ ˆ classification, the expected risk can be bounded as E[ 0/1(f)] 0/1(f ⇤)+P(f(X) = f ⇤(X)), R R 6 while for regression, the expected mean squared error is precisely E[ sq(ˆ⌘(X))] = sq(⌘)+ 2 ˆ R R 2 E[(ˆ⌘(X) ⌘(X) ]. Our analyses thus mostly focus on P(f(X) = f ⇤(X)) and E[(ˆ⌘(X) ⌘(X)) ] (where the probability and expectations are with respect to both the6 training data and the test example).
2.2 Smoothness, margin, and regularity conditions Below we list some standard conditions needed for further development.
↵ (A,↵)-smoothness (Hölder). For all x, x0 in the support of µ, ⌘(x) ⌘(x0) A x x0 . | | ·k k (B, )-margin condition [31, 45]. For all t 0, µ( x Rd : ⌘(x) 1/2 t ) B t . { 2 | | } · h-hard margin condition [32]. For all x in the support of µ, ⌘(x) 1/2 h>0. | | (c ,r )-regularity [5]. For all 0