Model Selection for Optimal Prediction in Statistical

Ernest Fokou´e

Introduction science with several ideas from cognitive neuroscience and At the core of all our modern-day advances in artificial in- psychology to inspire the creation, invention, and discov- telligence is the emerging field of statistical machine learn- ery of abstract models that attempt to learn and extract pat- ing (SML). From a very general perspective, SML can be terns from the . One could think of SML as a field thought of as a field of mathematical sciences that com- of science dedicated to building models endowed with bines , probability, , and computer the ability to learn from the data in ways similar to the ways humans learn, with the ultimate goal of understand- Ernest Fokou´eis a professor of statistics at Rochester Institute of Technology. His ing and then mastering our complex world well enough email address is [email protected]. to predict its unfolding as accurately as possible. One Communicated by Notices Associate Editor Emilie Purvine. of the earliest applications of statistical machine learning For permission to reprint this article, please contact: centered around the now ubiquitous MNIST benchmark [email protected]. task, which consists of building statistical models (also DOI: https://doi.org/10.1090/noti2014

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 155 known as learning machines) that automatically learn and Theoretical Foundations accurately recognize handwritten digits from the United It is typical in statistical machine learning that a given States Postal Service (USPS). A typical deployment of an problem will be solved in a wide variety of different ways. artificial intelligence solution to a real-life problem would As a result, it is a central element in SML, both within have several components touching several aspects of the each paradigm and among paradigms, to come up with taxonomy of statistical machine learning. For instance, good criteria for deciding and determining which learn- when artificial intelligence is used for the task of auto- ing machine or is the best for the given mated sorting of USPS letters, at least one component of task at hand. To better explain this quintessential task the whole system deals with recognizing the recipient of of model selection, we consider a typical statistical ma- a given letter as accurately as (or even better than) a hu- chine learning setting, with two sets X and Y , along with man operator. This would that the statistical machine their Cartesian product Z ≡ X × Y . We further define learning model can ideally recognize handwritten digits regard- Z 푛 ≡ Z × Z × ⋯ × Z to be the 푛-fold Cartesian product less of the various ways in which those digits are written. How of Z . We assume that Z is equipped with a probability does one go about formulating, defining, designing, build- measure 휓, albeit assumed unknown throughout this pa- ing, refining, and ultimately deploying such statistical ma- Z 푛 per. Let 퐙 ∈ , with 퐙 = ((퐱1, 푦1), (퐱2, 푦2), … , (퐱푛, 푦푛)), chine learning models for the intended use? In the case denote a realization of a random sample of 푛 examples, of the MNIST data, for instance, the digits to be poten- where each example 퐳푖 = (퐱푖, 푦푖) is independently drawn tially recognized are captured as a matrix, which is then according to the above probability measure 휓 on the prod- transformed and represented as a high-dimensional vec- uct space Z ≡ X × Y . For practical reasons, and in keep- tor fed as the input to a statistical learning model, along ing with the data science and artificial intelligence lexicon, with the true label of the digit at hand. Conceptually, the we shall quite often refer to the random sample 퐙 as the task of building the statistical learning machine is mathemati- data set, and will use the compact and comprehensive no- cally formulated as the construction of a function mapping the tation elements of the input space (space in which the digits are rep- D iid resented) to the output space (space of the true labels for the 푛 = {(퐱푖, 푦푖) ∼ 푝퐱y(퐱, 푦), 푖 = 1, … , 푛}, (1) X Y digits). Over the years, different methods have been cre- where all pairs (퐱푖, 푦푖) ∈ × , and 푝퐱y(퐱, 푦) is the ated and developed by and computer scientists probability density function associated with the proba- from all around the world to help build statistical learning bility measure 휓 on Z . Given a random sample 퐙 = machines for a wide variety of problems like those men- ((퐱1, 푦1), (퐱2, 푦2), … , (퐱푛, 푦푛)), one of the most pervading goals tioned earlier. F. Rosenblatt’s [13] groundbreaking and in both theoretical and applied statistical machine learning is to thought-provoking publication of the seminal paper fea- find the function 푓⋆ ∶ X → Y that best captures the depen- turing the so-called Perceptron ushered in the era of brain- dencies between the 퐱푖’s and the 푦푖’s in such a way that, given inspired statistical and computational learning, and can a new random (unseen) observation 퐳헇햾헐 = (퐱헇햾헐, 푦헇햾헐) ∼ 헇햾헐 D ⋆ 헇햾헐 헇햾헐 rightly be thought of as the catalyst of the field of artifi- 푝퐱y(퐱, 푦) with 퐳 ∉ 푛, the image 푓 (퐱 ) of 퐱 ∼ 푝퐱(퐱) cial neural networks, and even arguably the ancestor of provides a prediction of 푦헇햾헐 that is as accurate and precise as our modern-day hot topic of deep neural networks. A cou- possible, in the sense of yielding the smallest possible discrepancy ple of decades after Rosenblatt’s seminal paper, the Multi- between 푦헇햾헐 and 푓⋆(퐱헇햾헐). layer Perceptron (MLP) was introduced as one of the solu- This setting, where one seeks to build functions of the tions to the limitations of the Perceptron. MLPs extended, type 푓 ∶ X → Y , is the foundational setting of machine strengthened, and empowered artificial neural networks learning in general and statistical machine learning in par- by allowing potentially many hidden layers, a tremendous ticular. Throughout this paper, we shall refer to X as the improvement over the Perceptron that brought a much input space and to Y as the output space. For simplic- needed new spring to artificial intelligence. MLPs turned ity, we shall assume that X ⊆ ℝ푝 for our methodological out to be extremely successful, namely, on tasks like the and theoretical derivations and explanations, but will al- MNIST USPS digit recognition task mentioned earlier, but low X to be more general in practical demonstrations and also on several other tasks including credit scoring, stock examples. We will consider both regression learning corre- price prediction, automatic translation, and medical di- sponding to Y = ℝ and multicategory classification learn- agnosis, just to name a few. MLPs triggered a veritable ing (pattern recognition) corresponding to output spaces scientific revolution, inspiring the flourishing of creativity of the form Y = {1, 2, … , 퐺}, where 퐺 is the number of among researchers, many of whom invented or discovered categories. entirely new learning methods and paradigms, or revived Definition 1. A ℒ(⋅, ⋅) is a nonnegative bivari- or adapted existing ones. Y Y Y ate function ℒ ∶ × ⟶ ℝ+, such that given 푎, 푏 ∈ , the value of ℒ(푎, 푏) measures the discrepancy between 푎

156 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 and 푏, or the of 푎 from 푏, or the loss incurred from This intuitive result is of paramount importance for prac- using 푏 in place of 푎. For instance, ℒ(푦, 푓(퐱)) = ℒ(푓(퐱), 푦) tical aspects of statistical machine learning, because it pro- will be used throughout this paper to quantify the discrep- vides an understandable frame of reference for the interpre- ancy between 푦 and 푓(퐱) with the finality of choosing the tation of the predictive performance of learning machines. best 푓, the optimal 푓, the 푓 that minimizes expected dis- Indeed, the true error 푅(푓) of a classifier 푓 therefore defines crepancy over the entire Z . The loss function plays a cen- the probability that 푓 misclassifies any arbitrary observation ran- tral role in statistical learning theory as it allows an unam- domly drawn from the population of interest according to the biguous measure and quantification of optimality. distribution 휓. 푅(푓) can also be interpreted as the expected disagreement between classifier 푓(푋) and the true label 푌. Definition 2. The theoretical risk or generalization error or true error of any function 푓 ∈ Y X is given by Definition 3 (The Bayes classifier). Consider a pattern 퐱 푦 푝(퐱|푦) 푅(푓) = 피[ℒ(푌, 푓(푋))] from the input space and a class label . Let de- note the class conditional density of 퐱 in class 푦, and let

= ∫ ℒ(푦, 푓(퐱))푝퐱y(퐱, 푦)푑퐱푑푦 (2) 홿횛횘횋[푌 = 푦] denote the of class mem- X ×Y bership. The of class membership is and can be interpreted as the expected discrepancy be- 홿횛횘횋[푌 = 푦]푝(퐱|푦) 홿횛횘횋[푌 = 푦|퐱] = . (7) tween 푓(푋) and 푌, and indeed as a measure of the predic- 푝(퐱) tive strength of 푓. Ideally, one seeks to find the minimizer X Given 퐱 ∈ X to be classified, the Bayes classification strat- 푓⋆ of 푅(푓) over all measurable functions 푓 ∈ Y , specif- egy consists of assigning 퐱 to the class with maximum ically, posterior probability. With ℎ ∶ X ⟶ Y denoting the 푓⋆ = 횊횛횐 횒횗횏 {푅(푓)} = 횊횛횐 횒횗횏 {피[ℒ(푌, 푓(푋))]}, (3) Bayes classifier, we have, ∀퐱 ∈ X , 푓∈Y X 푓∈Y X ℎ(퐱) = 횊횛횐횖횊횡 {홿횛횘횋(푌 = 푐|퐱)} . (8) whose corresponding theoretical risk 푅⋆ serves as the gold 푐∈Y standard and is given by Theorem 1. The minimizer of the zero-one risk over all possible 푅⋆ = 푅(푓⋆) = 횒횗횏 {푅(푓)}. (4) classifiers is the Bayes classifier ℎ defined in (8): 푓∈Y X 푓⋆ = 횊횛횐횒횗횏 {푅(푓)} = 횊횛횐횒횗횏 {피 [ퟙ(푌 ≠ 푓(푋))]} If we reconsider our overarching goal stated earlier, 푓 푓 then the smallest risk (expected loss) in the prediction = 횊횛횐횒횗횏 {홿횛횘횋(푋,푌)∼휓[푌 ≠ 푓(푋)]} = ℎ. (9) of 푌 헇햾헐 given 푋헇햾헐 is achieved with the 푓⋆ of (3), 푓 ⋆ and that theoretical optimal risk is 푅 of (4), namely, Therefore, the Bayes classifier ℎ defined in (8) is the universal 헇햾헐 ⋆ 헇햾헐 ⋆ 피[ℒ(푌 , 푓 (푋 ))] = 푅 . The theoretical optimal pre- best classifier, such that ∀퐱 ∈ X , dictive model is therefore 푓⋆, although we must recognize 푓⋆(퐱) = ℎ(퐱) = 횊횛횐횖횊횡 {홿횛횘횋(푌 = 푐|퐱)} that it is of no practical use as it cannot be computed. For 푐∈Y instance, when we consider both classification and regres- ⋆ 홿횛횘횋[푌 = 푐]푝(퐱|푐) sion, the theoretical optimal predictive model 푓 can be = 횊횛횐횖횊횡{ }. (10) elicited and derived for some well-known foundational 푐∈Y 푝(퐱) loss functions. For classification, an intuitive and indeed The risk 푅⋆ corresponding to 푓⋆ is the smallest possible widely studied loss function is the so-called zero-one (0/1) error that any classifier can achieve, i.e., loss function, defined simply with the indicator function 푅⋆ = 푅(푓⋆) = 푅(ℎ) = 횒횗횏 {푅(푓)} . as follows: 푓 0 if 푦 = 푓(퐱), ℒ(푦, 푓(퐱)) = ퟙ(푦 ≠ 푓(퐱)) = { (5) The fact that the Bayes classifier achieves the universal infi- 1 if 푦 ≠ 푓(퐱). mum error over all measurable classifiers is a fundamental When the zero-one loss function is used in classification, it result in pattern recognition and statistical learning. The can be shown quite easily that 푅(푓), the corresponding true probability theory for pattern recognition is made up of risk (also known as theoretical risk or generalization error multiple results featuring learning machines whose perfor- or true error), coincides with the misclassification proba- mance is compared to the performance of the Bayes clas- bility 홿횛횘횋(푋,푌)∼휓[푌 ≠ 푓(푋)], namely, sifier [7] and [19]. Although this result is of more theo- retical than practical importance, it turns out to provide 푅(푓) = ∫ ℒ(푦, 푓(퐱))푝퐱y(퐱, 푦)푑퐱푑푦 a framework of reference for building more practical clas- X ×Y sification learning machines. Although we do not know = 피 [ퟙ(푌 ≠ 푓(푋))] the true density 푝퐱y(⋅, ⋅), we can assume a wide variety of = 홿횛횘횋(푋,푌)∼휓[푌 ≠ 푓(푋)]. (6) possible densities in special cases, and then attempt the

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 157 construction of the Bayes classifier under those distribu- are similar to the ones presented earlier in the context of tional assumptions. It is found in practice that when the classification learning. assumptions are met (or almost met), the ensuing learning Theorem 2. Consider functions 푓 ∶ ℝ푝 → ℝ and the squared machine tends to exhibit superior predictive performances. theoretical risk functional For instance, under the assumption of multivariate Gauss- 2 ian class conditional densities with equal ma- 푅(푓) = 피[(푌 − 푓(푋)) ] trices in , one can derive the popula- 2 = ∫ (푦 − 푓(퐱)) 푝퐱y(퐱, 푦)푑퐱푑푦. (11) tion Bayes Gaussian linear discriminant analysis classifier, X ×Y whose from the corresponding data yields the Then the best function 푓⋆ = 횊횛횐횒횗횏 {푅(푓)} is given by the best predictive performance over all other learning ma- 푓 chines. It bears repeating that this superior performance conditional expectation of 푌 given 푋; i.e., ∀퐱 ∈ X , presupposes that the assumed multivariate Gaussianity is ⋆ plausible. Every single aspect of optimal predictive model 푓 (퐱) = 피[푌|푋 = 퐱] = ∫ 푦 푝(푦|퐱)푑푦. (12) Y selection we have mentioned so far is strongly tied to the distributional characteristics of the space under consider- Theorem 2 provides the basic foundation of all regres- ation. In the case of superior predictive performances in- sion analysis under the squared error loss. Clearly, the con- 푌 푋 = 퐱 herited from the correct assumption of the generator of the ditional expectation of given that given in equa- data, it must be said that practical data sets often arise from tion (12) is the theoretical optimal predictive function in rather complex distributions that are often far too difficult regression, with a corresponding theoretical risk that is the to estimate. One could even consider estimating the den- baseline. sity and then estimating the corresponding classifier. Un- Theorem 3. For every 푓 ∶ X → Y , 푅(푓) = ⋆ 2 2 fortunately, the task of probability in ∫X (푓(퐱) − 푓 (퐱)) 푑 휓(퐱) + 휎⋆, where complex high-dimensional spaces turns out to be a treach- 2 ⋆ ⋆ 휎⋆ = 푅 = 푅(푓 ) erous task, often more complex (statistically and computa- 2 tionally) than the classification task one would be intend- = ∫ (푦 − 피[푌|퐱]) 푝퐱y(퐱, 푦)푑퐱푑푦. (13) ing to use density estimation for. Some researchers have X ×Y resorted to semiparametric solutions like the use of mix- Since the conditional density 푝(푦|퐱) of 푌 given 퐱, which tures of Gaussians (or mixtures of other parametric densi- is the main ingredient of 푓⋆, is not known in practice, the ties) to model their class conditional densities, and have optimum remains a theoretical one and serves as a gold done so with great success, although the analysis of mix- standard and reference when the squared error loss is used, tures is fraught with challenges, to the point that having to as is often the case. In an effort to realize an estimator deal with those along with the main task of classification of the optimum with the data, one can consider the tra- may render their use unattractive and not viable in this ditional machinery. In one di- context. For this reason, practitioners and methodologi- mension, nonparametric regression works very well, but cal and theoretical researchers tend to focus on more real- it unfortunately suffers from the curse of dimensionality. izable goals than the hunt for the universal best learning Just as with classification learning, one could relax the gen- machine. The approach consists of assuming that the func- erality of 푝(푦|퐱) by assuming, for instance, a specific distri- tion underlying the data (the decision boundary in the bution. An example of this is the ubiquitous assumption context of classification) is a member of a class of functions of Gaussianity by which 푝(푦|퐱) = 휙(푦; ℎ(퐱), 휎2), where ℎ ∈ with some specific (sometimes desirable) properties. Of H is a function with certain properties, taken from a func- course, the very fact of choosing a specific function space tion space H . The function space H could be anything automatically comes at the potential price of incurring an from the space of linear functions in the 푝-dimensional Eu- approximation error. In the example given earlier, assum- clidean space ℝ푝 to a space of certain nonlinear functions ing Gaussian class conditional probability densities with to reproducing kernel Hilbert spaces (RKHS) anchored by equal covariance matrices led to the derivation of a classi- a suitably chosen kernel (similarity measure). We will seek fier belonging to the space of linear learning machines. In to solve the more reasonable problem of choosing from a X this case, the ensuing function space was implicit in the function space H ⊂ Y the function 푓⋄ ∈ H that best distributional choice. We will see later that the choice of estimates the dependencies between 퐱 and 푦. As stated ear- the function space is often quite explicit and typically moti- lier, trying to find 푓⋆ is hopeless. One needs to select a X vated by experience or pure convenience. Before we delve function space H ⊂ Y , then choose the best function ⋄ into the search for optimal predictive models in specific 푓H from H , i.e., function spaces, it is useful to point out that fundamental ⋄ statistical learning results exist in that 푓H = 햺헋헀 inf {피[ℒ(푌, 푓(푋))]}, (14) 푓∈H

158 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 so that allowing the corresponding machine the capacity to cap- ⋄ ⋄ 푅(푓H ) = 푅H = inf 푅(푓). ture nonlinear decision boundaries. 푓∈H For notational simplicity, we will simply use 푓⋄ and 푅⋄ Empirical Foundations ⋄ ⋄ in place of 푓H and 푅H , respectively. For the regression Throughout the previous section, we explored some basic learning task, for instance, one could assume that the in- aspects of the theoretical foundations of optimal predic- put space X is a closed and bounded interval of ℝ, i.e., tion model selection. It turns out that 푓⋄ ∈ H , just like X ⋆ = [푎, 푏], and then consider estimating the dependen- 푓 , cannot be computed because 푝퐱y(퐱, 푦) is never known cies between 퐱 and y from within the space H of all in practice. What does happen in practice is that, given X D bounded functions on = [푎, 푏], i.e., the data set 푛 along with the chosen loss function ℒ(⋅, ⋅), the empirical risk 푅(푓)ˆ is defined as an estimator of the H = {푓 ∶ X → ℝ| ∃퐵 ≥ 0 such that |푓(퐱)| ≤ 퐵}. theoretical risk 푅(푓). From a practical perspective, given a H D One could further make the functions of the above con- data set 푛, empirical risk minimization is used in place tinuous, so that the space to search becomes of theoretical risk minimization to construct of 푓⋆, namely, H = {푓 ∶ [푎, 푏] → ℝ| 푓 is continuous} = 퐶([푎, 푏]), ˆ ˆ ˆ which is the well-known space of all continuous functions 푓 = 푓H ,푛 = 푓푛 = 횊횛횐횖횒횗 {푅ˆ푛(푓)} H on a closed and bounded interval [푎, 푏]. This is indeed 푓∈ 푛 a very important function space. In fact, polynomial re- 1 = 횊횛횐횖횒횗 { ∑ ℒ(푦 , 푓(퐱 ))} . (17) gression consists of searching our learning machine from 푛 푖 푖 푓∈H 푖=1 a function space that is a subspace of 퐶([푎, 푏]). In other words, in learning, we are search- Although the zero-one loss function allows us to theoret- ing the space ically define what constitutes the universal best optimal classifier, it cannot be used in any given function space 풫([푎, 푏]) = {푓 ∈ 퐶([푎, 푏])| 푓 is a polynomial in ℝ} . to construct an estimated learning machine, because its Interestingly, Weierstrass did prove that 풫([푎, 푏]) is dense use inherently implies an untenable combinatorial explo- in 퐶([푎, 푏]). One considers the space of all polynomials of ration. Fortunately, many other loss functions have been some degree 푝, i.e., typically used in the search for optimal predictive models in statistical machine learning. With 푓 ∶ X ⟶ {−1, +1}, H = 풫푝([푎, 푏]) = {푓 ∈ 퐶([푎, 푏])| ∃휽 ∈ ℝ푝+1| and ℎ ∈ H such that 푓(퐱) = 횜횒횐횗(ℎ(퐱)), some frequently used loss functions for binary classification include: (a) 푝 Zero-one (0/1) loss: ℒ(푦, 푓(퐱)) = ퟙ(푦ℎ(퐱) < 0), (b) 푗 푓(퐱) = ∑ 휃푗퐱 , ∀퐱 ∈ [푎, 푏]}. Hinge loss: ℒ(푦, 푓(퐱)) = max(1 − 푦ℎ(퐱), 0), (c) Logistic loss: 푗=0 ℒ(푦, 푓(퐱)) = log(1 + exp(−푦ℎ(퐱))), and (d) Exponential loss: Similarly, for the classification learning task of binary pat- ℒ(푦, 푓(퐱)) = exp(−푦ℎ(퐱)). With 푓 ∶ X ⟶ ℝ and 푓 ∈ ℋ, tern recognition with Y = {−1, +1}, one may consider some loss functions for regression include: (a) ℒ1 loss: 2 finding the best linear separating hyperplane, so that the ℒ(푦, 푓(퐱)) = |푦−푓(퐱)|, (b) ℒ2 loss: ℒ(푦, 푓(퐱)) = |푦−푓(퐱)| , corresponding function space is (c) 휀-insensitive ℒ1 loss: ℒ(푦, 푓(퐱)) = |푦 − 푓(퐱)| − 휀, and (d) 2 휀-insensitive ℒ2 loss: ℒ(푦, 푓(퐱)) = |푦−푓(퐱)| −휀. Other loss H X Y 푝 X = {푓 ∶ → | ∃w0 ∈ ℝ, 퐰 ∈ ℝ ∶ ∀퐱 ∈ , functions exist. Although the empirical risk minimization principle pro- vides an effective practical framework for learning patterns ⊤ 푓(퐱) = 헌헂헀헇 (퐰 퐱 + w0) }, (15) ˆ underlying the data, the estimator 푓H ,푛 derived from it must be handled with great care and caution for a wide va- or even a more complex function space capable of mod- riety of reasons, which we now make clear. With the def- elling and representing nonlinear decision boundaries like ⋆ ⋄ ˆ initions of 푓 , 푓 , and now 푓H ,푛 in hand, a natural and almost quintessential yet somewhat audacious question H (Φ) = {푓 ∶ X → Y | ∃w ∈ ℝ, 퐰 ∈ F ∶ ∀퐱 ∈ X , ˆ ⋆ 0 would be to assess the difference between 푓H ,푛 and 푓 , ˆ ⋆ maybe via some suitably defined norm, say ‖푓H ,푛 − 푓 ‖, ˆ ⋆ 푓(퐱) = 헌헂헀헇 (⟨퐰 , Φ(퐱)⟩ + w0) }, (16) maybe using probabilistic measures like Pr[‖푓H ,푛 − 푓 ‖] ˆ ⋆ or even 피[‖푓H ,푛 − 푓 ‖], though it might not be trivial where Φ ∶ X ⟶ F is a mapping that projects each in- at all how to properly define such a norm, let alone the put 퐱 up to a high-dimensional feature space F , thereby corresponding . A difference like

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 159 ˆ ⋄ ill-posed, in the sense that they typically violate at least one ‖푓H ,푛 −푓H ‖H might be easier, although itself neither easy nor even practically realizable. The typical approach is to of Hadamard’s three well-posedness conditions. For clar- deal with the utility of the function like 푅(푓) rather than ity, according to Hadamard a problem is well-posed if it ˆ fulfills the following three conditions: (a) a solution ex- the function itself. Now, the relationship between 푅(푓푛) and the other theoretical risks is captured by the following ists; (b) the solution is unique; and (c) the solution is stable, cascade of inequalities, namely, i.e., does not change drastically under small perturbations. For many machine learning problems, the first condition of ⋆ ⋄ ˆ 푅(푓 ) ≤ 푅(푓 ) ≤ 푅(푓H ,푛). (18) well-posedness, namely, existence, is fulfilled. However, ˆ ˆ the solution is either not unique or not stable. With large The true risk 푅(푓H ,푛) of the realized estimator 푓H ,푛 is clearly and unsurprisingly the largest of the three. Since 푝 small 푛 for instance, not only is there a multiplicity of 푅⋆ is unrealizable in practice, the natural goal should at solutions but also the instability thereof, due to the singu- least be: Out of all the functions in H generated using the larities resulting from the fact that 푛 ⋘ 푝. Typically, the D ⋆ regularization framework is used to isolate a feasible and data 푛, choose the one that best imitates 푓 , which ⋆ optimal (in some sense) solution. Tikhonov’s regulariza- choose 푓ˆH ∈ H such that 피[푅(푓ˆH )] − 푅(푓 ) is smallest. ,푛 ,푛 tion is the one most commonly resorted to and typically If one could directly (or even indirectly) construct amounts to a Lagrangian formulation of a constrained ver- 푓ˆ(횘횙횝) ∈ H H ,푛 such that sion of the initial problem, the constraints being the ob- ˆ(횘횙횝) ˆ ⋆ jects used to isolate a unique and stable solution. 푓H ,푛 = 횊횛횐횖횒횗 {피[푅(푓H ,푛)] − 푅(푓 )}, 푓ˆH ∈H ,푛 Effect of Model Complexity ˆ(횘횙횝) then 푓H ,푛 would be the optimal predictive model. Unfor- To gain deeper insights into the properties and challenges tunately, such a function cannot be directly constructed in inherent in optimal predictive model selection, we now practice because its objective function is purely theoretical. consider a practical exploration of univariate regression ˆ ⋆ The so-called excess risk, 피[푅(푓H ,푛)−푅 ], defined as the ex- learning using the polynomial function space, namely, ˆ pected value of the difference between the true risk 푅(푓푛) ˆ ⋆ associated with 푓푛 and the overall minimum risk 푅 , can be decomposed to explore in greater detail the source of H = {푓 ∈ 퐶([푎, 푏])| ∃휃 , 휃 , … , 휃 ∈ ℝ| error in the function estimation process: 0 1 푝 ˆ ⋆ ˆ ⋄ ⋄ ⋆ 푝 피[푅(푓푛) − 푅 ] = ⏟⎵⎵⎵⎵⏟⎵⎵⎵⎵⏟피[푅(푓푛) − 푅(푓 )] + 피⏟⎵⎵⎵⏟⎵⎵⎵⏟[푅(푓 ) − 푅 ]. (19) 푓(퐱) = ∑ 휃 퐱푗 ∀퐱 ∈ [푎, 푏]}. 햤헌헍헂헆햺헍헂허헇 햾헋헋허헋 햠헉헉헋허헑헂헆햺헍헂허헇 햾헋헋허헋 푗 푗=0 Making the excess risk small is tricky because of the fol- lowing dilemma: If the approximation error is made small, H typically by making the function space larger and more Having chosen our function space H along with the H ⋆ complex so that the members of approximate 푓 very squared error loss, our statistical learning task consists of well, then the corresponding estimation error tends to get finding the minimizer of the empirical counterpart ofthe undesirably larger. Many authors have written excessively average squared errors (ASE), i.e., on methods for achieving desirable trade-offs with favor- ˆ ˆ able predictive benefits. The empirical risk 푅ˆ푛(푓푛) on 푓푛 H can be made arbitrarily very small by making very com- ˆ ˆ ˆ 푓H ,푛 = 푓푛 = 푓 = 횊횛횐 횖횒횗 {홰횂홴(푓)ˆ } plex, leading to a phenomenon known as . It 푓∈H must be emphasized that such a function has very little 푛 to do with being optimal predictive, because the theoreti- 1 2 = 횊횛횐 횖횒횗 { ∑ (푦푖 − 푓(퐱푖; 휽)) } . (20) ˆ ˆ 푓∈H 푛 cal (true) risk 푅(푓푛) of such an 푓푛 is undesirably large. In- 푖=1 deed, when it comes to optimal prediction, it is crucial ˆ ˆ for the estimator 푓H ,푛 to have an empirical risk 푅ˆ푛(푓푛) H that is as close as possible to the true risk 푅(푓ˆ ). Now, We are seeking the best member of the function space 푛 D it is well known among practitioners that almost all sta- based on the given data set 푛. Since we specifically chose tistical machine learning problems are inherently inverse the function space of all univariate real-valued polynomi- problems, in the sense that learning methods seek to opti- als of degree at most 푝 in some interval [푎, 푏], finding 푓ˆ mally estimate an unknown generating function using em- comes down to estimating the coefficients of the polyno- pirical observations assumed to be generated by it. As a re- mial using the data. Using the 푛 × (푝 + 1) Vandermonde 푗 푛 sult, statistical machine learning problems are inherently matrix 퐗 = (퐱푖 ), 푖 = 1, … , 푛, 푗 = 0, … , 푝, and 퐘 ∈ ℝ , the

160 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 solution to problem (20) is given by

2 푛 푝 ̂ 1 푗 휽 = 횊횛횐 횖횒횗 { ∑ (푌푖 − ∑ 휃푗퐱푖 ) } 푝+1 푛 휽∈ℝ 푖=1 푗=0 = 횊횛횐횖횒횗 {(퐘 − 퐗휽)⊤(퐘 − 퐗휽)} 휽∈Θ = (퐗⊤퐗)−1퐗⊤퐘. (21) The estimator in equation (21) has many quintessential layers that are crucial to the understanding of optimal pre- dictive model selection. It is therefore important to dissect and unpack those key aspects of statistical learning. (a) Stochastic nature of the estimator. First and foremost, ̂ ̂ ̂ ̂ ⊤ ⊤ the estimate 휽 = (휃0, 휃1,…, 휃푝) of 휽 = (휃0, 휃1, … , 휃푝) is a , and as a result the estimate 푓(퐱)ˆ = ˆ ̂ ̂ ̂ ̂ 2 ̂ 푝 ⋆ 푓(퐱; 휽) = 휃0 + 휃1퐱 + 휃2퐱 + ⋯ + 휃푝퐱 of 푓 (퐱) is also a random variable. We therefore have to be mindful when- ever 푓ˆis used, that it is inherently a random entity whose handling is best done with the powerful machineries of Figure 1. Effect of complexity on estimated function. probability and statistics. (b) Bias and . Since 푓(퐱)ˆ is a random variable, performs poorly, as does the perfect memorizer whose 픹[푓(퐱)]ˆ = we must compute important aspects like its bias complexity is virtually infinite since it simply connects all ˆ ⋆ 피[푓(퐱)] − 푓 (퐱), which measures how far our chosen class the points. The solid line model does a great job learning of models is from the true generator of the data, and its the underlying function. The low complexity models at- ˆ ˆ ˆ 2 variance 핍[푓(퐱)] = 피[(푓(퐱) − 피[푓(퐱)]) ], which, as the tempt to avoid a large estimation variance but then pay a name says, tells us relatively how stable the constructed price in the form of an increased bias, resulting in a large estimator is. prediction error. The high complexity models attempt to (c) Model complexity and temptation to overfit. Since our fit too well, literally memorizing the data in the extreme goal expressed through the objective function is to find the case, and thereby learning both the noise and the signal, H member of the class that minimizes the empirical risk, resulting in a large variance as the price paid for low bias, it is very tempting at first to use the data at hand to build ultimately yielding another high prediction error. The op- ̂ ̂ the 푓 that makes 홰횂홴(ˆ 푓) the smallest. For instance, the timal fit depicted by the solid line model is achieved by higher the value of 퐩, the smaller 홰횂홴(ˆ 푓(⋅))ˆ will get. In settling for a trade-off between bias and variance. The task fact, in the most extreme of scenarios, one could simply dedicated to determining that optimal complexity, which ˆ ˆ make 홰횂홴(ˆ 푓(⋅)) = 0 by specifying 푓(퐱푖) = 푦푖∀푖 = 1, … , 푛. results in the optimal predictive performance, occupies a In a sense, we have a dilemma: If we make 푓ˆ complex central place in statistical machine learning and will be (large 푝), we make the bias small, but the variance is in- further discussed throughout this paper. The phenome- creased. If we make 푓ˆsimple (small 푝), we make the bias non of bias-variance trade-off is of fundamental impor- large, but the variance is decreased. In this case, the de- tance and can be further explained in the context of re- gree 푝 of the polynomial represents the complexity of the gression learning by the so-called bias-variance decompo- corresponding model. In the end, we will have to come up sition of the theoretical risk on 푓(⋅)ˆ under the squared er- D with various criteria for estimating the optimal complexity, ror loss. Let’s consider the data set 푛. Let’s also assume ⋆ in the sense of the one that leads to low prediction error. that 푌푖 = 푓 (퐱푖) + 휀푖, where the 휀푖’s are i.i.d. from some To help gain deeper insights into this fundamental statisti- distribution with 횖횎횊횗(휀) = 0 and 횟횊횛횒횊횗회횎(휀) = 휎2. Let cal machine learning phenomenon, let’s consider the syn- 푓ˆ be our estimator of 푓⋆ built using the random sample thetic (artificial) task of learning a univariate polynomial provided. Let 퐱 ∈ X . The pointwise bias-variance decom- regression from the data. We simulate the data using the position of the expected squared error is given by 푅(푓)ˆ = function 푓⋆(퐱) = −퐱 + √2 sin(휋3/2퐱2) for 퐱 ∈ [−1, +1], 피[(푌 −푓(퐱))ˆ 2] = 휎2+홱횒횊횜2(푓(퐱))+횟횊횛횒횊횗회횎(ˆ 푓(퐱))ˆ , where with a noise variance 휎2 = 0.32. 휎2 = 횟횊횛횒횊횗회횎(휀) is the variance of the noise term but Figure 1 helps us gain insights into the basics of bias- essentially represents the irreducible learning error, that variance trade-off. The polynomial of degree 1, which hap- is, the error inherent in the structure of the population, pens to be the model with lowest nonzero complexity, one that cannot be changed by any learning machine. It

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 161 is easy to verify that this is the smallest possible error, i.e., High Bias & Low Variance (Underfitting) Low Bias & High Variance (Overfitting) 푅⋆ = 푅(푓⋆) = 피[(푌 − 푓⋆(퐱))2] = 횟횊횛횒횊횗회횎(휀) = 휎2. In- terestingly, the bias-variance phenomenon depicted in Fig- ure 3 and Figure 1 in the context of regression learning is also present in classification learning. A detailed account of the same type of decomposition for the 0/1 loss used

in classification can be found in [12] and [8]. The opti- Prediction Error(f) Test Error(f) mal decision boundary seen in Figure 2 is obtained using cross validation on the 푘-Nearest Neighbors learning ma- chine (22) for various values of 푘. Clearly, 푘 does indeed Optimal complexity (Min Pred Error) control the complexity of the underlying model, namely, Training Error(f) Optimism of training error the decision boundary. Although the decision boundary Complexity(f) in this case cannot be explicitly written or learned as the optimum of some explicit objective function, one can still Figure 3. Bias-variance trade-off and model complexity. use cross validation to determine the optimal value of 푘 (optimal neighborhood size). This tremendous flexibility Elements of Model Identification of the cross validation principle is certainly one of its great- Once a specific function space is chosen for our learning est strengths, which makes it very appealing and widely ap- task, like we did earlier with our choice of the space of uni- plicable in statistical machine learning. For classification, variate real-valued polynomials, it is not enough to know Y = {1, … , 퐺}, the highest polynomial degree for our particular regression learning task. Indeed, we also need to know which of the 푛 1 푓ˆ(횔홽홽)(퐱) = 횊횛횐횖횊횡{ ∑ ퟙ(푦 = 푔)ퟙ(퐱 ∈ 풱 (퐱))}. (22) coefficients are nonzero. In other words, we need aclear 푘 푖 푖 푘 푔∈Y 푖=1 and unambiguous way, like an index, to distinguish the members of H so that we can identify and then select spe- For regression, Y = ℝ, cific ones. To help clarify that, we can think of the function H 푛 space in this case as a vector space with the monomi- 1 2 푗 푝 푓ˆ(횔홽홽)(퐱) = ∑ 푦 ퟙ(퐱 ∈ 풱 (퐱)). (23) als {퐱, 퐱 , … , 퐱 , … , 퐱 } as the basis vectors or atoms of the 푘 푖 푖 푘 푖=1 expansion that help span the space. In general, one con- siders a basis set {홱1(퐱), 홱2(퐱), … , 홱푗(퐱), … , 홱푝(퐱)}, so that 푗 for polynomial regression, 홱푗(퐱) = 퐱 . Using the basis set, a member 푓 ∈ H can then be specified by simply indi- cating which of the monomials are combined together to form its representation. For our space of univariate real- valued polynomials of degree at most 푝, we could use one of the key building blocks of the parametric model selec- tion machinery, namely, a vector of indicator variables. With the 푝 original atoms, there are 2푝 − 1 nonempty models, each corresponding to a subset of the provided ⊤ atoms. We shall use a vector 휸 = (훾1, 훾2, … , 훾푝) to denote the index of a given model, with each 훾푗 being an indicator of the atom’s presence in the model under consideration, namely, 훾푗 = ퟙ (atom 홱푗(퐱) appears in model 푀휸). For simplicity we shall assume no intercept; i.e., 휃0 = 0. ⊤ Here, 휸 = (1, 1, … , 1) corresponds to the full model 푀푓, while 휸 = (0, 0, … , 0)⊤ corresponds to the empty model, also referred to as the null model, and given by 푀0 ∶ 퐘 = 휺 (pure zero-mean noise). 푝 Equipped with this index, |푀휸| = |푓휸| = 푝휸 = ∑푗=1 훾푗 푝휸 is the number of atoms in model 푀휸, and 휽휸 ∈ ℝ is the 푝 subset of 휽 ∈ ℝ made up of only the 휃푗’s picked up by 휸, Figure 2. Optimal kNN decision boundary. that is, 휃훾푗 = 훾푗휃푗. Finally, 퐗휸 is the submatrix of 퐗 whose columns are only those 푝휸 columns of 퐗 picked up by 휸, so that 퐗휸 is really an 푛×푝휸 matrix, and the corresponding

162 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 model 푀휸 is given by side of the real number line. For the 푘-Nearest Neighbors 푀 ∶ 퐘 = 퐗 휽 + 휺. (24) learning machine, the complexity of the implicit underly- 휸 휸 휸 ing model is measured by 푘, the size of the neighborhood, Putting everything together, we define a function space which is a discrete number from 1 to 푛. Therefore, for 푘NN, H as the hypothesis space containing the pattern underly- 휸 ∈ 횪 = {1, 2, … , 푛}. In practice, this is truncated to a rea- ing our data, but in a sense, using the language of models, sonable maximum number of neighbors. we are somewhat dealing with a model space M . Having now defined the useful concept of index (indicator vector) Model Selection Criteria of a given model, we can unambiguously specify members When it comes to model selection for optimal prediction H M 푝 푓휸 ∈ or 푀휸 ∈ , using the vector 휸 ∈ 횪 = {0, 1} , both Bayesian statistics and non-Bayesian statistics have which represents the indexing of that specific member of contributed richly. Essentially, one can identify four main the model space M or, equivalently, the function space ways to address the quest for optimal prediction: namely, H . Clearly, 횪 is made up of the 2푝 models. For our poly- (a) Selection, (b) Compression, (c) Regularization, and (d) nomial regression task, we are in the presence of the so- Aggregation. The first three approaches operate under the called parametric family of models, in the sense that the strong assumption that a single member of function space M choice of a member 푀휸 of the model space through H exists with optimal predictive properties, and all the its index vector 휸 maps to the corresponding collection methods and techniques seek to find that unique mem- of parameters contained in 휽휸. In such a parametric con- ber. All the existing criteria are carefully created, designed, text, the unambiguous specification of a model or the cor- and developed to help yield that member of H . On the responding function thereof typically indicates both the other hand, aggregation, also known as ensemble learning model 푀휸 and the corresponding parameter vector 휽휸. Let or model averaging or model combination, takes the view ⊤ 푝×푝휸 ̃퐱 = (홱1(퐱), 홱2(퐱), … , 홱푝(퐱)) and 퐕휸 ∈ {0, 1} , such that a single optimum might not exist. Aggregation oper- that 퐕휸[푗, 푘] = 훾푗, 푗 = 1, … , 푝, 푘 = 1, … , 푝휸. Any member ates on the assumption that many decent candidate mod- H 푓휸 = 푓휸(퐱|휽휸, 푀휸) ∈ can be fully specified as els exist, and instead of needlessly wasting time to seek a unique optimum that one may never find, it is better 푓 (퐱|휽 , 푀 ) = ̃퐱⊤퐕 휽 휸 휸 휸 휸 휸 to combine the good candidates in some fashion to yield 푝 an overall lower prediction (generalization) error. Over = ∑ 훾휃 홱 (퐱). (25) 푗 훾푗 푗 the years, aggregation techniques like Bayesian Model Av- 푗=1 eraging (BMA) [2,11],Bootstrap Aggregating (Bagging) [3], M For any 푀휸 ∈ , the ordinary (OLS) esti- Random Forest [4], Random Subspace Learning, Stacking, mate encountered earlier in equation (21) is now given by and certainly Adaptive Boosting [14] and Gradient Boost- (홾홻횂)̂ ̂ ⊤ −1 ⊤ ing have emerged and continue to be developed. Interest- 휽휸 = 휽휸 = (퐗휸 퐗휸) 퐗휸 퐘. (26) ingly, these so-called ensemble learning methods tend to It is easy to see that the prediction yield the best predictive performances in practical applica- of average response at 퐱 is given by tions. ˆ(홾홻횂) ˆ(홾홻횂) (홾홻횂)̂ 푓휸 (퐱) = 푓휸 (퐱|휽휸 , 푀휸) Likelihood based selection. In the presence of a multi- 푝 plicity of potential models competing to fit the data, and ⊤ ̂ ̂ considering that the estimators of those models are based = ̃퐱 퐕휸휽휸 = ∑ 훾푗휃훾푗 홱푗(퐱). (27) 푗=1 on random samples with inherently built-in uncertainty, it makes sense to assume that any choice of a model con- It is important to note that the identifier of functions need sequently has built-in uncertainty. Before the data is col- not be a vector as in the above parametric modelling sce- lected and the model built, 푝(푀 ) represents its prior prob- nario. In nonparametric univariate regression learning, 휸 ability. Once the data is collected, the posterior probabil- for instance, the identifier of a member of the Naradaya– ity 푝(푀 |D ) of model 푀 provides a reasonable mecha- Watson space of estimators is simply a real scalar, which is 휸 푛 휸 nism for assessing and measuring the uncertainty attached the bandwidth of the kernel used in the estimation 푚 (D ) = 푝(D |푀 ) = 푛 푛 to its selection. Now, using 휸 푛 푛 휸 (홽횆) 퐱 − 퐱푖 퐱 − 퐱ℓ ∫ 푝(D |휽 , 푀 )푝(휽 )푑휽 , we can write 푓ˆ휸 (퐱) = ∑ 푦 퐾 ( ) ∑ 퐾 ( ). (28) 횯 푛 휸 휸 휸 휸 푖 휸 / 휸 푖=1 ℓ=1 D D For this nonparametric scenario, the model index 휸 suf- 푝(푀휸)푚휸( 푛) 푝( 푛|푀휸)푝(푀휸) 푝(푀 |D ) = = fices to fully specify the model, as there are no parameters 휸 푛 D D ∑휸′ 푝(푀휸′ )푚휸′ ( 푛) 푝( 푛) in the traditional sense of a finite collection of model co- D ⋆ 푝( 푛|푀휸)푝(푀휸) efficients. Here 휸 ∈ 횪 ⊆ ℝ+, which means that our model = . (29) 2푝 D space search is done on an infinite subset of the right-hand ∑ℓ=1 푝( 푛|푀ℓ)푝(푀ℓ)

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 163 In a parametric context like the one introduced in “Ele- The Schwarz Bayesian Information Criterion (BIC) [15], al- ments of Model Identification,” the Bayesian estimator of though very prevalent in non-Bayesian settings, just hap- M the parameter vector 휽휸 for model 푀휸 ∈ is given by pens, as its name suggests, to have a Bayesian origin. The model index 휸(홱홸홲) of a model 푀 ∈ M is given by (홱횊횢횎횜)̃ ̃ D 휸 휽휸 = 휽휸 = 피[휽휸|푀휸, 푛] 휸(홱홸홲) = 횊횛횐횖횒횗 {홱홸홲 (푀 )} , (36) D 푛 휸 = ∫ 휽휸푝(휽휸|푀휸, 푛)푑휽휸. (30) 휸∈횪 M where the score 홱홸홲푛(푀휸) of model 푀휸 ∈ is From a Bayesian perspective, if model 푀휸 is selected, then ̂ D the predictor of the response 푌 given 퐱 is given by 홱홸홲푛(푀휸) = −2 log 퐿(휽휸|푀휸; 푛) + |푀휸| log 푛. (37) ˆ(홱횊횢횎횜) ⊤ D ⊤ ̃ 푓휸 (퐱) = ̃퐱 퐕휸피[휽휸|푀휸, 푛] = 퐱 퐕휸휽휸 The Akaike Information Criterion (AIC) [1], where the M 푝 score 홰홸홲푛(푀휸) of model 푀휸 ∈ is defined as = ∑ 훾홱 (퐱)휃̃ . (31) 푗 푗 훾푗 ̂ D 푗=1 홰홸홲푛(푀휸) = −2 log 퐿(휽휸|푀휸; 푛) + 2|푀휸|, (38) Under the squared error loss, the Bayesian Model Averag- predates BIC, and while BIC is regarded as the chief se- ing (BMA) predictor provides the optimal predictor [2,11], lection criterion, AIC has enjoyed the distinct property of whose corresponding prediction function is given by yielding typically better predictive performances. 푝 Elements of cross validation. A more universally appli- ˆ(홱홼홰) D ̂ cable model selection score is the ubiquitous cross vali- 푓휸 (퐱) = ∑ ∑ 훾푗푝(푀휸| 푛)홱푗(퐱)휃훾푗 . (32) 휸∈횪 푗=1 dation score. In its most general formulation, the 푉-fold The probability model introduced and developed cross validation score proceeds by deterministically divid- D in [2] seeks to achieve both optimal prediction and con- ing the data set 푛 into 푉 chunks (folds) of almost equal D 푉 D 푉 D sistent model selection. The quintessential element in the sizes, such that 푛 = ⋃횟=1 횟 and 푛 = ∑횟=1 | 횟|. The construction of the median probability model is the pos- cross validation score is given by terior inclusion probability 홿홸홿푗 of atom 홱푗(퐱), with 푉 ˆ 1 D D 홲횅(푓) = ∑ 휀횟̂ , (39) 홿홸홿푗 = Pr[훾푗 = 1| 푛] = ∑ 훾푗푝(푀휸| 푛). (33) 푉 횟=1 휸∈횪 where The median probability model index vector is given by (횖횎획) 푝 (횖횎획) 1 푛 휸 ∈ 횪 = {0, 1} , where 훾 = ퟙ (홿홸홿푗 ≥ ). The 1 (−D ) 2 휀̂ = ∑ ퟙ(퐳 ∈ D )ℒ(y , 푓ˆ 횟 (퐱 )), 횟 |D | 푖 횟 푖 푖 median probability model is the model made up of atoms 횟 푖=1 appearing in at least half of the models in the model space. D ˆ(− 횟) The main limitation of the median probability model lies and 푓 (⋅) is the estimator of 푓 constructed without D D in the fact that the model does not always exist, mainly due the 횟th chunk 횟 of 푛. An algorithmic (pseudo-code) to the rigidity of the threshold. In [9] I remedied this limi- description is given below in Algorithm 1 to help build an tation by suggesting a flexibility and adaptive approach for intuitive understanding of this most general of model se- optimal predictive atom selection in the general basis func- lection scores. In practice, the data is often randomly shuf- tion expansion framework. An alternative to the median fled prior to the deterministic splitting into chunks. The probability model is the highest posterior model, whose oldest incarnation of the cross validation principle is leave model index vector is given by one out cross validation, which corresponds to 푉 = 푛. It is important to mention here that cross validation is one of (홷홿홼) D 휸 = 횊횛횐횖횊횡 {푝(푀휸| 푛)} . the most used approaches to model selection for optimal 휸∈횪 prediction in statistical machine learning. From its earli- M Recall also that given a model 푀휸 ∈ , along with the est days with M. Stone’s [17] seminal paper, along with 푝휸 corresponding 휽휸 ∈ ℝ , the likelihood of 휽휸 is its wide variety of extensions and adaptations, like [16], D D the cross validation principle has continually played a cen- 퐿(휽휸|푀휸, 푛) = 푝( 푛|푓휸(퐗|휽휸, 푀휸)) 푛 tral role in the selection of various types of model hyper- parameters. In virtually all the model spaces considered = ∏ 푝(푦푖|푓휸(퐱푖|휽휸, 푀휸)), (34) 푖=1 in this paper, cross validation is the default approach for empirical intraspace model comparison and model selec- and the maximum likelihood estimator of 휽 is 휸 tion. When classification and regression trees are used as (홼홻홴)̂ D 휽휸 = 횊횛횐횖횊횡{ log 퐿(휽휸|푀휸, 푛)}. (35) the function space, their pruning is done via cross valida- 푝휸 휽휸∈ℝ tion. Cross validation is also used as one way to estimate

164 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 the number of base learners in ensemble learning meth- performance. It is indeed the case that by making our es- ods like Bagging [3] or Random Forest [4] or even adaptive timated classifier very complex, it can adapt too well to boosting [14]. Cross validation also plays a central role in the data at hand, meaning very low in-sample error rate, support vector machine classification and support vector but yield very high out-of-sample error rates due to over- regression learning, as well as in ridge regression [10] and fitting, the estimated classifier having learned both thesig- the famous [18] and its extension. In short, cross nal and the noise. In technical terms, this is referred to validation is central to non-Bayesian regularization. One as the bias-variance dilemma, in the sense that by increas- of the greatest appeals of the cross validation principle lies ing the complexity of the estimated learning machine, the in its generality, its flexibility, and its wide applicability. bias is reduced (good fit all the way to the point of over- Cross validation is typically used for determining the op- fitting) (see Figure 2), but the variance of that estimator is timal complexity in both parametric and nonparametric increased. On the other hand, considering much simpler function spaces, but also crucially for selecting the specific estimators leads to less variance but higher bias (due to un- member of the function space that achieves the lowest pre- derfitting, model not rich enough to fit the data well). This diction error, provided such a unique member exists. It is phenomenon of the bias variance dilemma is particularly important to know that there are learning machines, and potent with massive data when the number of predictor very good ones at that, that are constructed purely algo- variables 푝 is much larger than the sample size 푛. One of rithmically. While it is difficult or even at times impos- the main tools in the modern machine learning arsenal sible to use some of the other optimal predictive model for dealing with this is the so-called regularization frame- selection criteria on purely algorithmic machines like 푘- work, whereby instead of using the empirical risk alone, a Nearest Neighbors learning machines of equations (22) constrained version of it, also known as the regularized or and (23), it is straightforward to use cross validation on penalized version, is used. Indeed, within a selected space them, as long as the error is well defined. Cross validation ℋ of potential learning machines, one typically chooses applies nicely to the most interpretable learning machines, some loss function ℒ(⋅, ⋅) with some desirable properties namely, classification and regression trees, which are built like smoothness or convexity (this is because one needs at purely algorithmically but still benefit from the predictive least to be able to build the desired classifier), and then power and flexibility of the cross validation principle. finds the minimizer of its regularized version, i.e.,

ˆ Algorithm 1: 푉-fold Cross Validation 푓H ,휆,푛 = 횊횛횐횖횒횗{푅ˆH ,푛(푓) + 휆ΩH (푓)}, (40) D ⊤ ⊤ 푓∈H Input: Training data 푛 = {퐳푖 = (퐱푖 , y푖) , 푖 = 1, … , 푛}, ⊤ X Y where 퐱푖 ∈ and y푖 ∈ , and the function where 휆 controls the bias-variance trade-off. Typically, of interest is denoted by 푓, sample size 푛, 휆 > 0 and is determined by cross validation. Cross vali- number of folds 푉 dation for determining 휆 proceeds by defining a grid Λ ⊂ ˆ ⋆ Output: Cross Validation score 홲횅(푓) ℝ+ = (0, +∞) of possible values of 휆. Sometimes, based for 횟 = 1 to 푉 do on intuition or experience, it could just be Λ = [휆횖횒횗, 휆횖횊횡]; Extract the validation set then D D 횟 = {퐳푖 ∈ 푛 ∶ 푖 ∈ [1 + (횟 − 1) × 푚, 횟 × 푚]} D 푐 D D (opt)̂ ˆ Extract the training set 횟 ∶= 푛\ 횟 휆 = 횊횛횐횖횒횗{홲횅(푓휆)}. (41) D ˆ(− 횟) D 푐 휆∈Λ Build the estimator 푓 (⋅) using 횟 D ˆ(− 횟) D Compute predictions 푓 (퐱푖) for 퐳푖 ∈ 횟 ˆ ˆ 푓H ,휆ˆ(opt),푛 is clearly far better than 푓H ,푛 from equation (17). Compute the validation error for the 횟th By inherent design, the cross validation mechanism en- chunk ˆ dows 푓H ˆ(opt) with some predictive power, making it an 푛 ,휆 ,푛 1 D estimator with the potential for predictive optimality. As 휀̂ = ∑ ퟙ(퐳 ∈ D )ℒ(y , 푓ˆ(− 횟)(퐱 )) 횟 |D | 푖 횟 푖 푖 ℒ(⋅, ⋅) 횟 푖=1 long as the loss function and the penalty function ΩH (⋅) have desirable mathematical and statistical proper- 푉 1 ties like convexity and differentiability and boundedness Compute the CV score CV(푓)ˆ = ∑ 휀̂ H 푉 횟 to allow the search of the function space to be per- 푣=1 ˆ formed by optimization, 푓H ,휆ˆ(opt),푛, thanks to the cross vali- dation mechanism, provides a practical framework for po- Regularized risk minimization. One of the fundamental tentially selecting the optimal predictive member of H . results in statistical learning theory has to do with the fact ˆ H It is important to note that finding 푓H ˆ(opt) ∈ does that the minimizer of the empirical risk could turn out ,휆 ,푛 not in any way guarantee that the true risk 푅(푓ˆ ) on to be overly optimistic and lead to poor generalization H ,휆ˆ(opt),푛 ˆ ⋆ ⋆ ˆ 푓H ,휆ˆ(opt),푛 is close to 푅 = 푅(푓 ). In other words, 푓H ,휆ˆ(opt),푛

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 165 is the best in H , but there is no guarantee that it is any- iid D = {푍 , 푍 , … , 푍 ∼ 푝 (퐳)} 푍 = (푋 , 푌 ) ∈ X × Y ⋆ ˆ 푛 1 2 푛 푍 where 푖 푖 푖 . where near 푓 . 푓 ˆ(opt) is what we refer to here as the H ,휆 ,푛 Consider random splits of D into a training and a test set intraspace optimal predictive model, since it is the cross val- 푛 such that D = D ∪ D such that 푛 = |D | + |D |. Con- idated best estimator within the function space H . 푛 횝횛 횝횎 횝횛 횝횎 sider mappings 푓 ∶ X ⟶ Y and a loss function ℒ(⋅, ⋅). is arguably one of the most widely Then the training and test errors are given by used statistical learning machines, even enjoying a direct 푛 and strong relationship with artificial neural networks. Us- 1 푅ˆ (푓) = ∑ ℒ(푌 , 푓(푋 ))ퟙ(푍 ∈ D ) (44) ing the traditional {0, 1} labelling on the response vari- 횝횛 |D | 푖 푖 푖 횝횛 횝횛 푖=1 able 푌, we have ℙ[푌푖 = 1|퐱푖, 휽휸, 푀휸] = 휋(퐱푖; 휽휸, 푀휸) = 1 and 휋(퐱푖; 휽휸) = 휋푖(휽휸) = ⊤ . The is −퐱 휽휸 1+푒 푖 푛 1 푛 ˆ D 푅횝횎(푓) = D ∑ ℒ(푌푗, 푓(푋푗))ퟙ(푍푗 ∈ 횝횎). (45) D 푦푖 1−푦푖 | | 퐿(휽휸; 푀휸, 푛) = ∏ {[휋푖(휽휸)] [1 − 휋푖(휽휸)] }. (42) 횝횎 푗=1 푖=1 ˆ ˆ ˆ If 푓 = 횊횛횐횒횗횏{푅ˆ횝횛(푓)}, then 피(푅ˆ횝횛(푓)) ≤ 피(푅ˆ횝횎(푓)). The corresponding regularized empirical risk for the binary 푓∈ℋ multiple linear logistic regression model is given by The so-called optimism of the training error is given by D ˆ ˆ ˆ ˆ ˆ ˆ 푅ˆ휆(휽휸, 푀휸) = − log 퐿(휽휸; 푀휸, 푛) + 휆‖휽휸‖H . (43) 홾횙횝횒횖횒횜횖(푅횝횎(푓)) = 피(푅횝횎(푓)) − 피(푅횝횛(푓)) and represents the amount by which the training error (empirical risk) Now, the celebrated support vector machine [19] for bi- underestimates (hence the term optimism) the test error nary classification with response variable taking values in (generalization error). Indeed, when the function is made {−1, +1} is a solution to the regularized empirical hinge risk more and more complex, the empirical risk gets lower and functional, namely, lower and farther from the true error, as seen in Figure 3. 푛 1 휆 This is an instance of bias-variance dilemma that happens ˆ퐰= 횊횛횐횖횒횗 { ∑ (1 − 푦 ⟨퐰, Φ(퐱 )⟩) + ‖퐰‖H } . 푛 푖 푖 + 2 to be at the heart of methodological, theoretical, practical, 퐰∈F 푖=1 computational, and epistemological aspects of statistical Using quadratic programming on the dual formulation of machine learning. The result of (3) highlights the reason this problem with 훼푖 as the Lagrangian multipliers, we get 푛 why (17), the minimizer of the empirical risk, does not ˆ퐰= ∑푖=1 ˆ훼푖푦푖Φ(퐱푖), and the corresponding estimated pre- possess the predictive power needed, in the sense that it diction function is does not generalize well. In our quest for optimal predic- 푛 tive models, we will therefore not rely on the empirical risk ˆ 푓횜횟횖(퐱) = 횜횒횐횗 (∑ 푦푖 ̂훼푖풦(퐱, 퐱푖)) , alone, but instead will resort to score functions with inher- 푖=1 ent built-in mechanisms for selecting models that general- where the nonzero ˆ훼푖’s correspond to the so-called support ize well, i.e., produce lower prediction errors. Practically D vectors, and 풦(퐱, 퐱푖) = ⟨Φ(퐱), Φ(퐱푖)⟩ is an incarnation of speaking, if the data 푛 is randomly split 푆 times, so that the so-called kernel trick that makes SVM immensely prac- for each 횜, the randomly shuffled (permuted) tical. Here, 풦(⋅, ⋅) is a bivariate function called kernel de- version D (횜) admits the decomposition D (횜) = D (횜) ∪D (횜), X X 푛 푛 횝횛 횝횎 fined on × , used to measure the similarity between then the 횜th replication of the test error is given by two points in an observation space. One of the most com- (횜) D(횜) ˆ( 횝횛 ) monly used kernels in statistical machine learning is the e횝횎 = 횝횎(푓 ) Gaussian radial basis function kernel given by 푛 1 (횜) (횜) (횜) D(횜) (횜) (46) D ˆ( 횝횛 ) 2 = ∑ ퟙ(퐳푖 ∈ )ℒ(y푖 , 푓 (퐱푖 )), 1 ‖퐱 − 퐱푖‖2 (횜) 횝횎 풦(퐱, 퐱 ) = exp ( − ). |D | 푖=1 푖 2 휏2 횝횎 (횜) (D ) There are many other kernels and kernel methods like where 푓ˆ 횝횛 (⋅) is the instance of 푓ˆobtained using the 횜th Gaussian processes [5, 6]. random replication of the training set. Clearly, one has 푆 (ퟷ) (횜) (횂) realizations of the test error, and {e횝횎 , … , e횝횎 , … , e횝횎 } can Computational Model Selection be regarded as a sample of size 푆 from the distribution of ˆ Before 푓H ,푛 can be deemed good from a predictive perspec- the true test error. One of the quantities often computed tive, its complexity must be controlled in order to endow it from the 푆 realizations of the test error is the correspond- with good generalization properties, i.e., small prediction ing average test error error on out-of-sample observations. This focus on the 푆 (횜) 푓ˆH 1 (D ) “generalizability” of ,푛 is incredibly central to statistical 홰횅횃홴(푓)ˆ = ∑ 횝횎(푓ˆ 횝횛 ). (47) 푆 learning when optimal prediction is the primary goal. Let 횜=1

166 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 (횜) (D ) It important to note that the median can also be used in It is important to note that 푓ˆ 횝횛 (⋅) should be internally place of the mean. Besides, the replications allow various optimized using its own internal intraspace optimality statistical analyses on the predictive performances of each search criterion (like cross validation). This assumption function space. A typical way to explore empirical inter- is made with the finality of making sure that the inter- space model comparison is to generate comparative box- space model comparison operates on the best of each plots of the replicated test errors, which can be done using considered model space. Let 풞 be a collection of mod- the stochastic hold-out scheme described in Algorithm 2. els, ideally with each from a different function space or Figure 4 depicts the results for the famous Crabs leptograp- a different method of estimation (learning). For instance, sus benchmark data set, and Figure 5 does the same for the 풞 = {푓ˆ , 푓ˆ , 푓ˆ , 푓ˆ , 푓ˆ , 푓ˆ , 푓ˆ , 푓ˆ , 푓ˆ }. 홻홳홰 횂횅홼 홲홰횁횃 횁홵 홶홿횁 횔홽홽 홱횘횘횜횝 홻횘횐횒횝 횁홳홰 ionosphere data set, which is another benchmark data set. Both data sets can be obtained from R. Algorithm 2: Stochastic Hold-Out for Generalization D Input: Training data 푛 = {퐳푖 = (퐱푖, y푖), 푖 = 1, … , 푛}, 0.4 X Y where 퐱푖 ∈ and y푖 ∈ , and list of learning machines to be evaluated, sample size 푛, number of random splits 푆, number of learning machines 푀, Proportion 휏 ∈ (1/2, 1) 0.3 of observations in training set (푠)̂ Output: Matrix 퐸 = (퐸푠푚) = 푅ˆ횝횎(푓푚 ) of test error values for several learning machines

푠 = 1 푆 )

for to do m f ^

( 0.2

Generate the 횜th random split of the data set te ^ D D (푠) D (푠) R 푛 into training set 횝횛 and test set 횝횎 D D (푠) D (푠) Such that 푛 = 횝횛 ∪ 횝횎 and 푛 = |D| = 휏|D (푠)| + (1 − 휏)|D (푠)| 횝횛 횝횎 0.1 for 푚 = 1 to 푀 do Build and refine the 푚th learning machine (푠) (D ) ˆ 횝횛 D (푠) 푓푚 (⋅) using 횝횛 (푠) (D ) ˆ 횝횛 0.0 Compute predictions 푓푚 (퐱푖) for D (푠) 퐳푖 ∈ 횝횎 LDA SVM CAR F GPR kNN Boost Logi DA Compute the test error for the 푚th Method learning machine Figure 4. Predictive performances on the Crabs data. (푠)̂ 휀푠푚̂ = 푅ˆ횝횎(푓푚 ) (푠) (D ) 1 푛 D ˆ 횝횎 As a matter of fact, each optimal classifier from a given = (푠) ∑ ퟙ(퐳푖 ∈ 횝횎)ℒ(y푖, 푓푚 (퐱푖)) |D | 푖=1 횝횎 space H will typically perform well if the data at hand and the generator from which it came somewhat accord with the properties of the space H . This remark is proba- Given a data set D and a collection of potential func- 푛 bly what prompted the famous so-called no free lunch the- tion spaces like 풞, one defines orem, herein stated informally. (No Free Lunch). There (횜) (D ) is no learning method that is universally superior to all other 퐸 = (퐸 ) = 푅ˆ (푓(푠)̂ ) = 횝횎(푓ˆ 횝횛 ) 푠푚 횝횎 푚 푚 methods on all data sets. In other words, if a learning method (횜) (D ) 횝횛 (횜) is presented with a data set whose inherent patterns violate its = 홴횛횛횘횛 횘횏 푓ˆ푚 (⋅) on D . 횝횎 assumptions, then that learning method will underperform. In- Then one proceeds to generate the matrix 퐸횝횎 contain- deed, it is very humbling to see that some of the methods ing 푆 realized values of the test error for each hypothesis deemed somewhat simple sometimes hugely outperform 푆×푀 space. For classification, 퐸횝횎 ∈ [0, 1] , and for regres- the most sophisticated ones when compared on the basis 푆×푀 sion 퐸횝횎 ∈ ℝ+ . Once 퐸횝횎 is generated, an interspace of average out-of-sample (test) error. predictive model comparison is performed. The practical empirical optimal predictive model is given by Discussion and Conclusion Modern data science and artificial intelligence greatly ˆ(횘횙횝) ˆ 푓 = 횊횛횐횖횒횗{홰횅횃홴(푓)}. value the creation and construction of statistical learning 푓∈풞ˆ

FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 167 [3] Breiman L. Bagging predictors, Machine Learning, (24):123–140, 1996. [4] Breiman L. Random forests, Machine Learning, (45):5–32, 0.20 2001. MR3874153 [5] Clarke B, Fokoue E, Zhang H. Principles and Theory for Data Mining and Machine Learning, first edition, Springer Texts in Statistics, Springer-Verlag, 2009. MR2839778 [6] Csat´oL, Fokou´eE, Opper M, Schottky B, Winther O. Ef- 0.15 ficient approaches to Gaussian process classification. In:

) Leen TK, Solla SA, Müller K-R, eds. Advances in Neural In- m f ^ (

te formation Processing Systems, number 12 of 12. MIT Press; ^ R 2000. 0.10 [7] Devroye L, Györfi L, Lugosi G. A Probabilistic Theory of Pat- tern Recognition, Stochastic Modelling and Applied Proba- bility, Springer New York, 1997. [8] Domingos P. A unified bias-variance decomposition for

0.05 zero-one and squared loss, AAAI/IAAI, AAAI Press, 564– 569, 2000. [9] Fokou´e E. Estimation of atom prevalence for optimal prediction. In: Prediction and Discovery, Contemporary Mathematics, vol. 443. American Mathematical Society; LDA SVM CAR F GPR kNN Boost Logi DA 2007:103–129. MR2433288 Method [10] Hoerl A and Kennard R. Ridge regression: biased estima- Figure 5. Predictive performances on the ionosphere data. tion for non-orthogonal problems, Technometrics, (12):55– 67, 1970. [11] Hoeting JA, Madigan D, Raftery AE, Volinsky CT. machines endowed with an inherent capability to predict Bayesian model averaging: A tutorial, Statist. Sci., accurately and precisely. In this paper, we have explored 14(4):382–417, 1999. MR1765176 the niceties and subtleties of such a goal and have demon- [12] Kohavi R and Wolpert DH. Bias plus variance decom- strated that it requires a hefty dose of care and caution position for zero-one loss functions. In: Machine Learning, and definitely calls upon a solid theoretical understand- Proceedings of the Thirteenth International Conference (ICML ing of learnability along with a lot of artlike practical com- ’96), Bari, Italy, July 3–6, 1996. 1996:275–283. [13] Rosenblatt F. The perceptron: A probabilistic model for mon sense. Anyone who has done practical data science information storage and organization in the brain, Psycho- knows beyond a shadow of a doubt that data has a mind logical Review, 65–386, 1958. MR0122606 of its own, and tends to resist the temptation to seek a [14] Schapire RE and Freund Y. Boosting: Foundations and Al- holy grail or a unified field, or any paradigm that works gorithms, The MIT Press, 2012. MR2920188 perfectly all the time. Practical data science almost always [15] Schwarz G. Estimating the dimension of a model, The forces the practitioner to solve the problem at hand as thor- Ann. Statist., (6):461–464, 1978. MR0468014 oughly and as idiosyncratically as possible rather than seek [16] Stone M. An asymptotic equivalence of choice of model a one-size-fits-all method that works well everywhere. At by cross-validation and Akaike’s criterion, J. Roy. Statist. Soc. the heart of what we suggested throughout this paper is Ser. B (Methodological), 39(1):44–47, 1977. MR501454 [17] Stone M. Cross-validatory choice and assessment of sta- the theoretical result known as the no free lunch theorem, tistical predictions, J. Roy. Statist. Soc. Ser. B (Methodologi- which reveals, both implicitly and explicitly, that the theo- cal), 111–147, 1974. MR356377 retical bounds studied extensively by experts do not really [18] Tibshirani R. Regression shrinkage and selection via the help much when it comes to practically selecting the op- lasso, J. Roy. Statist. Soc. Ser. B (Methodological), 58(1):267– timal predictive model. Optimal predictive modelling is, 288, 1996. MR1379242 and may always be, both a science and an art, requiring [19] Vapnik VN. The Nature of Statistical Learning Theory, both mathematical and statistical rigor along with practi- Springer, 2000. MR1719582 cal computational common sense. Credits References All figures are courtesy of the author. [1] Akaike H. Information theory and an extension of the maximum likelihood principle. In: Selected Papers of Hiro- Author photo is courtesy of Rick Scog- tugu Akaike. Springer New York, New York, NY; 1973:199– gins. 213. MR0483125 [2] Barbieri M and Berger JO. Optimal predictive model se- lection, Ann. Statist., (32):870–897, 2004. MR2065192

Ernest Fokou ´e 168 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2