Model Selection for Optimal Prediction in Statistical Machine Learning

Model Selection for Optimal Prediction in Statistical Machine Learning Ernest Fokou´e Introduction science with several ideas from cognitive neuroscience and At the core of all our modern-day advances in artificial in- psychology to inspire the creation, invention, and discov- telligence is the emerging field of statistical machine learn- ery of abstract models that attempt to learn and extract pat- ing (SML). From a very general perspective, SML can be terns from the data. One could think of SML as a field thought of as a field of mathematical sciences that com- of science dedicated to building models endowed with bines mathematics, probability, statistics, and computer the ability to learn from the data in ways similar to the ways humans learn, with the ultimate goal of understand- Ernest Fokou´eis a professor of statistics at Rochester Institute of Technology. His ing and then mastering our complex world well enough email address is [email protected]. to predict its unfolding as accurately as possible. One Communicated by Notices Associate Editor Emilie Purvine. of the earliest applications of statistical machine learning For permission to reprint this article, please contact: centered around the now ubiquitous MNIST benchmark [email protected]. task, which consists of building statistical models (also DOI: https://doi.org/10.1090/noti2014 FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 155 known as learning machines) that automatically learn and Theoretical Foundations accurately recognize handwritten digits from the United It is typical in statistical machine learning that a given States Postal Service (USPS). A typical deployment of an problem will be solved in a wide variety of different ways. artificial intelligence solution to a real-life problem would As a result, it is a central element in SML, both within have several components touching several aspects of the each paradigm and among paradigms, to come up with taxonomy of statistical machine learning. For instance, good criteria for deciding and determining which learn- when artificial intelligence is used for the task of auto- ing machine or statistical model is the best for the given mated sorting of USPS letters, at least one component of task at hand. To better explain this quintessential task the whole system deals with recognizing the recipient of of model selection, we consider a typical statistical ma- a given letter as accurately as (or even better than) a hu- chine learning setting, with two sets X and Y , along with man operator. This would mean that the statistical machine their Cartesian product Z ≡ X × Y . We further define learning model can ideally recognize handwritten digits regard- Z 푛 ≡ Z × Z × ⋯ × Z to be the 푛-fold Cartesian product less of the various ways in which those digits are written. How of Z . We assume that Z is equipped with a probability does one go about formulating, defining, designing, build- measure 휓, albeit assumed unknown throughout this pa- ing, refining, and ultimately deploying such statistical ma- Z 푛 per. Let 퐙 ∈ , with 퐙 = ((퐱1, 푦1), (퐱2, 푦2), … , (퐱푛, 푦푛)), chine learning models for the intended use? In the case denote a realization of a random sample of 푛 examples, of the MNIST data, for instance, the digits to be poten- where each example 퐳푖 = (퐱푖, 푦푖) is independently drawn tially recognized are captured as a matrix, which is then according to the above probability measure 휓 on the prod- transformed and represented as a high-dimensional vec- uct space Z ≡ X × Y . For practical reasons, and in keep- tor fed as the input to a statistical learning model, along ing with the data science and artificial intelligence lexicon, with the true label of the digit at hand. Conceptually, the we shall quite often refer to the random sample 퐙 as the task of building the statistical learning machine is mathemati- data set, and will use the compact and comprehensive no- cally formulated as the construction of a function mapping the tation elements of the input space (space in which the digits are rep- D iid resented) to the output space (space of the true labels for the 푛 = {(퐱푖, 푦푖) ∼ 푝퐱y(퐱, 푦), 푖 = 1, … , 푛}, (1) X Y digits). Over the years, different methods have been cre- where all pairs (퐱푖, 푦푖) ∈ × , and 푝퐱y(퐱, 푦) is the ated and developed by statisticians and computer scientists probability density function associated with the proba- from all around the world to help build statistical learning bility measure 휓 on Z . Given a random sample 퐙 = machines for a wide variety of problems like those men- ((퐱1, 푦1), (퐱2, 푦2), … , (퐱푛, 푦푛)), one of the most pervading goals tioned earlier. F. Rosenblatt’s [13] groundbreaking and in both theoretical and applied statistical machine learning is to thought-provoking publication of the seminal paper fea- find the function 푓⋆ ∶ X → Y that best captures the depen- turing the so-called Perceptron ushered in the era of brain- dencies between the 퐱푖’s and the 푦푖’s in such a way that, given inspired statistical and computational learning, and can a new random (unseen) observation 퐳헇햾헐 = (퐱헇햾헐, 푦헇햾헐) ∼ 헇햾헐 D ⋆ 헇햾헐 헇햾헐 rightly be thought of as the catalyst of the field of artifi- 푝퐱y(퐱, 푦) with 퐳 ∉ 푛, the image 푓 (퐱 ) of 퐱 ∼ 푝퐱(퐱) cial neural networks, and even arguably the ancestor of provides a prediction of 푦헇햾헐 that is as accurate and precise as our modern-day hot topic of deep neural networks. A cou- possible, in the sense of yielding the smallest possible discrepancy ple of decades after Rosenblatt’s seminal paper, the Multi- between 푦헇햾헐 and 푓⋆(퐱헇햾헐). layer Perceptron (MLP) was introduced as one of the solu- This setting, where one seeks to build functions of the tions to the limitations of the Perceptron. MLPs extended, type 푓 ∶ X → Y , is the foundational setting of machine strengthened, and empowered artificial neural networks learning in general and statistical machine learning in par- by allowing potentially many hidden layers, a tremendous ticular. Throughout this paper, we shall refer to X as the improvement over the Perceptron that brought a much input space and to Y as the output space. For simplic- needed new spring to artificial intelligence. MLPs turned ity, we shall assume that X ⊆ ℝ푝 for our methodological out to be extremely successful, namely, on tasks like the and theoretical derivations and explanations, but will al- MNIST USPS digit recognition task mentioned earlier, but low X to be more general in practical demonstrations and also on several other tasks including credit scoring, stock examples. We will consider both regression learning corre- price prediction, automatic translation, and medical di- sponding to Y = ℝ and multicategory classification learn- agnosis, just to name a few. MLPs triggered a veritable ing (pattern recognition) corresponding to output spaces scientific revolution, inspiring the flourishing of creativity of the form Y = {1, 2, … , 퐺}, where 퐺 is the number of among researchers, many of whom invented or discovered categories. entirely new learning methods and paradigms, or revived Definition 1. A loss function ℒ(⋅, ⋅) is a nonnegative bivari- or adapted existing ones. Y Y Y ate function ℒ ∶ × ⟶ ℝ+, such that given 푎, 푏 ∈ , the value of ℒ(푎, 푏) measures the discrepancy between 푎 156 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY VOLUME 67, NUMBER 2 and 푏, or the deviance of 푎 from 푏, or the loss incurred from This intuitive result is of paramount importance for prac- using 푏 in place of 푎. For instance, ℒ(푦, 푓(퐱)) = ℒ(푓(퐱), 푦) tical aspects of statistical machine learning, because it pro- will be used throughout this paper to quantify the discrep- vides an understandable frame of reference for the interpre- ancy between 푦 and 푓(퐱) with the finality of choosing the tation of the predictive performance of learning machines. best 푓, the optimal 푓, the 푓 that minimizes expected dis- Indeed, the true error 푅(푓) of a classifier 푓 therefore defines crepancy over the entire Z . The loss function plays a cen- the probability that 푓 misclassifies any arbitrary observation ran- tral role in statistical learning theory as it allows an unam- domly drawn from the population of interest according to the biguous measure and quantification of optimality. distribution 휓. 푅(푓) can also be interpreted as the expected disagreement between classifier 푓(푋) and the true label 푌. Definition 2. The theoretical risk or generalization error or true error of any function 푓 ∈ Y X is given by Definition 3 (The Bayes classifier). Consider a pattern 퐱 푦 푝(퐱|푦) 푅(푓) = 피[ℒ(푌, 푓(푋))] from the input space and a class label . Let denote the class conditional density of 퐱 in class 푦, and let = ∫ ℒ(푦, 푓(퐱))푝퐱y(퐱, 푦)푑퐱푑푦 (2) 홿횛횘횋[푌 = 푦] denote the prior probability of class mem- X ×Y bership. The posterior probability of class membership is and can be interpreted as the expected discrepancy be- 홿횛횘횋[푌 = 푦]푝(퐱|푦) 홿횛횘횋[푌 = 푦|퐱] = . (7) tween 푓(푋) and 푌, and indeed as a measure of the predic- 푝(퐱) tive strength of 푓. Ideally, one seeks to find the minimizer X Given 퐱 ∈ X to be classified, the Bayes classification strat- 푓⋆ of 푅(푓) over all measurable functions 푓 ∈ Y , specif- egy consists of assigning 퐱 to the class with maximum ically, posterior probability. With ℎ ∶ X ⟶ Y denoting the 푓⋆ = 횊횛횐 횒횗횏 {푅(푓)} = 횊횛횐 횒횗횏 {피[ℒ(푌, 푓(푋))]}, (3) Bayes classifier, we have, ∀퐱 ∈ X , 푓∈Y X 푓∈Y X ℎ(퐱) = 횊횛횐횖횊횡 {홿횛횘횋(푌 = 푐|퐱)} .

Model Selection for Optimal Prediction in Statistical Machine Learning

Linear Regression: Goodness of Fit and Model Selection

Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis

Model Selection Techniques: an Overview

Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP

Model Selection and Estimation in Regression with Grouped Variables

Model Selection for Production System Via Automated Online Experiments

Optimal Predictive Model Selection

Model Selection in Regression: Application to Tumours in Childhood

Statistical Modeling Methods: Challenges and Strategies

Model Selection for Linear Models with SAS/STAT Software

The Minimum Message Length Principle for Inductive Inference

Minimum Description Length Model Selection