Model Selection Techniques: an Overview
Total Page:16
File Type:pdf, Size:1020Kb
Model Selection Techniques An overview ©ISTOCKPHOTO.COM/GREMLIN Jie Ding, Vahid Tarokh, and Yuhong Yang n the era of big data, analysts usually explore various statis- following different philosophies and exhibiting varying per- tical models or machine-learning methods for observed formances. The purpose of this article is to provide a compre- data to facilitate scientific discoveries or gain predictive hensive overview of them, in terms of their motivation, large power. Whatever data and fitting procedures are employed, sample performance, and applicability. We provide integrated Ia crucial step is to select the most appropriate model or meth- and practically relevant discussions on theoretical properties od from a set of candidates. Model selection is a key ingredi- of state-of-the-art model selection approaches. We also share ent in data analysis for reliable and reproducible statistical our thoughts on some controversial views on the practice of inference or prediction, and thus it is central to scientific stud- model selection. ies in such fields as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a Why model selection long history of model selection techniques that arise from Vast developments in hardware storage, precision instrument researches in statistics, information theory, and signal process- manufacturing, economic globalization, and so forth have ing. A considerable number of methods has been proposed, generated huge volumes of data that can be analyzed to extract useful information. Typical statistical inference or machine- learning procedures learn from and make predictions on data Digital Object Identifier 10.1109/MSP.2018.2867638 Date of publication: 13 November 2018 by fitting parametric or nonparametric models (in a broad 16 IEEE SIGNAL PROCESSING MAGAZINE | November 2018 | 1053-5888/18©2018IEEE sense). However, there exists no model that Vast developments denote the true data-generating distribution is universally suitable for any data and in hardware storage, and expectation with respect to the true data- goal. An improper choice of model or precision instrument generating distribution, respectively. In the method can lead to purely noisy discover- parametric framework, there exists some manufacturing, economic ies, severely misleading conclusions, or m ! M and some i ! Hm such that p* is globalization, and so forth * disappointing predictive performances. exactly pi* . In the nonparametric framework, Therefore, a crucial step in a typical data have generated huge p* is excluded in the model class. We some- analysis is to consider a set of candidate volumes of data that can times call a model class {Mmm} ! M well- models (referred to as the model class) and be analyzed to extract specified (respectively, misspecified) if the then select the most appropriate one. In useful information. data-generating process is in a parametric other words, model selection is the task of (res pectively nonparametric) framework. We selecting a statistical model from a model use " p and " d to denote convergence in class, given a set of data. We may be interested, e.g., in the probability and in distribution (under p*), respectively. We use selection of N(,n V) to denote a Gaussian distribution of mean n and cova- 2 ■ variables for linear regression riance V, |d to denote a chi-squared distribution with d degrees ■ basis terms, such as polynomials, splines, or wavelets in of freedom, and · 2 to denote the Euclidean norm. The word va - function estimation riable is often referred to as the covariate in a regression setting. ■ order of an autoregressive (AR) process A typical data analysis can be thought of as consisting of ■ number of components in a mixture model two steps. ■ most appropriate parametric family among a number of ■ Step 1: For each candidate model MHmm= {,pim i ! m}, alternatives fit all of the observed data to that model by estimating its ■ number of change points in time series models parameter imm! H . ■ number of neurons and layers in neural networks ■ Step 2: Once we have a set of estimated candidate models ■ best choice among logistic regression, support vector machine, pmium ( ! M), select the most appropriate one for either in and neural networks terpretation or prediction. ■ best machine-learning techniques for solving real-world data We note that not every data analysis and its associated model challenges on an online competition platform. selection procedure formally rely on probability distributions. There have been many overview papers on model selection Examples of model-free methods are nearest-neighbor learn- scattered in the communities of signal processing [1], statistics ing, certain reinforcement learning, and expert learning. Before [2], machine learning [3], epidemiology [4], chemometrics [5], we proceed, it is helpful to first introduce the following two and ecology and evolution [6]. Despite the abundant literature concepts: the model fitting and the best model. on model selection, existing overviews usually focus on deriva- tions, descriptions, or applications of particular model selec- The model fitting tion principles. In this article, we aim to provide an integrated The fitting procedure (also called parameter estimation) given understanding of the properties and practical performances of a certain candidate model Mm is usually achieved by mini- various approaches by reviewing their theoretical and practical mizing the following (cumulative) loss: advantages, disadvantages, and relations. n u im = argmin / sp(,im zt). (2) Some basic concepts imm! H t = 1 In (2), each pim represents a distribution for the data, and s(,$$), Notation referred to as the loss function (or scoring function), is used to We use MHmm= {:pim i ! m} to denote a model (in the for- evaluate the goodness of fit between a distribution and the obser- mal probabilistic sense), which is a set of probability density vation. A commonly used loss function is the logarithmic loss functions to describe the data zz1,,f n . Here, Hm is the parameter space associated with Mm . A model class, sp(,zptt)(=-log z ), (3) {Mmm} ! M, is a collection of models indexed by m ! M. The number of models (or the cardinality of M) can be fixed or the negative logarithm of the distribution of zt . Then, (2) pro- depend on the sample size n. For each model Mm, we de duces the MLE for a parametric model. For time series data, note by dm the dimension of the parameter in model Mm . (3) is written as -log pz^httzz11,,f - , and the quadratic loss 2 Its log-likelihood function is written as iimn7 , ,mm()= sp(,zztt) =-" Ezpt^hzz11,,f t - , is often used, where the log pzim (,1 f,)zn , and the maximized log-likelihood value is expectation is taken over the joint distribution p of zz1,,f t . tt ,fnm, ()iimm,(with = argmax pzim 1,,zn), (1) The best model imm! H Let ppt m = ium denote the estimated distribution under model the maximum likelihood estimator (MLE) under model Mm . We Mm . The predictive performance can be assessed via the t t will write ,nm, ()im as ,nm, for simplicity. We use p* and E* to out-sample prediction loss, defined as IEEE SIGNAL PROCESSING MAGAZINE | November 2018 | 17 tt Es**^^^pZmm,(hhh= # sp zz), pz()dz, (4) scientist does not necessarily care about obtaining an accurate probabilistic description of the data. Of course, one may also where Z is independent with and identically distributed as the be interested in both directions. data used to obtain pt m . Here, Z does not have the subscript t In tune with the two different objectives, model selection as it is the out-sample data used to evaluate the predictive per- can also have two directions: model selection for inference formance. There can be a number of variations to this in terms and model selection for prediction. The first one is intended to of the prediction loss function [8] and time dependency. In identify the best model for the data, which hopefully provides view of this definition, the best model can be naturally defined a reliable characterization of the sources of uncertainty for sci- as the candidate model with the smallest out-sample predic- entific insight and interpretation. And the second is to choose a tion loss, i.e., model as a vehicle to arrive at a model or method that offers top performance. For the former goal, it is crucially important that tt mE0 = arg min * ^^spm,.Zhh the selected model is not too sensitive to the sample size. For m ! M the latter, however, the selected model may simply be the lucky In other words, Mmt 0 is the model whose predictive power winner among a few close competitors, yet the predictive per- is the best offered by the candidate models. We note that the formance can still be (nearly) the best possible. If so, the model best is in the scope of the available data, the class of models, selection is perfectly fine for the second goal (prediction), but and the loss function. the use of the selected model for insight and interpretation may In a parametric framework, typically the true data-generat- be severely unreliable and misleading. Associated with the ing model, if not too complicated, is the best model. In this vein, first goal of model selection for inference or identifying the if the true density function p* belongs to best candidate is the following concept of some model Mm or, equivalently, pp= i selection consistency. * * In this article, we aim for some i* ! Hm and m ! M, then we to provide an integrated seek to select such Mm (from {Mmm})! M Definition 1 with probability going to one as the sample understanding of the A model selection procedure is consistent if size increases, which is called consistency properties and practical the best model is selected with probability in model selection.