Model Selection Techniques —An Overview Jie Ding, Vahid Tarokh, and Yuhong Yang
Total Page:16
File Type:pdf, Size:1020Kb
IEEE SIGNAL PROCESSING MAGAZINE 1 Model Selection Techniques —An Overview Jie Ding, Vahid Tarokh, and Yuhong Yang Abstract—In the era of “big data”, analysts usually lead to purely noisy “discoveries”, severely misleading explore various statistical models or machine learning conclusions, or disappointing predictive performances. methods for observed data in order to facilitate scientific Therefore, a crucial step in a typical data analysis is discoveries or gain predictive power. Whatever data and to consider a set of candidate models (referred to as the fitting procedures are employed, a crucial step is to select model class), and then select the most appropriate one. the most appropriate model or method from a set of model selection candidates. Model selection is a key ingredient in data In other words, is the task of selecting a analysis for reliable and reproducible statistical inference statistical model from a model class, given a set of data. or prediction, and thus central to scientific studies in For example, we may be interested in the selection of fields such as ecology, economics, engineering, finance, • variables for linear regression, political science, biology, and epidemiology. There has • basis terms such as polynomials, splines, or been a long history of model selection techniques that arise from researches in statistics, information theory, wavelets in function estimation, and signal processing. A considerable number of methods • order of an autoregressive process, have been proposed, following different philosophies and • number of components in a mixture model, exhibiting varying performances. The purpose of this • most appropriate parametric family among a num- article is to bring a comprehensive overview of them, ber of alternatives, in terms of their motivation, large sample performance, • number of change points in time series models, and applicability. We provide integrated and practically • number of neurons and layers in neural networks, relevant discussions on theoretical properties of state-of- • best choice among logistic regression, support vec- the-art model selection approaches. We also share our tor machine, and neural networks, thoughts on some controversial views on the practice of model selection. • best machine learning techniques for solving real- data challenges on an online competition platform. Index Terms—Asymptotic efficiency; Consistency; Cross-Validation; High dimension; Information criteria; There have been many overview papers on model Model selection; Optimal prediction; Variable selection. selection scattered in the communities of signal process- ing [1], statistics [2], machine learning [3], epidemiol- ogy [4], chemometrics [5], ecology and evolution [6]. I. WHY MODEL SELECTION Despite the abundant literature on model selection, exist- Vast development in hardware storage, precision in- ing overviews usually focus on derivations, descriptions, strument manufacture, economic globalization, etc. have or applications of particular model selection principles. arXiv:1810.09583v1 [stat.ML] 22 Oct 2018 generated huge volumes of data that can be analyzed to In this paper, we aim to provide an integrated under- extract useful information. Typical statistical inference standing of the properties and practical performances of or machine learning procedures learn from and make various approaches, by reviewing their theoretical and predictions on data by fitting parametric or nonpara- practical advantages, disadvantages, and relations. Our metric models (in a broad sense). However, there exists overview is characterized by the following highlighted no model that is universally suitable for any data and perspectives, which we believe will bring deeper insights goal. An improper choice of model or method can for the research community. First of all, we provide a technical overview of This research was funded in part by the Defense Advanced Research Projects Agency (DARPA) under grant number W911NF- foundational model selection approaches and their rela- 18-1-0134. tions, by centering around two statistical goals, namely J. Ding and Y. Yang are with the School of Statistics, University of prediction and inference (see Section II). We then in- Minnesota, Minneapolis, Minnesota 55455, United States. V. Tarokh troduce rigorous developments in the understanding of is with the Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina 27708, United States. two fundamentally representative model selection criteria Digital Object Identifier 10.1109/MSP.2018.2867638 c 2018 IEEE (see Section III). We also review research developments IEEE SIGNAL PROCESSING MAGAZINE 2 in the problem of high-dimensional variable selection, Step 1: For each candidate model Mm = fpθm ; θm 2 including penalized regression and step-wise variable Hmg, fit all the observed data to that model by estimating selection which have been widely used in practice (see its parameter θm 2 Hm; Section IV). Moreover, we review recent developments Step 2: Once we have a set of estimated candidate in modeling procedure selection, which, differently from models p~ (m 2 ), select the most appropriate one θm M model selection in the narrow sense of choosing among for either interpretation or prediction. parametric models, aims to select the better statistical or We note that not every data analysis and its associated machine learning procedure (see Section V). Finally, we model selection procedure formally rely on probabil- address some common misconceptions and controversies ity distributions. Examples of model-free methods are about model selection techniques (see Section VI), and nearest neighbor learning, reinforcement learning, and provide a few general recommendations on the applica- expert learning. Before we proceed, it is helpful to first tion of model selection (in Section VII). For derivations introduce the following two concepts. and implementations of some popular information cri- The “model fitting”: The fitting procedure (also teria, we refer the readers to monographs and review called parameter estimation) given a certain candidate papers such as [1], [7]. model Mm is usually achieved by minimizing the fol- lowing (cumulative) loss: A. Some Basic Concepts n ~ X θm = arg min s(pθm ; zt): (2) Notation: We use Mm = fpθ : θm 2 Hmg to denote m θm2Hm t=1 a model (in the formal probabilistic sense), which is a In the objective (2), each p represents a distribution set of probability density functions to describe the data θm for the data, and s(·; ·), referred to as the loss function z1; : : : ; zn. Here, Hm is the parameter space associated (or scoring function), is used to evaluate the “goodness with Mm. A model class, fMmgm2 , is a collection M of fit” between a distribution and the observation. A of models indexed by m 2 M. The number of models commonly used loss function is the logarithmic loss (or the cardinality of M) can be fixed, or depend on the sample size n. For each model Mm, we denote by s(p; zt) = − log p(zt); (3) dm the dimension of the parameter in model Mm. Its the negative logarithm of the distribution of zt. Then, log-likelihood function is written as θm 7! `n;m(θm) = the objective (2) produces the MLE for a parametric log pθm (z1; : : : ; zn), and the maximized log-likelihood value is model. For time series data, (3) is written as − log p(zt j z1; : : : ; zt−1), and the quadratic loss s(p; zt) = fzt − ^ ^ 2 `n;m(θm) with θm = arg max pθm (z1; : : : ; zn) (1) Ep(zt j z1; : : : ; zt−1)g is often used, where the expec- θm2Hm tation is taken over the joint distribution p of z1; : : : ; zt. being the maximum likelihood estimator (MLE) under The “best model”: Let p^m = p~ denote the esti- ^ ^ θm model Mm. We will write `n;m(θm) as `n;m for simplic- mated distribution under model Mm. The predictive per- ity. We use p∗ and E∗ to denote the true-data generating formance can be assessed via the out-sample prediction distribution and expectation with respect to the true data- loss, defined as generating distribution, respectively. In the parametric Z framework, there exists some m 2 M and some θ∗ 2 Hm E∗(s(^pm;Z)) = s(^pm(z); z)p∗(z)dz (4) such that p∗ is exactly pθ . In the nonparametric frame- ∗ where Z is independent with and identically distributed work, p∗ is excluded in the model class. We sometimes as the data used to obtain p^m. Here, Z does not have call a model class fMmgm2M well-specified (resp. mis- specified) if the data generating process is in a parametric the subscript t as it is the out-sample data used to evaluate the predictive performance. There can be a (resp. nonparametric) framework. We use !p and !d to denote convergence in probability and in distribution number of variations to this in terms of the prediction loss function [8] and time dependency. In view of this (under p∗), respectively. We use N (µ, V ) to denote a 2 definition, the best model can be naturally defined as the Gaussian distribution of mean µ and covariance V , χd to denote a chi-squared distribution with d degrees of candidate model with the smallest out-sample prediction loss, i.e., freedom, and k·k2 to denote the Euclidean norm. The word “variable” is often referred to as the “covariate” in m^ 0 = arg min E∗(s(^pm;Z)): a regression setting. m2M A typical data analysis can be thought of as consisting In other words, Mm^ 0 is the model whose predictive of two steps: power is the best offered by the candidate models. We IEEE SIGNAL PROCESSING MAGAZINE 3 note that the “best” is in the scope of the available data, sample size. For the latter, however, the selected model the class of models, and the loss function. may be simply the lucky winner among a few close In a parametric framework, typically the true data- competitors, yet the predictive performance can still be generating model, if not too complicated, is the best (nearly) the best possible. If so, the model selection is model.