Model Selection: General Techniques

Total Page:16

File Type:pdf, Size:1020Kb

Model Selection: General Techniques Statistics 203: Introduction to Regression and Analysis of Variance Model Selection: General Techniques Jonathan Taylor - p. 1/16 Today ● Today ■ Outlier detection / simultaneous inference. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Goals of model selection. ● Model selection: goals ■ ● Model selection: general Criteria to compare models. ● Model selection: strategies ● Possible criteria ■ (Some) model selection. ● Mallow's Cp ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 2/16 Crude outlier detection test ● Today ■ If the studentized residuals are large: observation may be an ● Crude outlier detection test ● Bonferroni correction outlier. ● Simultaneous inference for β ● Model selection: goals ■ 1 2 1 ● Model selection: general Problem: if n is large, if we “threshold” at t −α/ ;n−p− we ● Model selection: strategies will get many outliers by chance even if model is correct. ● Possible criteria ● Mallow's Cp ■ ● AIC & BIC Solution: Bonferroni correction, threshold at t1−α/2n;n−p−1. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 3/16 Bonferroni correction ● Today ■ If we are doing many t (or other) tests, say m > 1 we can ● Crude outlier detection test ● Bonferroni correction control overall false positive rate at α by testing each one at ● Simultaneous inference for β ● Model selection: goals level α=m. ● Model selection: general ● Model selection: strategies ■ Proof: ● Possible criteria ● Mallow's Cp P (at least one false positive) ● AIC & BIC ● Maximum likelihood m estimation = P [i=1jTij ≥ t1−α/2m;n−p−1 ● AIC for a linear model ● Search strategies m ● Implementations in R ● Caveats ≤ P jT j ≥ t1− 2 − −1 X i α/ m;n p i=1 m α = = α: X m i=1 ■ Known as “simultaneous inference”: controlling overall false positive rate at α while performing many tests. - p. 4/16 Simultaneous inference for β ● Today ■ Other common situations in which simultaneous inference ● Crude outlier detection test ● Bonferroni correction occurs is “simultaneous inference” for β. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Using the facts that ● Model selection: strategies ● Possible criteria 2 t −1 ● Mallow's Cp β ∼ N β; σ (X X) ● AIC & BIC ● Maximum likelihood b 2 estimation 2 2 χn−p ● AIC for a linear model σ ∼ σ · ● Search strategies n − p ● Implementations in R b ● Caveats along with β ? σ2 leads to b t bt 2 (β − β) (X X)(β − β)=p χp=p 2 ∼ 2 ∼ Fp;n−p b σ b χn−p=(n − p) b ■ (1 − α) · 100% simultaneous confidence region: t t 2 nβ : (β − β) (X X)(β − β) ≤ pσ Fp;n−p;1−αo b b b - p. 5/16 Model selection: goals ● Today ■ When we have many predictors (with many possible ● Crude outlier detection test ● Bonferroni correction interactions), it can be difficult to find a good model. ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general Which main effects do we include? ● Model selection: strategies ■ ● Possible criteria Which interactions do we include? ● Mallow's Cp ■ ● AIC & BIC Model selection tries to “simplify” this task. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 6/16 Model selection: general ● Today ■ This is an “unsolved” problem in statistics: there are no ● Crude outlier detection test ● Bonferroni correction magic procedures to get you the “best model.” ● Simultaneous inference for β ● Model selection: goals ■ ● Model selection: general In some sense, model selection is “data mining.” ● Model selection: strategies ■ ● Possible criteria Data miners / machine learners often work with very many ● Mallow's Cp ● AIC & BIC predictors. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 7/16 Model selection: strategies ● Today ■ To “implement” this, we need: ● Crude outlier detection test ● Bonferroni correction ◆ a criterion or benchmark to compare two models. ● Simultaneous inference for β ◆ ● Model selection: goals a search strategy. ● Model selection: general ● Model selection: strategies ■ With a limited number of predictors, it is possible to search ● Possible criteria ● Mallow's Cp all possible models. ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 8/16 Possible criteria ● Today ■ 2 ● Crude outlier detection test R : not a good criterion. Always increase with model size –> ● Bonferroni correction “optimum” is to take the biggest model. ● Simultaneous inference for β ● Model selection: goals ■ 2 ● Model selection: general Adjusted R : better. It “penalized” bigger models. ● Model selection: strategies ■ ● Possible criteria Mallow’s Cp. ● Mallow's Cp ■ ● AIC & BIC Akaike’s Information Criterion (AIC), Schwarz’s BIC. ● Maximum likelihood estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 9/16 Mallow's Cp ● Today ■ ● Crude outlier detection test ● Bonferroni correction SSE(M) ● Simultaneous inference for β Cp(M) = 2 − n + 2 · p(M): ● Model selection: goals σ ● Model selection: general ■ 2 2 ● Model selection: strategies σ = SSE(F )=dfF is the “best”b estimate of σ we have (use ● Possible criteria ● Mallow's Cp theb fullest model) ● AIC & BIC 2 ● Maximum likelihood ■ SSE(M) = kY − YMk is the SSE of the model M estimation ● AIC for a linear model ■ ● Search strategies p(M) is the numberb of predictors in M, or the degrees of ● Implementations in R ● Caveats freedom used up by the model. ■ Based on an estimate of n 1 2 E (Y − E(Y )) σ2 X i i i=1 n 1 2 = E (Y − Y ) + Var(Y ) σ2 X i i i i=1 b b - p. 10/16 AIC & BIC ● Today ■ Mallow’s C is (almost) a special case of Akaike Information ● Crude outlier detection test p ● Bonferroni correction Criterion (AIC) ● Simultaneous inference for β ● Model selection: goals ● Model selection: general AIC(M) = −2 log L(M) + 2 · p(M): ● Model selection: strategies ● Possible criteria ● Mallow's Cp ■ L(M) is the likelihood function of the parameters in model ● AIC & BIC ● Maximum likelihood M evaluated at the MLE (Maximum Likelihood Estimators). estimation ● AIC for a linear model ■ ● Search strategies Schwarz’s Bayesian Information Criterion (BIC) ● Implementations in R ● Caveats BIC(M) = −2 log L(M) + p(M) · log n - p. 11/16 Maximum likelihood estimation ● Today ■ If the model is correct then the log-likelihood of is ● Crude outlier detection test (β; σ) ● Bonferroni correction ● Simultaneous inference for β n 2 1 2 ● Model selection: goals log L(β; σjX; Y ) = − log(2π) + log σ − 2 kY − Xβk ● Model selection: general 2 2σ ● Model selection: strategies ● Possible criteria ● Mallow's Cp where Y is the vector of observed responses. ● AIC & BIC ■ ● Maximum likelihood MLE for β in this case is the same as least squares estimate estimation ● AIC for a linear model because first term does not depend on β ● Search strategies ● Implementations in R ■ 2 ● Caveats MLE for σ : @ n 1 2 2 log L(β; σ) = − 2 + 4 kY − Xβk = 0 @σ b 2 2σ 2σ β;σb b ■ Solving for σ2: 2 1 2 1 σ = kY − Xβk = SSE(M) MLE n n b b Note that the MLE is biased. - p. 12/16 AIC for a linear model ● Today ■ ● Crude outlier detection test Using βMLE = β ● Bonferroni correction ● Simultaneous inference for β b b ● Model selection: goals 2 1 ● Model selection: general σMLE = SSE(M) ● Model selection: strategies n ● Possible criteria b ● Mallow's Cp we see that the AIC of a multiple linear regression model is ● AIC & BIC ● Maximum likelihood estimation ● AIC for a linear model AIC(M) = n (log(2π) + log(SSE(M)) − log(n))+2(n+p(M)+1) ● Search strategies ● Implementations in R ■ 2 ● Caveats If σ is known, then 2 SSE(M) AIC(M) = n log(2π) + log(σ ) + + 2p(M) σ2 which is almost Cp(M) + Kn. - p. 13/16 Search strategies ● Today ■ “Best subset”: search all possible models and take the one ● Crude outlier detection test 2 ● Bonferroni correction with highest R or lowest Cp. ● Simultaneous inference for β a ● Model selection: goals ■ ● Model selection: general Stepwise (forward, backward or both): useful when the ● Model selection: strategies number of predictors is large. Choose an initial model and ● Possible criteria ● Mallow's Cp be “greedy”. ● AIC & BIC ● Maximum likelihood ■ “Greedy” means always take the biggest jump (up or down) estimation ● AIC for a linear model ● Search strategies in your selected criterion. ● Implementations in R ● Caveats - p. 14/16 Implementations in R ● Today ■ “Best subset”: use the function leaps. Works only for ● Crude outlier detection test ● Bonferroni correction multiple linear regression models. ● Simultaneous inference for β ● Model selection: goals ■ step ● Model selection: general Stepwise: use the function . Works for any model with ● Model selection: strategies Akaike Information Criterion (AIC). In multiple linear ● Possible criteria ● Mallow's Cp regression, AIC is (almost) a linear function of Cp. ● AIC & BIC ● Maximum likelihood ■ Here is an example. estimation ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 15/16 Caveats ● Today ■ Many other “criteria” have been proposed. ● Crude outlier detection test ● Bonferroni correction ■ ● Simultaneous inference for β Some work well for some types of data, others for different ● Model selection: goals ● Model selection: general data. ● Model selection: strategies ■ ● Possible criteria These criteria are not “direct measures” of predictive power. ● Mallow's Cp ■ ● AIC & BIC Later – we will see cross-validation which is an estimate of ● Maximum likelihood estimation predictive power. ● AIC for a linear model ● Search strategies ● Implementations in R ● Caveats - p. 16/16.
Recommended publications
  • The Gompertz Distribution and Maximum Likelihood Estimation of Its Parameters - a Revision
    Max-Planck-Institut für demografi sche Forschung Max Planck Institute for Demographic Research Konrad-Zuse-Strasse 1 · D-18057 Rostock · GERMANY Tel +49 (0) 3 81 20 81 - 0; Fax +49 (0) 3 81 20 81 - 202; http://www.demogr.mpg.de MPIDR WORKING PAPER WP 2012-008 FEBRUARY 2012 The Gompertz distribution and Maximum Likelihood Estimation of its parameters - a revision Adam Lenart ([email protected]) This working paper has been approved for release by: James W. Vaupel ([email protected]), Head of the Laboratory of Survival and Longevity and Head of the Laboratory of Evolutionary Biodemography. © Copyright is held by the authors. Working papers of the Max Planck Institute for Demographic Research receive only limited review. Views or opinions expressed in working papers are attributable to the authors and do not necessarily refl ect those of the Institute. The Gompertz distribution and Maximum Likelihood Estimation of its parameters - a revision Adam Lenart November 28, 2011 Abstract The Gompertz distribution is widely used to describe the distribution of adult deaths. Previous works concentrated on formulating approximate relationships to char- acterize it. However, using the generalized integro-exponential function Milgram (1985) exact formulas can be derived for its moment-generating function and central moments. Based on the exact central moments, higher accuracy approximations can be defined for them. In demographic or actuarial applications, maximum-likelihood estimation is often used to determine the parameters of the Gompertz distribution. By solving the maximum-likelihood estimates analytically, the dimension of the optimization problem can be reduced to one both in the case of discrete and continuous data.
    [Show full text]
  • The Likelihood Principle
    1 01/28/99 ãMarc Nerlove 1999 Chapter 1: The Likelihood Principle "What has now appeared is that the mathematical concept of probability is ... inadequate to express our mental confidence or diffidence in making ... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term 'Likelihood' to designate this quantity; since both the words 'likelihood' and 'probability' are loosely used in common speech to cover both kinds of relationship." R. A. Fisher, Statistical Methods for Research Workers, 1925. "What we can find from a sample is the likelihood of any particular value of r [a parameter], if we define the likelihood as a quantity proportional to the probability that, from a particular population having that particular value of r, a sample having the observed value r [a statistic] should be obtained. So defined, probability and likelihood are quantities of an entirely different nature." R. A. Fisher, "On the 'Probable Error' of a Coefficient of Correlation Deduced from a Small Sample," Metron, 1:3-32, 1921. Introduction The likelihood principle as stated by Edwards (1972, p. 30) is that Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data. ...For a continuum of hypotheses, this principle
    [Show full text]
  • Model Selection for Optimal Prediction in Statistical Machine Learning
    Model Selection for Optimal Prediction in Statistical Machine Learning Ernest Fokou´e Introduction science with several ideas from cognitive neuroscience and At the core of all our modern-day advances in artificial in- psychology to inspire the creation, invention, and discov- telligence is the emerging field of statistical machine learn- ery of abstract models that attempt to learn and extract pat- ing (SML). From a very general perspective, SML can be terns from the data. One could think of SML as a field thought of as a field of mathematical sciences that com- of science dedicated to building models endowed with bines mathematics, probability, statistics, and computer the ability to learn from the data in ways similar to the ways humans learn, with the ultimate goal of understand- Ernest Fokou´eis a professor of statistics at Rochester Institute of Technology. His ing and then mastering our complex world well enough email address is [email protected]. to predict its unfolding as accurately as possible. One Communicated by Notices Associate Editor Emilie Purvine. of the earliest applications of statistical machine learning For permission to reprint this article, please contact: centered around the now ubiquitous MNIST benchmark [email protected]. task, which consists of building statistical models (also DOI: https://doi.org/10.1090/noti2014 FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 155 known as learning machines) that automatically learn and Theoretical Foundations accurately recognize handwritten digits from the United It is typical in statistical machine learning that a given States Postal Service (USPS). A typical deployment of an problem will be solved in a wide variety of different ways.
    [Show full text]
  • A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess
    S S symmetry Article A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess Guillermo Martínez-Flórez 1, Víctor Leiva 2,* , Emilio Gómez-Déniz 3 and Carolina Marchant 4 1 Departamento de Matemáticas y Estadística, Facultad de Ciencias Básicas, Universidad de Córdoba, Montería 14014, Colombia; [email protected] 2 Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, 2362807 Valparaíso, Chile 3 Facultad de Economía, Empresa y Turismo, Universidad de Las Palmas de Gran Canaria and TIDES Institute, 35001 Canarias, Spain; [email protected] 4 Facultad de Ciencias Básicas, Universidad Católica del Maule, 3466706 Talca, Chile; [email protected] * Correspondence: [email protected] or [email protected] Received: 30 June 2020; Accepted: 19 August 2020; Published: 1 September 2020 Abstract: In this paper, we consider skew-normal distributions for constructing new a distribution which allows us to model proportions and rates with zero/one inflation as an alternative to the inflated beta distributions. The new distribution is a mixture between a Bernoulli distribution for explaining the zero/one excess and a censored skew-normal distribution for the continuous variable. The maximum likelihood method is used for parameter estimation. Observed and expected Fisher information matrices are derived to conduct likelihood-based inference in this new type skew-normal distribution. Given the flexibility of the new distributions, we are able to show, in real data scenarios, the good performance of our proposal. Keywords: beta distribution; centered skew-normal distribution; maximum-likelihood methods; Monte Carlo simulations; proportions; R software; rates; zero/one inflated data 1.
    [Show full text]
  • Linear Regression: Goodness of Fit and Model Selection
    Linear Regression: Goodness of Fit and Model Selection 1 Goodness of Fit I Goodness of fit measures for linear regression are attempts to understand how well a model fits a given set of data. I Models almost never describe the process that generated a dataset exactly I Models approximate reality I However, even models that approximate reality can be used to draw useful inferences or to prediction future observations I ’All Models are wrong, but some are useful’ - George Box 2 Goodness of Fit I We have seen how to check the modelling assumptions of linear regression: I checking the linearity assumption I checking for outliers I checking the normality assumption I checking the distribution of the residuals does not depend on the predictors I These are essential qualitative checks of goodness of fit 3 Sample Size I When making visual checks of data for goodness of fit is important to consider sample size I From a multiple regression model with 2 predictors: I On the left is a histogram of the residuals I On the right is residual vs predictor plot for each of the two predictors 4 Sample Size I The histogram doesn’t look normal but there are only 20 datapoint I We should not expect a better visual fit I Inferences from the linear model should be valid 5 Outliers I Often (particularly when a large dataset is large): I the majority of the residuals will satisfy the model checking assumption I a small number of residuals will violate the normality assumption: they will be very big or very small I Outliers are often generated by a process distinct from those which we are primarily interested in.
    [Show full text]
  • 11. Parameter Estimation
    11. Parameter Estimation Chris Piech and Mehran Sahami May 2017 We have learned many different distributions for random variables and all of those distributions had parame- ters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables. What if we don’t know the values of the parameters and we can’t estimate them from our own expert knowl- edge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data. These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data. Parameters Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation q to be a vector of all the parameters: Distribution Parameters Bernoulli(p) q = p Poisson(l) q = l Uniform(a,b) q = (a;b) Normal(m;s 2) q = (m;s 2) Y = mX + b q = (m;b) In the real world often you don’t know the “true” parameters, but you get to observe data.
    [Show full text]
  • Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis
    Scalable model selection for spatial additive mixed modeling: application to crime analysis Daisuke Murakami1,2,*, Mami Kajita1, Seiji Kajita1 1Singular Perturbations Co. Ltd., 1-5-6 Risona Kudan Building, Kudanshita, Chiyoda, Tokyo, 102-0074, Japan 2Department of Statistical Data Science, Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan * Corresponding author (Email: [email protected]) Abstract: A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran. Keywords: model selection; spatial regression; crime; fast computation; spatially varying coefficient modeling 1.
    [Show full text]
  • Maximum Likelihood Estimation
    Chris Piech Lecture #20 CS109 Nov 9th, 2018 Maximum Likelihood Estimation We have learned many different distributions for random variables and all of those distributions had parame- ters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables. What if we don’t know the values of the parameters and we can’t estimate them from our own expert knowl- edge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data. These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data. Parameters Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation q to be a vector of all the parameters: In the real Distribution Parameters Bernoulli(p) q = p Poisson(l) q = l Uniform(a,b) q = (a;b) Normal(m;s 2) q = (m;s 2) Y = mX + b q = (m;b) world often you don’t know the “true” parameters, but you get to observe data.
    [Show full text]
  • Model Selection Techniques: an Overview
    Model Selection Techniques An overview ©ISTOCKPHOTO.COM/GREMLIN Jie Ding, Vahid Tarokh, and Yuhong Yang n the era of big data, analysts usually explore various statis- following different philosophies and exhibiting varying per- tical models or machine-learning methods for observed formances. The purpose of this article is to provide a compre- data to facilitate scientific discoveries or gain predictive hensive overview of them, in terms of their motivation, large power. Whatever data and fitting procedures are employed, sample performance, and applicability. We provide integrated Ia crucial step is to select the most appropriate model or meth- and practically relevant discussions on theoretical properties od from a set of candidates. Model selection is a key ingredi- of state-of-the-art model selection approaches. We also share ent in data analysis for reliable and reproducible statistical our thoughts on some controversial views on the practice of inference or prediction, and thus it is central to scientific stud- model selection. ies in such fields as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a Why model selection long history of model selection techniques that arise from Vast developments in hardware storage, precision instrument researches in statistics, information theory, and signal process- manufacturing, economic globalization, and so forth have ing. A considerable number of methods has been proposed, generated huge volumes of data that can be analyzed to extract useful information. Typical statistical inference or machine- learning procedures learn from and make predictions on data Digital Object Identifier 10.1109/MSP.2018.2867638 Date of publication: 13 November 2018 by fitting parametric or nonparametric models (in a broad 16 IEEE SIGNAL PROCESSING MAGAZINE | November 2018 | 1053-5888/18©2018IEEE sense).
    [Show full text]
  • Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP
    Bernoulli 19(2), 2013, 521–547 DOI: 10.3150/11-BEJ410 Least squares after model selection in high-dimensional sparse models ALEXANDRE BELLONI1 and VICTOR CHERNOZHUKOV2 1100 Fuqua Drive, Durham, North Carolina 27708, USA. E-mail: [email protected] 250 Memorial Drive, Cambridge, Massachusetts 02142, USA. E-mail: [email protected] In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that the OLS post-Lasso estimator performs at least as well as Lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the Lasso-based model selection “fails” in the sense of missing some components of the “true” regression model. By the “true” model, we mean the best s-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, OLS post-Lasso estimator can perform strictly better than Lasso, in the sense of a strictly faster rate of convergence, if the Lasso-based model selection correctly includes all components of the “true” model as a subset and also achieves sufficient sparsity. In the extreme case, when Lasso perfectly selects the “true” model, the OLS post-Lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by Lasso, which guarantees that this dimension is at most of the same order as the dimension of the “true” model.
    [Show full text]
  • Model Selection and Estimation in Regression with Grouped Variables
    J. R. Statist. Soc. B (2006) 68, Part 1, pp. 49–67 Model selection and estimation in regression with grouped variables Ming Yuan Georgia Institute of Technology, Atlanta, USA and Yi Lin University of Wisconsin—Madison, USA [Received November 2004. Revised August 2005] Summary. We consider the problem of selecting grouped variables (factors) for accurate pre- diction in regression. Such a problem arises naturally in many practical situations with the multi- factor analysis-of-variance problem as the most important and well-known example. Instead of selecting factors by stepwise backward elimination, we focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection. The lasso, the LARS algorithm and the non-negative garrotte are recently proposed regression methods that can be used to select individual variables. We study and propose effi- cient algorithms for the extensions of these methods for factor selection and show that these extensions give superior performance to the traditional stepwise backward elimination method in factor selection problems.We study the similarities and the differences between these methods. Simulations and real examples are used to illustrate the methods. Keywords: Analysis of variance; Lasso; Least angle regression; Non-negative garrotte; Piecewise linear solution path 1. Introduction In many regression problems we are interested in finding important explanatory factors in pre- dicting the response variable, where each explanatory factor may be represented by a group of derived input variables. The most common example is the multifactor analysis-of-variance (ANOVA) problem, in which each factor may have several levels and can be expressed through a group of dummy variables.
    [Show full text]
  • Model Selection for Production System Via Automated Online Experiments
    Model Selection for Production System via Automated Online Experiments Zhenwen Dai Praveen Ravichandran Spotify Spotify [email protected] [email protected] Ghazal Fazelnia Ben Carterette Mounia Lalmas-Roelleke Spotify Spotify Spotify [email protected] [email protected] [email protected] Abstract A challenge that machine learning practitioners in the industry face is the task of selecting the best model to deploy in production. As a model is often an intermedi- ate component of a production system, online controlled experiments such as A/B tests yield the most reliable estimation of the effectiveness of the whole system, but can only compare two or a few models due to budget constraints. We propose an automated online experimentation mechanism that can efficiently perform model se- lection from a large pool of models with a small number of online experiments. We derive the probability distribution of the metric of interest that contains the model uncertainty from our Bayesian surrogate model trained using historical logs. Our method efficiently identifies the best model by sequentially selecting and deploying a list of models from the candidate set that balance exploration-exploitation. Using simulations based on real data, we demonstrate the effectiveness of our method on two different tasks. 1 Introduction Evaluating the effect of individual changes to machine learning (ML) systems such as choice of algorithms, features, etc., is the key to growth in many internet services and industrial applications. Practitioners are faced with the decision of choosing one model from several candidates to deploy in production. This can be viewed as a model selection problem. Classical model selection paradigms such as cross-validation consider ML models in isolation and focus on selecting the model with the best predictive power on unseen data.
    [Show full text]