Minimum Description Length Model Selection

Total Page:16

File Type:pdf, Size:1020Kb

Minimum Description Length Model Selection Minimum Description Length Model Selection Problems and Extensions Steven de Rooij Minimum Description Length Model Selection Problems and Extensions ILLC Dissertation Series DS-2008-07 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: [email protected] homepage: http://www.illc.uva.nl/ Minimum Description Length Model Selection Problems and Extensions Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. D. C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op woensdag 10 september 2008, te 12.00 uur door Steven de Rooij geboren te Amersfoort. Promotiecommissie: Promotor: prof.dr.ir. P.M.B. Vit´anyi Co-promotor: dr. P.D. Grunwald¨ Overige leden: prof.dr. P.W. Adriaans prof.dr. A.P. Dawid prof.dr. C.A.J. Klaassen prof.dr.ir. R.J.H. Scha dr. M.W. van Someren prof.dr. V. Vovk dr.ir. F.M.J. Willems Faculteit der Natuurwetenschappen, Wiskunde en Informatica The investigations were funded by the Netherlands Organization for Scientific Research (NWO), project 612.052.004 on Universal Learning, and were supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. Copyright c 2008 by Steven de Rooij Cover inspired by Randall Munroe’s xkcd webcomic (www.xkcd.org) Printed and bound by PrintPartners Ipskamp (www.ppi.nl) ISBN: 978-90-5776-181-2 Parts of this thesis are based on material contained in the following papers: An Empirical Study of Minimum Description Length Model Selection with Infinite • Parametric Complexity Steven de Rooij and Peter Grunwald¨ In: Journal of Mathematical Psychology, special issue on model selection. Vol. 50, pp. 180-190, 2006 (Chapter 2) Asymptotic Log-Loss of Prequential Maximum Likelihood Codes • Peter D. Grunwald¨ and Steven de Rooij In: Proceedings of the 18th Annual Conference on Computational Learning The- ory (COLT), pp. 652-667, June 2005 (Chapter 3) MDL Model Selection using the ML Plug-in Code • Steven de Rooij and Peter Grunwald¨ In: Proceedings of the International Symposium on Information Theory (ISIT), 2005 (Chapters 2 and 3) Approximating Rate-Distortion Graphs of Individual Data: Experiments in Lossy • Compression and Denoising Steven de Rooij and Paul Vit´anyi Submitted to: IEEE Transactions on Computers (Chapter 6) Catching up Faster in Bayesian Model Selection and Averaging • Tim van Erven, Peter D. Grunwald¨ and Steven de Rooij In: Advances in Neural Information Processing Systems 20 (NIPS 2007) (Chapter 5) Expert Automata for Efficient Tracking • W. Koolen and S. de Rooij In: Proceedings of the 21st Annual Conference on Computational Learning The- ory (COLT) 2008 (Chapter 4) v Contents Acknowledgements xi 1 The MDL Principle for Model Selection 1 1.1 EncodingHypotheses......................... 5 1.1.1 Luckiness ........................... 6 1.1.2 Infinitely Many Hypotheses . 7 1.1.3 IdealMDL........................... 8 1.2 UsingModelstoEncodetheData. 9 1.2.1 Codes and Probability Distributions . 9 1.2.2 Sequential Observations . 12 1.2.3 Model Selection and Universal Coding . 13 1.3 ApplicationsofMDL ......................... 19 1.3.1 TruthFinding......................... 19 1.3.2 Prediction........................... 20 1.3.3 Separating Structure from Noise . 22 1.4 Organisation of this Thesis . 23 2 Dealing with Infinite Parametric Complexity 25 2.1 Universal codes and Parametric Complexity . 26 2.2 The Poisson and Geometric Models . 28 2.3 Four Ways to deal with Infinite Parametric Complexity . 29 2.3.1 BIC/ML............................ 30 2.3.2 RestrictedANML....................... 30 2.3.3 Two-partANML ....................... 31 2.3.4 Renormalised Maximum Likelihood . 32 2.3.5 Prequential Maximum Likelihood . 33 2.3.6 Objective Bayesian Approach . 33 2.4 Experiments.............................. 35 2.4.1 ErrorProbability . 36 vii 2.4.2 Bias .............................. 37 2.4.3 Calibration .......................... 37 2.5 Discussion of Results . 39 2.5.1 Poor Performance of the PML Criterion . 39 2.5.2 ML/BIC............................ 46 2.5.3 Basic Restricted ANML . 47 2.5.4 Objective Bayes and Two-part Restricted ANML . 47 2.6 Summary and Conclusion . 47 3 Behaviour of Prequential Codes under Misspecification 51 3.1 Main Result, Informally . 52 3.2 MainResult,Formally ..... ....... ....... ..... 55 3.3 Redundancy vs. Regret . 60 3.4 Discussion............................... 61 3.4.1 A Justification of Our Modification of the ML Estimator . 61 3.4.2 Prequential Models with Other Estimators . 62 3.4.3 Practical Significance for Model Selection . 63 3.4.4 Theoretical Significance . 64 3.5 Building Blocks of the Proof . 65 3.5.1 Results about Deviations between Average and Mean . 65 3.5.2 Lemma 3.5.6: Redundancy for Exponential Families . 68 3.5.3 Lemma 3.5.8: Convergence of the sum of the remainder terms 69 3.6 ProofofTheorem3.3.1. 71 3.7 Conclusion and Future Work . 73 4 Interlude: Prediction with Expert Advice 75 4.1 ExpertSequencePriors. 78 4.1.1 Examples ........................... 80 4.2 ExpertTrackingusingHMMs . 81 4.2.1 Hidden Markov Models Overview . 82 4.2.2 HMMsasES-Priors... ....... ....... ..... 84 4.2.3 Examples ........................... 86 4.2.4 TheHMMforData...................... 87 4.2.5 The Forward Algorithm and Sequential Prediction . 88 4.3 Zoology ................................ 92 4.3.1 Universal Elementwise Mixtures . 92 4.3.2 FixedShare .......................... 94 4.3.3 UniversalShare ........................ 96 4.3.4 Overconfident Experts . 97 4.4 New Models to Switch between Experts . 101 4.4.1 Switch Distribution . 101 4.4.2 Run-length Model . 110 4.4.3 Comparison .......................... 113 viii 4.5 Extensions............................... 114 4.5.1 Fast Approximations . 114 4.5.2 Data-Dependent Priors . 118 4.5.3 An Alternative to MAP Data Analysis . 118 4.6 Conclusion............................... 119 5 Slow Convergence: the Catch-up Phenomenon 121 5.1 The Switch-Distribution for Model Selection and Prediction . 124 5.1.1 Preliminaries ......................... 124 5.1.2 Model Selection and Prediction . 125 5.1.3 The Switch-Distribution . 126 5.2 Consistency .............................. 127 5.3 Optimal Risk Convergence Rates . 129 5.3.1 The Switch-distribution vs Bayesian Model Averaging . 130 5.3.2 Improved Convergence Rate: Preliminaries . 131 5.3.3 The Parametric Case . 134 5.3.4 The Nonparametric Case . 135 5.3.5 Example: Histogram Density Estimation . 136 5.4 Further Discussion of Convergence in the Nonparametric Case . 139 5.4.1 Convergence Rates in Sum . 139 5.4.2 Beyond Histograms . 141 5.5 Efficient Computation . 142 5.6 Relevance and Earlier Work . 143 5.6.1 AIC-BIC; Yang’s Result . 143 5.6.2 Prediction with Expert Advice . 145 5.7 The Catch-Up Phenomenon, Bayes and Cross-Validation . 145 5.7.1 The Catch-Up Phenomenon is Unbelievable! (according to pbma).............................. 145 5.7.2 Nonparametric Bayes . 147 5.7.3 Leave-One-Out Cross-Validation . 147 5.8 Conclusion and Future Work . 148 5.9 Proofs ................................. 149 5.9.1 Proof of Theorem 5.2.1 . 149 5.9.2 Proof of Theorem 5.2.2 . 150 5.9.3 Mutual Singularity as Used in the Proof of Theorem 5.2.1 151 5.9.4 Proof of Theorem 5.5.1 . 152 6 Individual Sequence Rate Distortion 157 6.1 Algorithmic Rate-Distortion . 158 6.1.1 Distortion Spheres, the Minimal Sufficient Statistic . 160 6.1.2 Applications: Denoising and Lossy Compression . 161 6.2 Computing Individual Object Rate-Distortion . 163 6.2.1 Compressor (rate function) . 164 ix 6.2.2 Code Length Function . 164 6.2.3 Distortion Functions . 165 6.2.4 Searching for the Rate-Distortion Function . 167 6.2.5 Genetic Algorithm . 168 6.3 Experiments.............................. 171 6.3.1 NamesofObjects. 173 6.4 Results and Discussion . 173 6.4.1 Lossy Compression . 174 6.4.2 Denoising ........................... 177 6.5 Quality of the Approximation . 187 6.6 AnMDLPerspective ......................... 188 6.7 Conclusion............................... 190 Bibliography 193 Notation 201 Abstract 203 Samenvatting 207 x Acknowledgements When I took prof. Vit´anyi’s course on Kolmogorov complexity, I was delighted by his seemingly endless supply of anecdotes, opinions and wisdoms that ranged the whole spectrum from politically incorrect jokes to profound philosophical insights. At the time I was working on my master’s thesis on data compression with Breannd´an O´ Nuall´ain, which may have primed me for the course; at any rate, I did well enough and Paul Vit´anyi and Peter Grunwald¨ (who taught a couple of guest lectures) singled me out as a suitable PhD candidate, which is how I got the job. This was a very lucky turn of events for me, because I was then, and am still, very much attracted to that strange mix of computer science, artificial intelligence, information theory and statistics that make up machine learning and learning theory. While Paul’s commonsense suggestions and remarks kept me steady, Peter has performed the job of PhD supervision to perfection: besides pointing me towards interesting research topics, teaching me to write better papers, and introducing me to several conferences, we also enjoyed many interesting discussions and con- versations, as often about work as not, and he said the right things at the right times to keep my spirits up. He even helped me think about my future career and dropped my name here and there. In short, I wish to thank Paul and Peter for their excellent support. Some time into my research I was joined by fellow PhD students Tim van Erven and Wouter Koolen. I know both of them from during college when I did some work as a lab assistant; Wouter and I in fact worked together on the Kolmogorov complexity course.
Recommended publications
  • Overall Objective Priors
    Bayesian Analysis (2015) 10, Number 1, pp. 189–221 Overall Objective Priors∗ James O. Berger†, Jose M. Bernardo‡, and Dongchu Sun§ Abstract. In multi-parameter models, reference priors typically depend on the parameter or quantity of interest, and it is well known that this is necessary to produce objective posterior distributions with optimal properties. There are, however, many situations where one is simultaneously interested in all the param- eters of the model or, more realistically, in functions of them that include aspects such as prediction, and it would then be useful to have a single objective prior that could safely be used to produce reasonable posterior inferences for all the quantities of interest. In this paper, we consider three methods for selecting a sin- gle objective prior and study, in a variety of problems including the multinomial problem, whether or not the resulting prior is a reasonable overall prior. Keywords: Joint Reference Prior, Logarithmic Divergence, Multinomial Model, Objective Priors, Reference Analysis. 1 Introduction 1.1 The problem Objective Bayesian methods, where the formal prior distribution is derived from the assumed model rather than assessed from expert opinions, have a long history (see e.g., Bernardo and Smith, 1994; Kass and Wasserman, 1996, and references therein). Reference priors (Bernardo, 1979, 2005; Berger and Bernardo, 1989, 1992a,b, Berger, Bernardo and Sun, 2009, 2012) are a popular choice of objective prior. Other interesting developments involving objective priors include Clarke and Barron (1994), Clarke and Yuan (2004), Consonni, Veronese and Guti´errez-Pe˜na (2004), De Santis et al. (2001), De Santis (2006), Datta and Ghosh (1995a; 1995b), Datta and Ghosh (1996), Datta et al.
    [Show full text]
  • Model Selection for Optimal Prediction in Statistical Machine Learning
    Model Selection for Optimal Prediction in Statistical Machine Learning Ernest Fokou´e Introduction science with several ideas from cognitive neuroscience and At the core of all our modern-day advances in artificial in- psychology to inspire the creation, invention, and discov- telligence is the emerging field of statistical machine learn- ery of abstract models that attempt to learn and extract pat- ing (SML). From a very general perspective, SML can be terns from the data. One could think of SML as a field thought of as a field of mathematical sciences that com- of science dedicated to building models endowed with bines mathematics, probability, statistics, and computer the ability to learn from the data in ways similar to the ways humans learn, with the ultimate goal of understand- Ernest Fokou´eis a professor of statistics at Rochester Institute of Technology. His ing and then mastering our complex world well enough email address is [email protected]. to predict its unfolding as accurately as possible. One Communicated by Notices Associate Editor Emilie Purvine. of the earliest applications of statistical machine learning For permission to reprint this article, please contact: centered around the now ubiquitous MNIST benchmark [email protected]. task, which consists of building statistical models (also DOI: https://doi.org/10.1090/noti2014 FEBRUARY 2020 NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY 155 known as learning machines) that automatically learn and Theoretical Foundations accurately recognize handwritten digits from the United It is typical in statistical machine learning that a given States Postal Service (USPS). A typical deployment of an problem will be solved in a wide variety of different ways.
    [Show full text]
  • IR-IITBHU at TREC 2016 Open Search Track: Retrieving Documents Using Divergence from Randomness Model in Terrier
    IR-IITBHU at TREC 2016 Open Search Track: Retrieving documents using Divergence From Randomness model in Terrier Mitodru Niyogi1 and Sukomal Pal2 1Department of Information Technology, Government College of Engineering & Ceramic Technology, Kolkata 2Department of Computer Science & Engineering, Indian Institute of Technology(BHU), Varanasi Abstract In our participation at TREC 2016 Open Search Track which focuses on ad-hoc scientic literature search, we used Terrier, a modular and a scalable Information Retrieval framework as a tool to rank documents. The organizers provided live data as documents, queries and user interac- tions from real search engine that were available through Living Lab API framework. The data was then converted into TREC format to be used in Terrier. We used Divergence from Randomness (DFR) model, specically, the Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for rst normalisation, and normalisation 2 for term frequency normalization with natural logarithm, i.e., In_expC2 model to rank the available documents in response to a set of queries. Al- together we submit 391 runs for sites CiteSeerX and SSOAR to the Open Search Track via the Living Lab API framework. We received an `out- come' of 0.72 for test queries and 0.62 for train queries of site CiteSeerX at the end of Round 3 Test Period where, the outcome is computed as: #wins / (#wins + #losses). A `win' occurs when the participant achieves more clicks on his documents than those of the site and `loss' otherwise. Complete relevance judgments is awaited at the moment. We look forward to getting the users' feedback and work further with them.
    [Show full text]
  • Linear Regression: Goodness of Fit and Model Selection
    Linear Regression: Goodness of Fit and Model Selection 1 Goodness of Fit I Goodness of fit measures for linear regression are attempts to understand how well a model fits a given set of data. I Models almost never describe the process that generated a dataset exactly I Models approximate reality I However, even models that approximate reality can be used to draw useful inferences or to prediction future observations I ’All Models are wrong, but some are useful’ - George Box 2 Goodness of Fit I We have seen how to check the modelling assumptions of linear regression: I checking the linearity assumption I checking for outliers I checking the normality assumption I checking the distribution of the residuals does not depend on the predictors I These are essential qualitative checks of goodness of fit 3 Sample Size I When making visual checks of data for goodness of fit is important to consider sample size I From a multiple regression model with 2 predictors: I On the left is a histogram of the residuals I On the right is residual vs predictor plot for each of the two predictors 4 Sample Size I The histogram doesn’t look normal but there are only 20 datapoint I We should not expect a better visual fit I Inferences from the linear model should be valid 5 Outliers I Often (particularly when a large dataset is large): I the majority of the residuals will satisfy the model checking assumption I a small number of residuals will violate the normality assumption: they will be very big or very small I Outliers are often generated by a process distinct from those which we are primarily interested in.
    [Show full text]
  • Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis
    Scalable model selection for spatial additive mixed modeling: application to crime analysis Daisuke Murakami1,2,*, Mami Kajita1, Seiji Kajita1 1Singular Perturbations Co. Ltd., 1-5-6 Risona Kudan Building, Kudanshita, Chiyoda, Tokyo, 102-0074, Japan 2Department of Statistical Data Science, Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan * Corresponding author (Email: [email protected]) Abstract: A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran. Keywords: model selection; spatial regression; crime; fast computation; spatially varying coefficient modeling 1.
    [Show full text]
  • Subnational Inequality Divergence
    Subnational Inequality Divergence Tom VanHeuvelen1 University of Minnesota Department of Sociology Abstract How have inequality levels across local labor markets in the subnational United States changed over the past eight decades? In this study, I examine inequality divergence, or the inequality of inequalities. While divergence trends of central tendencies such as per capita income have been well documented, less is known about the descriptive trends or contributing mechanisms for inequality. In this study, I construct wage inequality measures in 722 local labor markets covering the entire contiguous United States across 22 waves of Census and American Community Survey data from 1940-2019 to assess the historical trends of inequality divergence. I apply variance decomposition and counterfactual techniques to develop main conclusions. Inequality divergence follows a u-shaped pattern, declining through 1990 but with contemporary divergence at as high a level as any time in the past 80 years. Early era convergence occurred broadly and primarily worked to reduce interregional differences, whereas modern inequality divergence operates through a combination of novel mechanisms, most notably through highly unequal urban areas separating from other labor markets. Overall, results show geographical fragmentation of inequality underneath overall inequality growth in recent years, highlighting the fundamental importance of spatial trends for broader stratification outcomes. 1 Correspondence: [email protected]. A previous version of this manuscript was presented at the 2021 Population Association of American annual conference. Thank you to Jane VanHeuvelen and Peter Catron for their helpful comments. Recent changes in the United States have situated geographical residence as a key pillar of the unequal distribution of economic resources (Austin et al.
    [Show full text]
  • Model Selection Techniques: an Overview
    Model Selection Techniques An overview ©ISTOCKPHOTO.COM/GREMLIN Jie Ding, Vahid Tarokh, and Yuhong Yang n the era of big data, analysts usually explore various statis- following different philosophies and exhibiting varying per- tical models or machine-learning methods for observed formances. The purpose of this article is to provide a compre- data to facilitate scientific discoveries or gain predictive hensive overview of them, in terms of their motivation, large power. Whatever data and fitting procedures are employed, sample performance, and applicability. We provide integrated Ia crucial step is to select the most appropriate model or meth- and practically relevant discussions on theoretical properties od from a set of candidates. Model selection is a key ingredi- of state-of-the-art model selection approaches. We also share ent in data analysis for reliable and reproducible statistical our thoughts on some controversial views on the practice of inference or prediction, and thus it is central to scientific stud- model selection. ies in such fields as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a Why model selection long history of model selection techniques that arise from Vast developments in hardware storage, precision instrument researches in statistics, information theory, and signal process- manufacturing, economic globalization, and so forth have ing. A considerable number of methods has been proposed, generated huge volumes of data that can be analyzed to extract useful information. Typical statistical inference or machine- learning procedures learn from and make predictions on data Digital Object Identifier 10.1109/MSP.2018.2867638 Date of publication: 13 November 2018 by fitting parametric or nonparametric models (in a broad 16 IEEE SIGNAL PROCESSING MAGAZINE | November 2018 | 1053-5888/18©2018IEEE sense).
    [Show full text]
  • Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP
    Bernoulli 19(2), 2013, 521–547 DOI: 10.3150/11-BEJ410 Least squares after model selection in high-dimensional sparse models ALEXANDRE BELLONI1 and VICTOR CHERNOZHUKOV2 1100 Fuqua Drive, Durham, North Carolina 27708, USA. E-mail: [email protected] 250 Memorial Drive, Cambridge, Massachusetts 02142, USA. E-mail: [email protected] In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that the OLS post-Lasso estimator performs at least as well as Lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the Lasso-based model selection “fails” in the sense of missing some components of the “true” regression model. By the “true” model, we mean the best s-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, OLS post-Lasso estimator can perform strictly better than Lasso, in the sense of a strictly faster rate of convergence, if the Lasso-based model selection correctly includes all components of the “true” model as a subset and also achieves sufficient sparsity. In the extreme case, when Lasso perfectly selects the “true” model, the OLS post-Lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by Lasso, which guarantees that this dimension is at most of the same order as the dimension of the “true” model.
    [Show full text]
  • Model Selection and Estimation in Regression with Grouped Variables
    J. R. Statist. Soc. B (2006) 68, Part 1, pp. 49–67 Model selection and estimation in regression with grouped variables Ming Yuan Georgia Institute of Technology, Atlanta, USA and Yi Lin University of Wisconsin—Madison, USA [Received November 2004. Revised August 2005] Summary. We consider the problem of selecting grouped variables (factors) for accurate pre- diction in regression. Such a problem arises naturally in many practical situations with the multi- factor analysis-of-variance problem as the most important and well-known example. Instead of selecting factors by stepwise backward elimination, we focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection. The lasso, the LARS algorithm and the non-negative garrotte are recently proposed regression methods that can be used to select individual variables. We study and propose effi- cient algorithms for the extensions of these methods for factor selection and show that these extensions give superior performance to the traditional stepwise backward elimination method in factor selection problems.We study the similarities and the differences between these methods. Simulations and real examples are used to illustrate the methods. Keywords: Analysis of variance; Lasso; Least angle regression; Non-negative garrotte; Piecewise linear solution path 1. Introduction In many regression problems we are interested in finding important explanatory factors in pre- dicting the response variable, where each explanatory factor may be represented by a group of derived input variables. The most common example is the multifactor analysis-of-variance (ANOVA) problem, in which each factor may have several levels and can be expressed through a group of dummy variables.
    [Show full text]
  • Model Selection for Production System Via Automated Online Experiments
    Model Selection for Production System via Automated Online Experiments Zhenwen Dai Praveen Ravichandran Spotify Spotify [email protected] [email protected] Ghazal Fazelnia Ben Carterette Mounia Lalmas-Roelleke Spotify Spotify Spotify [email protected] [email protected] [email protected] Abstract A challenge that machine learning practitioners in the industry face is the task of selecting the best model to deploy in production. As a model is often an intermedi- ate component of a production system, online controlled experiments such as A/B tests yield the most reliable estimation of the effectiveness of the whole system, but can only compare two or a few models due to budget constraints. We propose an automated online experimentation mechanism that can efficiently perform model se- lection from a large pool of models with a small number of online experiments. We derive the probability distribution of the metric of interest that contains the model uncertainty from our Bayesian surrogate model trained using historical logs. Our method efficiently identifies the best model by sequentially selecting and deploying a list of models from the candidate set that balance exploration-exploitation. Using simulations based on real data, we demonstrate the effectiveness of our method on two different tasks. 1 Introduction Evaluating the effect of individual changes to machine learning (ML) systems such as choice of algorithms, features, etc., is the key to growth in many internet services and industrial applications. Practitioners are faced with the decision of choosing one model from several candidates to deploy in production. This can be viewed as a model selection problem. Classical model selection paradigms such as cross-validation consider ML models in isolation and focus on selecting the model with the best predictive power on unseen data.
    [Show full text]
  • A Generalized Divergence for Statistical Inference
    A GENERALIZED DIVERGENCE FOR STATISTICAL INFERENCE Abhik Ghosh, Ian R. Harris, Avijit Maji Ayanendranath Basu and Leandro Pardo TECHNICAL REPORT NO. BIRU/2013/3 2013 BAYESIAN AND INTERDISCIPLINARY RESEARCH UNIT INDIAN STATISTICAL INSTITUTE 203, Barrackpore Trunk Road Kolkata – 700 108 INDIA A Generalized Divergence for Statistical Inference Abhik Ghosh Indian Statistical Institute, Kolkata, India. Ian R. Harris Southern Methodist University, Dallas, USA. Avijit Maji Indian Statistical Institute, Kolkata, India. Ayanendranath Basu Indian Statistical Institute, Kolkata, India. Leandro Pardo Complutense University, Madrid, Spain. Summary. The power divergence (PD) and the density power divergence (DPD) families have proved to be useful tools in the area of robust inference. The families have striking similarities, but also have fundamental differences; yet both families are extremely useful in their own ways. In this paper we provide a comprehensive description of the two families and tie in their role in statistical theory and practice. At the end, the families are seen to be a part of a superfamily which contains both of these families as special cases. In the process, the limitation of the influence function as an effective descriptor of the robustness of the estimator is also demonstrated. Keywords: Robust Estimation, Divergence, Influence Function 2 Ghosh et al. 1. Introduction The density-based minimum divergence approach is an useful technique in para- metric inference. Here the closeness of the data and the model is quantified by a suitable measure of density-based divergence between the data density and the model density. Many of these methods have been particularly useful because of the strong robustness properties that they inherently possess.
    [Show full text]
  • Divergence Measures for Statistical Data Processing—An Annotated Bibliography
    Signal Processing 93 (2013) 621–633 Contents lists available at SciVerse ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Review Divergence measures for statistical data processing—An annotated bibliography Michele Basseville n,1 IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France article info abstract Article history: This paper provides an annotated bibliography for investigations based on or related to Received 19 April 2012 divergence measures for statistical data processing and inference problems. Received in revised form & 2012 Elsevier B.V. All rights reserved. 17 July 2012 Accepted 5 September 2012 Available online 14 September 2012 Keywords: Divergence Distance Information f-Divergence Bregman divergence Learning Estimation Detection Classification Recognition Compression Indexing Contents 1. Introduction ......................................................................................... 621 2. f-Divergences ........................................................................................ 622 3. Bregman divergences.................................................................................. 623 4. a-Divergences ....................................................................................... 624 5. Handling more than two distributions .................................................................... 625 6. Inference based on entropy and divergence criteria.......................................................... 626 7. Spectral divergence measures . .......................................................................
    [Show full text]