Minimum Description Length Model Selection

Minimum Description Length Model Selection Problems and Extensions Steven de Rooij Minimum Description Length Model Selection Problems and Extensions ILLC Dissertation Series DS-2008-07 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: [email protected] homepage: http://www.illc.uva.nl/ Minimum Description Length Model Selection Problems and Extensions Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. D. C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op woensdag 10 september 2008, te 12.00 uur door Steven de Rooij geboren te Amersfoort. Promotiecommissie: Promotor: prof.dr.ir. P.M.B. Vitányi Co-promotor: dr. P.D. Grunwald¨ Overige leden: prof.dr. P.W. Adriaans prof.dr. A.P. Dawid prof.dr. C.A.J. Klaassen prof.dr.ir. R.J.H. Scha dr. M.W. van Someren prof.dr. V. Vovk dr.ir. F.M.J. Willems Faculteit der Natuurwetenschappen, Wiskunde en Informatica The investigations were funded by the Netherlands Organization for Scientific Research (NWO), project 612.052.004 on Universal Learning, and were supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. Copyright c 2008 by Steven de Rooij Cover inspired by Randall Munroe’s xkcd webcomic (www.xkcd.org) Printed and bound by PrintPartners Ipskamp (www.ppi.nl) ISBN: 978-90-5776-181-2 Parts of this thesis are based on material contained in the following papers: An Empirical Study of Minimum Description Length Model Selection with Infinite • Parametric Complexity Steven de Rooij and Peter Grunwald¨ In: Journal of Mathematical Psychology, special issue on model selection. Vol. 50, pp. 180-190, 2006 (Chapter 2) Asymptotic Log-Loss of Prequential Maximum Likelihood Codes • Peter D. Grunwald¨ and Steven de Rooij In: Proceedings of the 18th Annual Conference on Computational Learning The- ory (COLT), pp. 652-667, June 2005 (Chapter 3) MDL Model Selection using the ML Plug-in Code • Steven de Rooij and Peter Grunwald¨ In: Proceedings of the International Symposium on Information Theory (ISIT), 2005 (Chapters 2 and 3) Approximating Rate-Distortion Graphs of Individual Data: Experiments in Lossy • Compression and Denoising Steven de Rooij and Paul Vitányi Submitted to: IEEE Transactions on Computers (Chapter 6) Catching up Faster in Bayesian Model Selection and Averaging • Tim van Erven, Peter D. Grunwald¨ and Steven de Rooij In: Advances in Neural Information Processing Systems 20 (NIPS 2007) (Chapter 5) Expert Automata for Efficient Tracking • W. Koolen and S. de Rooij In: Proceedings of the 21st Annual Conference on Computational Learning The- ory (COLT) 2008 (Chapter 4) v Contents Acknowledgements xi 1 The MDL Principle for Model Selection 1 1.1 EncodingHypotheses......................... 5 1.1.1 Luckiness ........................... 6 1.1.2 Infinitely Many Hypotheses . 7 1.1.3 IdealMDL........................... 8 1.2 UsingModelstoEncodetheData. 9 1.2.1 Codes and Probability Distributions . 9 1.2.2 Sequential Observations . 12 1.2.3 Model Selection and Universal Coding . 13 1.3 ApplicationsofMDL ......................... 19 1.3.1 TruthFinding......................... 19 1.3.2 Prediction........................... 20 1.3.3 Separating Structure from Noise . 22 1.4 Organisation of this Thesis . 23 2 Dealing with Infinite Parametric Complexity 25 2.1 Universal codes and Parametric Complexity . 26 2.2 The Poisson and Geometric Models . 28 2.3 Four Ways to deal with Infinite Parametric Complexity . 29 2.3.1 BIC/ML............................ 30 2.3.2 RestrictedANML....................... 30 2.3.3 Two-partANML ....................... 31 2.3.4 Renormalised Maximum Likelihood . 32 2.3.5 Prequential Maximum Likelihood . 33 2.3.6 Objective Bayesian Approach . 33 2.4 Experiments.............................. 35 2.4.1 ErrorProbability . 36 vii 2.4.2 Bias .............................. 37 2.4.3 Calibration .......................... 37 2.5 Discussion of Results . 39 2.5.1 Poor Performance of the PML Criterion . 39 2.5.2 ML/BIC............................ 46 2.5.3 Basic Restricted ANML . 47 2.5.4 Objective Bayes and Two-part Restricted ANML . 47 2.6 Summary and Conclusion . 47 3 Behaviour of Prequential Codes under Misspecification 51 3.1 Main Result, Informally . 52 3.2 MainResult,Formally ..... ....... ....... ..... 55 3.3 Redundancy vs. Regret . 60 3.4 Discussion............................... 61 3.4.1 A Justification of Our Modification of the ML Estimator . 61 3.4.2 Prequential Models with Other Estimators . 62 3.4.3 Practical Significance for Model Selection . 63 3.4.4 Theoretical Significance . 64 3.5 Building Blocks of the Proof . 65 3.5.1 Results about Deviations between Average and Mean . 65 3.5.2 Lemma 3.5.6: Redundancy for Exponential Families . 68 3.5.3 Lemma 3.5.8: Convergence of the sum of the remainder terms 69 3.6 ProofofTheorem3.3.1. 71 3.7 Conclusion and Future Work . 73 4 Interlude: Prediction with Expert Advice 75 4.1 ExpertSequencePriors. 78 4.1.1 Examples ........................... 80 4.2 ExpertTrackingusingHMMs . 81 4.2.1 Hidden Markov Models Overview . 82 4.2.2 HMMsasES-Priors... ....... ....... ..... 84 4.2.3 Examples ........................... 86 4.2.4 TheHMMforData...................... 87 4.2.5 The Forward Algorithm and Sequential Prediction . 88 4.3 Zoology ................................ 92 4.3.1 Universal Elementwise Mixtures . 92 4.3.2 FixedShare .......................... 94 4.3.3 UniversalShare ........................ 96 4.3.4 Overconfident Experts . 97 4.4 New Models to Switch between Experts . 101 4.4.1 Switch Distribution . 101 4.4.2 Run-length Model . 110 4.4.3 Comparison .......................... 113 viii 4.5 Extensions............................... 114 4.5.1 Fast Approximations . 114 4.5.2 Data-Dependent Priors . 118 4.5.3 An Alternative to MAP Data Analysis . 118 4.6 Conclusion............................... 119 5 Slow Convergence: the Catch-up Phenomenon 121 5.1 The Switch-Distribution for Model Selection and Prediction . 124 5.1.1 Preliminaries ......................... 124 5.1.2 Model Selection and Prediction . 125 5.1.3 The Switch-Distribution . 126 5.2 Consistency .............................. 127 5.3 Optimal Risk Convergence Rates . 129 5.3.1 The Switch-distribution vs Bayesian Model Averaging . 130 5.3.2 Improved Convergence Rate: Preliminaries . 131 5.3.3 The Parametric Case . 134 5.3.4 The Nonparametric Case . 135 5.3.5 Example: Histogram Density Estimation . 136 5.4 Further Discussion of Convergence in the Nonparametric Case . 139 5.4.1 Convergence Rates in Sum . 139 5.4.2 Beyond Histograms . 141 5.5 Efficient Computation . 142 5.6 Relevance and Earlier Work . 143 5.6.1 AIC-BIC; Yang’s Result . 143 5.6.2 Prediction with Expert Advice . 145 5.7 The Catch-Up Phenomenon, Bayes and Cross-Validation . 145 5.7.1 The Catch-Up Phenomenon is Unbelievable! (according to pbma).............................. 145 5.7.2 Nonparametric Bayes . 147 5.7.3 Leave-One-Out Cross-Validation . 147 5.8 Conclusion and Future Work . 148 5.9 Proofs ................................. 149 5.9.1 Proof of Theorem 5.2.1 . 149 5.9.2 Proof of Theorem 5.2.2 . 150 5.9.3 Mutual Singularity as Used in the Proof of Theorem 5.2.1 151 5.9.4 Proof of Theorem 5.5.1 . 152 6 Individual Sequence Rate Distortion 157 6.1 Algorithmic Rate-Distortion . 158 6.1.1 Distortion Spheres, the Minimal Sufficient Statistic . 160 6.1.2 Applications: Denoising and Lossy Compression . 161 6.2 Computing Individual Object Rate-Distortion . 163 6.2.1 Compressor (rate function) . 164 ix 6.2.2 Code Length Function . 164 6.2.3 Distortion Functions . 165 6.2.4 Searching for the Rate-Distortion Function . 167 6.2.5 Genetic Algorithm . 168 6.3 Experiments.............................. 171 6.3.1 NamesofObjects. 173 6.4 Results and Discussion . 173 6.4.1 Lossy Compression . 174 6.4.2 Denoising ........................... 177 6.5 Quality of the Approximation . 187 6.6 AnMDLPerspective ......................... 188 6.7 Conclusion............................... 190 Bibliography 193 Notation 201 Abstract 203 Samenvatting 207 x Acknowledgements When I took prof. Vitányi’s course on Kolmogorov complexity, I was delighted by his seemingly endless supply of anecdotes, opinions and wisdoms that ranged the whole spectrum from politically incorrect jokes to profound philosophical insights. At the time I was working on my master’s thesis on data compression with Breanndán O´ Nualláin, which may have primed me for the course; at any rate, I did well enough and Paul Vitányi and Peter Grunwald¨ (who taught a couple of guest lectures) singled me out as a suitable PhD candidate, which is how I got the job. This was a very lucky turn of events for me, because I was then, and am still, very much attracted to that strange mix of computer science, artificial intelligence, information theory and statistics that make up machine learning and learning theory. While Paul’s commonsense suggestions and remarks kept me steady, Peter has performed the job of PhD supervision to perfection: besides pointing me towards interesting research topics, teaching me to write better papers, and introducing me to several conferences, we also enjoyed many interesting discussions and con- versations, as often about work as not, and he said the right things at the right times to keep my spirits up. He even helped me think about my future career and dropped my name here and there. In short, I wish to thank Paul and Peter for their excellent support. Some time into my research I was joined by fellow PhD students Tim van Erven and Wouter Koolen. I know both of them from during college when I did some work as a lab assistant; Wouter and I in fact worked together on the Kolmogorov complexity course.

Minimum Description Length Model Selection

Overall Objective Priors

Model Selection for Optimal Prediction in Statistical Machine Learning

IR-IITBHU at TREC 2016 Open Search Track: Retrieving Documents Using Divergence from Randomness Model in Terrier

Linear Regression: Goodness of Fit and Model Selection

Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis

Subnational Inequality Divergence

Model Selection Techniques: an Overview

Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP

Model Selection and Estimation in Regression with Grouped Variables

Model Selection for Production System Via Automated Online Experiments

A Generalized Divergence for Statistical Inference

Divergence Measures for Statistical Data Processing—An Annotated Bibliography