Minimum Description Length Principle

Minimum Description Length Principle Carl E. Rasmussen, Niki Kilbertus 25 January 2018 1 Introduction Some Theory Crude Two-Part MDL Universal Codes and Refined MDL Conclusion and Discussion Introduction 2 Main idea Any regularity in the data allows us to compress them better. What is this about? The minimum description length (MDL) principle is a method for inductive inference that provides a generic solution to the model selection problem, and, more generally to the overfitting problem. – Peter Grunwald¨ Introduction 3 What is this about? The minimum description length (MDL) principle is a method for inductive inference that provides a generic solution to the model selection problem, and, more generally to the overfitting problem. – Peter Grunwald¨ Main idea Any regularity in the data allows us to compress them better. Introduction 3 I found the really interesting questions are hard to answer without getting lost in technicalities. This talk is entirely based on theMinimumDescriptionLength principle by Peter Grunwald¨ and contains numerous verbatim quotes. Disclaimer MDL is a principle, not a method. It allows for several methods of inductive inference and while sometimes there are clearly “better” or “worse” versions of MDL, there is no “optimal MDL”. Introduction 4 This talk is entirely based on theMinimumDescriptionLength principle by Peter Grunwald¨ and contains numerous verbatim quotes. Disclaimer MDL is a principle, not a method. It allows for several methods of inductive inference and while sometimes there are clearly “better” or “worse” versions of MDL, there is no “optimal MDL”. I found the really interesting questions are hard to answer without getting lost in technicalities. Introduction 4 Disclaimer MDL is a principle, not a method. It allows for several methods of inductive inference and while sometimes there are clearly “better” or “worse” versions of MDL, there is no “optimal MDL”. I found the really interesting questions are hard to answer without getting lost in technicalities. This talk is entirely based on theMinimumDescriptionLength principle by Peter Grunwald¨ and contains numerous verbatim quotes. Introduction 4 The advantages MDL is a general theory of inductive inference with I Occam’s razor “built in” I no overfitting automatically, without ad hoc principles I Bayesian interpretation I no need for “underlying truth” for clear interpretation I predictive interpretation (data compression ≡ probabilistic prediction) Introduction 5 Inductive Inference The task of inductive inference is to find laws or regularities underlying some given set of data. These laws are then used to gain insight into the data or to classify or predict future data. Introduction 6 I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data. I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not). I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent). Goals of Inductive Inference I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events. Introduction 7 I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not). I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent). Goals of Inductive Inference I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events. I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data. Introduction 7 I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent). Goals of Inductive Inference I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events. I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data. I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not). Introduction 7 Goals of Inductive Inference I Orthodox: The goal of probability theory is to compute probabilities of events for some given probability measure. The goal of (orthodox) statistics is, given that some events have taken place, to infer what distribution may have generated these events. I Bayesian: Start with a family of distributions (model) and our beliefs over them – the prior. Update the prior in light of data to get the posterior, which is used for prediction about the distributions in the model and/or new data. I Machine Learning: Find a decision/prediction rule that generalizes well (whether train and test set come from the same distribution or not). I MDL: Try to compress the data as much as possible. Hopefully such a model also generalizes well (and is consistent). Introduction 7 ) find its regularities and try to model them ) “remember” the residuals ) account for the model complexity ) minimize the description length of the model + residuals MDL views learning as data compression. Compression as the main focus compress data Introduction 8 ) “remember” the residuals ) account for the model complexity ) minimize the description length of the model + residuals MDL views learning as data compression. Compression as the main focus compress data ) find its regularities and try to model them Introduction 8 ) account for the model complexity ) minimize the description length of the model + residuals MDL views learning as data compression. Compression as the main focus compress data ) find its regularities and try to model them ) “remember” the residuals Introduction 8 ) minimize the description length of the model + residuals MDL views learning as data compression. Compression as the main focus compress data ) find its regularities and try to model them ) “remember” the residuals ) account for the model complexity Introduction 8 MDL views learning as data compression. Compression as the main focus compress data ) find its regularities and try to model them ) “remember” the residuals ) account for the model complexity ) minimize the description length of the model + residuals Introduction 8 Compression as the main focus compress data ) find its regularities and try to model them ) “remember” the residuals ) account for the model complexity ) minimize the description length of the model + residuals MDL views learning as data compression. Introduction 8 What is a description? – Examples Data 1. 00010001000100010001...000100010001000100010001 2. 01110100110100100110...111010111011000101100010 3. 00011000001010100000...001000010000001000110001 Descriptions 1. Repeating 0001 2. Tosses of a fair coin 3. Four times as many 0s as 1s (at random locations) Introduction 9 No-Hypercompression-Inequality Intuition If we want to encode binary sequences of length n, only a fraction of at most 2−K sequences can be compressed by more than K bits. Theorem Suppose X1; X2;:::; Xn are distributed according to some distribution P∗ on X n (not necessarily i.i.d.). Let Q be an arbitrary distribution on X n. Then for all K > 0 P∗ (− log(Q(Xn)) ≤ − log(P∗(Xn)) − K) ≤ 2−K : A short description is equivalent to identifying the data as belonging to a special, tiny subset of all possible sequences. Introduction 10 Ideal MDL Definition The Kolmogorov complexity (KC) of a binary string is the length of the shortest program in a universal language that produces the string and then halts. Problems: I KC is uncomputable I KC is only defined up to a (potentially) large constant The MDL principle with KC as its foundation is often called ideal(ized) MDL or algorithmic statistics. Ming Li, Paul Vitanyi,´ An Introduction to Kolmogorov Complexity and Its Applications. Springer, 2008 Introduction 11 Remark In practice, a suitable choice of the description method will be guided by a priori knowledge of the problem domain. Downside There will be “regular” sequences that we cannot compress. Practical MDL Practical MDL uses a less expressive description method that is I restrictive enough to allow us to compute the length of the shortest description of any data sequence I general enough to allow us to compress many of the intuitively “regular” data sequences Introduction 12 Downside There will be “regular” sequences that we cannot compress. Practical MDL Practical MDL uses a less expressive description method that is I restrictive enough to allow us to compute the length of the shortest description of any data sequence I general enough to allow us to compress many of the intuitively “regular” data sequences Remark In practice, a suitable choice of the description method will be guided by a priori knowledge of the problem domain. Introduction 12 Practical MDL Practical MDL uses a less expressive description method that is I restrictive enough to allow us to compute the length of the shortest description of any data sequence I general enough to allow us to compress many of the intuitively “regular” data sequences Remark In practice, a suitable choice of the description method will be guided by a priori knowledge of the problem domain.

Load more