Minimum Description Length Principle

M one, the MDL principle also quantifies the com- Minimum Description Length patibility of the hypotheses with the evidence. Principle This leads to a trade-off between the complexity of the hypothesis and its compatibility with the Teemu Roos evidence (“goodness of fit”). Department of Computer Science, Helsinki The philosophy of the MDL principle em- Institute for Information Technology, University phasizes that the evaluation of the merits of a of Helsinki, Helsinki, Finland model should not be based on its closeness to a “true” model, whose existence is often impos- Abstract sible to verify, but instead on the data. Inspired by Solomonoff’s theory of universal induction, The minimum description length (MDL) princi- Rissanen postulated that a yardstick of the per- ple states that one should prefer the model that formance of a statistical model is the probability yields the shortest description of the data when it assigns to the data. Since the probability is the complexity of the model itself is also ac- intimately related to code length (see below), the counted for. MDL provides a versatile approach code length provides an equivalent way to mea- to statistical modeling. It is applicable to model sure performance. The key idea made possible by selection and regularization. Modern versions of the coding interpretation is that the length of the MDL lead to robust methods that are well suited description of the model itself can be quantified for choosing an appropriate model complexity in the same units as the code length of the data, based on the data, thus extracting the maximum namely, bits. Earlier, Wallace and Boulton had amount of information from the data without made a similar proposal under the title minimum over-fitting. The modern versions of MDL go message length (MML) (Wallace and Boulton k well beyond the familiar 2 log n formula. 1968). A fundamental difference between the two principles is that MML is a Bayesian approach while MDL is not. The central tenet in MDL is that the better Philosophy one is able to discover the regular features in the data, the shorter the code length. Showing The MDL principle is a formal version of Oc- that this is indeed the case often requires that we cam’s razor. While the Occam’s razor only sug- assume, for the sake of argument, that the data gests that between hypotheses that are compatible are generated by a true distribution and verify with the evidence, one should choose the simplest the statistical behavior of MDL-based methods under this assumption. Hence, the emphasis on © Springer Science+Business Media New York 2016 C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining, DOI 10.1007/978-1-4899-7502-7 894-1 2 Minimum Description Length Principle the freedom from the assumption of a true model agivenmodel class (a set of distributions), M, is more pertinent in the philosophy of MDL than the goal may be to find a code that minimizes in the technical analysis carried out in its theory. the worst-case expected code length under any source distribution p 2 M. A uniquely decodable code that achieves near-optimal code lengths Theory with respect to a given model class is said to be universal. The theory of MDL addresses two kinds of ques- Rissanen’s groundbreaking 1978 paper (Rissa- tions: .i/ the first kind asks what is the shortest nen 1978) gives a general construction for uni- description achievable using a given model class, versal codes based on two-part codes. A two-part i.e., universal data compression; .ii/ the second code first includes a code for encoding a distribu- kind asks what can be said about the behavior tion, q, over source sequences. The second part of MDL methods when applied to model selec- encodes the data using a code based on q.The tion and other machine learning and data mining length of the second part is thus log q.xn/ bits. tasks. The latter kind of questions are closely The length of the first part, `.q/, depends on the related to the theory of statistical estimation and complexity of the distribution q, which leads to a statistical learning theory. We review the theory trade-off between complexity measured by `.q/ related to these two kinds of questions separately. and goodness of fit measured by log q.x/: Universal Data Compression min.`.q/ log q.xn//: (1) q As is well known in information theory, the shortest expected code length achievable by a uniquely For parametric models that are defined by a decodable code under a known data source, p , continuous parameter vector , a two-part coding is given by the entropy of the source, H.p /. approach requires that the parameters be quan- The lower bound is achieved by using a code tized so that their code length is finite. Rissa- word of length ` .x/ Dlog p .x/ bits for nen showed that given a k-dimensional para- each source symbol x. (Here and in the following, metric model class, M Dfp I 2 ‚ log denotes base-2 logarithm.) Correspondingly, Rkg, the optimal quantization of the parameter a code-length function ` is optimal under a source space ‚ is achieved by using accuracy of or- `.x/ p distribution defined by q.x/ D 2 .(Forthe der 1= n for each coordinate, where n is the sake of notationalP simplicity, we omit a normal- sample size. The resulting total code length be- `.x/ izing factor C D x 2 which is necessary n k O haves as log p.xO / C 2 log n C .1/, where in case the code is not complete. Likewise, as is n n p.xO / D maxfp .x / W 2 ‚g is the maxi- customary in MDL, we ignore the requirement mum probability under model class M. Note that that code lengths be integers.) These results can the leading terms of the formula are equivalent be extended to data sequences whereupon we to the Bayesian information criterion (BIC) by n write x D x1 :::xn to denote a sequence of Schwarz (Schwarz 1978). Later, Rissanen also length n. showed that this is a lower bound on the code While the case where the source distribution length of any universal code that holds for all p is known can be considered solved in the but a measure-zero subset of sources in the given sense that the average-case optimal code-length model class (Rissanen 1986). function ` is easily established as described The above results have subsequently been above, the case where p is unknown is more refined by studying the asymptotic and finite- intricate. Universal data compression studies sim- sample values of the O.1/ residual term for ilar lower bounds when the source distribution is specific model classes. The resulting formulas not known or when the goal is not to minimize lead to a more accurate characterization of the expected code length. For example, when the source distribution is only known to be in Minimum Description Length Principle 3 model complexity, often involving the Fisher For model selection problems, consistency is information (Rissanen 1996). often defined in relation to a fixed set of alter- Subsequently, Rissanen and others have pro- native model classes and a criterion that selects posed other kinds of universal codes that are one of them given the data. If the criterion leads superior to two-part codes. These include Bayes- to the simplest model class that contains the true type mixture codes that involve a prior distri- source distribution, the criterion is said to be bution for the unknown parameters (Rissanen consistent. (Note that the additional requirement 1986), predictive forms of MDL (Rissanen 1984; that the selected model class is the simplest one Wei 1992), and, most importantly, normalized is needed in order to circumvent a trivial solution maximum likelihood (NML) codes (Yuri 1987; in nested model classes where simpler models Rissanen 1996). The latter have the important are subsets of more complex model classes.) point-wise minimax property that they achieve There are a large number of results showing that the minimum worst-case point-wise redundancy: various MDL-based model selection criteria are consistent; for examples, see the next section. min max log q.xn/ C log p.xO n/; q xn where the maximum is over all possible data Applications sequences of length n and the minimum is over all distributions. MDL has been applied in a wide range of applications. It is well suited for model selection Behavior of MDL-Based Learning Methods problems where one needs not only to estimate The philosophy of MDL suggests that data com- continuous parameters but also their number and, pression is a measure of the success in discov- more generally, the model structure, based on sta- ering regularities in the data, and hence, better tistical data. Other approaches applicable in many M compression implies better modeling. Showing such scenarios include Bayesian methods (in- that this is indeed the case is the second kind of cluding minimum message length), cross valida- theory related to MDL. tion, and structural risk minimization (see Cross- Barron and Cover proposed the index of re- References below). solvability as a measure of the hardness of esti- Some example applications include the fol- mating a probabilistic source in a two-part coding lowing: setting (see above) (Barron and Cover 1991). It is defined as 1. Autoregressive models, Markov chains, and their generalizations such as tree machines `.q/ were among the first model classes studied in Rn.p / D min C D.p jj q/ ; q n the MDL literature, see Rissanen (1978, 1984) and Weinberger et al.

Load more