<<

M

one, the MDL principle also quantifies the com- Minimum Description Length patibility of the hypotheses with the evidence. Principle This leads to a trade-off between the of the hypothesis and its compatibility with the Teemu Roos evidence (“goodness of fit”). Department of Computer Science, Helsinki The philosophy of the MDL principle em- Institute for Information Technology, University phasizes that the evaluation of the merits of a of Helsinki, Helsinki, Finland model should not be based on its closeness to a “true” model, whose existence is often impos- Abstract sible to verify, but instead on the . Inspired by Solomonoff’s theory of universal induction, The minimum description length (MDL) princi- Rissanen postulated that a yardstick of the per- ple states that one should prefer the model that formance of a is the probability yields the shortest description of the data when it assigns to the data. Since the probability is the complexity of the model itself is also ac- intimately related to code length (see below), the counted for. MDL provides a versatile approach code length provides an equivalent way to mea- to statistical modeling. It is applicable to model sure performance. The key idea made possible by selection and regularization. Modern versions of the coding interpretation is that the length of the MDL lead to robust methods that are well suited description of the model itself can be quantified for choosing an appropriate model complexity in the same units as the code length of the data, based on the data, thus extracting the maximum namely, bits. Earlier, Wallace and Boulton had amount of information from the data without made a similar proposal under the title minimum over-fitting. The modern versions of MDL go message length (MML) (Wallace and Boulton k well beyond the familiar 2 log n formula. 1968). A fundamental difference between the two principles is that MML is a Bayesian approach while MDL is not. The central tenet in MDL is that the better Philosophy one is able to discover the regular features in the data, the shorter the code length. Showing The MDL principle is a formal version of Oc- that this is indeed the case often requires that we cam’s razor. While the Occam’s razor only sug- assume, for the sake of argument, that the data gests that between hypotheses that are compatible are generated by a true distribution and verify with the evidence, one should choose the simplest the statistical behavior of MDL-based methods under this assumption. Hence, the emphasis on

© Springer Science+Business Media New York 2016 C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine and Data Mining, DOI 10.1007/978-1-4899-7502-7 894-1 2 Minimum Description Length Principle the freedom from the assumption of a true model agivenmodel class (a set of distributions), M, is more pertinent in the philosophy of MDL than the goal may be to find a code that minimizes in the technical analysis carried out in its theory. the worst-case expected code length under any source distribution p 2 M. A uniquely decod- able code that achieves near-optimal code lengths Theory with respect to a given model class is said to be universal. The theory of MDL addresses two kinds of ques- Rissanen’s groundbreaking 1978 paper (Rissa- tions: .i/ the first kind asks what is the shortest nen 1978) gives a general construction for uni- description achievable using a given model class, versal codes based on two-part codes. A two-part i.e., universal data compression; .ii/ the second code first includes a code for encoding a distribu- kind asks what can be said about the behavior tion, q, over source sequences. The second part of MDL methods when applied to model selec- encodes the data using a code based on q.The tion and other machine learning and data mining length of the second part is thus log q.xn/ bits. tasks. The latter kind of questions are closely The length of the first part, `.q/, depends on the related to the theory of statistical estimation and complexity of the distribution q, which leads to a statistical learning theory. We review the theory trade-off between complexity measured by `.q/ related to these two kinds of questions separately. and goodness of fit measured by log q.x/:

Universal Data Compression min.`.q/ log q.xn//: (1) q As is well known in , the short- est expected code length achievable by a uniquely For parametric models that are defined by a  decodable code under a known data source, p , continuous parameter vector , a two-part coding  is given by the entropy of the source, H.p /. approach requires that the parameters be quan- The lower bound is achieved by using a code tized so that their code length is finite. Rissa-   word of length ` .x/ Dlog p .x/ bits for nen showed that given a k-dimensional para- each source symbol x. (Here and in the following, metric model class, M Dfp I 2 ‚ log denotes base-2 logarithm.) Correspondingly, Rkg, the optimal quantization of the parameter a code-length function ` is optimal under a source space ‚ is achieved by using accuracy of or- `.x/ p distribution defined by q.x/ D 2 .(Forthe der 1= n for each coordinate, where n is the sake of notationalP simplicity, we omit a normal- sample size. The resulting total code length be- `.x/ izing factor C D x 2 which is necessary n k O haves as log p.xO / C 2 log n C .1/, where in case the code is not complete. Likewise, as is n n p.xO / D maxfp .x / W 2 ‚g is the maxi- customary in MDL, we ignore the requirement mum probability under model class M. Note that that code lengths be integers.) These results can the leading terms of the formula are equivalent be extended to data sequences whereupon we to the Bayesian information criterion (BIC) by n write x D x1 :::xn to denote a sequence of Schwarz (Schwarz 1978). Later, Rissanen also length n. showed that this is a lower bound on the code While the case where the source distribution length of any universal code that holds for all  p is known can be considered solved in the but a measure-zero subset of sources in the given sense that the average-case optimal code-length model class (Rissanen 1986).  function ` is easily established as described The above results have subsequently been  above, the case where p is unknown is more refined by studying the asymptotic and finite- intricate. Universal data compression studies sim- sample values of the O.1/ residual term for ilar lower bounds when the source distribution is specific model classes. The resulting formulas not known or when the goal is not to minimize lead to a more accurate characterization of the expected code length. For example, when the source distribution is only known to be in Minimum Description Length Principle 3 model complexity, often involving the Fisher For problems, consistency is information (Rissanen 1996). often defined in relation to a fixed set of alter- Subsequently, Rissanen and others have pro- native model classes and a criterion that selects posed other kinds of universal codes that are one of them given the data. If the criterion leads superior to two-part codes. These include Bayes- to the simplest model class that contains the true type mixture codes that involve a prior distri- source distribution, the criterion is said to be bution for the unknown parameters (Rissanen consistent. (Note that the additional requirement 1986), predictive forms of MDL (Rissanen 1984; that the selected model class is the simplest one Wei 1992), and, most importantly, normalized is needed in order to circumvent a trivial solution maximum likelihood (NML) codes (Yuri 1987; in nested model classes where simpler models Rissanen 1996). The latter have the important are subsets of more complex model classes.) point-wise minimax property that they achieve There are a large number of results showing that the minimum worst-case point-wise redundancy: various MDL-based model selection criteria are consistent; for examples, see the next section. min max log q.xn/ C log p.xO n/; q xn where the maximum is over all possible data Applications sequences of length n and the minimum is over all distributions. MDL has been applied in a wide of ap- plications. It is well suited for model selection Behavior of MDL-Based Learning Methods problems where one needs not only to estimate The philosophy of MDL suggests that data com- continuous parameters but also their number and, pression is a measure of the success in discov- more generally, the model structure, based on sta- ering regularities in the data, and hence, better tistical data. Other approaches applicable in many M compression implies better modeling. Showing such scenarios include Bayesian methods (in- that this is indeed the case is the second kind of cluding ), cross valida- theory related to MDL. tion, and structural risk minimization (see Cross- Barron and Cover proposed the index of re- References below). solvability as a measure of the hardness of esti- Some example applications include the fol- mating a probabilistic source in a two-part coding lowing: setting (see above) (Barron and Cover 1991). It is defined as 1. Autoregressive models, Markov chains, and their generalizations such as tree machines  `.q/  were among the first model classes studied in Rn.p / D min C D.p jj q/ ; q n the MDL literature, see Rissanen (1978, 1984) and Weinberger et al. (1995). where p is the source distribution and 2. . Selecting a subset of rele- D.p jj q/ denotes the Kullback-Leibler vant covariates is a classical example of a situ- between p and q. Intuitively, a ation involving models of variable complexity, source is easily estimable if there exists a simple see Speed and Yu (1993), Wei (1992), and distribution that is close to the source. The Rissanen (2000). result by Barron and Cover bounds the Hellinger 3. Discretization of continuous covariates en- between the true source distribution and ables the use of learning methods that use the distribution qO minimizing the two-part code discrete data. The granularity of the discretiza- length, Eq. (1), as tion can be determined by applying MDL, see Fayyad and Irani (1993). 2  O   dH .p ; q/O .Rn.p // in p -probability: 4. The structure of probabilistic graphical models encodes conditional independencies 4 Minimum Description Length Principle

and determines the complexity of the model. Recommended Reading Their structure can be learned by MDL, see, e.g., Lam and Bacchus (1994) and Silander Good review articles on MDL include Barron et al. (2010) et al. (1998); Hansen and Yu (2001). The text- book by Gr¬unwald (2007) is a comprehensive and Future Directions detailed reference covering developments until 2007 Gr¬unwald (2007).

The development of efficient and computation- Barron A, Cover T (1991) Minimum complexity density ally tractable codes for practically relevant model estimation. IEEE Trans Inf Theory 37(4):1034Ð1054 classes is required in order to apply MDL more Barron A, Rissanen J, Yu B (1998) The minimum descrip- commonly in modern statistical applications. The tion length principle in coding and modeling. IEEE Trans Inf Theory 44:2734Ð2760 following are among the most important future Fayyad U, Irani K (1993) Multi-interval discretization of directions: continuous-valued attributes for classification learning. In: Bajczy R (ed) Proceedings of the 13th International k Joint Conference on Artificial Intelligence and Mini- Ð While the original 2 log n formula is still mum Description Length Principle, Chambery. Morgan regularly referred to as “the MDL principle,” Kauffman future work should focus on modern formula- Gr¬unwald P (2007) The Minimum Description Length tions involving more advanced codes such as Principle. MIT Press, Cambridge Hansen M, Yu B (2001) Model selection and the princi- the NML and its variations. ple of minimum description length. J Am Stat Assoc Ð There is strong empirical evidence suggest- 96(454):746Ð774 ing that coding strategies with strong mini- Lam W, Bacchus F (1994) Learning Bayesian belief max properties lead to robust model selection networks: an approach based on the MDL principle. Comput Intell 10:269Ð293 methods, see, e.g., Silander et al. (2010). Tools Rissanen J (1978) Modeling by shortest data description. akin to the index of resolvability are needed Automatica 14(5):465Ð658 to gain better theoretical understanding of the Rissanen J (1984) Universal coding, information, predic- properties of modern MDL methods. tion, and estimation. IEEE Trans Inf Theory 30:629Ð 636 Ð Scaling up to modern big data applications, Rissanen J (1986) Stochastic complexity and modeling. where model complexity regularization is cru- Ann Stat 14(3):1080Ð1100 cial, requires approximate versions of MDL Rissanen J (1996) and stochasic com- with sublinear computational and storage re- plexity. IEEE Trans Inf Theory 42(1):40Ð47 Rissanen J (2000) MDL denoising. IEEE Trans Inf Theory quirements. Predictive MDL is a promising 46(7):2537Ð2543 approach in handling high-throughput stream- Schwarz G (1978) Estimating the dimension of a model. ing data scenarios. Ann Stat 6(2):461Ð464 Silander T, Roos T, Myllymaki¬ P (2010) Learning locally minimax optimal Bayesian networks. Int J Approx Cross-References Reason 51(5):544Ð557 Speed T, Yu B (1993) Model selection and prediction:  Complete Minimum Description Length normal regression. Ann Inst Stat Math 45(1):35Ð54 Wallace C, Boulton D (1968) An information measure for  Cross Validation classification. Comput J 11(2):185Ð194  Inductive Inference Wei C (1992) On predictive principles. Ann  Learning Graphical Models Stat 20(1):1Ð42  Minimum Message Length Weinberger M, Rissanen J, Feder M (1995) A univer- sal finite memory source. IEEE Trans Inf Theory  Model Evaluation 41(3):643Ð652  Occam’s Razor Yuri Shtarkov (1987) Universal sequential coding of sin-  Overfitting gle messages. Probl Inf Transm 23(3):3Ð17  Regularization  Structural Risk Minimization  Universal Learning Theory