<<

MDL Density Estimation

Petri Kontkanen, Petri Myllym¨aki Complex Systems Computation Group (CoSCo) Helsinki Institute for Information Technology (HIIT) University of Helsinki and Helsinki University of Technology P.O.Box 68 (Department of Computer Science) FIN-00014 University of Helsinki, Finland {Firstname}.{Lastname}@hiit.fi

Abstract only on finding the optimal bin count. These regu- lar are, however, often problematic. It has been argued (Rissanen, Speed, & Yu, 1992) that reg- We regard histogram density estimation as ular histograms are only good for describing roughly a problem. Our approach uniform . If the data distribution is strongly non- is based on the information-theoretic min- uniform, the bin count must necessarily be high if one imum description length (MDL) principle, wants to capture the details of the high density portion which can be applied for tasks such as data of the data. This in turn that an unnecessary clustering, density estimation, image denois- large amount of bins is wasted in the low density re- ing and model selection in general. MDL- gion. based model selection is formalized via the normalized maximum likelihood (NML) dis- To avoid the problems of regular histograms one must tribution, which has several desirable opti- allow the bins to be of variable width. For these irreg- mality properties. We show how this frame- ular histograms, it is necessary to find the optimal set work can be applied for learning generic, ir- of cut points in addition to the number of bins, which regular (variable-width bin) histograms, and naturally makes the learning problem essentially more how to compute the NML model selection difficult. For solving this problem, we regard the his- criterion efficiently. We also derive a dy- togram density estimation as a model selection task, namic programming algorithm for finding where the cut point sets are considered as models. In both the MDL-optimal bin count and the cut this framework, one must first choose a set of candidate point locations in polynomial time. Finally, cut points, from which the optimal model is searched we demonstrate our approach via simulation for. The quality of each of the cut point sets is then tests. measured by some model selection criterion. Our approach is based on information theory, more specifically on the Minimum description length (MDL) 1 INTRODUCTION principle developed in the series of papers (Rissanen, 1978, 1987, 1996). MDL is a well-founded, general Density estimation is one of the central problems in framework for performing model selection and other and . Given a types of statistical inference. The fundamental idea sample of observations, the goal of histogram den- behind the MDL principle is that any regularity in sity estimation is to find a piecewise constant density data can be used to compress the data, i.e., to find that describes the data best according to some pre- a description or code of it such that this description determined criterion. Although histograms are con- uses the least number of symbols, less than other codes ceptually simple densities, they are very flexible and and less than it takes to describe the data literally. can model complex properties like multi-modality with The more regularities there are, the more the data a relatively small number of parameters. Furthermore, can be compressed. According to the MDL principle, one does not need to assume any specific form for the learning can be equated with finding regularities in underlying density function: given enough bins, a his- data. Consequently, we can say that the more we are togram adapts to any kind of density. able to compress the data, the more we have learned about it. Most existing methods for learning histogram densities assume that the bin widths are equal and concentrate Model selection with MDL is done by minimizing a quantity called the stochastic complexity, which is the nan, 1988), but the histograms are equal-width only. shortest description length of a given data relative to Another similarity between our work and (Rissanen a given model class. The definition of the stochas- et al., 1992) is the dynamic programming optimiza- tic complexity is based on the normalized maximum tion process, but since the optimality criterion is not likelihood (NML) distribution introduced in (Shtarkov, the same, the process itself is quite different. It should 1987; Rissanen, 1996). The NML distribution has sev- be noted that these differences are significant as the eral theoretical optimality properties, which make it a Bayesian mixture criterion does not possess the opti- very attractive candidate for performing model selec- mality properties of NML mentioned above. tion. It was originally (Rissanen, 1996) formulated as This paper is structured as follows. In Section 2 we a unique solution to the minimax problem presented discuss the basic properties of the MDL framework in in (Shtarkov, 1987), which implied that NML is the general, and also shortly review the optimality proper- minimax optimal universal model. Later (Rissanen, ties of the NML distribution. Section 3 introduces the 2001), it was shown that NML is also the solution to NML histogram density and also provides a solution a related problem involving expected regret. See Sec- to the related computational problem. The cut point tion 2 and (Rissanen, 2001; Gr¨unwald, 2006; Rissanen, optimization process based on dynamic programming 2005) for more discussion on the theoretical properties is the topic of Section 4. Finally, in Section 5 our of the NML. approach is demonstrated via simulation tests. On the practical side, NML has been successfully ap- plied to several problems. We mention here two ex- 2 PROPERTIES OF MDL AND amples. In (Kontkanen, Myllym¨aki, Buntine, Rissa- NML nen, & Tirri, 2006), NML was used for data clustering, and its performance was compared to alternative ap- The MDL principle has several desirable properties. proaches like Bayesian . The results showed Firstly, it automatically protects against overfitting that NML was especially impressive with small sam- when learning both the parameters and the structure ple sizes. In (Roos, Myllym¨aki, & Tirri, 2005), NML (number of parameters) of the model. Secondly, there was applied to denoising of computer images. is no need to assume the existence of some underly- Since the MDL principle in general can be interpreted ing “true” model, which is not the case with several as separating information from noise, this approach is other statistical methods. The model is only used as very natural. a technical device for constructing an efficient code. Unfortunately, in most practical applications of NML MDL is also closely related to the one must face severe computational problems, since but there are some fundamental differences, the most the definition of the NML involves a normalizing inte- important being that MDL is not dependent on any gral or a sum, called the parametric complexity, which prior distribution, it only uses the data at hand. usually is difficult to compute. One of the contribu- MDL model selection is based on minimization of the tions of this paper is to show how the parametric com- stochastic complexity. In the following, we give the plexity can be computed efficiently in the histogram definition of the stochastic complexity and then pro- case, which makes it possible to use NML as a model ceed by discussing its theoretical properties. selection criterion in practice. n Let x = (x1,..., x ) be a data sample of n outcomes, There is obviously an exponential number of different n where each outcome x is an element of some space of cut point sets. Therefore, a brute-force search is not j observations X . The n-fold cartesian product X × feasible. Another contribution of this paper is to show ···×X is denoted by X n, so that xn ∈X n. Consider how the NML-optimal cut point locations can be found a set Θ ⊆ Rd, where d is a positive integer. A class via dynamic programming in a polynomial (quadratic) of parametric distributions indexed by the elements time with respect to the size of the set containing the of Θ is called a model class. That is, a model class M cut points considered in the optimization process. is defined as M = {f(· | θ) : θ ∈ Θ}. Denote the The histogram density estimation is naturally a well- maximum likelihood estimate of data xn by θˆ(xn), i.e., studied problem, but unfortunately almost all of the previous studies, e.g. (Birge & Rozenholc, 2002; Hall θˆ(xn) = arg max{f(xn | θ)}. (1) & Hannan, 1988; Yu & Speed, 1992), consider regular θ∈Θ histograms only. Most similar to our work is (Rissanen The normalized maximum likelihood (NML) den- et al., 1992), in which irregular histograms are learned sity (Shtarkov, 1987) is now defined as with the Bayesian mixture criterion using a uniform n ˆ n prior. The same criterion is also used in (Hall & Han- n f(x | θ(x ), M) fNML(x | M) = n , (2) RM n n where the normalizing constant RM is given by defined as the minimum and maximum value in x , re- spectively. Without any loss of generality, we assume n n ˆ n n that the data is sorted into increasing order. Further- RM = f(x | θ(x ), M)dx , (3) Zxn∈X n more, we assume that the data is recorded at a finite accuracy ǫ, which means that each x ∈ xn belongs to and the of integration goes over the space of data j the set X defined by samples of size n. If the data is discrete, the integral is replaced by the corresponding sum. xmax − xmin X = {xmin + tǫ : t = 0,..., }. (8) The stochastic complexity of the data xn given a model ǫ class M is defined via the NML density as This assumption is made to simplify the mathematical n n formulation, and as can be seen later, the effect of the SC(x | M) = − log fNML(x | M) (4) accuracy parameter ǫ on the stochastic complexity is n ˆ n n = − log f(x | θ(x ), M) + log RM, a constant that can be ignored in the model selection (5) process.

n 1 1 and the term log RM is called the parametric com- Let C = (c , . . . , cK− ) be an increasing sequence of plexity or minimax regret. The parametric complexity points partitioning the range [xmin − ǫ/2, xmax + ǫ/2] can be interpreted as measuring the logarithm of the into the following K intervals (bins): number of essentially different (distinguishable) distri- butions in the model class. Intuitively, if two distribu- ([xmin − ǫ/2, c1], ]c1, c2],..., ]cK−1, xmax + ǫ/2]). (9) tions assign high likelihood to the same data samples, The points c are called the cut points of the his- they do not contribute much to the overall complexity k togram. Note that the original data range [xmin, xmax] of the model class, and the distributions should not is extended by ǫ/2 from both ends for technical rea- be counted as different for the purposes of statistical sons. It is natural to assume that there is only one cut inference. See (Balasubramanian, 2006) for more dis- point between two consecutive elements of X , since cussion on this topic. placing two or more cut points would always produce The NML density (2) has several important theoretical unnecessary empty bins. For simplicity, we assume optimality properties. The first one is that NML pro- that the cut points belong to the set C defined by vides a unique solution to the minimax problem posed xmax − xmin in (Shtarkov, 1987), C = {xmin + ǫ/2 + tǫ : t = 0,..., − 1}, ǫ n ˆ n (10) f(x | θ(x ), M) n min max log = log RM, (6) i.e., each ck ∈C is a midpoint of two consecutive values ˆ xn ˆ n f f(x | M) of X .

This means that the NML density is the minimax op- Define c0 = xmin − ǫ/2, cK = xmax + ǫ/2 and let Lk = timal universal model. A related property of NML ck − ck−1, k = 1,...,K be the bin lengths. Given a involving expected regret was proven in (Rissanen, parameter vector θ ∈ Θ, 2001). This property states that NML also minimizes Θ = {(θ1,...,θK ) : θk ≥ 0,θ1 + · · · + θK = 1}, (11) n ˆ n f(x | θ(x ), M) n min max Eg log = log RM, (7) fˆ g fˆ(xn | M) and a set (sequence) of cut points C, we now define the histogram density fh by where the expectation is taken over xn and g is the ǫ · θ worst-case data generating density. f (x | θ, C) = k , (12) h L Having now discussed the MDL principle and the NML k density in general, we return to the main topic of the where x ∈ ]ck−1, ck]. Note that (12) does not de- paper. In the next section, we instantiate the NML fine a density in the purest sense, since fh(x | θ, C) density for the histograms and show how the para- is actually the that x falls into the inter- metric complexity can be computed efficiently in this val ]x − ǫ/2,x + ǫ/2]. Given (12), the likelihood of the case. whole data sample xn is easy to write. We have

K hk 3 NML HISTOGRAM DENSITY n ǫ · θk fh(x | θ, C) = , (13)  Lk  k=1 n Y Consider a sample of n outcomes x = (x1,..., xn) on the interval [xmin, xmax]. Typically, xmin and xmax are where hk is the number of data points falling into bin k. To instantiate the NML distribution (2) for the his- Having now derived both the maximum likelihood pa- togram density fh, we need to find the maximum likeli- rameters and the parametric complexity, we are now n hood parameters θˆ(x ) = (θˆ1,..., θˆK ) and an efficient ready to write down the stochastic complexity (5) for way to compute the parametric complexity (3). It is the histogram model. We have well-known that the ML parameters are given by the xn relative frequencies θˆk = hk/n, so that we have SC( | C) hk K ǫ·hk K hk k=1 Lk·n n ˆ n ǫ · hk = − log   (20) fh(x | θ(x ), C) = . (14) Q Rn Lk · n hK kY=1 K Denote now the parametric complexity of a K-bin his- = −hk(log(ǫ · hk) − log(Lk · n)) n kX=1 togram by log RhK . First thing to notice is that since n the data is pre-discretized, the integral in (3) is re- + log RhK . (21) placed by a sum over the space X n. We have Equation (21) is the basis for measuring the qual- K hk ity of NML histograms, i.e., comparing different cut n ǫ · hk RhK = (15) point sets. It should be noted that as the term Lk · n K xn∈X n k=1 X Y k=1 −hk log ǫ = −n log ǫ is a constant with respect K hk to C, the value of ǫ does not affect the comparison. n! Lk P = In the next section we will discuss how NML-optimal h1! · · · hK !  ǫ  h1+···X+hK =n kY=1 histograms can be found in practice. K ǫ · h hk · k (16) Lk · n 4 LEARNING MDL-OPTIMAL kY=1 K HISTOGRAMS n! h hk = k , (17) h1! · · · hK !  n  In this section we will describe a dynamic program- h1+···X+hK =n kY=1 ming algorithm, which can be used to efficiently find hk where the term (Lk/ǫ) in (16) follows from the fact both the optimal bin count and the cut point loca- that an interval of length Lk contains exactly (Lk/ǫ) tions. We start by giving the exact definition of the members of the set X , and the multinomial coeffi- problem. Let C˜ ⊆C denote the candidate cut point cient n!/(h1! · · · hK !) counts the number of arrange- set, which is the set of cut points we consider in the ments of n objects into K boxes each containing optimization process. How C˜ is chosen in practice, de- h1,...,hK objects, respectively. pends on the problem at hand. The simplest choice is naturally C˜ = C, which means that all the possible cut Although the final form (17) of the parametric com- points are candidates. However, if the value of the ac- plexity is still an exponential sum, we can compute it curacy parameter ǫ is small or the data range contains efficiently. It turns out that (17) is exactly the same as large gaps, this choice might not be practical. Another the parametric complexity of a K-valued multinomial, idea would be to define C˜ to be the set of midpoints which we studied in (Kontkanen & Myllym¨aki, 2005). of all the consecutive value pairs in the data xn. This In this work, we derived the recursion choice, however, does not allow empty bins, and thus the potential large gaps are still problematic. n n n n RhK = RhK−1 + RhK−2 , (18) K − 2 A much more sensible choice is to place two candidate cut points between each consecutive values in the data. which holds for K > 2. It is now straightforward to It is straightforward to prove and also intuitively clear write a linear-time algorithm based on (18). The com- that these two candidate points should be placed as putation starts with the trivial case Rn ≡ 1. The h1 close as possible to the respective data points. In this case K = 2 is a simple sum way, the resulting bin lengths are as small as possible, h1 h2 which will produce the greatest likelihood for the data. n n! h1 h2 ˜ Rh2 = , (19) These considerations suggest that C should be chosen h1!h2!  n   n  h1+Xh2=n as

n n which clearly can be computed in time O (n). Fi- C˜ =({xj − ǫ/2 : xj ∈ x } ∪ {xj + ǫ/2 : xj ∈ x }) nally, recursion (18) is applied K − 2 times to end \ {xmin − ǫ/2, xmax + ǫ/2}. (22) n up with RhK . The time complexity of the whole com- putation is O (n + K). Note that the end points xmin − ǫ/2 and xmax + ǫ/2 are excluded from C˜, since they are always implicitly means that BˆK,e is the optimizing value of (24) when ne included in all the cut point sets. the data is restricted to x . For a fixed K, BˆK,E+1 is After choosing the candidate cut point set, the his- clearly the final solution we are looking for, since the x togram density estimation problem is straightforward interval [ min, c˜E+1] contains all the data. to define: find the cut point set C ⊆ C˜ which optimizes Consider now a K-bin histogram with cut points C = the given goodness criterion. In our case the criterion (˜ce1 ,..., c˜eK−1 ). Assuming that the data range is re- is based on the stochastic complexity (21), and the cut stricted to [xmin, c˜eK ] for somec ˜eK > c˜eK−1 , we can point sets are considered as models. In practical model straightforwardly write the score function B(xneK | selection tasks, however, the stochastic complexity cri- E,K,C) by using the score function of a (K − 1)-bin ′ terion itself may not be sufficient. The reason is that histogram with cut points C = (˜ce1 ,..., c˜eK−2 ) as it is also necessary to encode the model index in some ne way, as argued in (Gr¨unwald, 2006). In some tasks, an B(x K | E,K,C) encoding based on the uniform distribution is appro- = B(xneK−1 | E, K − 1, C′) priate. Typically, if the set of models is finite and the − (neK − neK−1 )(log(ǫ · (neK − neK−1 )) models are of same complexity, this choice is suitable. − log((˜c − c˜ ) · n)) In the histogram case, however, the cut point sets of eK eK−1 ne different size produce densities which are dramatically R K E − K + 2 + log hK + log , (27) different complexity-wise. Therefore, it is natural to neK−1 K − 1 RhK−1 assume that the model index is encoded with a uniform distribution over all the cut point sets of the same size. since (neK −neK−1 ) is the number of data points falling th For a K-bin histogram with the size of the candidate into the K bin, (˜ceK − c˜eK−1 ) is the length of that E cut point set fixed to E, there are clearly K−1 ways bin, and to choose the cut points. Thus, the codelength  for E E K−1 E − K + 2 encoding them is log K−1 . log = log . (28)  E  K − 1 After these considerations, we define the final criterion K−2 (or score) used for comparing different cut point sets  as We can now write the dynamic programming recursion as B(xn | E,K,C) ˆ ˆ ′ ′ ′ E BK,e = min BK−1,e −(ne −ne )·(log(ǫ·(ne −ne )) = SC(xn | C) + log (23) e′  K − 1 − log((˜ce − c˜e′ ) · n)) K Rne = −hk (log(ǫ · hk) − log(Lk · n)) hK E − K + 2 + log ne′ + log , (29) =1 kX RhK−1 K − 1  E + log Rn + log . (24) where e′ = K −1,...,e−1. The recursion is initialized hK K − 1 with It is clear that there is an exponential number of pos- Bˆ1 = −n · (log(ǫ · n ) − log((˜c − (xmin − ǫ/2)) · n)), sible cut point sets, and thus an exhaustive search ,e e e e (30) to minimize (24) is not feasible. However, the opti- for e = 1,...,E + 1. After that, the bin count mal cut point set can be found via dynamic program- is always increased by one, and (29) is applied for ming, which works by tabulating partial solutions to e = K,...,E + 1 until a pre-determined maximum the problem. The final solution is then found recur- bin count Kmax is reached. The minimum Bˆ is sively. K,e then chosen to be the final solution. By constantly Let us first assume that the elements of C˜ are indexed keeping track which e′ minimizes (29) during the pro- in such a way that cess, the optimal cut point sequence can also be re- covered. The time complexity of the whole algorithm C˜ = {c˜1,..., c˜ }, c˜1 < c˜2 < · · · < c˜ . (25) 2 E E is O E · Kmax .  We also definec ˜E+1 = xmax + ǫ/2. Denote 5 EMPIRICAL RESULTS ne BˆK,e = min B(x | E,K,C), (26) C⊆C˜ The quality of a density estimator is usually measured ne where x = (x1,..., xne ) is the portion of the data by a suitable distance metric between the data gen- falling into interval [xmin, c˜e] for e = 1,...,E +1. This erating density and the estimated one. This is often 0.35 gm2 problematic, since we typically do not know the data NML histogram generating density, which means that some heavy as- 0.3 sumptions must be made. The MDL principle, how- 0.25 ever, states that the stochastic complexity (plus the codelength for encoding the model index) itself can be 0.2 used as a goodness measure. Therefore, it is not neces- 0.15 sary to use any additional way of assessing the quality 0.1 of an MDL density estimator. The optimality proper- ties of the NML criterion and the fact that we are able 0.05

0 to find the global optimum in the histogram case will -2 0 2 4 6 8 10

0.35 make sure that the final result is theoretically valid. gm5 NML histogram Nevertheless, to demonstrate the behaviour of the 0.3

NML histogram method in practice we implemented 0.25 the dynamic programming algorithm of the previous section and ran some simulation tests. We generated 0.2 data samples of various size from four densities of dif- 0.15 ferent shapes (see below) and then used the dynamic 0.1 programming method to find the NML-optimal his- tograms. In all the tests, the accuracy parameter ǫ was 0.05

0 fixed to 0.1. We decided to use Gaussian finite mix- -2 0 2 4 6 8 10 tures as generating densities, since they are very flexi- ble and easy to sample from. The four generating den- Figure 1: The Gaussian finite mixture densities gm2 sities we chose and the corresponding NML-optimal and gm5 and the NML-optimal histograms with sam- histograms using a sample of 10000 data points are ple size 10000. shown in Figures 1 and 2. The densities are labeled gm2, gm5, gm6 and gm8, and they are mixtures of 2, 5, 6 and 8 Gaussian components, respectively, with vari- histograms. ous amount of overlap between the components. From the plots we can see that the NML histogram method To visually see the effect of the sample size, we plotted is able to capture properties such as multi-modality the NML-optimal histograms against the generating (all densities) and long tails (gm6). Another nice fea- density gm6 with sample sizes 100, 1000 and 10000. ture is that the algorithm automatically places more These plots can be found in Figure 4. As a refer- bins to the areas where more detail is needed like the ence, we also plotted the empirical distributions of the high, narrow peaks of gm5 and gm6. data samples as a (mirrored) equal-width histograms (the negative y-values). Each bar of the empirical plot To see the behaviour of the NML histogram density has width 0.1 (the value of the accuracy parameter ǫ). algorithm with varying amount of data, we generated When the sample size is 100, the NML histogram al- data samples of various sizes between 100−10000 from gorithm has chosen only 3 bins, and the resulting his- the four generating densities. For each case, we mea- togram density is rather crude. However, the small sured the distance between the generating density and sample size does not justify placing any more bins as the NML-optimal histogram. As the distance measure can be seen from the empirical distribution. There- we used the (squared) Hellinger distance fore, we claim that the NML-optimal solution is ac- tually a very sensible one. When the sample size is 2 2 h (f, g) = ( f(x) − g(x)) dx, (31) increased, the bin count is increased and more and Z p p more details are captured. Notice that with all the which has often been used in the histogram context be- sample sizes, the bin widths of the NML-optimal his- fore (see, e.g., (Birge & Rozenholc, 2002; Kanazawa, tograms are strongly variable. It is clear that it would 1993)). The actual values of the Hellinger distance be impossible for any equal-width histogram density were calculated via numerical integration. The results estimator to produce such detailed results using the can be found in Figure 3. The curves are averaged over same amount of data. 10 different samples of each size. The figure shows that the NML histogram density converges to the generat- ing one quite rapidly when the sample size is increased. 6 CONCLUSION The shapes of the convergence curves with the four generating densities are also very similar, which is fur- In this paper we have presented an information- ther evidence of the flexibility of the variable-width theoretic framework for histogram density estimation. 0.35 0.4 gm6 gm6 NML histogram N = 100 NML histogram 0.3 0.3

0.2

0.25 0.1

0.2 0

-0.1 0.15

-0.2 0.1

-0.3

0.05 -0.4

0 -0.5 -5 0 5 10 -5 0 5 10

0.16 0.4 gm8 gm6 NML histogram N = 1000 NML histogram

0.14 0.3

0.12 0.2

0.1 0.1

0.08 0

0.06 -0.1

0.04 -0.2

0.02 -0.3

0 -0.4 0 5 10 15 20 -5 0 5 10

0.4 gm6 N = 10000 NML histogram Figure 2: The Gaussian finite mixture densities gm6 0.3 and gm8 and the NML-optimal histograms with sam- 0.2 ple size 10000. 0.1

0

-0.1 The selected approach based on the MDL principle has several advantages. Firstly, the MDL criterion for -0.2 model selection (stochastic complexity) has nice the- -0.3

-0.4 oretical optimality properties. Secondly, by regarding -5 0 5 10 histogram estimation as a model selection problem, it is possible to learn generic, variable-width bin his- Figure 4: The generating density gm6, the NML- tograms and also estimate the optimal bin count auto- optimal histograms and the empirical distributions matically. Furthermore, the MDL criterion itself can with sample sizes 100, 1000 and 10000. be used as a measure of quality of a density estima- tor, which means that there is no need to assume any- thing about the underlying generating density. Since the model selection criterion is based on the NML dis- tribution, there is also no need to specify any prior distribution for the parameters.

0.07 gm2 To make our approach practical, we presented an effi- gm5 gm6 gm8 cient way to compute the value of the stochastic com- 0.06 plexity in the histogram case. We also derived a dy- 0.05 namic programming algorithm for efficiently optimiz-

0.04 ing the NML-based criterion. Consequently, we were able to find the globally optimal bin count and cut 0.03

Hellinger distance point locations in quadratic time with respect to the 0.02 size of the candidate cut point set.

0.01 In addition to the theoretical part, we demonstrated

0 the validity of our approach by simulation tests. In 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sample size (N) these tests, data samples of various sizes were gener- ated from Gaussian finite mixture densities with highly Figure 3: The Hellinger distance between the four gen- complex shapes. The results showed that the NML his- erating densities and the corresponding NML-optimal tograms automatically adapt to various kind of densi- histograms as a function of the sample size. ties. In the future, our plan is to perform an extensive set plexity and nonparametric density estimation. of empirical tests using both simulated and real data. Biometrika, 75 (4), 705–714. In these tests, we will compare our approach to other histogram . It is anticipated that the vari- Kanazawa, Y. (1993). Hellinger distance and Akaike’s ous equal-width estimators will not be performing well information criterion for the histogram. Statist. in the tests due to the severe limitations of regular Probab. Letters(17), 293–298. histograms. More interesting will be the compara- Kontkanen, P., & Myllym¨aki, P. (2005). Analyzing the tive performance of the density estimator in (Rissanen stochastic complexity via tree polynomials (Tech. et al., 1992), which is similar to ours but based on the Rep. No. 2005-4). Helsinki Institute for Informa- Bayesian mixture criterion. Theoretically, our version tion Technology (HIIT). has an advantage at least with small sample sizes. Kontkanen, P., Myllym¨aki, P., Buntine, W., Rissanen, Another interesting application of NML histograms J., & Tirri, H. (2006). An MDL framework for would be to use them for modeling the class-specific data clustering. In P. Gr¨unwald, I. Myung, & distributions of classifiers such as the Naive Bayes. M. Pitt (Eds.), Advances in minimum descrip- These distributions are usually modeled with a Gaus- tion length: Theory and applications. The MIT sian density or a multinomial distribution with equal- Press. width discretization, which typically cannot capture all the relevant properties of the distributions. Al- Rissanen, J. (1978). Modeling by shortest data de- though the NML histogram is not specifically tailored scription. Automatica, 14, 445-471. for classification tasks, it seems evident that if the class-specific distributions are modeled with high ac- Rissanen, J. (1987). Stochastic complexity. Journal of curacy, the resulting classifier also performs well. the Royal Statistical Society, 49 (3), 223–239 and 252–265. Acknowledgements Rissanen, J. (1996). Fisher information and stochastic This work was supported in part by the Academy of complexity. IEEE Transactions on Information Finland under the project Civi and by the Finnish Theory, 42 (1), 40–47. Funding Agency for Technology and Innovation under Rissanen, J. (2001). Strong optimality of the normal- the projects Kukot and PMMA. In addition, this work ized ML models as universal codes and informa- was supported in part by the IST Programme of the tion in data. IEEE Transactions on Information European Community, under the PASCAL Network Theory, 47 (5), 1712–1717. of Excellence, IST-2002-506778. This publication only reflects the authors’ views. Rissanen, J. (2005, August). Lectures on statistical modeling theory. (Available online at www.mdl- References research.org) Rissanen, J., Speed, T., & Yu, B. (1992). Density es- Balasubramanian, V. (2006). MDL, Bayesian infer- timation by stochastic complexity. IEEE Trans- ence, and the geometry of the space of probabil- actions on Information Theory, 38 (2), 315–323. ity distributions. In P. Gr¨unwald, I. Myung, & M. Pitt (Eds.), Advances in minimum descrip- Roos, T., Myllym¨aki, P., & Tirri, H. (2005). On the tion length: Theory and applications (pp. 81– behavior of MDL denoising. In R. G. Cowell & 98). The MIT Press. Z. Ghahramani (Eds.), Proceedings of the 10th international workshop on artificial intelligence Birge, L., & Rozenholc, Y. (2002, April). How many and statistics (aistats) (pp. 309–316). Barbados: bins should be put in a regular histogram. (Pre- Society for Artificial Intelligence and Statistics. publication no 721, Laboratoire de Probabilites et Modeles Aleatoires, CNRS-UMR 7599, Uni- Shtarkov, Y. M. (1987). Universal sequential coding of versite Paris VI & VII) single messages. Problems of Information Trans- mission, 23, 3–17. Gr¨unwald, P. (2006). Minimum description length tutorial. In P. Gr¨unwald, I. Myung, & M. Pitt Yu, B., & Speed, T. (1992). Data compression and (Eds.), Advances in minimum description length: histograms. Probab. Theory Relat. Fields, 92, Theory and applications (pp. 23–79). The MIT 195–229. Press.

Hall, P., & Hannan, E. (1988). On stochastic com-