MDL Histogram Density Estimation
Total Page:16
File Type:pdf, Size:1020Kb
MDL Histogram Density Estimation Petri Kontkanen, Petri Myllym¨aki Complex Systems Computation Group (CoSCo) Helsinki Institute for Information Technology (HIIT) University of Helsinki and Helsinki University of Technology P.O.Box 68 (Department of Computer Science) FIN-00014 University of Helsinki, Finland {Firstname}.{Lastname}@hiit.fi Abstract only on finding the optimal bin count. These regu- lar histograms are, however, often problematic. It has been argued (Rissanen, Speed, & Yu, 1992) that reg- We regard histogram density estimation as ular histograms are only good for describing roughly a model selection problem. Our approach uniform data. If the data distribution is strongly non- is based on the information-theoretic min- uniform, the bin count must necessarily be high if one imum description length (MDL) principle, wants to capture the details of the high density portion which can be applied for tasks such as data of the data. This in turn means that an unnecessary clustering, density estimation, image denois- large amount of bins is wasted in the low density re- ing and model selection in general. MDL- gion. based model selection is formalized via the normalized maximum likelihood (NML) dis- To avoid the problems of regular histograms one must tribution, which has several desirable opti- allow the bins to be of variable width. For these irreg- mality properties. We show how this frame- ular histograms, it is necessary to find the optimal set work can be applied for learning generic, ir- of cut points in addition to the number of bins, which regular (variable-width bin) histograms, and naturally makes the learning problem essentially more how to compute the NML model selection difficult. For solving this problem, we regard the his- criterion efficiently. We also derive a dy- togram density estimation as a model selection task, namic programming algorithm for finding where the cut point sets are considered as models. In both the MDL-optimal bin count and the cut this framework, one must first choose a set of candidate point locations in polynomial time. Finally, cut points, from which the optimal model is searched we demonstrate our approach via simulation for. The quality of each of the cut point sets is then tests. measured by some model selection criterion. Our approach is based on information theory, more specifically on the Minimum description length (MDL) 1 INTRODUCTION principle developed in the series of papers (Rissanen, 1978, 1987, 1996). MDL is a well-founded, general Density estimation is one of the central problems in framework for performing model selection and other statistical inference and machine learning. Given a types of statistical inference. The fundamental idea sample of observations, the goal of histogram den- behind the MDL principle is that any regularity in sity estimation is to find a piecewise constant density data can be used to compress the data, i.e., to find that describes the data best according to some pre- a description or code of it such that this description determined criterion. Although histograms are con- uses the least number of symbols, less than other codes ceptually simple densities, they are very flexible and and less than it takes to describe the data literally. can model complex properties like multi-modality with The more regularities there are, the more the data a relatively small number of parameters. Furthermore, can be compressed. According to the MDL principle, one does not need to assume any specific form for the learning can be equated with finding regularities in underlying density function: given enough bins, a his- data. Consequently, we can say that the more we are togram estimator adapts to any kind of density. able to compress the data, the more we have learned about it. Most existing methods for learning histogram densities assume that the bin widths are equal and concentrate Model selection with MDL is done by minimizing a quantity called the stochastic complexity, which is the nan, 1988), but the histograms are equal-width only. shortest description length of a given data relative to Another similarity between our work and (Rissanen a given model class. The definition of the stochas- et al., 1992) is the dynamic programming optimiza- tic complexity is based on the normalized maximum tion process, but since the optimality criterion is not likelihood (NML) distribution introduced in (Shtarkov, the same, the process itself is quite different. It should 1987; Rissanen, 1996). The NML distribution has sev- be noted that these differences are significant as the eral theoretical optimality properties, which make it a Bayesian mixture criterion does not possess the opti- very attractive candidate for performing model selec- mality properties of NML mentioned above. tion. It was originally (Rissanen, 1996) formulated as This paper is structured as follows. In Section 2 we a unique solution to the minimax problem presented discuss the basic properties of the MDL framework in in (Shtarkov, 1987), which implied that NML is the general, and also shortly review the optimality proper- minimax optimal universal model. Later (Rissanen, ties of the NML distribution. Section 3 introduces the 2001), it was shown that NML is also the solution to NML histogram density and also provides a solution a related problem involving expected regret. See Sec- to the related computational problem. The cut point tion 2 and (Rissanen, 2001; Gr¨unwald, 2006; Rissanen, optimization process based on dynamic programming 2005) for more discussion on the theoretical properties is the topic of Section 4. Finally, in Section 5 our of the NML. approach is demonstrated via simulation tests. On the practical side, NML has been successfully ap- plied to several problems. We mention here two ex- 2 PROPERTIES OF MDL AND amples. In (Kontkanen, Myllym¨aki, Buntine, Rissa- NML nen, & Tirri, 2006), NML was used for data clustering, and its performance was compared to alternative ap- The MDL principle has several desirable properties. proaches like Bayesian statistics. The results showed Firstly, it automatically protects against overfitting that NML was especially impressive with small sam- when learning both the parameters and the structure ple sizes. In (Roos, Myllym¨aki, & Tirri, 2005), NML (number of parameters) of the model. Secondly, there was applied to wavelet denoising of computer images. is no need to assume the existence of some underly- Since the MDL principle in general can be interpreted ing “true” model, which is not the case with several as separating information from noise, this approach is other statistical methods. The model is only used as very natural. a technical device for constructing an efficient code. Unfortunately, in most practical applications of NML MDL is also closely related to the Bayesian inference one must face severe computational problems, since but there are some fundamental differences, the most the definition of the NML involves a normalizing inte- important being that MDL is not dependent on any gral or a sum, called the parametric complexity, which prior distribution, it only uses the data at hand. usually is difficult to compute. One of the contribu- MDL model selection is based on minimization of the tions of this paper is to show how the parametric com- stochastic complexity. In the following, we give the plexity can be computed efficiently in the histogram definition of the stochastic complexity and then pro- case, which makes it possible to use NML as a model ceed by discussing its theoretical properties. selection criterion in practice. n Let x = (x1,..., x ) be a data sample of n outcomes, There is obviously an exponential number of different n where each outcome x is an element of some space of cut point sets. Therefore, a brute-force search is not j observations X . The n-fold cartesian product X × feasible. Another contribution of this paper is to show ···×X is denoted by X n, so that xn ∈X n. Consider how the NML-optimal cut point locations can be found a set Θ ⊆ Rd, where d is a positive integer. A class via dynamic programming in a polynomial (quadratic) of parametric distributions indexed by the elements time with respect to the size of the set containing the of Θ is called a model class. That is, a model class M cut points considered in the optimization process. is defined as M = {f(· | θ) : θ ∈ Θ}. Denote the The histogram density estimation is naturally a well- maximum likelihood estimate of data xn by θˆ(xn), i.e., studied problem, but unfortunately almost all of the previous studies, e.g. (Birge & Rozenholc, 2002; Hall θˆ(xn) = arg max{f(xn | θ)}. (1) & Hannan, 1988; Yu & Speed, 1992), consider regular θ∈Θ histograms only. Most similar to our work is (Rissanen The normalized maximum likelihood (NML) den- et al., 1992), in which irregular histograms are learned sity (Shtarkov, 1987) is now defined as with the Bayesian mixture criterion using a uniform n ˆ n prior. The same criterion is also used in (Hall & Han- n f(x | θ(x ), M) fNML(x | M) = n , (2) RM n n where the normalizing constant RM is given by defined as the minimum and maximum value in x , re- spectively. Without any loss of generality, we assume n n ˆ n n that the data is sorted into increasing order. Further- RM = f(x | θ(x ), M)dx , (3) Zxn∈X n more, we assume that the data is recorded at a finite accuracy ǫ, which means that each x ∈ xn belongs to and the range of integration goes over the space of data j the set X defined by samples of size n. If the data is discrete, the integral is replaced by the corresponding sum. xmax − xmin X = {xmin + tǫ : t = 0,..., }. (8) The stochastic complexity of the data xn given a model ǫ class M is defined via the NML density as This assumption is made to simplify the mathematical n n formulation, and as can be seen later, the effect of the SC(x | M) = − log fNML(x | M) (4) accuracy parameter ǫ on the stochastic complexity is n ˆ n n = − log f(x | θ(x ), M) + log RM, a constant that can be ignored in the model selection (5) process.