Minimum Description Length Principle

Total Page:16

File Type:pdf, Size:1020Kb

Minimum Description Length Principle M one, the MDL principle also quantifies the com- Minimum Description Length patibility of the hypotheses with the evidence. Principle This leads to a trade-off between the complexity of the hypothesis and its compatibility with the Teemu Roos evidence (“goodness of fit”). Department of Computer Science, Helsinki The philosophy of the MDL principle em- Institute for Information Technology, University phasizes that the evaluation of the merits of a of Helsinki, Helsinki, Finland model should not be based on its closeness to a “true” model, whose existence is often impos- Abstract sible to verify, but instead on the data. Inspired by Solomonoff’s theory of universal induction, The minimum description length (MDL) princi- Rissanen postulated that a yardstick of the per- ple states that one should prefer the model that formance of a statistical model is the probability yields the shortest description of the data when it assigns to the data. Since the probability is the complexity of the model itself is also ac- intimately related to code length (see below), the counted for. MDL provides a versatile approach code length provides an equivalent way to mea- to statistical modeling. It is applicable to model sure performance. The key idea made possible by selection and regularization. Modern versions of the coding interpretation is that the length of the MDL lead to robust methods that are well suited description of the model itself can be quantified for choosing an appropriate model complexity in the same units as the code length of the data, based on the data, thus extracting the maximum namely, bits. Earlier, Wallace and Boulton had amount of information from the data without made a similar proposal under the title minimum over-fitting. The modern versions of MDL go message length (MML) (Wallace and Boulton k well beyond the familiar 2 log n formula. 1968). A fundamental difference between the two principles is that MML is a Bayesian approach while MDL is not. The central tenet in MDL is that the better Philosophy one is able to discover the regular features in the data, the shorter the code length. Showing The MDL principle is a formal version of Oc- that this is indeed the case often requires that we cam’s razor. While the Occam’s razor only sug- assume, for the sake of argument, that the data gests that between hypotheses that are compatible are generated by a true distribution and verify with the evidence, one should choose the simplest the statistical behavior of MDL-based methods under this assumption. Hence, the emphasis on © Springer Science+Business Media New York 2016 C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining, DOI 10.1007/978-1-4899-7502-7 894-1 2 Minimum Description Length Principle the freedom from the assumption of a true model agivenmodel class (a set of distributions), M, is more pertinent in the philosophy of MDL than the goal may be to find a code that minimizes in the technical analysis carried out in its theory. the worst-case expected code length under any source distribution p 2 M. A uniquely decod- able code that achieves near-optimal code lengths Theory with respect to a given model class is said to be universal. The theory of MDL addresses two kinds of ques- Rissanen’s groundbreaking 1978 paper (Rissa- tions: .i/ the first kind asks what is the shortest nen 1978) gives a general construction for uni- description achievable using a given model class, versal codes based on two-part codes. A two-part i.e., universal data compression; .ii/ the second code first includes a code for encoding a distribu- kind asks what can be said about the behavior tion, q, over source sequences. The second part of MDL methods when applied to model selec- encodes the data using a code based on q.The tion and other machine learning and data mining length of the second part is thus log q.xn/ bits. tasks. The latter kind of questions are closely The length of the first part, `.q/, depends on the related to the theory of statistical estimation and complexity of the distribution q, which leads to a statistical learning theory. We review the theory trade-off between complexity measured by `.q/ related to these two kinds of questions separately. and goodness of fit measured by log q.x/: Universal Data Compression min.`.q/ log q.xn//: (1) q As is well known in information theory, the short- est expected code length achievable by a uniquely For parametric models that are defined by a decodable code under a known data source, p , continuous parameter vector , a two-part coding is given by the entropy of the source, H.p /. approach requires that the parameters be quan- The lower bound is achieved by using a code tized so that their code length is finite. Rissa- word of length ` .x/ Dlog p .x/ bits for nen showed that given a k-dimensional para- each source symbol x. (Here and in the following, metric model class, M Dfp I 2 ‚ log denotes base-2 logarithm.) Correspondingly, Rkg, the optimal quantization of the parameter a code-length function ` is optimal under a source space ‚ is achieved by using accuracy of or- `.x/ p distribution defined by q.x/ D 2 .(Forthe der 1= n for each coordinate, where n is the sake of notationalP simplicity, we omit a normal- sample size. The resulting total code length be- `.x/ izing factor C D x 2 which is necessary n k O haves as log p.xO / C 2 log n C .1/, where in case the code is not complete. Likewise, as is n n p.xO / D maxfp .x / W 2 ‚g is the maxi- customary in MDL, we ignore the requirement mum probability under model class M. Note that that code lengths be integers.) These results can the leading terms of the formula are equivalent be extended to data sequences whereupon we to the Bayesian information criterion (BIC) by n write x D x1 :::xn to denote a sequence of Schwarz (Schwarz 1978). Later, Rissanen also length n. showed that this is a lower bound on the code While the case where the source distribution length of any universal code that holds for all p is known can be considered solved in the but a measure-zero subset of sources in the given sense that the average-case optimal code-length model class (Rissanen 1986). function ` is easily established as described The above results have subsequently been above, the case where p is unknown is more refined by studying the asymptotic and finite- intricate. Universal data compression studies sim- sample values of the O.1/ residual term for ilar lower bounds when the source distribution is specific model classes. The resulting formulas not known or when the goal is not to minimize lead to a more accurate characterization of the expected code length. For example, when the source distribution is only known to be in Minimum Description Length Principle 3 model complexity, often involving the Fisher For model selection problems, consistency is information (Rissanen 1996). often defined in relation to a fixed set of alter- Subsequently, Rissanen and others have pro- native model classes and a criterion that selects posed other kinds of universal codes that are one of them given the data. If the criterion leads superior to two-part codes. These include Bayes- to the simplest model class that contains the true type mixture codes that involve a prior distri- source distribution, the criterion is said to be bution for the unknown parameters (Rissanen consistent. (Note that the additional requirement 1986), predictive forms of MDL (Rissanen 1984; that the selected model class is the simplest one Wei 1992), and, most importantly, normalized is needed in order to circumvent a trivial solution maximum likelihood (NML) codes (Yuri 1987; in nested model classes where simpler models Rissanen 1996). The latter have the important are subsets of more complex model classes.) point-wise minimax property that they achieve There are a large number of results showing that the minimum worst-case point-wise redundancy: various MDL-based model selection criteria are consistent; for examples, see the next section. min max log q.xn/ C log p.xO n/; q xn where the maximum is over all possible data Applications sequences of length n and the minimum is over all distributions. MDL has been applied in a wide range of ap- plications. It is well suited for model selection Behavior of MDL-Based Learning Methods problems where one needs not only to estimate The philosophy of MDL suggests that data com- continuous parameters but also their number and, pression is a measure of the success in discov- more generally, the model structure, based on sta- ering regularities in the data, and hence, better tistical data. Other approaches applicable in many M compression implies better modeling. Showing such scenarios include Bayesian methods (in- that this is indeed the case is the second kind of cluding minimum message length), cross valida- theory related to MDL. tion, and structural risk minimization (see Cross- Barron and Cover proposed the index of re- References below). solvability as a measure of the hardness of esti- Some example applications include the fol- mating a probabilistic source in a two-part coding lowing: setting (see above) (Barron and Cover 1991). It is defined as 1. Autoregressive models, Markov chains, and their generalizations such as tree machines `.q/ were among the first model classes studied in Rn.p / D min C D.p jj q/ ; q n the MDL literature, see Rissanen (1978, 1984) and Weinberger et al.
Recommended publications
  • Arxiv:1502.07813V1 [Cs.LG] 27 Feb 2015 Distributions of the Form: M Pr(X;M ) = ∑ W J F J(X;Θ J) (1) J=1
    Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu · Lloyd Allison Abstract Mixture modelling involves explaining some observed evidence using a combination of probability distributions. The crux of the problem is the inference of an optimal number of mixture components and their corresponding parameters. This paper discusses unsupervised learning of mixture models using the Bayesian Minimum Message Length (MML) crite- rion. To demonstrate the effectiveness of search and inference of mixture parameters using the proposed approach, we select two key probability distributions, each handling fundamentally different types of data: the multivariate Gaussian distribution to address mixture modelling of data distributed in Euclidean space, and the multivariate von Mises-Fisher (vMF) distribu- tion to address mixture modelling of directional data distributed on a unit hypersphere. The key contributions of this paper, in addition to the general search and inference methodology, include the derivation of MML expressions for encoding the data using multivariate Gaussian and von Mises-Fisher distributions, and the analytical derivation of the MML estimates of the parameters of the two distributions. Our approach is tested on simulated and real world data sets. For instance, we infer vMF mixtures that concisely explain experimentally determined three-dimensional protein conformations, providing an effective null model description of protein structures that is central to many inference problems in structural bioinformatics. The ex- perimental results demonstrate that the performance of our proposed search and inference method along with the encoding schemes improve on the state of the art mixture modelling techniques. Keywords mixture modelling · minimum message length · multivariate Gaussian · von Mises-Fisher · directional data 1 Introduction Mixture models are common tools in statistical pattern recognition (McLachlan and Basford, 1988).
    [Show full text]
  • Nmo Ash University
    S ISSN 1440-771X ISBN 0 7326 1048 6 NMO ASH UNIVERSITY, AUSTRALIA t 1 NOV 498 Comparisons of Estimators and Tests Based on Modified Likelihood and Message Length Functions Mizan R. Laskar and Maxwell L. King Working Paper 11/98 August 1998 .DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS COMPARISONS OF ESTIMATORS AND TESTS BASED ON MODIFIED LIKELIHOOD AND MESSAGE LENGTH FUNCTIONS - Mizan R. Laskar and Maxwell L. King Department of Econometrics and Business Statistics Monash University Clayton, Victoria 3168 Australia Summary The presence of nuisance parameters causes unwanted complications in statistical and econometric inference procedures. A number of modified likelihood and message length functions have been developed for better handling of nuisance parameters but they are not equally efficient. In this paper, we empirically compare different modified likelihood and message length functions in the context of estimation and testing of parameters from linear regression disturbances that follow either first-order moving average or first-order autoregressive error processes. The results show that estimators based on the conditional profile likelihood and tests based on the marginal likelihood are best. If there is a minor identification problem, the sizes of the likelihood ratio and Wald tests based on simple message length functions are best. The true sizes of the Lagrange multiplier tests based on message length functions are rather poor because the score functions of message length functions are biased. Key words: linear regression model; marginal likelihood; conditional profile likelihood; first-order moving average errors; first-order autoregressive errors. 1. Introduction Satisfactory statistical analysis of non-experimental data is an important problem in statistics and econometrics.
    [Show full text]
  • A Minimum Message Length Criterion for Robust Linear Regression
    A Minimum Message Length Criterion for Robust Linear Regression Chi Kuen Wong, Enes Makalic, Daniel F. Schmidt February 21, 2018 Abstract This paper applies the minimum message length principle to inference of linear regression models with Student-t errors. A new criterion for variable selection and parameter estimation in Student-t regression is proposed. By exploiting properties of the regression model, we de- rive a suitable non-informative proper uniform prior distribution for the regression coefficients that leads to a simple and easy-to-apply criterion. Our proposed criterion does not require specification of hyperparameters and is invariant under both full rank transformations of the design matrix and linear transformations of the outcomes. We compare the proposed criterion with several standard model selection criteria, such as the Akaike information criterion and the Bayesian information criterion, on simulations and real data with promising results. 1 Introduction ′ Consider a vector of n observations y = (y1,...,yn) generated by the linear regression model y = β01n + Xβ + ε , (1) where X = (x ,..., x ) is the design matrix of p explanatory variables with each x Rn, β Rp 1 k i ∈ ∈ is the vector of regression coefficients, β0 R is the intercept parameter, 1n is a vector of 1s of ′ ∈ length n, and ε = (ε1,...,εn) is a vector of random errors. In this paper, the random disturbances εi are assumed to be independently and identically distributed as per a Student-t distribution with location zero, scale τ, and ν degrees of freedom. The Student-t linear regression model finds frequent application in modelling heavy-tailed data [1, 2].
    [Show full text]
  • Suboptimal Behavior of Bayes and MDL in Classification Under
    Mach Learn (2007) 66:119–149 DOI 10.1007/s10994-007-0716-7 Suboptimal behavior of Bayes and MDL in classification under misspecification Peter Gr¨unwald · John Langford Received: 29 March 2005 / Revised: 15 December 2006 / Accepted: 20 December 2006 / Published online: 30 January 2007 Springer Science + Business Media, LLC 2007 Abstract We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning prob- lem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective. Keywords Bayesian statistics . Minimum description length . Classification . Consistency . Inconsistency . Misspecification 1 Introduction Overfitting is a central concern of machine learning and statistics. Two frequently used learning methods that in many cases ‘automatically’ protect against overfitting are Bayesian inference (Bernardo & Smith, 1994) and the Minimum Description Length (MDL) Principle (Rissanen, 1989; Barron, Rissanen, & Yu, 1998; Gr¨unwald, 2005, 2007). We show that, when applied to classification problems, some of the standard variations of these two methods can be inconsistent in the sense that they asymptotically overfit: there exist scenarios where, no matter how much data is available, the generalization error of a classifier based on MDL Editors: Olivier Bousquet and Andre Elisseeff P.
    [Show full text]
  • Efficient Linear Regression by Minimum Message Length
    1 Efficient Linear Regression by Minimum Message Length Enes Makalic and Daniel F. Schmidt Abstract This paper presents an efficient and general solution to the linear regression problem using the Minimum Message Length (MML) principle. Inference in an MML framework involves optimising a two-part costing function that describes the trade-off between model complexity and model capability. The MML criterion is integrated into the orthogonal least squares algorithm (MML-OLS) to improve both speed and numerical stability. This allows for the message length to be iteratively updated with the selection of each new regressor, and for potentially problematic regressors to be rejected. The MML- OLS algorithm is subsequently applied to function approximation with univariate polynomials. Empirical results demonstrate superior performance in terms of mean squared prediction error in comparison to several well-known benchmark criteria. I. INTRODUCTION Linear regression is the process of modelling data in terms of linear relationships between a set of independent and dependent variables. An important issue in multi-variate linear modelling is to determine which of the independent variables to include in the model, and which are superfluous and contribute little to the power of the explanation. This paper formulates the regressor selection problem in a Minimum Message Length (MML) [1] framework which offers a sound statistical basis for automatically discriminating between competing models. In particular, we combine the MML criterion with the well- known Orthogonal Least Squares (OLS) [2] algorithm to produce a search (MML-OLS) that is both numerically stable and computationally efficient. To evaluate the performance of the proposed algorithm Enes Makalic and Daniel Schmidt are with Monash University Clayton School of Information Technology Clayton Campus Victoria 3800, Australia.
    [Show full text]
  • IR-IITBHU at TREC 2016 Open Search Track: Retrieving Documents Using Divergence from Randomness Model in Terrier
    IR-IITBHU at TREC 2016 Open Search Track: Retrieving documents using Divergence From Randomness model in Terrier Mitodru Niyogi1 and Sukomal Pal2 1Department of Information Technology, Government College of Engineering & Ceramic Technology, Kolkata 2Department of Computer Science & Engineering, Indian Institute of Technology(BHU), Varanasi Abstract In our participation at TREC 2016 Open Search Track which focuses on ad-hoc scientic literature search, we used Terrier, a modular and a scalable Information Retrieval framework as a tool to rank documents. The organizers provided live data as documents, queries and user interac- tions from real search engine that were available through Living Lab API framework. The data was then converted into TREC format to be used in Terrier. We used Divergence from Randomness (DFR) model, specically, the Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for rst normalisation, and normalisation 2 for term frequency normalization with natural logarithm, i.e., In_expC2 model to rank the available documents in response to a set of queries. Al- together we submit 391 runs for sites CiteSeerX and SSOAR to the Open Search Track via the Living Lab API framework. We received an `out- come' of 0.72 for test queries and 0.62 for train queries of site CiteSeerX at the end of Round 3 Test Period where, the outcome is computed as: #wins / (#wins + #losses). A `win' occurs when the participant achieves more clicks on his documents than those of the site and `loss' otherwise. Complete relevance judgments is awaited at the moment. We look forward to getting the users' feedback and work further with them.
    [Show full text]
  • Subnational Inequality Divergence
    Subnational Inequality Divergence Tom VanHeuvelen1 University of Minnesota Department of Sociology Abstract How have inequality levels across local labor markets in the subnational United States changed over the past eight decades? In this study, I examine inequality divergence, or the inequality of inequalities. While divergence trends of central tendencies such as per capita income have been well documented, less is known about the descriptive trends or contributing mechanisms for inequality. In this study, I construct wage inequality measures in 722 local labor markets covering the entire contiguous United States across 22 waves of Census and American Community Survey data from 1940-2019 to assess the historical trends of inequality divergence. I apply variance decomposition and counterfactual techniques to develop main conclusions. Inequality divergence follows a u-shaped pattern, declining through 1990 but with contemporary divergence at as high a level as any time in the past 80 years. Early era convergence occurred broadly and primarily worked to reduce interregional differences, whereas modern inequality divergence operates through a combination of novel mechanisms, most notably through highly unequal urban areas separating from other labor markets. Overall, results show geographical fragmentation of inequality underneath overall inequality growth in recent years, highlighting the fundamental importance of spatial trends for broader stratification outcomes. 1 Correspondence: [email protected]. A previous version of this manuscript was presented at the 2021 Population Association of American annual conference. Thank you to Jane VanHeuvelen and Peter Catron for their helpful comments. Recent changes in the United States have situated geographical residence as a key pillar of the unequal distribution of economic resources (Austin et al.
    [Show full text]
  • Arxiv:0906.0052V1 [Cs.LG] 30 May 2009 Responses, We Can Write the Y Values in an N × 1 Vector Y and the X Values in an N × M Matrix X
    A Minimum Description Length Approach to Multitask Feature Selection Brian Tomasik May 2009 Abstract One of the central problems in statistics and machine learning is regression: Given values of input variables, called features, develop a model for an output variable, called a response or task. In many settings, there are potentially thousands of possible features, so that feature selection is required to reduce the number of predictors used in the model. Feature selection can be interpreted in two broad ways. First, it can be viewed as a means of reducing prediction error on unseen test data by improving model generalization. This is largely the focus within the machine-learning community, where the primary goal is to train a highly accurate system. The second approach to feature selection, often of more interest to scientists, is as a form of hypothesis testing: Assuming a \true" model that generates the data from a small number of features, determine which features actually belong in the model. Here the metrics of interest are precision and recall, more than test-set error. Many regression problems involve not one but several response variables. Often the responses are suspected to share a common underlying structure, in which case it may be advantageous to share information across the responses; this is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive features|a project we might call multitask feature selection. This thesis is organized as follows. Section1 introduces feature selection for regression, focusing on `0 regularization methods and their interpretation within a Minimum Description Length (MDL) framework.
    [Show full text]
  • Model Selection for Mixture Models-Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert
    Model Selection for Mixture Models-Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert To cite this version: Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert. Model Selection for Mixture Models- Perspectives and Strategies. Handbook of Mixture Analysis, CRC Press, 2018. hal-01961077 HAL Id: hal-01961077 https://hal.archives-ouvertes.fr/hal-01961077 Submitted on 19 Dec 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 7 Model Selection for Mixture Models – Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter and Christian P. Robert INRIA Saclay, France; Vienna University of Economics and Business, Austria; Université Paris-Dauphine, France and University of Warwick, UK CONTENTS 7.1 Introduction .................................................................... 124 7.2 Selecting G as a Density Estimation Problem ................................. 125 7.2.1 Testing the order of a finite mixture through likelihood ratio tests .. 127 7.2.2 Information criteria for order selection ................................ 128 7.2.2.1 AIC and BIC ............................................... 128 7.2.2.2 The Slope Heuristics ....................................... 130 7.2.2.3 DIC ......................................................... 131 7.2.2.4 The minimum message length .............................. 132 7.2.3 Bayesian model choice based on marginal likelihoods ................ 133 7.2.3.1 Chib’s method, limitations and extensions ................
    [Show full text]
  • A Generalized Divergence for Statistical Inference
    A GENERALIZED DIVERGENCE FOR STATISTICAL INFERENCE Abhik Ghosh, Ian R. Harris, Avijit Maji Ayanendranath Basu and Leandro Pardo TECHNICAL REPORT NO. BIRU/2013/3 2013 BAYESIAN AND INTERDISCIPLINARY RESEARCH UNIT INDIAN STATISTICAL INSTITUTE 203, Barrackpore Trunk Road Kolkata – 700 108 INDIA A Generalized Divergence for Statistical Inference Abhik Ghosh Indian Statistical Institute, Kolkata, India. Ian R. Harris Southern Methodist University, Dallas, USA. Avijit Maji Indian Statistical Institute, Kolkata, India. Ayanendranath Basu Indian Statistical Institute, Kolkata, India. Leandro Pardo Complutense University, Madrid, Spain. Summary. The power divergence (PD) and the density power divergence (DPD) families have proved to be useful tools in the area of robust inference. The families have striking similarities, but also have fundamental differences; yet both families are extremely useful in their own ways. In this paper we provide a comprehensive description of the two families and tie in their role in statistical theory and practice. At the end, the families are seen to be a part of a superfamily which contains both of these families as special cases. In the process, the limitation of the influence function as an effective descriptor of the robustness of the estimator is also demonstrated. Keywords: Robust Estimation, Divergence, Influence Function 2 Ghosh et al. 1. Introduction The density-based minimum divergence approach is an useful technique in para- metric inference. Here the closeness of the data and the model is quantified by a suitable measure of density-based divergence between the data density and the model density. Many of these methods have been particularly useful because of the strong robustness properties that they inherently possess.
    [Show full text]
  • Minimum Message Length Ridge Regression for Generalized Linear Models
    Problem Description MML GLM Ridge Regression Results/Examples Minimum Message Length Ridge Regression for Generalized Linear Models Daniel F. Schmidt and Enes Makalic Centre for Biostatistics and Epidemiology The University of Melbourne 26th Australasian Joint Conference on Artificial Intelligence Dunedin, New Zealand 2013 Problem Description MML GLM Ridge Regression Results/Examples Outline 1 Problem Description Generalized Linear Models GLM Ridge Regression 2 MML GLM Ridge Regression Minimum Message Length Inference Message Lengths of GLMs 3 Results/Examples Parameter Estimation Experiments Example: Dog Bite Data Conclusion Problem Description Generalized Linear Models MML GLM Ridge Regression GLM Ridge Regression Results/Examples Outline 1 Problem Description Generalized Linear Models GLM Ridge Regression 2 MML GLM Ridge Regression Minimum Message Length Inference Message Lengths of GLMs 3 Results/Examples Parameter Estimation Experiments Example: Dog Bite Data Conclusion Problem Description Generalized Linear Models MML GLM Ridge Regression GLM Ridge Regression Results/Examples Generalized Linear Models (GLMs) (1) We have A vector of targets, y ∈ Rn 0 0 0 n×p A matrix of features, X = (x¯1,..., x¯n) ∈ R ⇒ Usual problem is to use X to predict y If targets are reals, linear-normal model is a standard choice yi ∼ N(ηi, τ) where ηi = x¯iβ + α, is the linear predictor, and β ∈ Rp are the coefficients α ∈ R is the intercept Problem Description Generalized Linear Models MML GLM Ridge Regression GLM Ridge Regression Results/Examples Generalized
    [Show full text]
  • Divergence Measures for Statistical Data Processing—An Annotated Bibliography
    Signal Processing 93 (2013) 621–633 Contents lists available at SciVerse ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Review Divergence measures for statistical data processing—An annotated bibliography Michele Basseville n,1 IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France article info abstract Article history: This paper provides an annotated bibliography for investigations based on or related to Received 19 April 2012 divergence measures for statistical data processing and inference problems. Received in revised form & 2012 Elsevier B.V. All rights reserved. 17 July 2012 Accepted 5 September 2012 Available online 14 September 2012 Keywords: Divergence Distance Information f-Divergence Bregman divergence Learning Estimation Detection Classification Recognition Compression Indexing Contents 1. Introduction ......................................................................................... 621 2. f-Divergences ........................................................................................ 622 3. Bregman divergences.................................................................................. 623 4. a-Divergences ....................................................................................... 624 5. Handling more than two distributions .................................................................... 625 6. Inference based on entropy and divergence criteria.......................................................... 626 7. Spectral divergence measures . .......................................................................
    [Show full text]