Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood

Total Page:16

File Type:pdf, Size:1020Kb

Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood Daniel Grossman [email protected] Pedro Domingos [email protected] Department of Computer Science and Engineeering, University of Washington, Seattle, WA 98195-2350, USA Abstract timize the likelihood of the entire data, rather than just the conditional likelihood of the class given the Bayesian networks are a powerful probabilis- attributes. Such scoring results in suboptimal choices tic representation, and their use for classi- during the search process whenever the two functions fication has received considerable attention. favor differing changes to the network. The natural However, they tend to perform poorly when solution would then be to use conditional likelihood as learned in the standard way. This is at- the objective function. Unfortunately, Friedman et al. tributable to a mismatch between the objec- observed that, while maximum likelihood parameters tive function used (likelihood or a function can be efficiently computed in closed form, this is not thereof) and the goal of classification (max- true of conditional likelihood. The latter must be opti- imizing accuracy or conditional likelihood). mized using numerical methods, and doing so at each Unfortunately, the computational cost of op- search step would be prohibitively expensive. Fried- timizing structure and parameters for condi- man et al. thus abandoned this avenue, leaving the tional likelihood is prohibitive. In this pa- investigation of possible heuristic alternatives to it as per we show that a simple approximation| an important direction for future research. In this pa- choosing structures by maximizing condi- per, we show that the simple heuristic of setting the tional likelihood while setting parameters parameters by maximum likelihood while choosing the by maximum likelihood|yields good results. structure by conditional likelihood is accurate and ef- On a large suite of benchmark datasets, ficient. this approach produces better class proba- bility estimates than naive Bayes, TAN, and Friedman et al. chose instead to extend naive Bayes by generatively-trained Bayesian networks. allowing a slightly less restricted structure (one parent per variable in addition to the class) while still opti- mizing likelihood. They showed that TAN, the result- 1. Introduction ing algorithm, was indeed more accurate than naive Bayes on benchmark datasets. We compare our algo- The simplicity and surprisingly high accuracy of the rithm to TAN and naive Bayes on the same datasets, naive Bayes classifier have led to its wide use, and and show that it outperforms both in the accuracy of to many attempts to extend it (Domingos & Pazzani, class probability estimates, while outperforming naive 1997). In particular, naive Bayes is a special case of a Bayes and tying TAN in classification error. Bayesian network, and learning the structure and pa- rameters of an unrestricted Bayesian network would If the structure is fixed in advance, computing the appear to be a logical means of improvement. How- maximum conditional likelihood parameters by gra- ever, Friedman et al. (1997) found that naive Bayes dient descent should be computationally feasible, and easily outperforms such unrestricted Bayesian network Greiner and Zhou (2002) have shown with their ELR classifiers on a large sample of benchmark datasets. algorithm that it is indeed beneficial. They leave op- Their explanation was that the scoring functions used timization of the structure as an important direction in standard Bayesian network learning attempt to op- for future work, and that is what we accomplish in this paper. st Appearing in Proceedings of the 21 International Confer- Perhaps the most important reason to seek an im- ence on Machine Learning, Banff, Canada, 2004. Copyright 2004 by the authors. proved Bayesian network classifier is that, for many applications, high accuracy in class predictions is not is in state k given that its parents are in state j, for all enough; accurate estimates of class probabilities are i; j; k. When there are no examples with missing val- also desirable. For example, we may wish to rank cases ues in the training set and we assume parameter inde- by probability of class membership (Provost & Domin- pendence, the maximum likelihood estimates are sim- gos, 2003), or the costs associated with incorrect pre- ply the observed frequency estimates p^ijk = nijk=nij , dictions may be variable and not known precisely at where nijk is the number of occurrences in the train- learning time (Provost & Fawcett, 2001). In this case, ing set of the kth state of xi with the jth state of its knowing the class probabilities allows the learner to parents, and nij is the sum of nijk over all k. make optimal decisions at classification time, what- In this paper we assume no missing data through- ever the misclassification costs (Duda & Hart, 1973). out, and focus on the problem of learning network More generally, a classifier is often only one part of a structure. Chow and Liu (1968) provide an efficient larger decision process, and outputting accurate class algorithm for the special case where each variable probabilities increases its utility to the process. has only one parent. Solution methods for the gen- We begin by reviewing the essentials of learning eral (intractable) case fall into two main classes: in- Bayesian networks. We then present our algorithm, dependence tests (Spirtes et al., 1993) and search- followed by experimental results and their interpreta- based methods (Cooper & Herskovits, 1992; Hecker- tion. The paper concludes with a discussion of related man et al., 1995). The latter are probably the most and future work. widely used, and we focus on them in this paper. We assume throughout that hill-climbing search is used; 2. Bayesian Networks this was found by Heckerman et al. to yield the best combination of accuracy and efficiency. Hill-climbing A Bayesian network (Pearl, 1988) encodes the joint starts with an initial network, which can be empty, probability distribution of a set of v variables, random, or constructed from expert knowledge. At fx1; : : : ; xvg, as a directed acyclic graph and a set of each search step, it creates all legal variations of the conditional probability tables (CPTs). (In this paper current network obtainable by adding, deleting, or re- we assume all variables are discrete, or have been pre- versing any single arc, and scores these variations. The discretized.) Each node corresponds to a variable, and best variation becomes the new current network, and the CPT associated with it contains the probability of the process repeats until no variation improves the each state of the variable given every possible combi- score. nation of states of its parents. The set of parents of x , i Since on average adding an arc never decreases like- denoted π , is the set of nodes with an arc to x in the i i lihood on the training data, using the log likelihood graph. The structure of the network encodes the asser- as the scoring function can lead to severe overfit- tion that each node is conditionally independent of its ting. This problem can be overcome in a number non-descendants given its parents. Thus the probabil- of ways. The simplest one, which is often surpris- ity of an arbitrary event X = (x1; : : : ; x ) can be com- v ingly effective, is to limit the number of parents a puted as P (X) = v P (x jπ ). In general, encoding Qi=1 i i variable can have. Another alternative is to add a the joint distribution of a set of v discrete variables complexity penalty to the log-likelihood. For exam- requires space exponential in v; Bayesian networks re- ple, the MDL method (Lam & Bacchus, 1994) mini- duce this to space exponential in maxi2f1;:::;vg jπij. 1 mizes MDL(BjD) = 2 m log n − LL(SjD), where m is the number of parameters in the network. In both 2.1. Learning Bayesian Networks these approaches, the parameters of each candidate Given an i.i.d. training set D = fX1; : : : ; Xd; : : : ; Xng, network are set by maximum likelihood, as in the where Xd = (xd;1; : : : ; xd;v), the goal of learning is known-structure case. Finally, the full Bayesian ap- to find the Bayesian network that best represents the proach (Cooper & Herskovits, 1992; Heckerman et al., joint distribution P (xd;1; : : : ; xd;v). One approach is 1995) maximizes the Bayesian Dirichlet (BD) score to find the network B that maximizes the likelihood P (B ; D) = P (B )P (DjB ) of the data or (more conveniently) its logarithm: S S S v qi 0 ri 0 Γ(nij ) Γ(nijk + nijk) n n v = P (BS) Y Y 0 Y 0 Γ(nij + nij ) Γ(n ) LL(BjD) = X log PB(Xd) = X X log PB(xd;ijπd;i) i=1 j=1 k=1 ijk d=1 d=1 i=1 (2) (1) When the structure of the network is known, this re- where BS is the structure of network B, Γ() is the duces to estimating pijk, the probability that variable i gamma function, qi is the number of states of the Cartesian product of xi's parents, and ri is the num- CLL(BjD) by itself as the objective function. This ber of states of xi. P (BS) is the prior probability of would be a form of discriminative learning, be- the structure, which Heckerman et al. set to an expo- cause it would focus on correctly discriminating nentially decreasing function of the number of differing between classes. The problem with this approach is arcs between BS and the initial (prior) network. Each that, unlike LL(BjD) (Equation 1), CLL(BjD) = n multinomial distribution for xi given a state of its par- Pd=1 log[PB (xd;1; : : : ; xd;v−1; y)=PB(xd;1; : : : ; xd;v−1)] ents has an associated Dirichlet prior distribution with does not decompose into a separate term for each 0 0 ri 0 parameters nijk, with nij = Pk=1 nijk.
Recommended publications
  • Common Quandaries and Their Practical Solutions in Bayesian Network Modeling
    Ecological Modelling 358 (2017) 1–9 Contents lists available at ScienceDirect Ecological Modelling journa l homepage: www.elsevier.com/locate/ecolmodel Common quandaries and their practical solutions in Bayesian network modeling Bruce G. Marcot U.S. Forest Service, Pacific Northwest Research Station, 620 S.W. Main Street, Suite 400, Portland, OR, USA a r t i c l e i n f o a b s t r a c t Article history: Use and popularity of Bayesian network (BN) modeling has greatly expanded in recent years, but many Received 21 December 2016 common problems remain. Here, I summarize key problems in BN model construction and interpretation, Received in revised form 12 May 2017 along with suggested practical solutions. Problems in BN model construction include parameterizing Accepted 13 May 2017 probability values, variable definition, complex network structures, latent and confounding variables, Available online 24 May 2017 outlier expert judgments, variable correlation, model peer review, tests of calibration and validation, model overfitting, and modeling wicked problems. Problems in BN model interpretation include objec- Keywords: tive creep, misconstruing variable influence, conflating correlation with causation, conflating proportion Bayesian networks and expectation with probability, and using expert opinion. Solutions are offered for each problem and Modeling problems researchers are urged to innovate and share further solutions. Modeling solutions Bias Published by Elsevier B.V. Machine learning Expert knowledge 1. Introduction other fields. Although the number of generally accessible journal articles on BN modeling has continued to increase in recent years, Bayesian network (BN) models are essentially graphs of vari- achieving an exponential growth at least during the period from ables depicted and linked by probabilities (Koski and Noble, 2011).
    [Show full text]
  • On Discriminative Bayesian Network Classifiers and Logistic Regression
    On Discriminative Bayesian Network Classifiers and Logistic Regression Teemu Roos and Hannes Wettig Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Peter Gr¨unwald Centrum voor Wiskunde en Informatica, P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands Petri Myllym¨aki and Henry Tirri Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Abstract. Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graph-theoretic property. The property holds for naive Bayes but also for more complex structures such as tree-augmented naive Bayes (TAN) as well as for mixed diagnostic-discriminative structures. Our results imply that for networks satisfying our property, the condi- tional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, non-global maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model. Keywords: Bayesian classifiers, Bayesian networks, discriminative learning, logistic regression 1.
    [Show full text]
  • Lecture 5: Naive Bayes Classifier 1 Introduction
    CSE517A Machine Learning Spring 2020 Lecture 5: Naive Bayes Classifier Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml 5.2.1 (Bayes Classifier, Naive Bayes, Classifying Text, and Smoothing) Application Sentiment analysis aims at discovering people's opinions, emo- tions, feelings about a subject matter, product, or service from text. Take some time to answer the following warm-up questions: (1) Is this a regression or classification problem? (2) What are the features and how would you represent them? (3) What is the prediction task? (4) How well do you think a linear model will perform? 1 Introduction Thought: Can we model p(y j x) without model assumptions, e.g. Gaussian/Bernoulli distribution etc.? Idea: Estimate p(y j x) from the data directly, then use the Bayes classifier. Let y be discrete, Pn i=1 I(xi = x \ yi = y) p(y j x) = Pn (1) i=1 I(xi = x) where the numerator counts all examples with input x that have label y and the denominator counts all examples with input x. Figure 1: Illustration of estimatingp ^(y j x) From Figure ??, it is clear that, using the MLE method, we can estimatep ^(y j x) as jCj p^(y j x) = (2) jBj But there is a big problem with this method: The MLE estimate is only good if there are many training vectors with the same identical features as x. 1 2 This never happens for high-dimensional or continuous feature spaces. Solution: Bayes Rule. p(x j y) p(y) p(y j x) = (3) p(x) Let's estimate p(x j y) and p(y) instead! 2 Naive Bayes We have a discrete label space C that can either be binary f+1; −1g or multi-class f1; :::; Kg.
    [Show full text]
  • Bayesian Network Modelling with Examples
    Bayesian Network Modelling with Examples Marco Scutari [email protected] Department of Statistics University of Oxford November 28, 2016 What Are Bayesian Networks? Marco Scutari University of Oxford What Are Bayesian Networks? A Graph and a Probability Distribution Bayesian networks (BNs) are defined by: a network structure, a directed acyclic graph G = (V;A), in which each node vi 2 V corresponds to a random variable Xi; a global probability distribution X with parameters Θ, which can be factorised into smaller local probability distributions according to the arcs aij 2 A present in the graph. The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation, thus specifying the factorisation of the global distribution: p Y P(X) = P(Xi j ΠXi ;ΘXi ) where ΠXi = fparents of Xig i=1 Marco Scutari University of Oxford What Are Bayesian Networks? How the DAG Maps to the Probability Distribution Graphical Probabilistic DAG separation independence A B C E D F Formally, the DAG is an independence map of the probability distribution of X, with graphical separation (??G) implying probabilistic independence (??P ). Marco Scutari University of Oxford What Are Bayesian Networks? Graphical Separation in DAGs (Fundamental Connections) separation (undirected graphs) A B C d-separation (directed acyclic graphs) A B C A B C A B C Marco Scutari University of Oxford What Are Bayesian Networks? Graphical Separation in DAGs (General Case) Now, in the general case we can extend the patterns from the fundamental connections and apply them to every possible path between A and B for a given C; this is how d-separation is defined.
    [Show full text]
  • Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: a Map Reduce Approach
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-6S3, April 2019 Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: A Map Reduce Approach Shikha Agarwal, Balmukumd Jha, Tisu Kumar, Manish Kumar, Prabhat Ranjan Abstract: Naive Bayes classifier is well known machine model could be used without accepting Bayesian probability learning algorithm which has shown virtues in many fields. In or using any Bayesian methods. Despite their naive design this work big data analysis platforms like Hadoop distributed and simple assumptions, Naive Bayes classifiers have computing and map reduce programming is used with Naive worked quite well in many complex situations. An analysis Bayes and Gaussian Naive Bayes for classification. Naive Bayes is manily popular for classification of discrete data sets while of the Bayesian classification problem showed that there are Gaussian is used to classify data that has continuous attributes. sound theoretical reasons behind the apparently implausible Experimental results show that Hybrid of Naive Bayes and efficacy of types of classifiers[5]. A comprehensive Gaussian Naive Bayes MapReduce model shows the better comparison with other classification algorithms in 2006 performance in terms of classification accuracy on adult data set showed that Bayes classification is outperformed by other which has many continuous attributes. approaches, such as boosted trees or random forests [6]. An Index Terms: Naive Bayes, Gaussian Naive Bayes, Map Reduce, Classification. advantage of Naive Bayes is that it only requires a small number of training data to estimate the parameters necessary I. INTRODUCTION for classification. The fundamental property of Naive Bayes is that it works on discrete value.
    [Show full text]
  • Locally Differentially Private Naive Bayes Classification
    Locally Differentially Private Naive Bayes Classification EMRE YILMAZ, Case Western Reserve University MOHAMMAD AL-RUBAIE, University of South Florida J. MORRIS CHANG, University of South Florida In machine learning, classification models need to be trained in order to predict class labels. When the training data contains personal information about individuals, collecting training data becomes difficult due to privacy concerns. Local differential privacy isa definition to measure the individual privacy when there is no trusted data curator. Individuals interact with an untrusteddata aggregator who obtains statistical information about the population without learning personal data. In order to train a Naive Bayes classifier in an untrusted setting, we propose to use methods satisfying local differential privacy. Individuals send theirperturbed inputs that keep the relationship between the feature values and class labels. The data aggregator estimates all probabilities needed by the Naive Bayes classifier. Then, new instances can be classified based on the estimated probabilities. We propose solutions forboth discrete and continuous data. In order to eliminate high amount of noise and decrease communication cost in multi-dimensional data, we propose utilizing dimensionality reduction techniques which can be applied by individuals before perturbing their inputs. Our experimental results show that the accuracy of the Naive Bayes classifier is maintained even when the individual privacy is guaranteed under local differential privacy, and that using dimensionality reduction enhances the accuracy. CCS Concepts: • Security and privacy → Privacy-preserving protocols; • Computing methodologies → Supervised learning by classification; Dimensionality reduction and manifold learning. Additional Key Words and Phrases: Local Differential Privacy, Naive Bayes, Classification, Dimensionality Reduction 1 INTRODUCTION Predictive analytics is the process of making prediction about future events by analyzing the current data using statistical techniques.
    [Show full text]
  • Improving Feature Selection Algorithms Using Normalised Feature Histograms
    Page 1 of 10 Improving feature selection algorithms using normalised feature histograms A. P. James and A. K. Maan The proposed feature selection method builds a histogram of the most stable features from random subsets of training set and ranks the features based on a classifier based cross validation. This approach reduces the instability of features obtained by conventional feature selection methods that occur with variation in training data and selection criteria. Classification results on 4 microarray and 3 image datasets using 3 major feature selection criteria and a naive bayes classifier show considerable improvement over benchmark results. Introduction : Relevance of features selected by conventional feature selection method is tightly coupled with the notion of inter-feature redundancy within the patterns [1-3]. This essentially results in design of data-driven methods that are designed for meeting criteria of redundancy using various measures of inter-feature correlations [4-5]. These measures behave differently to different types of data, and to different levels of natural variability. This introduces instability in selection of features with change in number of samples in the training set and quality of features from one database to another. The use of large number of statistically distinct training samples can increase the quality of the selected features, although it can be practically unrealistic due to increase in computational complexity and unavailability of sufficient training samples. As an example, in realistic high dimensional databases such as gene expression databases of a rare cancer, there is a practical limitation on availability of large number of training samples. In such cases, instability that occurs with selecting features from insufficient training data would lead to a wrong conclusion that genes responsible for a cancer be strongly subject dependent and not general for a cancer.
    [Show full text]
  • Empirical Evaluation of Scoring Functions for Bayesian Network Model Selection
    Liu et al. BMC Bioinformatics 2012, 13(Suppl 15):S14 http://www.biomedcentral.com/1471-2105/13/S15/S14 PROCEEDINGS Open Access Empirical evaluation of scoring functions for Bayesian network model selection Zhifa Liu1,2†, Brandon Malone1,3†, Changhe Yuan1,4* From Proceedings of the Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge Oxford, MS, USA. 17-18 February 2012 Abstract In this work, we empirically evaluate the capability of various scoring functions of Bayesian networks for recovering true underlying structures. Similar investigations have been carried out before, but they typically relied on approximate learning algorithms to learn the network structures. The suboptimal structures found by the approximation methods have unknown quality and may affect the reliability of their conclusions. Our study uses an optimal algorithm to learn Bayesian network structures from datasets generated from a set of gold standard Bayesian networks. Because all optimal algorithms always learn equivalent networks, this ensures that only the choice of scoring function affects the learned networks. Another shortcoming of the previous studies stems from their use of random synthetic networks as test cases. There is no guarantee that these networks reflect real-world data. We use real-world data to generate our gold-standard structures, so our experimental design more closely approximates real-world situations. A major finding of our study suggests that, in contrast to results reported by several prior works, the Minimum Description Length (MDL) (or equivalently, Bayesian information criterion (BIC)) consistently outperforms other scoring functions such as Akaike’s information criterion (AIC), Bayesian Dirichlet equivalence score (BDeu), and factorized normalized maximum likelihood (fNML) in recovering the underlying Bayesian network structures.
    [Show full text]
  • Bayesian Networks
    Bayesian Networks Read R&N Ch. 14.1-14.2 Next lecture: Read R&N 18.1-18.4 You will be expected to know • Basic concepts and vocabulary of Bayesian networks. – Nodes represent random variables. – Directed arcs represent (informally) direct influences. – Conditional probability tables, P( Xi | Parents(Xi) ). • Given a Bayesian network: – Write down the full joint distribution it represents. • Given a full joint distribution in factored form: – Draw the Bayesian network that represents it. • Given a variable ordering and some background assertions of conditional independence among the variables: – Write down the factored form of the full joint distribution, as simplified by the conditional independence assertions. Computing with Probabilities: Law of Total Probability Law of Total Probability (aka “summing out” or marginalization) P(a) = Σb P(a, b) = Σb P(a | b) P(b) where B is any random variable Why is this useful? given a joint distribution (e.g., P(a,b,c,d)) we can obtain any “marginal” probability (e.g., P(b)) by summing out the other variables, e.g., P(b) = Σa Σc Σd P(a, b, c, d) Less obvious: we can also compute any conditional probability of interest given a joint distribution, e.g., P(c | b) = Σa Σd P(a, c, d | b) = (1 / P(b)) Σa Σd P(a, c, d, b) where (1 / P(b)) is just a normalization constant Thus, the joint distribution contains the information we need to compute any probability of interest. Computing with Probabilities: The Chain Rule or Factoring We can always write P(a, b, c, … z) = P(a | b, c, ….
    [Show full text]
  • Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute
    1 Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute Abstract Deep directed generative models have attracted much attention recently due to their generative modeling nature and powerful data representation ability. In this paper, we review different structures of deep directed generative models and the learning and inference algorithms associated with the structures. We focus on a specific structure that consists of layers of Bayesian Networks due to the property of capturing inherent and rich dependencies among latent variables. The major difficulty of learning and inference with deep directed models with many latent variables is the intractable inference due to the dependencies among the latent variables and the exponential number of latent variable configurations. Current solutions use variational methods often through an auxiliary network to approximate the posterior probability inference. In contrast, inference can also be performed directly without using any auxiliary network to maximally preserve the dependencies among the latent variables. Specifically, by exploiting the sparse representation with the latent space, max-max instead of max-sum operation can be used to overcome the exponential number of latent configurations. Furthermore, the max-max operation and augmented coordinate ascent are applied to both supervised and unsupervised learning as well as to various inference. Quantitative evaluations on benchmark datasets of different models are given for both data representation and feature learning tasks. I. INTRODUCTION Deep learning has become a major enabling technology for computer vision. By exploiting its multi-level representation and the availability of the big data, deep learning has led to dramatic performance improvements for certain tasks.
    [Show full text]
  • Naïve Bayes Lecture 17
    Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri Brief ArticleBrief Article The Author The Author January 11, 2012January 11, 2012 θˆ =argmax ln P ( θ) θˆ =argmax ln P ( θ) θ D| θ D| ln θαH ln θαH d d d ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] d d dθ D| dθ d − dθ H T − ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] dθ D| dθ − dθ H T − d d αH αT = αH ln θ + αT ln(1 θ)= =0 d d dθ αH dθ αT− θ − 1 θ = αH ln θ + αT ln(1 θ)= =0 − dθ dθ − θ −2N12 θ δ 2e− − P (mistake) ≥ ≥ 2N2 δ 2e− P (mistake) ≥ ≥ ln δ ln 2 2N2 Bayesian Learning ≥ − Prior • Use Bayes’ rule! 2 Data Likelihood ln δ ln 2 2N ln(2/δ) ≥ − N ≥ 22 Posterior ln(2/δ) N ln(2/0.05) 3.8 ≥ 22N Normalization= 190 ≥ 2 0.12 ≈ 0.02 × ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (θ) to 1 ≥ 2 0.12 ≈ 0.02 ∝ ×maximum likelihood estimation! P (θ) 1 P (θ ) P ( θ) ∝ |D ∝ D| 1 1 Brief ArticleBrief Article Brief Article Brief Article The Author The Author The Author January 11, 2012 JanuaryThe Author 11, 2012 January 11, 2012 January 11, 2012 θˆ =argmax ln Pθˆ( =argmaxθ) ln P ( θ) θ D| θ D| ˆ =argmax ln (θˆ =argmax) ln P ( θ) θ P θ θ θ D| αH D| ln θαH ln θ αH ln θαH ln θ d d α α d d d lnαP ( θ)=α [lndθ H (1 θ) T ] = [α ln θ + α ln(1 θ)] ln P ( θ)= [lndθ H (1 θ) T ]d= [α ln θ + α ln(1d Hθ)] T d d dθ D| dθ d αH H − αT T dθ − dθ D| dθ αlnH P ( − θα)=T [lndθθ (1 θ) ] = [α−H ln θ + αT ln(1 θ)] ln P ( θ)= [lndθθ (1 D|θ) ] =dθ [αH ln θ−+ αT ln(1dθ θ)] − dθ D| dθ − d dθ d −α
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]