Or Gate Bayesian Networks for Text Classification: a Discriminative Alternative Approach to Multinomial Naive Bayes

Total Page:16

File Type:pdf, Size:1020Kb

Or Gate Bayesian Networks for Text Classification: a Discriminative Alternative Approach to Multinomial Naive Bayes ESTYLF08, Cuencas Mineras (Mieres - Langreo), 17 - 19 de Septiembre de 2008 Or gate Bayesian networks for text classification: A discriminative alternative approach to multinomial naive Bayes Luis M. de Campos Juan M. Fern´andez-Luna Juan F. Huete Alfonso E. Romero Departamento de Ciencias de la Computaci´on e Inteligencia Artificial E.T.S.I. Inform´atica y de Telecomunicaci´on, Universidad de Granada, 18071 – Granada, Spain {lci,jmfluna,jhg,aeromero}@decsai.ugr.es Abstract probabilistic classifiers model the posterior probabili- ties p(ci|dj) directly. We propose a simple Bayesian network-based The naive Bayes classifier is the simplest genera- text classifier, which may be considered as tive probabilistic classification model that, despite its a discriminative counterpart of the genera- strong and often unrealistic assumptions, performs fre- tive multinomial naive Bayes classifier. The quently surprisingly well. It assumes that all the at- method relies on the use of a fixed net- tribute variables are conditionally independent on each work topology with the arcs going form term other given the class variable. In fact, the naive Bayes nodes to class nodes, and also on a network classifier can be considered as a Bayesian network- parametrization based on noisy or gates. based classifier [1], where the network structure is fixed Comparative experiments of the proposed and contains only arcs from the class variable to the method with naive Bayes and Rocchio algo- attribute variables, as shown in Figure 1. rithms are carried out using three standard document collections. Keywords: Bayesian network, noisy or gate, T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 multinomial naive Bayes, text classification Ci 1 Introduction: Probabilistic Methods Figure 1: Network structure of the naive Bayes classi- for Text Classification fier. The classical approach to probabilistic text classifica- In this paper we are going to propose another simple tion may be stated as follows: We have a class variable Bayesian network-based classifier, which may be con- C taking values in the set {c1, c2, . , cn} and, given a sidered as a discriminative counterpart of naive Bayes, document dj to be classified (described by a set of at- in the following senses: (1) it is based on a type of tribute variables, which usually are the terms appear- Bayesian network similar to that of naive Bayes, but ing in the document), the posterior probability of each with the arcs in the network going in the opposite di- class, p(ci|dj), is computed in some way, and the doc- rection; (2) it requires the same set of simple sufficient ument is assigned to the class having the greatest pos- statistics than naive Bayes, so that the complexity of terior probability. Learning methods for probabilis- the training step in both methods is the same, namely tic classifiers are often characterized as being gener- linear with the number of attribute variables; the com- ative or discriminative. Generative methods estimate plexity of the classification step is also identical. the joint probabilities of all the variables, p(ci, dj), and p(ci|dj) is computed according to the Bayes formula: The rest of the paper is organized in the following way: in Sections 2 and 3 we describe the two probabilistic p(ci, dj) p(ci)p(dj |ci) text classifiers we are considering, naive Bayes and the p(ci|dj) = = ∝ p(ci)p(dj |ci) . p(dj ) p(dj) proposed new model, respectively. Section 4 is focused (1) on the experimental results. Finally, Section 5 con- The problem in this case is how to estimate the prob- tains the concluding remarks and some proposals for abilities p(ci) and p(dj|ci). In contrast, discriminative future work. XIV Congreso Español sobre Tecnologías y Lógica fuzzy 385 ESTYLF08, Cuencas Mineras (Mieres - Langreo), 17 - 19 de Septiembre de 2008 2 The Multinomial Naive Bayes variable C having n values, we can decompose the Classifier problem using n binary class variables Ci taking its values in the sets {ci, ci}. This is a quite common In the context of text classification, there exist two transformation in text classification [9], especially for different models called naive Bayes, the multivariate multilabel problems, where a document may be as- Bernoulli naive Bayes model [2, 4, 8] and the multi- sociated to several classes. In this case n naive nomial naive Bayes model [5, 6]. In this paper we Bayes classifiers are built, each one giving a poste- shall only consider the multinomial model. In this rior probability pi(ci|dj) for each document. In the model a document is an ordered sequence of words or case that each document may be assigned to only ∗ terms drawn from the same vocabulary, and the naive one class (single-label problems), the class c (dj ) such ∗ Bayes assumption here means that the occurrences of that c (dj ) = arg maxci {pi(ci|dj)} is selected. Notice the terms in a document are conditionally indepen- that in this case, as the term pi(dj) in the expression dent given the class, and the positions of these terms pi(ci|dj ) = pi(dj|ci)pi(ci)/pi(dj ) is not necessarily the in the document are also independent given the class1. same for all the class values, we need to compute it Thus, each document dj is drawn from a multinomial explicitly through distribution of words with as many independent trials pi dj pi dj ci pi ci pi dj ci pi ci . as the length of dj. Then, ( ) = ( | ) ( ) + ( | )(1 − ( )) |dj|! n This means that we have also to compute pi(dj |ci). p(dj|ci) = p(|dj |) p(tk|ci) jk , (2) t ∈d njk! This value is estimated using the corresponding coun- k j tkY∈dj Q terparts of eqs. (3) and (4), where where tk are the distinct words in dj, njk is the number N•k − Nik + 1 of times the word tk appears in the document dj and pˆ(tk|ci) = . (6) N − Ni• + M j jk j |d | = tk ∈dj n is the number of words in d . As |dj |! p(|dj|) P does not depend on the class, we can N•k is the numbers of times that the term tk appears njk ! t ∈d k j in the training documents, i.e. N•k = c Nik, and N omit itQ from the computations, so that we only need i is the total number of words in the trainingP documents. to calculate njk p(dj|ci) ∝ p(tk|ci) . (3) 3 The OR Gate Bayesian Network tkY∈dj Classifier The estimation of the term probabilities given the The document classification method that we are go- class,p ˆ(tk|ci), is usually carried out by means of the Laplace estimation: ing to propose is based on another restricted type of Bayesian network with the following topology: Each Nik + 1 term tk appearing in the training documents (or a sub- pˆ(tk|ci) = , (4) Ni• + M set of these terms in the case of using some method for feature selection) is associated to a binary variable where Nik is the number of times the term tk appears Tk taking its values in the set {tk, tk}, which in turn is in documents of class ci, Ni• is the total number of represented in the network by the corresponding node. words in documents of class ci, i.e. Ni• = t Nik, k There are also n binary variables Ci taking its values and M is the size of the vocabulary (the numberP of in the sets {ci, ci} (as in the previous binary version distinct words in the documents of the training set). of the naive Bayes model) and the corresponding class The estimation of the prior probabilities of the classes, nodes. The network structure is fixed, having an arc pˆ(ci), is usually done by maximum likelihood, i.e.: going from each term node Tk to the class node Ci if the term tk appears in training documents which are Ni,doc pˆ(ci) = , (5) of class ci. Let nti be the number of different terms Ndoc appearing in documents of class ci. In this way we where Ndoc is the number of documents in the train- have a network topology with two layers, where the term nodes are the “causes” and the class nodes are ing set and Ni,doc is the number of documents in the n training set which are assigned to class ci. the “effects”, having a total of i=1 nti arcs. An ex- ample of this network topologyP is displayed in Figure The multinomial naive Bayes model can also be used 2. It should be noticed that the proposed topology, in another way: instead of considering only one class with arcs going from attribute nodes to class nodes, is 1 The length of the documents is also assumed to be the opposite of the one associated to the naive Bayes independent on the class. model. 386 XIV Congreso Español sobre Tecnologías y Lógica fuzzy ESTYLF08, Cuencas Mineras (Mieres - Langreo), 17 - 19 de Septiembre de 2008 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 p(ci|dj) = 1 − (1 − w(Tk,Ci) × p(tk|dj)) C1 C2 C3 C4 C5 Tk∈YP a(Ci) = 1 − (1 − w(Tk,Ci)) . (9) Figure 2: The OR gate classifier. Tk∈PY a(Ci)∩dj In order to take into account the number of times a It should also be noticed that this network topology word tk occurs in a document dj, njk, we can replicate explicitly requires modeling the “discriminative” con- each node Tk njk times, so that the posterior proba- ditional probabilities p(ci|pa(Ci)), where P a(Ci) is the bilities then become set of parents of node Ci in the network (the set of njk terms appearing in documents of class ci) and pa(Ci) p(ci|dj) = 1 − (1 − w(Tk,Ci)) .
Recommended publications
  • Common Quandaries and Their Practical Solutions in Bayesian Network Modeling
    Ecological Modelling 358 (2017) 1–9 Contents lists available at ScienceDirect Ecological Modelling journa l homepage: www.elsevier.com/locate/ecolmodel Common quandaries and their practical solutions in Bayesian network modeling Bruce G. Marcot U.S. Forest Service, Pacific Northwest Research Station, 620 S.W. Main Street, Suite 400, Portland, OR, USA a r t i c l e i n f o a b s t r a c t Article history: Use and popularity of Bayesian network (BN) modeling has greatly expanded in recent years, but many Received 21 December 2016 common problems remain. Here, I summarize key problems in BN model construction and interpretation, Received in revised form 12 May 2017 along with suggested practical solutions. Problems in BN model construction include parameterizing Accepted 13 May 2017 probability values, variable definition, complex network structures, latent and confounding variables, Available online 24 May 2017 outlier expert judgments, variable correlation, model peer review, tests of calibration and validation, model overfitting, and modeling wicked problems. Problems in BN model interpretation include objec- Keywords: tive creep, misconstruing variable influence, conflating correlation with causation, conflating proportion Bayesian networks and expectation with probability, and using expert opinion. Solutions are offered for each problem and Modeling problems researchers are urged to innovate and share further solutions. Modeling solutions Bias Published by Elsevier B.V. Machine learning Expert knowledge 1. Introduction other fields. Although the number of generally accessible journal articles on BN modeling has continued to increase in recent years, Bayesian network (BN) models are essentially graphs of vari- achieving an exponential growth at least during the period from ables depicted and linked by probabilities (Koski and Noble, 2011).
    [Show full text]
  • On Discriminative Bayesian Network Classifiers and Logistic Regression
    On Discriminative Bayesian Network Classifiers and Logistic Regression Teemu Roos and Hannes Wettig Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Peter Gr¨unwald Centrum voor Wiskunde en Informatica, P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands Petri Myllym¨aki and Henry Tirri Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Abstract. Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graph-theoretic property. The property holds for naive Bayes but also for more complex structures such as tree-augmented naive Bayes (TAN) as well as for mixed diagnostic-discriminative structures. Our results imply that for networks satisfying our property, the condi- tional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, non-global maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model. Keywords: Bayesian classifiers, Bayesian networks, discriminative learning, logistic regression 1.
    [Show full text]
  • Lecture 5: Naive Bayes Classifier 1 Introduction
    CSE517A Machine Learning Spring 2020 Lecture 5: Naive Bayes Classifier Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml 5.2.1 (Bayes Classifier, Naive Bayes, Classifying Text, and Smoothing) Application Sentiment analysis aims at discovering people's opinions, emo- tions, feelings about a subject matter, product, or service from text. Take some time to answer the following warm-up questions: (1) Is this a regression or classification problem? (2) What are the features and how would you represent them? (3) What is the prediction task? (4) How well do you think a linear model will perform? 1 Introduction Thought: Can we model p(y j x) without model assumptions, e.g. Gaussian/Bernoulli distribution etc.? Idea: Estimate p(y j x) from the data directly, then use the Bayes classifier. Let y be discrete, Pn i=1 I(xi = x \ yi = y) p(y j x) = Pn (1) i=1 I(xi = x) where the numerator counts all examples with input x that have label y and the denominator counts all examples with input x. Figure 1: Illustration of estimatingp ^(y j x) From Figure ??, it is clear that, using the MLE method, we can estimatep ^(y j x) as jCj p^(y j x) = (2) jBj But there is a big problem with this method: The MLE estimate is only good if there are many training vectors with the same identical features as x. 1 2 This never happens for high-dimensional or continuous feature spaces. Solution: Bayes Rule. p(x j y) p(y) p(y j x) = (3) p(x) Let's estimate p(x j y) and p(y) instead! 2 Naive Bayes We have a discrete label space C that can either be binary f+1; −1g or multi-class f1; :::; Kg.
    [Show full text]
  • Bayesian Network Modelling with Examples
    Bayesian Network Modelling with Examples Marco Scutari [email protected] Department of Statistics University of Oxford November 28, 2016 What Are Bayesian Networks? Marco Scutari University of Oxford What Are Bayesian Networks? A Graph and a Probability Distribution Bayesian networks (BNs) are defined by: a network structure, a directed acyclic graph G = (V;A), in which each node vi 2 V corresponds to a random variable Xi; a global probability distribution X with parameters Θ, which can be factorised into smaller local probability distributions according to the arcs aij 2 A present in the graph. The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation, thus specifying the factorisation of the global distribution: p Y P(X) = P(Xi j ΠXi ;ΘXi ) where ΠXi = fparents of Xig i=1 Marco Scutari University of Oxford What Are Bayesian Networks? How the DAG Maps to the Probability Distribution Graphical Probabilistic DAG separation independence A B C E D F Formally, the DAG is an independence map of the probability distribution of X, with graphical separation (??G) implying probabilistic independence (??P ). Marco Scutari University of Oxford What Are Bayesian Networks? Graphical Separation in DAGs (Fundamental Connections) separation (undirected graphs) A B C d-separation (directed acyclic graphs) A B C A B C A B C Marco Scutari University of Oxford What Are Bayesian Networks? Graphical Separation in DAGs (General Case) Now, in the general case we can extend the patterns from the fundamental connections and apply them to every possible path between A and B for a given C; this is how d-separation is defined.
    [Show full text]
  • Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: a Map Reduce Approach
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-6S3, April 2019 Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: A Map Reduce Approach Shikha Agarwal, Balmukumd Jha, Tisu Kumar, Manish Kumar, Prabhat Ranjan Abstract: Naive Bayes classifier is well known machine model could be used without accepting Bayesian probability learning algorithm which has shown virtues in many fields. In or using any Bayesian methods. Despite their naive design this work big data analysis platforms like Hadoop distributed and simple assumptions, Naive Bayes classifiers have computing and map reduce programming is used with Naive worked quite well in many complex situations. An analysis Bayes and Gaussian Naive Bayes for classification. Naive Bayes is manily popular for classification of discrete data sets while of the Bayesian classification problem showed that there are Gaussian is used to classify data that has continuous attributes. sound theoretical reasons behind the apparently implausible Experimental results show that Hybrid of Naive Bayes and efficacy of types of classifiers[5]. A comprehensive Gaussian Naive Bayes MapReduce model shows the better comparison with other classification algorithms in 2006 performance in terms of classification accuracy on adult data set showed that Bayes classification is outperformed by other which has many continuous attributes. approaches, such as boosted trees or random forests [6]. An Index Terms: Naive Bayes, Gaussian Naive Bayes, Map Reduce, Classification. advantage of Naive Bayes is that it only requires a small number of training data to estimate the parameters necessary I. INTRODUCTION for classification. The fundamental property of Naive Bayes is that it works on discrete value.
    [Show full text]
  • Locally Differentially Private Naive Bayes Classification
    Locally Differentially Private Naive Bayes Classification EMRE YILMAZ, Case Western Reserve University MOHAMMAD AL-RUBAIE, University of South Florida J. MORRIS CHANG, University of South Florida In machine learning, classification models need to be trained in order to predict class labels. When the training data contains personal information about individuals, collecting training data becomes difficult due to privacy concerns. Local differential privacy isa definition to measure the individual privacy when there is no trusted data curator. Individuals interact with an untrusteddata aggregator who obtains statistical information about the population without learning personal data. In order to train a Naive Bayes classifier in an untrusted setting, we propose to use methods satisfying local differential privacy. Individuals send theirperturbed inputs that keep the relationship between the feature values and class labels. The data aggregator estimates all probabilities needed by the Naive Bayes classifier. Then, new instances can be classified based on the estimated probabilities. We propose solutions forboth discrete and continuous data. In order to eliminate high amount of noise and decrease communication cost in multi-dimensional data, we propose utilizing dimensionality reduction techniques which can be applied by individuals before perturbing their inputs. Our experimental results show that the accuracy of the Naive Bayes classifier is maintained even when the individual privacy is guaranteed under local differential privacy, and that using dimensionality reduction enhances the accuracy. CCS Concepts: • Security and privacy → Privacy-preserving protocols; • Computing methodologies → Supervised learning by classification; Dimensionality reduction and manifold learning. Additional Key Words and Phrases: Local Differential Privacy, Naive Bayes, Classification, Dimensionality Reduction 1 INTRODUCTION Predictive analytics is the process of making prediction about future events by analyzing the current data using statistical techniques.
    [Show full text]
  • Improving Feature Selection Algorithms Using Normalised Feature Histograms
    Page 1 of 10 Improving feature selection algorithms using normalised feature histograms A. P. James and A. K. Maan The proposed feature selection method builds a histogram of the most stable features from random subsets of training set and ranks the features based on a classifier based cross validation. This approach reduces the instability of features obtained by conventional feature selection methods that occur with variation in training data and selection criteria. Classification results on 4 microarray and 3 image datasets using 3 major feature selection criteria and a naive bayes classifier show considerable improvement over benchmark results. Introduction : Relevance of features selected by conventional feature selection method is tightly coupled with the notion of inter-feature redundancy within the patterns [1-3]. This essentially results in design of data-driven methods that are designed for meeting criteria of redundancy using various measures of inter-feature correlations [4-5]. These measures behave differently to different types of data, and to different levels of natural variability. This introduces instability in selection of features with change in number of samples in the training set and quality of features from one database to another. The use of large number of statistically distinct training samples can increase the quality of the selected features, although it can be practically unrealistic due to increase in computational complexity and unavailability of sufficient training samples. As an example, in realistic high dimensional databases such as gene expression databases of a rare cancer, there is a practical limitation on availability of large number of training samples. In such cases, instability that occurs with selecting features from insufficient training data would lead to a wrong conclusion that genes responsible for a cancer be strongly subject dependent and not general for a cancer.
    [Show full text]
  • Empirical Evaluation of Scoring Functions for Bayesian Network Model Selection
    Liu et al. BMC Bioinformatics 2012, 13(Suppl 15):S14 http://www.biomedcentral.com/1471-2105/13/S15/S14 PROCEEDINGS Open Access Empirical evaluation of scoring functions for Bayesian network model selection Zhifa Liu1,2†, Brandon Malone1,3†, Changhe Yuan1,4* From Proceedings of the Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge Oxford, MS, USA. 17-18 February 2012 Abstract In this work, we empirically evaluate the capability of various scoring functions of Bayesian networks for recovering true underlying structures. Similar investigations have been carried out before, but they typically relied on approximate learning algorithms to learn the network structures. The suboptimal structures found by the approximation methods have unknown quality and may affect the reliability of their conclusions. Our study uses an optimal algorithm to learn Bayesian network structures from datasets generated from a set of gold standard Bayesian networks. Because all optimal algorithms always learn equivalent networks, this ensures that only the choice of scoring function affects the learned networks. Another shortcoming of the previous studies stems from their use of random synthetic networks as test cases. There is no guarantee that these networks reflect real-world data. We use real-world data to generate our gold-standard structures, so our experimental design more closely approximates real-world situations. A major finding of our study suggests that, in contrast to results reported by several prior works, the Minimum Description Length (MDL) (or equivalently, Bayesian information criterion (BIC)) consistently outperforms other scoring functions such as Akaike’s information criterion (AIC), Bayesian Dirichlet equivalence score (BDeu), and factorized normalized maximum likelihood (fNML) in recovering the underlying Bayesian network structures.
    [Show full text]
  • Bayesian Networks
    Bayesian Networks Read R&N Ch. 14.1-14.2 Next lecture: Read R&N 18.1-18.4 You will be expected to know • Basic concepts and vocabulary of Bayesian networks. – Nodes represent random variables. – Directed arcs represent (informally) direct influences. – Conditional probability tables, P( Xi | Parents(Xi) ). • Given a Bayesian network: – Write down the full joint distribution it represents. • Given a full joint distribution in factored form: – Draw the Bayesian network that represents it. • Given a variable ordering and some background assertions of conditional independence among the variables: – Write down the factored form of the full joint distribution, as simplified by the conditional independence assertions. Computing with Probabilities: Law of Total Probability Law of Total Probability (aka “summing out” or marginalization) P(a) = Σb P(a, b) = Σb P(a | b) P(b) where B is any random variable Why is this useful? given a joint distribution (e.g., P(a,b,c,d)) we can obtain any “marginal” probability (e.g., P(b)) by summing out the other variables, e.g., P(b) = Σa Σc Σd P(a, b, c, d) Less obvious: we can also compute any conditional probability of interest given a joint distribution, e.g., P(c | b) = Σa Σd P(a, c, d | b) = (1 / P(b)) Σa Σd P(a, c, d, b) where (1 / P(b)) is just a normalization constant Thus, the joint distribution contains the information we need to compute any probability of interest. Computing with Probabilities: The Chain Rule or Factoring We can always write P(a, b, c, … z) = P(a | b, c, ….
    [Show full text]
  • Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute
    1 Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute Abstract Deep directed generative models have attracted much attention recently due to their generative modeling nature and powerful data representation ability. In this paper, we review different structures of deep directed generative models and the learning and inference algorithms associated with the structures. We focus on a specific structure that consists of layers of Bayesian Networks due to the property of capturing inherent and rich dependencies among latent variables. The major difficulty of learning and inference with deep directed models with many latent variables is the intractable inference due to the dependencies among the latent variables and the exponential number of latent variable configurations. Current solutions use variational methods often through an auxiliary network to approximate the posterior probability inference. In contrast, inference can also be performed directly without using any auxiliary network to maximally preserve the dependencies among the latent variables. Specifically, by exploiting the sparse representation with the latent space, max-max instead of max-sum operation can be used to overcome the exponential number of latent configurations. Furthermore, the max-max operation and augmented coordinate ascent are applied to both supervised and unsupervised learning as well as to various inference. Quantitative evaluations on benchmark datasets of different models are given for both data representation and feature learning tasks. I. INTRODUCTION Deep learning has become a major enabling technology for computer vision. By exploiting its multi-level representation and the availability of the big data, deep learning has led to dramatic performance improvements for certain tasks.
    [Show full text]
  • Naïve Bayes Lecture 17
    Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri Brief ArticleBrief Article The Author The Author January 11, 2012January 11, 2012 θˆ =argmax ln P ( θ) θˆ =argmax ln P ( θ) θ D| θ D| ln θαH ln θαH d d d ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] d d dθ D| dθ d − dθ H T − ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] dθ D| dθ − dθ H T − d d αH αT = αH ln θ + αT ln(1 θ)= =0 d d dθ αH dθ αT− θ − 1 θ = αH ln θ + αT ln(1 θ)= =0 − dθ dθ − θ −2N12 θ δ 2e− − P (mistake) ≥ ≥ 2N2 δ 2e− P (mistake) ≥ ≥ ln δ ln 2 2N2 Bayesian Learning ≥ − Prior • Use Bayes’ rule! 2 Data Likelihood ln δ ln 2 2N ln(2/δ) ≥ − N ≥ 22 Posterior ln(2/δ) N ln(2/0.05) 3.8 ≥ 22N Normalization= 190 ≥ 2 0.12 ≈ 0.02 × ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (θ) to 1 ≥ 2 0.12 ≈ 0.02 ∝ ×maximum likelihood estimation! P (θ) 1 P (θ ) P ( θ) ∝ |D ∝ D| 1 1 Brief ArticleBrief Article Brief Article Brief Article The Author The Author The Author January 11, 2012 JanuaryThe Author 11, 2012 January 11, 2012 January 11, 2012 θˆ =argmax ln Pθˆ( =argmaxθ) ln P ( θ) θ D| θ D| ˆ =argmax ln (θˆ =argmax) ln P ( θ) θ P θ θ θ D| αH D| ln θαH ln θ αH ln θαH ln θ d d α α d d d lnαP ( θ)=α [lndθ H (1 θ) T ] = [α ln θ + α ln(1 θ)] ln P ( θ)= [lndθ H (1 θ) T ]d= [α ln θ + α ln(1d Hθ)] T d d dθ D| dθ d αH H − αT T dθ − dθ D| dθ αlnH P ( − θα)=T [lndθθ (1 θ) ] = [α−H ln θ + αT ln(1 θ)] ln P ( θ)= [lndθθ (1 D|θ) ] =dθ [αH ln θ−+ αT ln(1dθ θ)] − dθ D| dθ − d dθ d −α
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]