Building Classifiers Using Bayesian Networks

Total Page:16

File Type:pdf, Size:1020Kb

Building Classifiers Using Bayesian Networks From: AAAI-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. Building Classifiers using ayesian Networks Nir Friedman Moises Goldszmidt* Stanford University Rockwell Science Center Dept. of Computer Science 444 High St., Suite 400 Gates Building 1A Palo Alto, CA 94301 Stanford, CA 943059010 [email protected] [email protected] Abstract instantiation of Al, . , A,. This computation is rendered feasible by making a strong independence assumption: all Recent work in supervised learning has shown that a surpris- the attributes Ai are conditionally independent given the ingly simple Bayesian classifier with strong assumptions of value of the label C.’ The performance of naive Bayes is independence among features, called naive Bayes, is com- somewhat surprising given that this is clearly an unrealistic petitive with state of the art classifiers such as C4.5. This fact raises the question of whether a classifier with less re- assumption. Consider for example a classifier for assessing strictive assumptions can perform even better. In this paper the risk in loan applications. It would be erroneous to ignore we examine and evaluate approaches for inducing classifiers the correlations between age, education level, and income. from data, based on recent results in the theory of learn- This fact naturally begs the question of whether we can ing Bayesian networks. Bayesian networks are factored improve the performance of Bayesian classifiers by avoid- representations of probability distributions that generalize ing unrealistic assumptions about independence. In order the naive Bayes classifier and explicitly represent statements to effectively tackle this problem we need an appropriate about independence. Among these approaches we single out language and effective machinery to represent and manip- a method we call Tree AugmentedNaive Bayes (TAN), which ulate independences. Bayesian networks (Pearl 1988) pro- outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness vide both. Bayesian networks are directed acyclic graphs which are characteristic of naive Bayes. We experimentally that allow for efficient and effective representation of the tested these approaches using benchmark problems from the joint probability distributions over a set of random vari- U. C. Irvine repository, and compared them against C4.5, ables. Each vertex in the graph represents a random vari- naive Bayes, and wrapper-based feature selection methods. able, and edges represent direct correlations between the variables. More precisely, the network encodes the follow- ing statements about each random variable: each variable 1 Introduction is independent of its non-descendants in the graph given A somewhat simplified statement of the problem of super- the state of its parents. Other independences follow from vised learning is as follows. Given a training set of labeled these ones. These can be efficiently read from the net- instances of the form (al, . , a,), c, construct a classi- work structure by means of a simple graph-theoretic crite- 7; f capable of predicting the value of c, given instances ria. Independences are then exploited to reduce the number al,-*-, a,) as Input. The variables Ai, . , A, are called of parameters needed to characterize a probability distribu- features or attributes, and the variable C is usually referred tion, and to efficiently compute posterior probabilities given to as the class variable or label. This is a basic problem evidence. Probabilistic parameters are encoded in a set of in many applications of machine learning, and there are nu- tables, one for each variable, in the form of local conditional merous approaches to solve it based on various functional distributions of a variable given its parents. Given the in- representations such as decision trees, decision lists, neural dependences encoded in the network, the joint distribution networks, decision-graphs, rules, and many others. One of can be reconstructed by simply multiplying these tables. the most effective classifiers, in the sense that its predic- When represented as a Bayesian network, a naive tive performance is competitive with state of the art clas- Bayesian classifier has the simple structure depicted in Fig- sifiers such as C4.5 (Quinlan 1993), is the so-called naive ure 1. This network captures the main assumption behind Bayesian classifier (or simply naive Bayes) (Langley, Iba, the naive Bayes classifier, namely that every attribute (ev- & Thompson 1992). This classifier learns the conditional ery leaf in the network) is independent from the rest of the probability of each attribute Ai given the label C in the attributes given the state of the class variable (the root in training data. Classification is then done by applying Bayes the network). Now that we have the means to represent and rule to compute the probability of C given the particular ‘By independence we mean probabilistic independence, that is *Current address: SRI International, 333 Ravenswood Way, A is independentof B given C wheneverPr(AIB, C) = Pr(AIC) Menlo Park, CA 94025, [email protected] for all possible values of A, B and C. Bayesian Networks 1277 Geiger (1992), is a consequence of a well-known result by CC-7 Chow and Liu (1968) (see also (Pearl 1988)). This ap- proach, which we call Tree Augmented Naive Bayes (TAN), / approximates the interactions between attributes using a tree-structure imposed on the naive Bayes structure. We J”\, note that while this method has been proposed in the liter- 0 0 00 0 ature, it has never been rigorously tested in practice. We d show that TAN maintains the robustness and computational Al A2 A, complexity (the search process is bounded) of naive Bayes, and at the same time displays better performance in our experiments. We compare this approach with C4.5, naive Figure 1: The structure of the naive Bayes network. Bayes, and selective naive Bayes, a wrapper-based feature subset selection method combined with naive Bayes, on a set of benchmark problems from the U. C. Irvine repository manipulate independences, the obvious question follows: (see Section 4). This experiments show that TAN leads to can we learn an unrestricted Bayesian network from the significant improvement over all of these three approaches. data that when used as a classifier maximizes the prediction rate? 2 Learning Bayesian Networks Learning Bayesian networks from data is a rapidly grow- Consider a finite set U = {Xi, . , X,} of discrete random ing field of research that has seen a great deal of activ- variables where each variable Xi may take on values from ity in recent years, see for example (Heckerman 1995; a finite domain. We use capital letters, such as X, Y, 2, Heckerman, Geiger, & Chickering 1995; Lam & Bacchus for variable names and lowercase letters 2, y, z to denote 1994). This is a form unsupervised learning in the sense that specific values taken by those variables. The set of val- the learner is not guided by a set of informative examples. ues X can attain is denoted as KzZ(X), the cardinality of The objective is to induce a network (or a set of networks) this set is denoted as 11X 1I = I kZ(X) I. Sets of variables that “best describes” the probability distribution over the are denoted by boldface capital letters X, U, Z, and assign- training data. This optimization process is implemented in ments of values to the variables in these sets will be denoted practice using heuristic search techniques to find the best by boldface lowercase letters x, y, z (we use VaZ(X) and candidate over the space of possible networks. The search 11x1I in the obvious way). Finally, let P be a joint proba- process relies on a scoring metric that asses the merits of bility distribution over the variables in U, and let X, ‘!I?,Z each candidate network. be subsets of U. X and U are conditionally independent We start by examining a straightforward application of given Z if for all x E l&Z(X), y E KzZ(‘lr’),2 E KzZ(Z), current Bayesian networks techniques. We learn networks P(x I z, y) = P(x I z) whenever P(y, z) > 0. using the MDL metric (Lam & Bacchus 1994) and use them A Bayesian network is an annotated directed acyclic for classification. The results, which are analyzed in Sec- graph that encodes a joint probability distribution of a do- tion 3, are mixed: although the learned networks perform main composed of a set of random variables. Formally, significantly better than naive Bayes on some datasets, they a Bayesian network for U is the pair B = (G, 0). G perform worse on others. We trace the reasons for these re- is a directed acyclic graph whose nodes correspond to the sults to the definition of the MDL scoring metric. Roughly random variables X1, . , X,, and whose edges represent speaking, the problem is that the MDL scoring metric mea- direct dependencies between the variables. The graph struc- sures the “error” of the learned Bayesian network over all ture G encodes the following set of independence assump- the variables in the domain. Minimizing this error, however, tions: each node Xi is independent of its non-descendants does not necessarily minimize the local “error” in predict- given its parents in G (Pearl 1988).2 The second compo- ing the class variable given the attributes. We argue that nent of the pair, namely 0, represents the set of param- similar problems will occur with other scoring metrics in eters that quantifies the network. It contains a parameter the literature. e z* I&* = P(QI&*) for each possible value zi of Xi, and In light of these results we limit our attention to a class of III,, of IIx, (the set of parents of Xi in G). B defines a network structures that are based on the structure of naive unique joint probability distribution over U given by: Bayes, requiring that the class variable be a parent of ev- n n ery attribute.
Recommended publications
  • Linear Models for Classification
    CS480 Introduction to Machine Learning Linear Models for Classification Edith Law Based on slides from Joelle Pineau Probabilistic Framework for Classification As we saw in the last lecture, learning can be posed a problem of statistical inference. The premise is that for any learning problem, if we have access to the underlying probability distribution D, then we could form a Bayes optimal classifier as: f (BO)(x)̂ = arg max D(x,̂ y)̂ y∈̂ Y But since we don't have access to D, we can instead estimate this distribution from data. The probabilistic framework allows you to choose a model to represent your data, and develop algorithms that have inductive biases that are closer to what you, as a designer, believe. 2 Classification by Density Estimation The most direct way to construct such a probability distribution is to select a family of parametric distributions. For example, we can select a Gaussian (or Normal) distribution. Gaussian is parametric: its parameters are its mean and variance. The job of learning is to infer which parameters are “best” in terms of describing the observed training data. In density estimation, we assume that the sample data is drawn independently from the same distribution D. That is, your nth sample (xn, yn) from D does not depend on the previous n-1 samples. This is called the i.i.d. (independently and identically distributed) assumption. 3 Consider the Problem of Flipping a Biased Coin You are flipping a coin, and you want to find out if it is biased. This coin has some unknown probability of β of coming up Head (H), and some probability ( 1 − β ) of coming up Tail (T).
    [Show full text]
  • On Discriminative Bayesian Network Classifiers and Logistic Regression
    On Discriminative Bayesian Network Classifiers and Logistic Regression Teemu Roos and Hannes Wettig Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Peter Gr¨unwald Centrum voor Wiskunde en Informatica, P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands Petri Myllym¨aki and Henry Tirri Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Abstract. Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graph-theoretic property. The property holds for naive Bayes but also for more complex structures such as tree-augmented naive Bayes (TAN) as well as for mixed diagnostic-discriminative structures. Our results imply that for networks satisfying our property, the condi- tional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, non-global maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model. Keywords: Bayesian classifiers, Bayesian networks, discriminative learning, logistic regression 1.
    [Show full text]
  • Lecture 5: Naive Bayes Classifier 1 Introduction
    CSE517A Machine Learning Spring 2020 Lecture 5: Naive Bayes Classifier Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml 5.2.1 (Bayes Classifier, Naive Bayes, Classifying Text, and Smoothing) Application Sentiment analysis aims at discovering people's opinions, emo- tions, feelings about a subject matter, product, or service from text. Take some time to answer the following warm-up questions: (1) Is this a regression or classification problem? (2) What are the features and how would you represent them? (3) What is the prediction task? (4) How well do you think a linear model will perform? 1 Introduction Thought: Can we model p(y j x) without model assumptions, e.g. Gaussian/Bernoulli distribution etc.? Idea: Estimate p(y j x) from the data directly, then use the Bayes classifier. Let y be discrete, Pn i=1 I(xi = x \ yi = y) p(y j x) = Pn (1) i=1 I(xi = x) where the numerator counts all examples with input x that have label y and the denominator counts all examples with input x. Figure 1: Illustration of estimatingp ^(y j x) From Figure ??, it is clear that, using the MLE method, we can estimatep ^(y j x) as jCj p^(y j x) = (2) jBj But there is a big problem with this method: The MLE estimate is only good if there are many training vectors with the same identical features as x. 1 2 This never happens for high-dimensional or continuous feature spaces. Solution: Bayes Rule. p(x j y) p(y) p(y j x) = (3) p(x) Let's estimate p(x j y) and p(y) instead! 2 Naive Bayes We have a discrete label space C that can either be binary f+1; −1g or multi-class f1; :::; Kg.
    [Show full text]
  • Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: a Map Reduce Approach
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-6S3, April 2019 Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: A Map Reduce Approach Shikha Agarwal, Balmukumd Jha, Tisu Kumar, Manish Kumar, Prabhat Ranjan Abstract: Naive Bayes classifier is well known machine model could be used without accepting Bayesian probability learning algorithm which has shown virtues in many fields. In or using any Bayesian methods. Despite their naive design this work big data analysis platforms like Hadoop distributed and simple assumptions, Naive Bayes classifiers have computing and map reduce programming is used with Naive worked quite well in many complex situations. An analysis Bayes and Gaussian Naive Bayes for classification. Naive Bayes is manily popular for classification of discrete data sets while of the Bayesian classification problem showed that there are Gaussian is used to classify data that has continuous attributes. sound theoretical reasons behind the apparently implausible Experimental results show that Hybrid of Naive Bayes and efficacy of types of classifiers[5]. A comprehensive Gaussian Naive Bayes MapReduce model shows the better comparison with other classification algorithms in 2006 performance in terms of classification accuracy on adult data set showed that Bayes classification is outperformed by other which has many continuous attributes. approaches, such as boosted trees or random forests [6]. An Index Terms: Naive Bayes, Gaussian Naive Bayes, Map Reduce, Classification. advantage of Naive Bayes is that it only requires a small number of training data to estimate the parameters necessary I. INTRODUCTION for classification. The fundamental property of Naive Bayes is that it works on discrete value.
    [Show full text]
  • Locally Differentially Private Naive Bayes Classification
    Locally Differentially Private Naive Bayes Classification EMRE YILMAZ, Case Western Reserve University MOHAMMAD AL-RUBAIE, University of South Florida J. MORRIS CHANG, University of South Florida In machine learning, classification models need to be trained in order to predict class labels. When the training data contains personal information about individuals, collecting training data becomes difficult due to privacy concerns. Local differential privacy isa definition to measure the individual privacy when there is no trusted data curator. Individuals interact with an untrusteddata aggregator who obtains statistical information about the population without learning personal data. In order to train a Naive Bayes classifier in an untrusted setting, we propose to use methods satisfying local differential privacy. Individuals send theirperturbed inputs that keep the relationship between the feature values and class labels. The data aggregator estimates all probabilities needed by the Naive Bayes classifier. Then, new instances can be classified based on the estimated probabilities. We propose solutions forboth discrete and continuous data. In order to eliminate high amount of noise and decrease communication cost in multi-dimensional data, we propose utilizing dimensionality reduction techniques which can be applied by individuals before perturbing their inputs. Our experimental results show that the accuracy of the Naive Bayes classifier is maintained even when the individual privacy is guaranteed under local differential privacy, and that using dimensionality reduction enhances the accuracy. CCS Concepts: • Security and privacy → Privacy-preserving protocols; • Computing methodologies → Supervised learning by classification; Dimensionality reduction and manifold learning. Additional Key Words and Phrases: Local Differential Privacy, Naive Bayes, Classification, Dimensionality Reduction 1 INTRODUCTION Predictive analytics is the process of making prediction about future events by analyzing the current data using statistical techniques.
    [Show full text]
  • Improving Feature Selection Algorithms Using Normalised Feature Histograms
    Page 1 of 10 Improving feature selection algorithms using normalised feature histograms A. P. James and A. K. Maan The proposed feature selection method builds a histogram of the most stable features from random subsets of training set and ranks the features based on a classifier based cross validation. This approach reduces the instability of features obtained by conventional feature selection methods that occur with variation in training data and selection criteria. Classification results on 4 microarray and 3 image datasets using 3 major feature selection criteria and a naive bayes classifier show considerable improvement over benchmark results. Introduction : Relevance of features selected by conventional feature selection method is tightly coupled with the notion of inter-feature redundancy within the patterns [1-3]. This essentially results in design of data-driven methods that are designed for meeting criteria of redundancy using various measures of inter-feature correlations [4-5]. These measures behave differently to different types of data, and to different levels of natural variability. This introduces instability in selection of features with change in number of samples in the training set and quality of features from one database to another. The use of large number of statistically distinct training samples can increase the quality of the selected features, although it can be practically unrealistic due to increase in computational complexity and unavailability of sufficient training samples. As an example, in realistic high dimensional databases such as gene expression databases of a rare cancer, there is a practical limitation on availability of large number of training samples. In such cases, instability that occurs with selecting features from insufficient training data would lead to a wrong conclusion that genes responsible for a cancer be strongly subject dependent and not general for a cancer.
    [Show full text]
  • Naïve Bayes Lecture 17
    Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri Brief ArticleBrief Article The Author The Author January 11, 2012January 11, 2012 θˆ =argmax ln P ( θ) θˆ =argmax ln P ( θ) θ D| θ D| ln θαH ln θαH d d d ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] d d dθ D| dθ d − dθ H T − ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] dθ D| dθ − dθ H T − d d αH αT = αH ln θ + αT ln(1 θ)= =0 d d dθ αH dθ αT− θ − 1 θ = αH ln θ + αT ln(1 θ)= =0 − dθ dθ − θ −2N12 θ δ 2e− − P (mistake) ≥ ≥ 2N2 δ 2e− P (mistake) ≥ ≥ ln δ ln 2 2N2 Bayesian Learning ≥ − Prior • Use Bayes’ rule! 2 Data Likelihood ln δ ln 2 2N ln(2/δ) ≥ − N ≥ 22 Posterior ln(2/δ) N ln(2/0.05) 3.8 ≥ 22N Normalization= 190 ≥ 2 0.12 ≈ 0.02 × ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (θ) to 1 ≥ 2 0.12 ≈ 0.02 ∝ ×maximum likelihood estimation! P (θ) 1 P (θ ) P ( θ) ∝ |D ∝ D| 1 1 Brief ArticleBrief Article Brief Article Brief Article The Author The Author The Author January 11, 2012 JanuaryThe Author 11, 2012 January 11, 2012 January 11, 2012 θˆ =argmax ln Pθˆ( =argmaxθ) ln P ( θ) θ D| θ D| ˆ =argmax ln (θˆ =argmax) ln P ( θ) θ P θ θ θ D| αH D| ln θαH ln θ αH ln θαH ln θ d d α α d d d lnαP ( θ)=α [lndθ H (1 θ) T ] = [α ln θ + α ln(1 θ)] ln P ( θ)= [lndθ H (1 θ) T ]d= [α ln θ + α ln(1d Hθ)] T d d dθ D| dθ d αH H − αT T dθ − dθ D| dθ αlnH P ( − θα)=T [lndθθ (1 θ) ] = [α−H ln θ + αT ln(1 θ)] ln P ( θ)= [lndθθ (1 D|θ) ] =dθ [αH ln θ−+ αT ln(1dθ θ)] − dθ D| dθ − d dθ d −α
    [Show full text]
  • Last Time Today Bayesian Learning Naïve Bayes the Distributions We
    Last Time CSE 446 • Learning Gaussians Gaussian Naïve Bayes & • Naïve Bayes Logistic Regression Winter 2012 Today • Gaussians Naïve Bayes Dan Weld • Logistic Regression Some slides from Carlos Guestrin, Luke Zettlemoyer 2 Text Classification Bayesian Learning Bag of Words Representation Prior aardvark 0 Use Bayes rule: Data Likelihood about 2 all 2 Africa 1 Posterior P(Y | X) = P(X |Y) P(Y) apple 0 P(X) anxious 0 ... Normalization gas 1 ... oil 1 Or equivalently: P(Y | X) P(X | Y) P(Y) … Zaire 0 Naïve Bayes The Distributions We Love Discrete Continuous • Naïve Bayes assumption: – Features are independent given class: Binary {0, 1} k Values Single Bernouilli Event – More generally: Sequence Binomial Multinomial (N trials) N= H+T • How many parameters now? • Suppose X is composed of n binary features Conjugate Beta Dirichlet Prior 6 1 NB with Bag of Words for Text Classification Easy to Implement • Learning phase: – Prior P(Ym) • But… • Count how many documents from topic m / total # docs – P(Xi|Ym) • Let Bm be a bag of words formed from all the docs in topic m • Let #(i, B) be the number of times word iis in bag B • If you do… it probably won’t work… • P(Xi | Ym) = (#(i, Bm)+1) / (W+j#(j, Bm)) where W=#unique words • Test phase: – For each document • Use naïve Bayes decision rule 8 Probabilities: Important Detail! Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes • P(spam | X … X ) = P(spam | X ) 1 n i i – I.e. the class with maximum posterior probability… Any more potential problems here? – Usually fairly accurate (?!?!?) • However, due to the inadequacy of the .
    [Show full text]
  • Confidence Intervals for Probabilistic Network Classifiers
    Computational Statistics & Data Analysis 49 (2005) 998–1019 www.elsevier.com/locate/csda Confidence intervals for probabilistic network classifiersଁ M. Egmont-Petersena,∗, A. Feeldersa, B. Baesensb aUtrecht University, Institute of Information and Computing Sciences, P. O. Box 80.089, 3508, TB Utrecht, The Netherlands bUniversity of Southampton, School of Management, UK Received 12 July 2003; received in revised form 28 June 2004 Available online 25 July 2004 Abstract Probabilistic networks (Bayesian networks) are suited as statistical pattern classifiers when the feature variables are discrete. It is argued that their white-box character makes them transparent, a requirement in various applications such as, e.g., credit scoring. In addition, the exact error rate of a probabilistic network classifier can be computed without a dataset. First, the exact error rate for probabilistic network classifiers is specified. Secondly, the exact sampling distribution for the conditional probability estimates in a probabilistic network classifier is derived. Each conditional probability is distributed according to the bivariate binomial distribution. Subsequently, an approach for computing the sampling distribution and hence confidence intervals for the posterior probability in a probabilistic network classifier is derived. Our approach results in parametric bootstrap confidence intervals. Experiments with general probabilistic network classifiers, the Naive Bayes classifier and tree augmented Naive Bayes classifiers (TANs) show that our approximation performs well. Also simulations performed with the Alarm network show good results for large training sets. The amount of computation required is exponential in the number of feature variables. For medium and large-scale classification problems, our approach is well suited for quick simulations.A running example from the domain of credit scoring illustrates how to actually compute the sampling distribution of the posterior probability.
    [Show full text]
  • Generalized Naive Bayes Classifiers
    Generalized Naive Bayes Classifiers Kim Larsen Concord, California [email protected] ABSTRACT This paper presents a generalization of the Naive Bayes Clas- sifier. The method is called the Generalized Naive Bayes This paper presents a generalization of the Naive Bayes Clas- Classifier (GNBC) and extends the NBC by relaxing the sifier. The method is specifically designed for binary clas- assumption of conditional independence between the pre- sification problems commonly found in credit scoring and dictors. This reduces the bias in the estimated class prob- marketing applications. The Generalized Naive Bayes Clas- abilities and improves the fit, while greatly enhancing the sifier turns out to be a powerful tool for both exploratory descriptive value of the model. The generalization is done in and predictive analysis. It can generate accurate predictions a fully non-parametric fashion without putting any restric- through a flexible, non-parametric fitting procedure, while tions on the relationship between the independent variables being able to uncover hidden patterns in the data. In this and the dependent variable. This makes the GNBC much paper, the Generalized Naive Bayes Classifier and the orig- more flexible than traditional linear and non-linear regres- inal Bayes Classifier will be demonstrated. Also, important sion, and allows it to uncover hidden patterns in the data. ties to logistic regression, the Generalized Additive Model Moreover, the GNBC retains the additive structure of the (GAM), and Weight Of Evidence will be discussed. NBC which means that it does not suffer from the prob- lems of dimensionality that are common with other non- 1. INTRODUCTION parametric techniques.
    [Show full text]
  • Bayesian Learning
    SC4/SM8 Advanced Topics in Statistical Machine Learning Bayesian Learning Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 1 / 14 Bayesian Learning Review of Bayesian Inference The Bayesian Learning Framework Bayesian learning: treat parameter vector θ as a random variable: process of learning is then computation of the posterior distribution p(θjD). In addition to the likelihood p(Djθ) need to specify a prior distribution p(θ). Posterior distribution is then given by the Bayes Theorem: p(Djθ)p(θ) p(θjD) = p(D) Likelihood: p(Djθ) Posterior: p(θjD) Prior: p(θ) Marginal likelihood: p(D) = Θ p(Djθ)p(θ)dθ ´ Summarizing the posterior: MAP Posterior mode: θb = argmaxθ2Θ p(θjD) (maximum a posteriori). mean Posterior mean: θb = E [θjD]. Posterior variance: Var[θjD]. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 2 / 14 Bayesian Learning Review of Bayesian Inference Bayesian Inference on the Categorical Distribution Suppose we observe the with yi 2 f1;:::; Kg, and model them as i.i.d. with pmf π = (π1; : : : ; πK): n K Y Y nk p(Djπ) = πyi = πk i=1 k=1 Pn PK with nk = i=1 1(yi = k) and πk > 0, k=1 πk = 1. The conjugate prior on π is the Dirichlet distribution Dir(α1; : : : ; αK) with parameters αk > 0, and density PK K Γ( αk) Y p(π) = k=1 παk−1 QK k k=1 Γ(αk) k=1 PK on the probability simplex fπ : πk > 0; k=1 πk = 1g.
    [Show full text]
  • Bayesian Machine Learning
    Advanced Topics in Machine Learning: Bayesian Machine Learning Tom Rainforth Department of Computer Science Hilary 2020 Contents 1 Introduction 1 1.1 A Note on Advanced Sections . .3 2 A Brief Introduction to Probability 4 2.1 Random Variables, Outcomes, and Events . .4 2.2 Probabilities . .4 2.3 Conditioning and Independence . .5 2.4 The Laws of Probability . .6 2.5 Probability Densities . .7 2.6 Expectations and Variances . .8 2.7 Measures [Advanced Topic] . 10 2.8 Change of Variables . 12 3 Machine Learning Paradigms 13 3.1 Learning From Data . 13 3.2 Discriminative vs Generative Machine Learning . 16 3.3 The Bayesian Paradigm . 20 3.4 Bayesianism vs Frequentism [Advanced Topic] . 23 3.5 Further Reading . 31 4 Bayesian Modeling 32 4.1 A Fundamental Assumption . 32 4.2 The Bernstein-Von Mises Theorem . 34 4.3 Graphical Models . 35 4.4 Example Bayesian Models . 38 4.5 Nonparametric Bayesian Models . 42 4.6 Gaussian Processes . 43 4.7 Further Reading . 52 5 Probabilistic Programming 53 5.1 Inverting Simulators . 54 5.2 Differing Approaches . 58 5.3 Bayesian Models as Program Code [Advanced Topic] . 62 5.4 Further Reading . 73 Contents iii 6 Foundations of Bayesian Inference and Monte Carlo Methods 74 6.1 The Challenge of Bayesian Inference . 74 6.2 Deterministic Approximations . 77 6.3 Monte Carlo . 79 6.4 Foundational Monte Carlo Inference Methods . 83 6.5 Further Reading . 93 7 Advanced Inference Methods 94 7.1 The Curse of Dimensionality . 94 7.2 Markov Chain Monte Carlo . 97 7.3 Variational Inference .
    [Show full text]