Learning Bayesian Networks from Data

Total Page:16

File Type:pdf, Size:1020Kb

Learning Bayesian Networks from Data Overview Introduction Parameter Estimation Learning Bayesian Model Selection Networks from Data Structure Discovery Incomplete Data Nir Friedman Daphne Koller Learning from Structured Data Hebrew U. Stanford . 2 Bayesian Networks Example: “ICU Alarm” network Compact representation of probability Domain: Monitoring Intensive-Care Patients distributions via conditional independence 37 variables Family of Alarm MINVOLSET E B P(A | E,B) 509 parameters Qualitative part: Earthquake Burglary PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT Directed acyclic graph (DAG) e b 0.9 0.1 …instead of 254 e b 0.2 0.8 PAP SHUNT VENTLUNG VENITUBE Nodes - random variables PRESS e b 0.9 0.1 MINOVL FIO2 VENTALV Edges - direct influence Radio Alarm e b 0.01 0.99 ANAPHYLAXIS PVSAT ARTCO2 TPR SAO2 INSUFFANESTH EXPCO2 Call Together: HYPOVOLEMIA LVFAILURE CATECHOL Quantitative part: Define a unique distribution Set of conditional LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER in a factored form probability distributions CVP PCWP CO HREKG HRSAT P (B,E ,A,C ,R) = P (B)P (E )P (A |B,E )P (R |E )P (C |A) HRBP BP 3 4 Inference Why learning? Posterior probabilities Knowledge acquisition bottleneck Probability of any event given any evidence Knowledge acquisition is an expensive process Most likely explanation Often we don’t have an expert Scenario that explains evidence Rational decision making Data is cheap Earthquake Burglary Maximize expected utility Amount of available information growing rapidly Value of Information Learning allows us to construct models from raw Radio Alarm Effect of intervention data Call 5 6 1 Why Learn Bayesian Networks? Learning Bayesian networks Conditional independencies & graphical language capture structure of many real-world distributions E B Graph structure provides much insight into domain Data Learner R A Allows “knowledge discovery” + Prior Information C Learned model can be used for many tasks E B P(A | E,B) e b .9 .1 Supports all the features of probabilistic learning e b .7 .3 Model selection criteria e b .8 .2 .99 .01 Dealing with missing data & hidden variables e b 7 8 Known Structure, Complete Data Unknown Structure, Complete Data E, B, A E, B, A <Y,N,N> <Y,N,N> <Y,N,Y> <Y,N,Y> <N,N,Y> <N,N,Y> <N,Y,Y> <N,Y,Y> . E B . E B . <N,Y,Y> E B P(A | E,B) <N,Y,Y> E B P(A | E,B) Learner A Learner A e b .9 .1 e b .9 .1 E B P(A | E,B) E B E B P(A | E,B) E B e b .7 .3 e b .7 .3 e b ?? e b ?? A e b .8 .2 A e b .8 .2 e b ?? e b ?? e b .99 .01 e b .99 .01 e b ?? e b ?? e b ?? e b ?? Network structure is specified Network structure is not specified Inducer needs to estimate parameters Inducer needs to select arcs & estimate parameters Data does not contain missing values Data does not contain missing values 9 10 Known Structure, Incomplete Data Unknown Structure, Incomplete Data E, B, A E, B, A <Y,N,N> <Y,N,N> <Y,?,Y> <Y,?,Y> <N,N,Y> <N,N,Y> <N,Y,?> <N,Y,?> . E B . E B . <?,Y,Y> E B P(A | E,B) <?,Y,Y> E B P(A | E,B) Learner A Learner A e b .9 .1 e b .9 .1 E B P(A | E,B) E B E B P(A | E,B) E B e b .7 .3 e b .7 .3 e b ?? e b ?? A e b .8 .2 A e b .8 .2 e b ?? e b ?? e b .99 .01 e b .99 .01 e b ?? e b ?? e b ?? e b ?? Network structure is specified Network structure is not specified Data contains missing values Data contains missing values Need to consider assignments to missing values Need to consider assignments to missing values 11 12 2 Overview Learning Parameters Introduction Training data has the form: E B Parameter Estimation A Likelihood function E [1] B[1] A[1] C [1] C Bayesian estimation ⋅ ⋅ ⋅ ⋅ D = Model Selection ⋅ ⋅ ⋅ ⋅ Structure Discovery E [M] B[M] A[M] C [M] Incomplete Data Learning from Structured Data 13 14 Likelihood Function Likelihood Function Assume i.i.d. samples E B By definition of network, we get E B Likelihood function is A A L(Θ :D) = ∏P(E[m],B[m],A[m],C[m] : Θ) C L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m P(E [m]: Θ) P(B[m]: Θ) = ∏ m P(A[m]|B[m],E [m]: Θ) P(C[m]|A[m]: Θ) E [1] B[1] A[1] C [1] ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ E [M] B[M] A[M] C [M] 15 16 Likelihood Function General Bayesian Networks Rewriting terms, we get E B Generalizing for any Bayesian network: A L(Θ :D) = ∏P (x1[m],K,xn [m] : Θ) L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m = ∏∏P (xi [m] |Pai [m] : Θi ) ∏P(E[m]: Θ) im m = L (Θ : D) P(B[m]: Θ) ∏ i i ∏ i m = ∏P(A[m]|B[m],E[m]: Θ) E [1] B[1] A[1] C [1] Decomposition m ⋅ ⋅ ⋅ ⋅ ⇒ ∏P(C[m]|A[m]: Θ) ⋅ ⋅ ⋅ ⋅ Independent estimation problems m E [M] B[M] A[M] C [M] 17 18 3 Likelihood Function: Multinomials Bayesian Inference L(θ : D) = P (D | θ) = ∏ P (x [m]| θ) m Represent uncertainty about parameters using a probability distribution over parameters, data The likelihood for the sequence H,T, T, H, H is Learning using Bayes rule ) :D L(θ : D) = θ ⋅ (1 − θ) ⋅ (1 − θ) ⋅ θ ⋅ θ Likelihood Prior (θ L P (x [1],Kx [M] | θ)P (θ) P (θ| x [1],Kx [M]) = Count of kth P (x [1],Kx [M ]) 0 0.2 0.4 0.6 0.8 1 outcome in D θ Posterior K Probability of data N General case: L(Θ : D) = ∏θ k k Probability of k =1 kth outcome 19 20 Bayesian Inference Example: Binomial Data Represent Bayesian distribution as Bayes net Prior: uniform for θ in [0,1] θ ⇒ P(θ |D) ∝ the likelihood L(θ :D) P(θ |x[1],Kx[M])∝P(x[1],Kx[M]|θ)⋅P(θ) X[1] X[2] X[m] (NH,NT ) = (4,1) Observed data The values of X are independent given θ MLE for P(X = H ) is 4/5 = 0.8 0 0.2 0.4 0.6 0.8 1 P(x[m] | θ ) = θ Bayesian prediction is 5 Bayesian prediction is inference in this network P (x[M + 1] = H |D) = θ ⋅P (θ|D)dθ = = 0.7142K ∫ 7 21 22 Dirichlet Priors Dirichlet Priors - Example 5 Recall that the likelihood function is 4.5 K N Dirichlet(α , α ) L(Θ : D) = θ k 4 heads tails ∏ k k =1 3.5 ) Dirichlet prior with hyperparameters α1,…,αK 3 K Dirichlet(5,5) αk −1 heads 2.5 P (Θ) ∝ ∏ θk θ Dirichlet(0.5,0.5) k =1 P( 2 Dirichlet(2,2) ⇒ the posterior has the same form, with 1.5 Dirichlet(1,1) 1 hyperparameters α1+N 1,…,αK +N K K K K 0.5 αk −1 Nk αk +Nk −1 P (Θ|D) ∝ P (Θ)P (D | Θ) ∝ ∏θ k ∏θk = ∏θk 0 k =1 k =1 k =1 0 0.2 0.4 0.6 0.8 1 θ heads 23 24 4 Dirichlet Priors (cont.) Bayesian Nets & Bayesian Prediction If P(Θ) is Dirichlet with hyperparameters α1,…,αK θY|X θX θX αk P (X [1] = k ) = θ ⋅P (Θ)dΘ = m ∫ k θY|X α X[m] ∑ l X[1] X[2] X[M] X[M+1] l Y[m] Since the posterior is also Dirichlet, we get Y[1] Y[2] Y[M] Y[M+1] Plate notation Observed data Query αk + Nk P (X [M + 1] = k |D) = ∫ θk ⋅P (Θ |D)dΘ = ∑ (αl + Nl ) Priors for each parameter group are independent l Data instances are independent given the unknown parameters 25 26 Bayesian Nets & Bayesian Prediction Learning Parameters: Summary θ θ Y|X X Estimation relies on sufficient statistics For multinomials: counts N(xi,pai) X[1] X[2] X[M] Parameter estimation Y[1] Y[2] Y[M] N (x , pa ) ~ α(x , pa ) + N (x , pa ) ˆθ = i i θ = i i i i xi |pai xi |pai N (pai ) α(pai ) + N (pai ) Observed data MLE Bayesian (Dirichlet) We can also “read” from the network: Both are asymptotically equivalent and consistent Complete data ⇒ posteriors on parameters are independent Both can be implemented in an on-line manner by accumulating sufficient statistics Can compute posterior over parameters separately! 27 28 Learning Parameters: Case Study Overview 1.4 Instances sampled from Introduction 1.2 ICU Alarm network Parameter Learning M’ — strength of prior 1 Model Selection Scoring function 0.8 MLE Structure search 0.6 Bayes; M'=20 Structure Discovery 0.4 Bayes; M'=50 Incomplete Data KL Divergence to true distribution true to 0.2 Learning from Structured Data Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 29 30 5 Why Struggle for Accurate Structure? Score-based Learning Earthquake Alarm Set Burglary Define scoring function that evaluates how well a structure matches the data Sound Missing an arc Adding an arc E, B, A <Y,N,N> Earthquake Alarm Set Burglary Earthquake Alarm Set Burglary <Y,Y,Y> <N,N,Y> <N,Y,Y> Sound Sound .
Recommended publications
  • Common Quandaries and Their Practical Solutions in Bayesian Network Modeling
    Ecological Modelling 358 (2017) 1–9 Contents lists available at ScienceDirect Ecological Modelling journa l homepage: www.elsevier.com/locate/ecolmodel Common quandaries and their practical solutions in Bayesian network modeling Bruce G. Marcot U.S. Forest Service, Pacific Northwest Research Station, 620 S.W. Main Street, Suite 400, Portland, OR, USA a r t i c l e i n f o a b s t r a c t Article history: Use and popularity of Bayesian network (BN) modeling has greatly expanded in recent years, but many Received 21 December 2016 common problems remain. Here, I summarize key problems in BN model construction and interpretation, Received in revised form 12 May 2017 along with suggested practical solutions. Problems in BN model construction include parameterizing Accepted 13 May 2017 probability values, variable definition, complex network structures, latent and confounding variables, Available online 24 May 2017 outlier expert judgments, variable correlation, model peer review, tests of calibration and validation, model overfitting, and modeling wicked problems. Problems in BN model interpretation include objec- Keywords: tive creep, misconstruing variable influence, conflating correlation with causation, conflating proportion Bayesian networks and expectation with probability, and using expert opinion. Solutions are offered for each problem and Modeling problems researchers are urged to innovate and share further solutions. Modeling solutions Bias Published by Elsevier B.V. Machine learning Expert knowledge 1. Introduction other fields. Although the number of generally accessible journal articles on BN modeling has continued to increase in recent years, Bayesian network (BN) models are essentially graphs of vari- achieving an exponential growth at least during the period from ables depicted and linked by probabilities (Koski and Noble, 2011).
    [Show full text]
  • Bayesian Network Modelling with Examples
    Bayesian Network Modelling with Examples Marco Scutari [email protected] Department of Statistics University of Oxford November 28, 2016 What Are Bayesian Networks? Marco Scutari University of Oxford What Are Bayesian Networks? A Graph and a Probability Distribution Bayesian networks (BNs) are defined by: a network structure, a directed acyclic graph G = (V;A), in which each node vi 2 V corresponds to a random variable Xi; a global probability distribution X with parameters Θ, which can be factorised into smaller local probability distributions according to the arcs aij 2 A present in the graph. The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation, thus specifying the factorisation of the global distribution: p Y P(X) = P(Xi j ΠXi ;ΘXi ) where ΠXi = fparents of Xig i=1 Marco Scutari University of Oxford What Are Bayesian Networks? How the DAG Maps to the Probability Distribution Graphical Probabilistic DAG separation independence A B C E D F Formally, the DAG is an independence map of the probability distribution of X, with graphical separation (??G) implying probabilistic independence (??P ). Marco Scutari University of Oxford What Are Bayesian Networks? Graphical Separation in DAGs (Fundamental Connections) separation (undirected graphs) A B C d-separation (directed acyclic graphs) A B C A B C A B C Marco Scutari University of Oxford What Are Bayesian Networks? Graphical Separation in DAGs (General Case) Now, in the general case we can extend the patterns from the fundamental connections and apply them to every possible path between A and B for a given C; this is how d-separation is defined.
    [Show full text]
  • Empirical Evaluation of Scoring Functions for Bayesian Network Model Selection
    Liu et al. BMC Bioinformatics 2012, 13(Suppl 15):S14 http://www.biomedcentral.com/1471-2105/13/S15/S14 PROCEEDINGS Open Access Empirical evaluation of scoring functions for Bayesian network model selection Zhifa Liu1,2†, Brandon Malone1,3†, Changhe Yuan1,4* From Proceedings of the Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge Oxford, MS, USA. 17-18 February 2012 Abstract In this work, we empirically evaluate the capability of various scoring functions of Bayesian networks for recovering true underlying structures. Similar investigations have been carried out before, but they typically relied on approximate learning algorithms to learn the network structures. The suboptimal structures found by the approximation methods have unknown quality and may affect the reliability of their conclusions. Our study uses an optimal algorithm to learn Bayesian network structures from datasets generated from a set of gold standard Bayesian networks. Because all optimal algorithms always learn equivalent networks, this ensures that only the choice of scoring function affects the learned networks. Another shortcoming of the previous studies stems from their use of random synthetic networks as test cases. There is no guarantee that these networks reflect real-world data. We use real-world data to generate our gold-standard structures, so our experimental design more closely approximates real-world situations. A major finding of our study suggests that, in contrast to results reported by several prior works, the Minimum Description Length (MDL) (or equivalently, Bayesian information criterion (BIC)) consistently outperforms other scoring functions such as Akaike’s information criterion (AIC), Bayesian Dirichlet equivalence score (BDeu), and factorized normalized maximum likelihood (fNML) in recovering the underlying Bayesian network structures.
    [Show full text]
  • Bayesian Networks
    Bayesian Networks Read R&N Ch. 14.1-14.2 Next lecture: Read R&N 18.1-18.4 You will be expected to know • Basic concepts and vocabulary of Bayesian networks. – Nodes represent random variables. – Directed arcs represent (informally) direct influences. – Conditional probability tables, P( Xi | Parents(Xi) ). • Given a Bayesian network: – Write down the full joint distribution it represents. • Given a full joint distribution in factored form: – Draw the Bayesian network that represents it. • Given a variable ordering and some background assertions of conditional independence among the variables: – Write down the factored form of the full joint distribution, as simplified by the conditional independence assertions. Computing with Probabilities: Law of Total Probability Law of Total Probability (aka “summing out” or marginalization) P(a) = Σb P(a, b) = Σb P(a | b) P(b) where B is any random variable Why is this useful? given a joint distribution (e.g., P(a,b,c,d)) we can obtain any “marginal” probability (e.g., P(b)) by summing out the other variables, e.g., P(b) = Σa Σc Σd P(a, b, c, d) Less obvious: we can also compute any conditional probability of interest given a joint distribution, e.g., P(c | b) = Σa Σd P(a, c, d | b) = (1 / P(b)) Σa Σd P(a, c, d, b) where (1 / P(b)) is just a normalization constant Thus, the joint distribution contains the information we need to compute any probability of interest. Computing with Probabilities: The Chain Rule or Factoring We can always write P(a, b, c, … z) = P(a | b, c, ….
    [Show full text]
  • Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute
    1 Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute Abstract Deep directed generative models have attracted much attention recently due to their generative modeling nature and powerful data representation ability. In this paper, we review different structures of deep directed generative models and the learning and inference algorithms associated with the structures. We focus on a specific structure that consists of layers of Bayesian Networks due to the property of capturing inherent and rich dependencies among latent variables. The major difficulty of learning and inference with deep directed models with many latent variables is the intractable inference due to the dependencies among the latent variables and the exponential number of latent variable configurations. Current solutions use variational methods often through an auxiliary network to approximate the posterior probability inference. In contrast, inference can also be performed directly without using any auxiliary network to maximally preserve the dependencies among the latent variables. Specifically, by exploiting the sparse representation with the latent space, max-max instead of max-sum operation can be used to overcome the exponential number of latent configurations. Furthermore, the max-max operation and augmented coordinate ascent are applied to both supervised and unsupervised learning as well as to various inference. Quantitative evaluations on benchmark datasets of different models are given for both data representation and feature learning tasks. I. INTRODUCTION Deep learning has become a major enabling technology for computer vision. By exploiting its multi-level representation and the availability of the big data, deep learning has led to dramatic performance improvements for certain tasks.
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]
  • Checking Individuals and Sampling Populations with Imperfect Tests
    Checking individuals and sampling populations with imperfect tests Giulio D’Agostini1 and Alfredo Esposito2 Abstract In the last months, due to the emergency of Covid-19, questions related to the fact of belonging or not to a particular class of individuals (‘infected or not infected’), after being tagged as ‘positive’ or ‘negative’ by a test, have never been so popular. Similarly, there has been strong interest in estimating the proportion of a population expected to hold a given characteristics (‘having or having had the virus’). Taking the cue from the many related discussions on the media, in addition to those to which we took part, we analyze these questions from a probabilistic perspective (‘Bayesian’), considering several ef- fects that play a role in evaluating the probabilities of interest. The resulting paper, written with didactic intent, is rather general and not strictly related to pandemics: the basic ideas of Bayesian inference are introduced and the uncertainties on the performances of the tests are treated using the metro- logical concepts of ‘systematics’, and are propagated into the quantities of interest following the rules of probability theory; the separation of ‘statistical’ and ‘systematic’ contributions to the uncertainty on the inferred proportion of infectees allows to optimize the sample size; the role of ‘priors’, often over- looked, is stressed, however recommending the use of ‘flat priors’, since the resulting posterior distribution can be ‘reshaped’ by an ‘informative prior’ in a later step; details on the calculations are given, also deriving useful approx- imated formulae, the tough work being however done with the help of direct Monte Carlo simulations and Markov Chain Monte Carlo, implemented in R and JAGS (relevant code provided in appendix).
    [Show full text]
  • A Bayesian Network Approach to Making Inferences in Causal Maps
    European Journal of Operational Research 128 (2001) 479±498 www.elsevier.com/locate/dsw Theory and Methodology A Bayesian network approach to making inferences in causal maps Sucheta Nadkarni, Prakash P. Shenoy * School of Business, University of Kansas, Summer®eld Hall, Lawrence, KS 66045-2003, USA Received 8 September 1998; accepted 19 May 1999 Abstract The main goal of this paper is to describe a new graphical structure called ÔBayesian causal mapsÕ to represent and analyze domain knowledge of experts. A Bayesian causal map is a causal map, i.e., a network-based representation of an expertÕs cognition. It is also a Bayesian network, i.e., a graphical representation of an expertÕs knowledge based on probability theory. Bayesian causal maps enhance the capabilities of causal maps in many ways. We describe how the textual analysis procedure for constructing causal maps can be modi®ed to construct Bayesian causal maps, and we illustrate it using a causal map of a marketing expert in the context of a product development decision. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Causal maps; Cognitive maps; Bayesian networks; Bayesian causal maps 1. Introduction In the last few years, researchers have empha- sized that causal maps may be used as tools to Recently, there has been a growing interest in facilitate decision making and problem solving the use of causal maps to represent domain within the context of organizational intervention knowledge of decision-makers (Hu, 1990; Eden (Eden, 1992; Laukkanen, 1996). Studies in arti®- et al., 1992; Laukkanen, 1996). Causal maps are cial intelligence have also emphasized the impor- cognitive maps that represent the causal knowl- tance of domain knowledge in constructing edge of subjects in a speci®c domain.
    [Show full text]
  • Learning Bayesian Networks: Naïve and Non-Naïve Bayes
    LearningLearning BayesianBayesian Networks:Networks: NaNaïïveve andand nonnon--NaNaïïveve BayesBayes HypothesisHypothesis SpaceSpace –– fixedfixed sizesize –– stochasticstochastic –– continuouscontinuous parametersparameters LearningLearning AlgorithmAlgorithm –– directdirect computationcomputation –– eagereager –– batchbatch MultivariateMultivariate GaussianGaussian ClassifierClassifier y x TheThe multivariatemultivariate GaussianGaussian ClassifierClassifier isis equivalentequivalent toto aa simplesimple BayesianBayesian networknetwork ThisThis modelsmodels thethe jointjoint distributiondistribution P(P(xx,y),y) underunder thethe assumptionassumption thatthat thethe classclass conditionalconditional distributionsdistributions P(P(xx|y)|y) areare multivariatemultivariate gaussiansgaussians – P(y): multinomial random variable (K-sided coin) – P(x|y): multivariate gaussian mean µk covariance matrix Σk NaNaïïveve BayesBayes ModelModel y x1 x2 x3 … xn Each node contains a probability table – y: P(y = k) – xj: P(xj = v | y = k) “class conditional probability” Interpret as a generative model – Choose the class k according to P(y = k) – Generate each feature independently according to P(xj=v | y=k) – The feature values are conditionally independent P(xi,xj | y) = P(xi | y) · P(xj | y) RepresentingRepresenting P(P(xxjj||yy)) Many representations are possible – Univariate Gaussian if xj is a continuous random variable, then we can use a normal distribution and learn the mean µ and variance σ2 – Multinomial if xj is a discrete random variable,
    [Show full text]
  • Bayesian Networks: Probabilistic Programming
    Bayesian networks: probabilistic programming • In this module, I will talk about probabilistic programming, a new way to think about defining Bayesian networks through the lens of writing programs, which really highlights the generative process aspect of Bayesian networks. Probabilistic programs B E Joint distribution: A P(B = b; E = e; A = a) = p(b)p(e)p(a j b; e) Probabilistic program: alarm B ∼ Bernoulli() def Bernoulli(epsilon): E ∼ Bernoulli() return random.random() < epsilon A = B _ E Key idea: probabilistic program A randomized program that sets the random variables. CS221 2 • Recall that a Bayesian network is given by (i) a set of random variables, (ii) directed edges between those variables capturing qualitative dependencies, (iii) local conditional distributions of each variable given its parents which captures these dependencies quantitatively, and (iv) a joint distribution which is produced by multiplying all the local conditional distributions together. Now the joint distribution is your probabilistic database, which you can answer all sorts of questions on it using probabilistic inference. • There is another way of writing down Bayesian networks other than graphically or mathematically, and that is as a probabilistic program. • Let's go through the alarm example. We can sample B and E independently from a Bernoulli distribution with parameter , which produces 1 (true) with probability . Then we just set A = B _ E. • In general, a probabilistic program is a randomized program that invokes a random number generator. Executing this program will assign values to a collection of random variables X1;:::;Xn; that is, generating an assignment. • We then define probability under the joint distribution of an assignment to be exactly the probability that the program generates an assignment.
    [Show full text]
  • The Hidden Structure of Fact-Finding
    Florida State University College of Law Scholarship Repository Scholarly Publications Fall 2013 The Hidden Structure of Fact-Finding Mark Spottswood Florida State University College of Law Follow this and additional works at: https://ir.law.fsu.edu/articles Part of the Cognitive Psychology Commons, and the Law Commons Recommended Citation Mark Spottswood, The Hidden Structure of Fact-Finding, 64 CASE W. RES. L REV. 131 (2013), Available at: https://ir.law.fsu.edu/articles/106 This Article is brought to you for free and open access by Scholarship Repository. It has been accepted for inclusion in Scholarly Publications by an authorized administrator of Scholarship Repository. For more information, please contact [email protected]. Case Western Reserve Law Review Volume 64 Fall 2013 Issue 1 The Hidden Structure of Fact-Finding Mark Spottswood Case Western Reserve Law Review·Volume 64 ·Issue 1·2013 The Hidden Structure of Fact-Finding Mark Spottswood† Abstract This Article offers a new account of legal fact-finding based on the dual-process framework in cognitive psychology. This line of research suggests that our brains possess two radically different ways of thinking. “System 1” cognition is unconscious, fast, and associative, while “System 2” involves effortful, conscious reasoning. Drawing on these insights, I describe the ways that unconscious processing and conscious reflection interact when jurors hear and decide cases. Most existing evidential models offer useful insights about the ways that juries use relevant information in deciding cases but fail to account for situations in which their decisions are likely to be affected by irrelevant stimuli. The dual-process approach, by contrast, is able to explain both probative and prejudicial influences on decision making.
    [Show full text]
  • Arxiv:2002.00269V2 [Cs.LG] 8 Mar 2021
    A Tutorial on Learning With Bayesian Networks David Heckerman [email protected] November 1996 (Revised January 2020) Abstract A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study. 1 Introduction A Bayesian network is a graphical model for probabilistic relationships among a set of variables. arXiv:2002.00269v2 [cs.LG] 8 Mar 2021 Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems (Heckerman et al., 1995a).
    [Show full text]