Graphical Models and Message Passing

Total Page:16

File Type:pdf, Size:1020Kb

Graphical Models and Message Passing Graphical Models and Message passing Yunshu Liu ASPITRG Research Group 2013-07-16 References: [1]. Steffen Lauritzen, Graphical Models, Oxford University Press, 1996 [2]. Michael I. Jordan, Graphical models, Statistical Science, Vol.19, No. 1. Feb., 2004 [3]. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag New York, Inc. 2006 [4]. Daphne Koller and Nir Friedman, Probabilistic Graphical Models - Principles and Techniques, The MIT Press, 2009 [5].Kevin P. Murphy, Machine Learning - A Probabilistic Perspective, The MIT Press, 2012 Outline Preliminaries on Graphical Models Directed graphical model Undirected graphical model Directed graph and Undirected graph Message passing and sum-product algorithm Factor graphs Sum-product algorithms Junction tree algorithm Outline I Preliminaries on Graphical Models I Message passing and sum-product algorithm Preliminaries on Graphical Models Motivation: Curse of dimensionality I Matroid enumeration I Polyhedron computation I Entropic Region I Machine Learning: computing likelihoods, marginal probabilities ··· Graphical Models: Help capturing complex dependencies among random variables, building large-scale statistical models and designing efficient algorithms for Inference. Preliminaries on Graphical Models Definition of Graphical Models: A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. MIT : Accepted by MIT Stanford : Accepted by Stanford MIT Example: Suppose MIT and GPA Stanford accepted undergraduate Stanford students only Given Alice’s GPA as GP AAlice, based on GPA P(MIT Stanford, GP AAlice)=P(MIT GP AAlice) | | We say MIT is conditionally independent of Stanford given GP AAlice Sometimes use symbol (MIT Stanford GP A ) ⊥ | Alice Bayesian networks: directed graphical model Bayesian networks A Bayesian network consists of a collection of probability distributions P over x = x1; ; xK that factorize over a directed acyclic graph(DAG)f ··· in the followingg way: Y p(x) = p(x ; ; x ) = p(x pa ) 1 ··· K k j k k2K where pak is the direct parents nodes of xk . Alias of Bayesian networks: I probabilistic directed graphical model: via directed acyclic graph(DAG) I belief networks I causal networks: directed arrows represent causal realtions Bayesian networks: directed graphical model Examples of Bayesian networks Consider an arbitrary joint distribution p(x) = p(x1; x2; x3) over three variables, we can write: p(x ; x ; x ) = p(x x ; x )p(x ; x ) 1 2 3 3j 1 2 1 2 = p(x x ; x )p(x x )p(x ) 3j 1 2 2j 1 1 which can be expressed in the following directed graph: x1 x3 x2 Bayesian networks: directed graphical model Examples Similarly, if we change the order of x1, x2 and x3(same as consider all permutations of them), we can express p(x1; x2; x3) in five other different ways, for example: p(x ; x ; x ) = p(x x ; x )p(x ; x ) 1 2 3 1j 2 3 2 3 = p(x x ; x )p(x x )p(x ) 1j 2 3 2j 3 3 which corresponding to the following directed graph: x1 x3 x2 Bayesian networks: directed graphical model Examples Recall the previous example about how MIT and Stanford accept undergraduate students, if we assign x1 to ”GPA”, x2 to ”accepted by MIT” and x3 to ”accepted by Stanford”, then since p(x x ; x ) = p(x x ) we have 3j 1 2 3j 1 p(x ; x ; x ) = p(x x ; x )p(x x )p(x ) 1 2 3 3j 1 2 2j 1 1 = p(x x )p(x x )p(x ) 3j 1 2j 1 1 which corresponding to the following directed graph: x1 GPA x3 Stanford x2 MIT 8.3. Markov Random Fields 385 Figure 8.28 For an undirected graph, the Markov blanket of a node xi consists of the set of neighbouring nodes. It has the property that the conditional distribution of xi, conditioned on all the remaining variables in the graph, is dependent only on the variables in the Markov blanket. If we consider two nodes xi and xj that are not connected by a link, then these variables must be conditionally independent given all other nodes in the graph. This follows from the fact that there is no direct path between the two nodes, and all other paths pass through nodes that are observed, and hence those paths are blocked. This conditional independence property can be expressed as p(xi,xj x i,j )=p(xi x i,j )p(xj x i,j ) (8.38) | \{ } | \{ } | \{ } where x i,j denotes the set x of all variables with xi and xj removed. The factor- \{ } ization of the joint distribution must therefore be such that xi and xj do not appear in the same factor in order for the conditional independence property to hold for all possible distributions belonging to the graph. This leads us to consider a graphical concept called a clique, which is defined as a subset of the nodes in a graph such that there exists a link between all pairs of nodes in the subset. In other words, the set of nodes in a clique is fully connected. Furthermore, a maximal clique is a clique such that it is not possible to include any other nodes from the graph in the set without it ceasing to be a clique. These concepts are illustrated by the undirected graph over four variables shown in Figure 8.29. This graph has five cliques of two nodes given by x1,x2 , x2,x3 , x3,x4 , x4,x2 , and x ,x , as well as two maximal cliques{ given by} {x ,x ,x} { and }x {,x ,x }. { 1 3} { 1 2 3} { 2 3 4} The set x1,x2,x3,x4 is not a clique because of the missing link from x1 to x4. We{ can thereforeMarkov define} the random factors in fields: the decomposition undirected of graphicalthe joint distribution model to be functions of the variablesIn the undirected in the cliques. case, the In probability fact, we candistribution consider factorizes functions of the maximal cliques,according without loss to functions of generality, defined because on the clique other of cliques the graph. must be subsets of maximal cliques. Thus, if x ,x ,x is a maximal clique and we define A clique is a subset{ 1 2 of nodes3} in a graph such that there exist a an arbitrary function overlink this between clique, all then pairs including of nodes in another the subset. factor defined over a subset of these variablesA would maximal beredundant. clique is a clique such that it is not possible to Let us denote a cliqueinclude by C anyand other the set nodes of variables from the graph in that in clique the set by withoutxC . Then it ceasing to be a clique. Figure 8.29 A four-node undirected graph showing a clique (outlined in green) and a maximal clique (outlined in blue). x1 Example of cliques: x2 x ,x , x ,x , x ,x , x ,x , x ,x , f 1 2g f 2 3g f 3 4g f 2 4g f 1 3g x1,x2,x3 , x2,x3,x4 Maximalf g cliques:f g x1,x2,x3 , x2,x3,x4 x f g f g 3 x4 Markov random fields: undirected graphical model Markov random fields: Definition Denote C as a clique, xC the set of variables in clique C and C(xC) a nonnegative potential function associated with clique C. Then a Markov random field is a collection of distributions that factorize as a product of potential functions C(xC) over the maximal cliques of the graph: 1 Y p(x) = (x ) Z C C C P Q where normalization constant Z = x C C(xC) sometimes called the partition function. Markov random fields: undirected graphical model Factorization of undirected graphs Question: how to write the joint distribution for this undirected graph? x1 x3 x2 (2 3 1) hold ⊥ | Answer: 1 p(x) = (x ; x ) (x ; x ) Z 12 1 2 13 1 3 where 12(x1; x2) and 13(x1; x3) are the potential functions and Z is the partition function that make sure p(x) satisfy the conditions to be a probability distribution. Markov random fields: undirected graphical model Markov property Given an undirected graph G = (V,E), a set of random variables X = (Xa)a2V indexed by V , we have the following Markov properties: I Pairwise Markov property: Any two non-adjacent variables are conditionally independent given all other variables: Xa X X if a; b E ? bj V nfa;bg f g 62 I Local Markov property: A variable is conditionally independent of all other variables given its neighbors: Xa XV nfnb(a)[ag Xnb(a) where? nb(a) is thej neighbors of node a. I Global Markov property: Any two subsets of variables are conditionally independent given a separating subset: XA XB XS, where every path from a node in A to a node? in Bj passes through S(means when we remove all the nodes in S, there are no paths connecting any nodes in A to any nodes in B). 662 Chapter 19. Undirected graphical models (Markov random fields) X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X16 X17 X18 X19 X20 (a) (b) Markov random fields: undirected graphical model Figure 19.1 (a) A 2d lattice represented as a DAG. The dotted red node X8 is independent of all other nodes (black) given its Markov blanket, which includeExamples its parents of Markov (blue), properties children (green) and co-parents Pairwise Markov property: (1 7 23456), (3 4 12567) (orange).
Recommended publications
  • Elements of Graphical Models
    Steffen L. Lauritzen Elements of Graphical Models Lectures from the XXXVIth International Probability Summer School in Saint-Flour, France, 2006 September 4, 2011 Springer Your dedication goes here Preface Here come the golden words place(s), First name Surname month year First name Surname vii Contents 1 Introduction ................................................... 1 2 Markov Properties ............................................. 5 2.1 Conditional Independence . .5 2.1.1 Basic properties . .5 2.1.2 General conditional independence . .6 2.2 Markov Properties for Undirected Graphs . .9 2.3 Markov Properties for Directed Acyclic Graphs . 15 2.4 Summary . 22 3 Graph Decompositions and Algorithms ........................... 25 3.1 Graph Decompositions and Markov Properties . 25 3.2 Chordal Graphs and Junction Trees . 27 3.3 Probability Propagation and Junction Tree Algorithms. 32 3.4 Local Computation . 32 3.5 Summary . 39 4 Specific Graphical Models ...................................... 41 4.1 Log-linear Models . 41 4.1.1 Interactions and factorization . 41 4.1.2 Dependence graphs and factor graphs . 42 4.1.3 Data and likelihood function . 44 4.2 Gaussian Graphical Models . 50 4.2.1 The multivariate Gaussian distribution . 50 4.2.2 The Wishart distribution . 52 4.2.3 Gaussian graphical models . 53 4.3 Summary . 58 5 Further Statistical Theory....................................... 61 5.1 Hyper Markov Laws . 61 5.2 Meta Markov Models . 66 ix x Contents 5.2.1 Bayesian inference . 71 5.2.2 Hyper inverse Wishart and Dirichlet laws . 72 5.3 Summary . 73 6 Estimation of Structure ......................................... 77 6.1 Estimation of Structure and Bayes Factors . 77 6.2 Estimating Trees and Forests . 81 6.3 Learning Bayesian networks .
    [Show full text]
  • Arxiv:1606.02359V1 [Stat.ME] 7 Jun 2016
    STRUCTURE LEARNING IN GRAPHICAL MODELING MATHIAS DRTON AND MARLOES H. MAATHUIS Abstract. A graphical model is a statistical model that is associated to a graph whose nodes correspond to variables of interest. The edges of the graph reflect allowed conditional dependencies among the variables. Graphical models admit computationally convenient fac- torization properties and have long been a valuable tool for tractable modeling of multivariate distributions. More recently, applications such as reconstructing gene regulatory networks from gene expression data have driven major advances in structure learning, that is, estimat- ing the graph underlying a model. We review some of these advances and discuss methods such as the graphical lasso and neighborhood selection for undirected graphical models (or Markov random fields), and the PC algorithm and score-based search methods for directed graphical models (or Bayesian networks). We further review extensions that account for effects of latent variables and heterogeneous data sources. 1. Introduction This article gives an overview of commonly used techniques for structure learning in graphical modeling. Structure learning is a model selection problem in which one estimates a graph that summarizes the dependence structure in a given data set. 1.1. What is a Graphical Model? A graphical model captures stochastic dependencies among a collection of random variables Xv, v 2 V (Lauritzen, 1996). Each model is associated to a graph G = (V; E), where the vertex set V indexes the variables and the edge set E imposes a set of conditional independencies. Specifically, Xv and Xw are required to be conditionally independent given XC := (Xu : u 2 C), denoted by Xv ?? Xw j XC , if every path between nodes v and w in G is suitably blocked by the nodes in C.
    [Show full text]
  • Finding Optimal Tree Decompositions
    MSc thesis Computer Science Finding Optimal Tree Decompositions Tuukka Korhonen June 4, 2020 Faculty of Science University of Helsinki Supervisor(s) Assoc. Prof. Matti J¨arvisalo,Dr. Jeremias Berg Examiner(s) Assoc. Prof. Matti J¨arvisalo,Dr. Jeremias Berg Contact information P. O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki,Finland Email address: [email protected].fi URL: http://www.cs.helsinki.fi/ HELSINGIN YLIOPISTO – HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Koulutusohjelma — Utbildningsprogram — Study programme Faculty of Science Computer Science Tekij¨a— F¨orfattare— Author Tuukka Korhonen Ty¨onnimi — Arbetets titel — Title Finding Optimal Tree Decompositions Ohjaajat — Handledare — Supervisors Assoc. Prof. Matti J¨arvisalo, Dr. Jeremias Berg Ty¨onlaji — Arbetets art — Level Aika — Datum — Month and year Sivum¨a¨ar¨a— Sidoantal — Number of pages MSc thesis June 4, 2020 94 pages, 1 appendix page Tiivistelm¨a— Referat — Abstract The task of organizing a given graph into a structure called a tree decomposition is relevant in multiple areas of computer science. In particular, many NP-hard problems can be solved in polynomial time if a suitable tree decomposition of a graph describing the problem instance is given as a part of the input. This motivates the task of finding as good tree decompositions as possible, or ideally, optimal tree decompositions. This thesis is about finding optimal tree decompositions of graphs with respect to several notions of optimality. Each of the considered notions measures the quality of a tree decomposition in the context of an application. In particular, we consider a total of seven problems that are formulated as finding optimal tree decompositions: treewidth, minimum fill-in, generalized and fractional hypertreewidth, total table size, phylogenetic character compatibility, and treelength.
    [Show full text]
  • Chapter 6 Inference with Tree-Clustering
    Chapter 6 Inference with Tree-Clustering As noted in Chapters 3, 4, topological characterization of tractability in graphical models is centered on the graph parameter called induced-width or tree-width. In this chap- ter, we take variable elimination algorithms such as bucket-elimination one step further, showing that they are a restricted version of schemes that are based on the notion of tree-decompositions which is applicable across the whole spectrum of graphical models. These methods have received different names in different research areas, such as join-tree clustering, clique-tree clustering and hyper-tree decompositions. We will refer to these schemes by the umbrella name tree-clustering or tree-decomposition. The complexity of these methods is governed by the induced-width, the same parameter that controls the performance of bucket elimination. We will start by extending the bucket-elimination algorithm into an algorithms that process a special class of tree-decompositions, called bucket-trees, and will then move to the general notion of tree-decomposition. 6.1 Bucket-Tree Elimination The bucket-elimination algorithm, BE-bel, for belief updating computes the belief of the first node in the ordering, given all the evidence or just the probability of evidence. However, it is often desirable to answer the belief query for every variable in the network. A brute-force approach will require running BE-bel n times, each time with a different variable order. We will show next that this is unnecessary. By viewing bucket-elimination 103 104 CHAPTER 6. INFERENCE WITH TREE-CLUSTERING as message passing algorithm along a rooted bucket-tree, we can augment it with a second message passing from root to leaves which is equivalent to running the algorithm for each variable separately.
    [Show full text]
  • Unifying Cluster-Tree Decompositions for Reasoning in Graphical Models∗
    Unifying Cluster-Tree Decompositions for Reasoning in Graphical models¤ Kalev Kask¤, Rina Dechter¤, Javier Larrosa¤¤ and Avi Dechter¤¤¤ ¤Bren School of Information and Computer Science, University of California, Irvine, CA 92697-3425, USA ¤¤ Universitat Politcnica de Catalunya (UPC), Barcelona, Spain ¤¤¤ Business school, California State University, Northridge, CA, USA April 7, 2005 Abstract The paper provides a unifying perspective of tree-decomposition algorithms appearing in various automated reasoning areas such as join-tree clustering for constraint-satisfaction and the clique-tree al- gorithm for probabilistic reasoning. Within this framework, we in- troduce a new algorithm, called bucket-tree elimination (BT E), that extends Bucket Elimination (BE) to trees, and show that it can pro- vide a speed-up of n over BE for various reasoning tasks. Time-space tradeo®s of tree-decomposition processing are analyzed. 1 Introduction This paper provides a unifying perspective of tree-decomposition algorithms that appear in a variety of automated reasoning areas. Its main contribution ¤This work was supported in part by NSF grant IIS-0086529 and by MURI ONR award N0014-00-1-0617 1 is bringing together, within a single coherent framework, seemingly di®erent approaches that have been developed over the years within a number of di®erent communities 1. The idea of embedding a database that consists of a collection of functions or relations in a tree structure and, subsequently, processing it e®ectively by a tree-processing algorithm, has been discovered and rediscovered in dif- ferent contexts. Database researchers observed, almost three decades ago, that relational database schemes that constitute join-trees enable e±cient query processing [22].
    [Show full text]
  • Graph Properties of DAG Associahedra and Related Polytopes
    DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Graph properties of DAG associahedra and related polytopes JONAS BEDEROFF ERIKSSON KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Graph properties of DAG associahedra and related polytopes JONAS BEDEROFF ERIKSSON Degree Projects in Mathematics (30 ECTS credits) Degree Programme in Mathematics (120 credits) KTH Royal Institute of Technology year 2018 Supervisor at KTH: Liam Solus Examiner at KTH: Svante Linusson TRITA-SCI-GRU 2018:161 MAT-E 2018:25 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci GRAPH PROPERTIES OF DAG ASSOCIAHEDRA AND RELATED POLYTOPES JONAS BEDEROFF ERIKSSON Sammanfattning. En riktad acyklisk graf kan betraktas som en representant för relatio- ner av betingat oberoende mellan stokastiska variabler. Ett grundläggande problem inom kausal inferens är, givet att vi får data från en sannolikhetsfördelning som uppfyller en mängd relationer av betingat oberoende, att hitta den underliggande graf som represen- terar dessa relationer. Den giriga SP-algoritmen strävar efter att hitta den underliggande grafen genom att traversera kanter på en polytop kallad DAG-associahedern. Därav spelar kantstrukturen hos DAG-associahedern en stor roll för vår förståelse för den giriga SP- algoritmens komplexitet. I den här uppsatsen studerar vi grafegenskaper hos kantgrafer av DAG-associahedern, relaterade polytoper och deras viktiga delgrafer. Exempel på grafe- genskaper som vi studerar är diameter, radie och center. Våra diameterresultat för DAG- associahedern ger upphov till en ny kausal inferensalgoritm med förbättrade teoretiska på- litlighetsgarantier och komplexitetsbegränsningar jämfört med den giriga SP-algoritmen. Abstract. A directed acyclic graph (DAG) can be thought of as encoding a set of condi- tional independence (CI) relations among random variables.
    [Show full text]