Machine Learning

Total Page:16

File Type:pdf, Size:1020Kb

Machine Learning Machine Learning Bert Kappen SNN Radboud University, Nijmegen Gatsby Unit, UCL London October 9, 2020 Bert Kappen Course setup • 3 ec course. 7 weeks. • Lectures are prerecorded. Available through brightspace. • examination based on weekly exercises only – You make a group of maximal 3 persons – Each week, you make the exercises with your group. – You can ask questions in tutorial class – Your group hands them in before the next tutorial class. Hard rule, because answers may be discussed in tutorial class. • All course materials (slides, exercises) and schedule via http://www.snn.ru.nl/~bertk/machinelearning/ Bert Kappen ML 1 Content 1. Probabilities, Bayes rule, information theory 2. Model selection 3. Classification, perceptron, gradient descent learning rules 4. Multi layered perceptrons 5. Deep learning 6. Graphical models, latent variable models, EM 7. Variational autoencoders Bert Kappen ML 2 Lecture 1a Based on Mackay ch. 2: Probability, Entropy and Inference • Probabilities • Bayesian inference, inverse probabilities, Bayes rule Bert Kappen ML 3 Forward probabilities Forward probabilities is the usual way to compute the probabilities of possible out- comes given a probability model p(xj f ) ! p(outcomej f ) Exercise 2.4 Urn with B black balls and W white balls and K = W + B. Draw N times a ball from the urn with replacement. What is the probability to draw NB black balls? Bert Kappen ML 4 Forward probabilities Define f = B=K the fraction of black balls in the urn. Then N p(NB = 0) = (1 − f ) N−1 p(NB = 1) = f (1 − f ) N ! N NB N−NB p(NB) = f (1 − f ) NB Expected value and variance: N N X X 2 ENB = p(NB)NB = ::: = N f VNB = p(NB)(NB − N f ) = ::: = N f (1 − f ) NB=0 NB=0 Suppose K = 10; B = 2; f = 0:2. When N = 5; NB ≈ 1 ± 0:9. When N = 400; NB ≈ 80 ± 8. Bert Kappen ML 5 Inverse probabilities Inverse probabilities do the converse: given a specific outcome, what is the proba- bility that this has been generated by a model with parameters f ? outcome ! p( f joutcome) Exercise 2.6 11 urns u = 0;:::; 10 each with 10 balls. Urn u has u black balls and 10 − u white balls. Select one urn at random, and draw N times with replacement from that urn. The outcome is that after N = 10 draws there are NB = 3 black balls. What is the probability that urn u was selected? Bert Kappen ML 6 We treat u and NB as a random variables. N is given and provides the context or condition in which the probabilities are calculated. p(u; NBjN) = p(ujN)p(NBju; N) p(ujN)p(NBju; N) p(ujNB; N) = p(NBjN) ! u N NB N−NB p(NBju; N) = fu (1 − fu) fu = NB 10 1 p(ujN) = p(u) = 11 10 ! 10 X 1 X N NB N−NB p(NBjN) = p(u)p(NBju; N) = fu (1 − fu) 11 NB u=0 u=0 N f B(1 − f )N−NB p(ujN ; N) = u u B 10 N P B 0 N−NB u0=0 fu0 (1 − fu ) Bert Kappen ML 7 Ù ¼ ½ 0.3 0.25 0.2 ¾ 0.15 0.1 ¿ 0.05 0 4 0 1 2 3 4 5 6 7 8 9 10 u 5 Ù È ´Ù j Ò = ¿;Æ µ B 6 ¼ ¼ 7 ½ ¼º¼6¿ ¾ ¼º¾¾ 8 ¿ ¼º¾9 4 ¼º¾4 9 5 ¼º½¿ 6 ¼º¼47 ½¼ 7 ¼º¼¼99 8 ¼º¼¼¼86 9 ¼º¼¼¼¼¼96 ¼ ½ ¾ ¿ 4 5 6 7 8 9 ½¼ Ò ½¼ ¼ B Left: Joint probability p(u; NBjN). Right: conditional probability p(ujNB; N). Bert Kappen ML 8 Bayesian inference Note, that our ’inference’ has resulted in a distribution over u. We do not know which urn was selected, only probabilities. This is the best we can do given the data. This is called Bayesian inference. In general, when the models are parametrized by θ, the procedure is 1. A given data set 2. Specify the prior over models p(θ). 3. Compute the likelihood of the observed data under the model with parameters θ: p(datajθ). 4. Compute the posterior using Bayes’ rule p(datajθ)p(θ) Z p(θjdata) = p(data) = dθp(datajθ)p(θ) p(data) Bert Kappen ML 9 Exercise 2.6 continued We draw another ball from the same urn. Given the observations sofar, what is the probability that it is black? This is computed from the posterior: X10 p(blackjNB = 3; N = 10) = p(blackju; NB; N)p(ujNB; N) u=0 X10 = fu p(ujNB; N) u=0 10 N X f B(1 − f )N−NB = f u u = 0:333 u 10 N P B 0 N−NB u=0 u0=0 fu0 (1 − fu ) Compare with prediction from most likely urn (u = 3) would give p(blackju = 3) = fu=3 = 3=10 = 0:3 Bert Kappen ML 10 Exercise 2.7. The bent coin Bent coin has probability f to come up head. We do not know f . We toss N times and get NH times head. What is f ? Bert Kappen ML 11 Exercise 2.7. The bent coin Bent coin has probability f to come up head. We do not know f . We toss N times and get NH times head. What is f ? f is fixed but unknown. The key conceptual step is to treat f as a random variable. Assume a prior over f : p( f ). p( f ) is our (subjective) prior belief in the value of f . Given f and N, we know the likelihood of the observation: ! N NH N−NH p(NHj f; N) = f (1 − f ) NH The posterior is p(NHj f; N)p( f ) p( f jNH; N) = p(NHjN) Bert Kappen ML 12 Intermezzo: the Beta distribution The Beta distribution is a probability density over a continuous random variable x 2 [0; 1] defined by two shape parameters a; b > 0. Γ(a + b) Beta(xja; b) = xa−1(1 − x)b−1 0 ≤ x ≤ 1 Γ(a)Γ(b) The prefactor ensures normalisation Z 1 Γ(a)Γ(b) dxxa−1(1 − x)b−1 = 0 Γ(a + b) 3 3 3 3 a = 0.1 a = 1 a = 2 a = 8 b = 0.1 b = 1 b = 3 b = 4 2 2 2 2 1 1 1 1 0 0 0 0 0 0.5 µ 1 0 0.5 µ 1 0 0.5 µ 1 0 0.5 µ 1 Bert Kappen ML 13 Intermezzo: the Beta distribution Mean and variance a Ex = a + b ab Vx = (a + b)2(a + b + 1) Note, that with a = f N; b = (1 − f )N we get f (1 − f ) Ex = f Vx = N + 1 becomes more peaked for large N. Bert Kappen ML 14 Exercise 2.7. The bent coin Assume that we have no prior knowledge of f . Therefore p( f ) = 1. The posterior N p(N j f; N)p( f ) H NH NH N−NH p( f jNH; N) = = f (1 − f ) p(NHjN) p(NHjN) Γ(a + b) = f a−1(1 − f )b−1 Γ(a)Γ(b) with NH = a − 1; N − NH = b − 1. From this we can infer that ! N Γ(a)Γ(b) N! NH!(N − NH)! 1 p(NHjN) = = = NH Γ(a + b) NH!(N − NH)! (N + 1)! N + 1 where we used Γ(x) = (x − 1)! when x is integer. R Since p(NHjN) = d f p(NHj f; N)p( f ) we conclude that when integrating over all models, the probability of all outcomes are equally likely. Bert Kappen ML 15 Exercise 2.7. The bent coin So, given our experiment where we have thrown the coin N times and have ob- served NH times ’head’. What is the probability to observe another ’head’? Z p(HjNH; N) = d f p(Hj f )p( f jNH; N) Z a N + 1 = d f f Beta( f ja = N + 1; b = N − N + 1) = = H H H a + b N + 2 Suppose N = NH = 1. Naive answer is f = 1 and thus p(Hj f ) = 1. 2 Bayesian answer is p( f jNH; N) = 2 f and p(HjNH; H) = 3 Bert Kappen ML 16 Lecture 1b Based on Mackay ch. 2: Probability, Entropy and Inference • Entropy as information • Maximum entropy, and its relation to exponential models • the KL divergence – use in variational approximation – relation to maximum likelihood Bert Kappen ML 17 Entropy as information Information is a measure of the ’degree of surprise’ that a certain value gives us given that we know the distribution. Unlikely events are informative, likely events less so. Thus, information decreases with the probability of the event. Given p(x) = δ(x) (no uncertainty), observing x gives us no additional information. Let us denote h(x) the information of x. Then if x; y are independent, we want information to be additive h(x; y) = h(x) + h(y). Since p(x; y) = p(x)p(y) we see that h(x) = − log2 p(x) is a good candidate to quantify the information in x. The expected information is X H[x]:= − log2 p = − p(x) log2 p(x) x is the entropy of the distribution p. Bert Kappen ML 18 Entropy as information 0.5 0.5 H = 1.77 H = 3.09 0.25 0.25 probabilities probabilities 0 0 When p is sharply peaked (p(x1) = 1; p(x2) = ::: = p(xM) = 0) then the entropy is H[x] = −1 log 1 − (M − 1)0 log 0 = 0 When p is flat (p(xi) = 1=M) the entropy is maximal 1 1 H[x] = −M log = log M M M 1 1 log is log2 is this section on entropy as information.
Recommended publications
  • Bayesian Methods for Unsupervised Learning
    Bayesian Methods for Unsupervised Learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK [email protected] http://www.gatsby.ucl.ac.uk Mathematical Psychology Conference July 2003 What is Unsupervised Learning? Unsupervised learning: given some data, learn a probabilistic model of the data: clustering models (e.g. k-means, mixture models) • dimensionality reduction models (e.g. factor analysis, PCA) • generative models which relate the hidden causes or sources to the observed data • (e.g. ICA, hidden markov models) models of conditional independence between variables (e.g. graphical models) • other models of the data density • Supervised learning: given some input and target data, learn a mapping from inputs to targets: classification/discrimination (e.g. perceptron, logistic regression) • function approximation (e.g. linear regression) • Reinforcement learning: systems learns to produce actions while interacting in an environment so as to maximize its expected sum of long-term rewards. Formally equivalent to sequential decision theory and optimal adaptive control theory. Bayesian Learning Consider a data set , and a model m with parameters θ. D Prior over model parameters: p(θ m) Likelihood of model parameters for dataj set : p( θ; m) Prior over model class: p(m) D Dj The likelihood and parameter priors are combined into the posterior for a particular model; batch and online versions: p( θ; m)p(θ m) p(x θ; ; m)p(θ ; m) p(θ ; m) = Dj j p(θ ; x; m) = j D jD jD p( m) jD p(x ; m) Dj jD Predictions are made by integrating over the posterior: p(x ; m) = Z dθ p(x θ; ; m) p(θ ; m): jD j D jD Bayesian Model Comparison A data set , and a model m with parameters θ.
    [Show full text]
  • Neural Trees for Learning on Graphs Rajat Talak, Siyi Hu, Lisa Peng, and Luca Carlone
    1 Neural Trees for Learning on Graphs Rajat Talak, Siyi Hu, Lisa Peng, and Luca Carlone Abstract—Graph Neural Networks (GNNs) have emerged as a Various GNN architectures have been proposed, that go flexible and powerful approach for learning over graphs. Despite beyond local message passing, in order to improve expressivity. this success, existing GNNs are constrained by their local message- Graph isomorphism testing, G-invariant function approximation, passing architecture and are provably limited in their expressive power. In this work, we propose a new GNN architecture – and the generalized k-order WL (k-WL) tests have served as the Neural Tree. The neural tree architecture does not perform end objectives and guided recent progress of this inquiry. For message passing on the input graph, but on a tree-structured example, k-order linear GNN [Maron et al., 2019b] and k-order graph, called the H-tree, that is constructed from the input graph. folklore GNN [Maron et al., 2019a] have expressive powers Nodes in the H-tree correspond to subgraphs in the input graph, equivalent to k-WL and (k + 1)-WL test, respectively [Azizian and they are reorganized in a hierarchical manner such that a parent-node of a node in the H-tree always corresponds to a and Lelarge, 2021]. While these architectures can theoretically larger subgraph in the input graph. We show that the neural tree approximate any G-invariant function (as k ! 1), they use architecture can approximate any smooth probability distribution k-order tensors for representations, rendering them impractical function over an undirected graph, as well as emulate the junction for any k ≥ 3.
    [Show full text]
  • Duality of Graphical Models and Tensor Networks Arxiv:1710.01437V1
    Duality of Graphical Models and Tensor Networks Elina Robeva and Anna Seigal Abstract In this article we show the duality between tensor networks and undirected graphical models with discrete variables. We study tensor networks on hypergraphs, which we call tensor hypernetworks. We show that the tensor hypernetwork on a hypergraph exactly corresponds to the graphical model given by the dual hypergraph. We translate various notions under duality. For example, marginalization in a graphical model is dual to contraction in the tensor network. Algorithms also translate under duality. We show that belief propagation corresponds to a known algorithm for tensor network contraction. This article is a reminder that the research areas of graphical models and tensor networks can benefit from interaction. 1 Introduction Graphical models and tensor networks are very popular but mostly separate fields of study. Graphical models are used in artificial intelligence, machine learning, and statistical mechan- ics [16]. Tensor networks show up in areas such as quantum information theory and partial differential equations [7, 10]. Tensor network states are tensors which factor according to the adjacency structure of the vertices of a graph. On the other hand, graphical models are probability distributions which factor according to the clique structure of a graph. The joint probability distribution of several discrete random variables is naturally organized into a tensor. Hence both graphical arXiv:1710.01437v1 [math.ST] 4 Oct 2017 models and tensor networks are ways to represent families of tensors that factorize according to a graph structure. The relationship between particular graphical models and particular tensor networks has been studied in the past.
    [Show full text]
  • Grammar Factorization by Tree Decomposition
    Grammar Factorization by Tree Decomposition Daniel Gildea∗ University of Rochester We describe the application of the graph-theoretic property known as treewidth to the problem of finding efficient parsing algorithms. This method, similar to the junction tree algorithm used in graphical models for machine learning, allows automatic discovery of efficient algorithms such as the O(n4) algorithm for bilexical grammars of Eisner and Satta. We examine the complexity of applying this method to parsing algorithms for general Linear Context-Free Rewriting Sys- tems. We show that any polynomial-time algorithm for this problem would imply an improved approximation algorithm for the well-studied treewidth problem on general graphs. 1. Introduction In this article, we describe meta-algorithms for parsing: algorithms for finding the optimal parsing algorithm for a given grammar, with the constraint that rules in the grammar are considered independently of one another. In order to have a common representation for our algorithms to work with, we represent parsing algorithms as weighted deduction systems (Shieber, Schabes, and Pereira 1995; Goodman 1999; Nederhof 2003). Weighted deduction systems consist of axioms and rules for building items or partial results. Items are identified by square brackets, with their weights written to the left. Figure 1 shows a rule for deducing a new item when parsing a context free grammar (CFG) with the rule S → AB. The item below the line, called the consequent, can be derived if the two items above the line, called the antecedents, have been derived. Items have types, corresponding to grammar nonterminals in this example, and variables, whose values range over positions in the string to be parsed.
    [Show full text]
  • Performance Evaluation of Algorithms for Soft Evidential Update in Bayesian Networks: First Results
    Performance Evaluation of Algorithms for Soft Evidential Update in Bayesian Networks: First Results Scott Langevin and Marco Valtorta Department of Computer Science and Engineering University of South Carolina Columbia, SC 29208, USA {langevin,mgv}@cse.sc.edu Abstract. In this paper we analyze the performance of three algorithms for soft evidential update, in which a probability distribution represented by a Bayesian network is modified to a new distribution constrained by given marginals, and closest to the original distribution according to cross entropy. The first algorithm is a new and improved version of the big clique algorithm [1] that utilizes lazy propagation [2]. The second and third algorithm [3] are wrapper methods that convert soft evidence to virtual evidence, in which the evidence for a variable consists of a like- lihood ratio. Virtual evidential update is supported in existing Bayesian inference engines, such as Hugin. To evaluate the three algorithms, we implemented BRUSE (Bayesian Reasoning Using Soft Evidence), a new Bayesian inference engine, and instrumented it. The resulting statistics are presented and discussed. Keywords: Bayesian networks, Iterative Proportional Fitting Proce- dure, Soft Evidence, Virtual Evidence, Cross Entropy. 1 Introduction and Motivation The issue of how to deal with uncertain evidence in Bayesian networks appears in Pearl’s foundational text [4, sections 2.2.2, 2.3.3] and has recently been the sub- ject of methological inquiry and algorithm development (e.g., [1,5,6,7,8,9,3,10]). A result of these studies has been to clarify the distinction between soft and virtual evidence. Briefly, representing uncertain probabilistic evidence as virtual evidence is appropriate when we model the reliability of an information source, while the soft evidence representation is appropriate when we want to incorpo- rate the distribution of a variable of interest into a probabilistic model.
    [Show full text]
  • Preventing Unnecessary Groundings in the Lifted Dynamic Junction Tree Algorithm
    Preventing Unnecessary Groundings in the Lifted Dynamic Junction Tree Algorithm Marcel Gehrke, Tanya Braun, and Ralf Moller¨ Institute of Information Systems, Universitat¨ zu Lubeck,¨ Lubeck¨ fgehrke, braun, moellerg@ifis.uni-luebeck.de Abstract tree algorithm (LDJT) to exactly answer multiple filtering and prediction queries for multiple time steps efficiently The lifted dynamic junction tree algorithm (LDJT) efficiently answers filtering and prediction queries for probabilistic rela- (Gehrke, Braun, and Moller¨ 2018). LDJT combines the tional temporal models by building and then reusing a first- advantages of the interface algorithm (Murphy 2002) and order cluster representation of a knowledge base for multiple the lifted junction tree algorithm (LJT) (Braun and Moller¨ queries and time steps. Unfortunately, a non-ideal elimina- 2016). Specifically, this paper extends LDJT and contributes tion order can lead to groundings even though a lifted run is (i) means to identify whether groundings occur and (ii) an possible for a model. We extend LDJT (i) to identify unnec- approach to prevent unnecessary groundings by extending essary groundings while proceeding in time and (ii) to pre- inter first-order junction tree (FO jtree) separators. vent groundings by delaying eliminations through changes LDJT reuses an FO jtree structure to answer multiple in a temporal first-order cluster representation. The extended queries and reuses the structure to answer queries for all version of LDJT answers multiple temporal queries orders of time steps t > 0. Additionally, LDJT ensures a minimal magnitude faster than the original version. exact inter FO jtree information propagation over a sepa- rator. Unfortunately, due to a non-ideal elimination order 1 Introduction unnecessary groundings can occur.
    [Show full text]
  • Region Based Approximation for High Dimensional Bayesian Network Models
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE/ SUBMITTED VERSION 1 Region Based Approximation for High Dimensional Bayesian Network Models Peng Lin, Martin Neil, Norman Fenton Abstract—Performing efficient inference on Bayesian Networks (BNs), with large numbers of densely connected variables is challenging. With exact inference methods, such as the Junction Tree algorithm, clustering complexity can grow exponentially with the number of nodes and so computation becomes intractable. This paper presents a general purpose approximate inference algorithm called Triplet Region Construction (TRC) that reduces the clustering complexity for factorized models from worst case exponential to polynomial. We employ graph factorization to reduce connection complexity and produce clusters of limited size. Unlike MCMC algorithms TRC is guaranteed to converge and we present experiments that show that TRC achieves accurate results when compared with exact solutions. Index Terms—Belief propagation, High dimensional, Bayesian Networks, Graph factorization, discrete energy optimization F 1 MOTIVATION AND CONTRIBUTION ERFORMING efficient inference on Bayesian network number of existing well known challenges (region choice, P (BN) models with a large number of variables that are convergence and accuracy) encountered when using region also densely connected (high order dependent), is a major based approximation are addressed. Most significantly our computational challenge. With exact methods, such as the methods provide an algorithm where the clustering and Junction Tree (JT) algorithm [1], [2], the complexity depends efficiency complexity for factorized models is reduced from on the size of maximal cluster of the triangulated graph [2], worst case exponential to polynomial. [3], and the maximal cluster size can grow exponentially The paper is structured as follows: with the number of nodes.
    [Show full text]
  • Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2
    Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 22, 2007 Yasemin Altun Lecture 2 Graphical Models A framework for multivariate statistical models Dependency between variables encoded in the graph structure A graphical model is associated with a family of probability distributions characterized by conditional independencies numerical representations Captures a joint probability distribution from a set of local pieces 2 types Bayesian Networks: Graphical models with DAGs Markov Random Fields: Graphical models with undirected graphs Yasemin Altun Lecture 2 Directed Graphical Models Bayes(ian) Net(work)s Example DAG Consider directed acyclic graphs over n variables. Consider this six node network: The joint probability is now: • • X 4 Each node has (possibly empty) set of parents πi. • X 2 X P(x , x , x , x , x , x ) = Each node maintains a function fi(xi; xπ ) such that X 6 1 2 3 4 5 6 • i 1 f > 0 and f (x ; xπ ) = 1 π . P(x1)P(x2 x1)P(x3 x1) i xi i i i ∀ i | | Define the joint probability to be: P(x4 x2)P(x5 x3)P(x6 x2, x5) • ! | | | P(x , x , . , x ) = f (x ; x ) X 3 X 5 1 2 n i i πi x 2 i 0 1 x 1 0 " 0 1 x 4 Even with no further restriction on the the f , it is always true that G(V , E), V set of nodes, E set of directed edges 1 i x 5 0 1 x 2 0 1 x x G is acyclic X 4 fi(xi; πi) = P(xi πi) 0 | x 6 Each node associated with a random variableXX2 i with 1 so we will just write 0 1 0 assignments xi X 6 x 2 x 1 X1 P(x1, x2, .
    [Show full text]
  • The Inclusion-Exclusion Rule and Its Application to the Junction Tree
    Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence The Inclusion-Exclusion Rule and Its Application to the Junction Tree Algorithm David Smith Vibhav Gogate Department of Computer Science Department of Computer Science The University of Texas at Dallas The University of Texas at Dallas Richardson, TX, 75080, USA Richardson, TX, 75080, USA [email protected] [email protected] Abstract teescu et al., 2008], etc.) developed in the knowledge com- pilation community are also receiving increasing attention In this paper, we consider the inclusion-exclusion in the machine learning community. In particular, recently, rule – a known yet seldom used rule of probabilistic there has been a push to learn tractable models directly inference. Unlike the widely used sum rule which from data instead of first learning a PGM and then compil- requires easy access to all joint probability values, ing it into a tractable model (cf. [Bach and Jordan, 2001; the inclusion-exclusion rule requires easy access to Poon and Domingos, 2011]). several marginal probability values. We therefore develop a new representation of the joint distri- In this paper, we focus on junction trees (a tree of clus- bution that is amenable to the inclusion-exclusion ters of variables) – one of the oldest yet most widely used rule. We compare the relative strengths and weak- tractable subclass of PGMs – and seek to improve their query nesses of the inclusion-exclusion rule with the sum complexity. One of the main drawbacks of using junction rule and develop a hybrid rule called the inclusion- trees as a target for knowledge compilation is that the time exclusion-sum (IES) rule, which combines their required to process each cluster is exponential in the num- power.
    [Show full text]
  • A Sufficiently Fast Algorithm for Finding Close to Optimal Junction Trees
    A fast algorithm for finding junction trees 81 A sufficiently fast algorithm for finding close to optimal junction trees Ann Becker and Dan Geiger Computer Science Department Technion Haifa 32000, ISRAEL anyuta(Qlcs. technion.ac.il, dangC9lcs. technion.ac.il Abstract presented with an appropriate example. One algo­ rithm, due to Rose (1974), is as fo llows: repeatedly, select a vertex v with minimum number of neighbors An algorithm is developed for finding a close N(v), delete v fr om the graph, and make a clique out to optimal junction tree of a given graph G. of N ( v) . The resulting sequence of cliques creates a The algorithm has a worst case complexity junction tree. This greedy algorithm minimizes the 0 ( ckn a) where a and c are constants, n is size of each clique as it is being created. However, it the number of vertices , and k is the size of the could easily make a mistake at the first step that would largest clique in a junction tree of Gin which lead it to a junction tree far offthe optimal size. An­ this size is minimized. The algorithm guaran­ other algorithm, investigated by Kjaerulff (1990), is tees that the logarithm of the size of the state simulating annealing which takes a long time to run space of the heaviest clique in the junction and has no guarantees on the quality of the output. tree produced is less than a constant factor off tl;e optimal value. When k = O(logn), Finding an optimal junction tree is NP-complete but our algorithm yields a polynomial inference for a graph with n vertices and a cliquewidth k there algorithm fo r Bayesian networks.
    [Show full text]
  • 6 : Factor Graphs, Message Passing and Junction Trees
    10-708: Probabilistic Graphical Models 10-708, Spring 2018 6 : Factor Graphs, Message Passing and Junction Trees Lecturer: Kayhan Batmanghelich Scribes: Sarthak Garg 1 Factor Graphs Factor Graphs are graphical representations which unify directed and undirected graphical models. Let G be a graphical model (directed/undirected) over a set of variables χ = x1 ...xn . For an undirected graphical model, the joint probability distribution can be written as { } m φ (! ) p(x)= i=1 i i where ! χ (1) Z i ✓ Q Each φi (!i ) is called a factor. The joint probability distribution for directed graphical models can also be written in this form with m = n,Z =1,!i = xi ⌘(xi ),φi (!i )=p(xi ⌘(xi )) where ⌘(x) denotes the set of parents of variable x in the directed graph. From{ }[ this factored representation,| a factor graph F(V,E) can be constructed. The set of vertices V of the factor graph consists for nodes for all the variables and all the factors, V = χ φ1 ...φm . The graph is undirected and bipartite, with edges only between the factors and variables. The[ edge{ set is} given by, i,j,x ! (x ,φ ) E. 8 i 2 j ! i j 2 Factor graphs can be used to describe Markov Random Fields, Conditional Random Fields and Bayesian Networks. For doing inference, i.e. computing conditional and/or marginal probability distributions over some subset of variables, the model should first be converted into a factor graph. If the factor graph is acyclic, then the Belief Propagation algorithm is used for exact inference. If the factor graph is cyclic, then the Junction Tree algorithm is used for inference.
    [Show full text]
  • Duality of Graphical Models and Tensor Networks Arxiv:1710.01437V1
    Duality of Graphical Models and Tensor Networks Elina Robeva and Anna Seigal Abstract In this article we show the duality between tensor networks and undirected graphical models with discrete variables. We study tensor networks on hypergraphs, which we call tensor hypernetworks. We show that the tensor hypernetwork on a hypergraph exactly corresponds to the graphical model given by the dual hypergraph. We translate various notions under duality. For example, marginalization in a graphical model is dual to contraction in the tensor network. Algorithms also translate under duality. We show that belief propagation corresponds to a known algorithm for tensor network contraction. This article is a reminder that the research areas of graphical models and tensor networks can benefit from interaction. 1 Introduction Graphical models and tensor networks are very popular but mostly separate fields of study. Graphical models are used in artificial intelligence, machine learning, and statistical mechan- ics [16]. Tensor networks show up in areas such as quantum information theory and partial differential equations [7, 10]. Tensor network states are tensors which factor according to the adjacency structure of the vertices of a graph. On the other hand, graphical models are probability distributions which factor according to the clique structure of a graph. The joint probability distribution of several discrete random variables is naturally organized into a tensor. Hence both graphical arXiv:1710.01437v1 [math.ST] 4 Oct 2017 models and tensor networks are ways to represent families of tensors that factorize according to a graph structure. The relationship between particular graphical models and particular tensor networks has been studied in the past.
    [Show full text]