Graphical Models and Conditional Random Fields

Graphical Models and Conditional Random Fields Presenter: Shih-Hsiang Lin Bishop, C. M., “Pattern Recognition and Machine Learning,” Springer, 2006 Sutton, C., McCallum, A., “An Introduction to Conditional Random Fields for Relational Learning,” Introduction to Statistical Relational Learning, MIT Press. 2007 Rahul Gupta, “Conditional Random Fields,” Dept. of Computer Science and Engg., IIT Bombay, India. Overview • Introduction to graphical models • Applications of graphical models • More detail on conditional random fields • Conclusions 2 Introduction to Graphical Models Power of Probabilistic Graphical Models • Why do we need graphical models – Graphs are an intuitive way of representing and visualizing the relationships between many variables • Used to design and motivate new models – A graph allows us to abstract out the conditional independence relationships between the variables from the details of their parametric forms. • Provide a new insights into existing model – Graphical models allow us to define general message-passing algorithms that implement probabilistic inference efficiently • Graph based algorithms for calculation and computation Probability Graph Theory Theory Probability Graphical Models 4 Probability Theory • What do we need to know in advance – Probability Theory •Sum Rule(Law of Total Probability or Marginal Probability) p()x = ∑ p(x, y ) y • Product Rule (Chain Rule) p()x, y = p(x | y )p(y ) = p(y | x )p(x ) • From the above we can derive Bayes’ theorem p(x | y )p(y ) p()x | y p(y ) p()y | x = = p()x ∑ p()()x | y p y y 5 Conditional Independence and Marginal Independence • Conditional Independence x y | z ⇔ p(x | y, z ) = p(x | z ) which is equivalent to saying p()x, y | z = p(x | y, z )p(y | z ) = p(x | z )(p y | z ) – Conditional independence crucial in practical applications since we can rarely work with a general joint distribution • Marginal Independence x y ⇔ a b | 0 ⇔ p()x, y = p(x )p(y ) •Example empty set – Amount of Speeding Fine Type of Car | Speed – Lung Cancer Yellow Teeth | Smoking – Child’s Genes Grandparents’ Genes | Parents’ Genes – Ability of Team A Ability of Team B 6 Graphical models a b a b c c Directed Graph Undirected Graph • A graphical model comprises nodes connected by links – Nodes (vertices) correspond to random variables – Links (edges or arcs) represents the relationships between the variables • Directed graphs are useful for expressing casual relationships between random variables • Undirected graphs are better suited to expressing soft constraints between random variables 7 Directed Graphs • Consider an arbitrary distribution p(a,b, c ), we can write the joint distribution in the form – By successive application of the product rule p()a,b, c = p(c | a,b )p(a,b ) * Note that this decomposition holds for or any choice of joint distribution p()a,b, c = p(c | a,b )p(b | a )p(a ) • We then can represent the above equation in terms of a simple graphical models – First, we introduce a node for each of the random variables – Second, for each conditional distribution we add directed links to the graph a b A fully connected graph c 8 Directed Graphs (cont.) • Let us consider another case p(x1 , x2 , x3 , x4 , x5 , x6 , x7 ) p(x1 , x2 , x3 , x4 , x5 , x6 , x7 ) = p(x7 | x1 ,L , x6 )L p(x2 | x1 )p(x1 ) x1 x3 x2 x5 Again, it is a fully connected graph x6 x4 x7 • What would happen if some links were dropped?? (considering the relationship between nodes) 9 Directed Graphs (cont.) • The joint distribution of p ( x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 ) is therefore given by x1 x3 x p(x1 , x2 , x3 , x4 , x5 , x6 , x7 ) 2 = p()x1 × p ()x2 × p ()x3 × p (x4 | x1 , x2 , x3 )× x5 p()()()x5 | x1 , x3 × p x6 | x4 × p x7 | x4 , x5 x6 x4 x7 * The joint distribution is then defined by the product of a conditional distribution for each node conditioned on the variables corresponding to the parents of that node in the graph • Thus, for a graph with K nodes, the joint distribution is given by K where parent x denotes the set of parents of xk p()x1 ,L , x K = ∏ p()xk | parent ()xk ( k ) k =1 • We always restrict the directed graph must have no directed cycles – Such graphs are also called directed acyclic graphs (DAGs) or Bayesian network 10 Directed Graph: Conditional Independence • Joint distribution over 3 variables specified by the graph p(a,b, c ) = p(a | c )p(b | c )p(c ) c if node c is not observed a b p()a,b = ∑ p(a | c )p(b | c )p(c )()≠ p a p(b ) Î a b | 0 c c if node c is observed a b p(a,b, c ) p()a,b | c = = p()()a | c p b | c Î a b | c p()c The node c is said to be tail-to-tail r.w.t. this path from node a to node b Î this observation ‘blocks’ the path from a to b and cause a and b to become conditionally independent 11 Directed Graph: Conditional Independence (cont.) • The second example p(a,b, c ) = p(a )p(c | a )p(b | c ) a c b if node c is not observed p(a,b )()= p a ∑ p(c | a )p(b | c ) = p(a )p(b | a ) ≠ p(a )p(b ) Î a b | 0 c a c b if node c is observed p()a,b, c p(a )p(c | a )p(b | c ) p()a,b | c = = = p()()a | c p b | c Î a b | c p()c p()c The node c is said to be head-to-tail r.w.t. this path from node a to node b Î this observation ‘blocks’ the path from a to b and cause a and b to become conditionally independent 12 Directed Graph: Conditional Independence (cont.) • The third example p(a,b, c ) = p(a )p(b )p(c | a,b ) c if node c is not observed a b p(a,b ) = ∑ p(a,b, c ) = p(a )p(b )∑ p(c | a,b ) = p(a )p(b ) Î a b | 0 c c c if node c is observed a b p()a,b, c p(a )p(b )p(c | a,b ) p()a,b | c = = ≠ p()()a | c p b | c Î a b | c p()c p()c The node c is said to be head-to-head r.w.t. this path from node a to node b Î the conditioned node c ‘unblocks’ the path and renders a and b dependent 13 D-separation •ifA B | C C d-separated A from B – We need to consider all possible paths from any node in A to any node in B – Any such path is said to be blocked if it includes a node such that either (a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C (b) the arrows meet head-to-tail at the node, and neither the node, nor any of its descendants, is in the set C – If all paths are blocked, then A is said to be d-separated from B by C a f a f e b e b c c a b | c a b | f 14 Markov Blankets • Markov blankets (or Markov boundary) of a node x is the minimal set of nodes that isolates nodes A from the rest of the graph – Every set of nodes in the network is conditionally independent of A when conditioned on the Markov blanket of the node A p()A | MB (A)∩ B = p(A | MB (A)) – MB(A)= {parents(A) and children(A) and parents-of-children(A)} A 15 Examples of Directed Graphs • Hidden Markov models • Kalman filters • Factor analysis • Probabilistic principal component analysis • Independent component analysis • Mixtures of Gaussians • Transformed component analysis • Probabilistic expert systems • Sigmoid belief networks • Hierarchical mixtures of experts • etc,… 16 Example: State Space Models (SSM) • Hidden Markov models • Kalman filters Hidden Observed yt −1 yt yt +1 xt −1 xt xt +1 P ()X ,Y = L p(yt | yt −1 )p(xt | yt )p(yt +1 | y )L 17 Example: Factorial SSM • Multiple hidden sequences Hidden Observed 18 Markov Random Fields • Random Field – Let F = { F 1 , F 2 ,..., F M } be a family of random variables defined on the set S , in which each random variable F i takes a value f i in a label set L. The family F is called a random field • Markov Random Field – F is said to be a Markov random field on S with respect to a neighborhood system N if and only if the following two conditions are satisfied Possitivity : P(f ) > 0,∀f ∈F Markovianity : P(fi | all other f )= P(fi | neighbors(fi )) a b c P(b | all other node) = P(b | c,d) e d 19 Undirected Graphs • An undirected graphical model can also called Markov random fields, or also known as a Markov networks – It has a set of nodes each of which corresponds to a variable of group of variables, as well as a set of links each of which connects a pair of nodes • In an undirected graphical models, the joint distribution is product of non-negative functions over the cliques of the graph ψ x 1 where C ( C ) are the clique potential, and Z is a p()x = ∏ ψ C ()xC normalization constant (sometimes called the partition function) Z C A B a b 1 p()x = ψ ()(a, c ψ b, c, d )(ψ c, d , e ) c Z A B C e d C 20 Clique Potentials • A clique is a fully connected subgraph – By clique we usually mean maximal clique (i.e.

Graphical Models and Conditional Random Fields

Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding

A Multitask Bi-Directional RNN Model for Named Entity Recognition On

ML Cheatsheet Documentation

CRF, CTC, Word2vec

PDF Hosted at the Radboud Repository of the Radboud University Nijmegen

Conditional Random Fields Meet Deep Neural Networks for Semantic

Evaluating the Utility of Hand-Crafted Features in Sequence Labelling∗

Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction

Conditional Random Fields by Charles Sutton and Andrew Mccallum

Fast and Effective Biomedical Named Entity Recognition Using Temporal Convolutional Network with Conditional Random Field

Thesis Submitted for the Requirements of The

Conditional Random Field Autoencoders for Unsupervised Structured Prediction