Probabilistic Graphical Models

Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 24: Conditional Random Fields, MAP Estimation & Max-Product BP

Some figures and examples courtesy C. Sutton and A. McCallum, An Introduction to Conditional Random Fields, 2012. Generative ML or MAP Learning: Naïve Bayes N

max log p(⇡) + log p(✓)+ [log p(yi ⇡) + log p(xi yi,✓)] ⇡,✓ | | i=1 X ⇡ ⇡ yi yt yi yt

✓ ✓ ✓ ✓ xi xt xi xt N N Train Test Train Test Discriminative ML or MAP Learning: N

max log p(✓)+ log p(yi xi,✓) ✓ | i=1 X Binary Logistic Regression 1

0.9 T 0.8 p(yi xi, ✓)=Ber(yi sigm(✓ (xi))) 0.7 | | 0.6 0.5

0.4 • Linear discriminant analysis: 0.3 0.2

0.1 (x )=[1,x ,x ,...,x ] 0 i i1 i2 id −10 −5 0 5 10 • Quadratic discriminant analysis: 2 2 (xi)=[1,xi1,...,xid,xi1,xi1xi2,xi2,...] • Can derive weights from Gaussian if that happens to be known, but more generally: • Choose any convenient feature set (x) • Do discriminative Bayesian learning: N p(✓ x, y) p(✓) Ber(y sigm(✓T (x ))) | / i | i i=1 Y 2.2 Generative versus Discriminative Models 279

One simple way to accomplish this is to assume that once the class label is known, all the features are independent. The resulting classifier is called the naive Bayes classifier. It is based on a joint model of the form:

K p(y,x)=p(y) p(x y). (2.7) k| k!=1 This model can be described by the directed model shown in Figure 2.3 (left).2.2 Generative We can versus also Discriminative write this model Models as a279 , by defining a factor Ψ(y)=p(y), and a factor Ψ (y,x )=p(x y) for each feature x . One simple way to accomplish this is to assume that oncek thek class k| k This factor graph is shown in Figure 2.3 (right). label is known, all the features are independent. The resulting classifier Another well-known classifier that is naturally represented as a is called the naive Bayes classifier. It is based on a joint probability graphical model is logistic regression (sometimes known as the max- model of the form: imum entropy classifier in the NLP community). This classifier can be motivatedK by the assumption that the log probability, logp(y x), of p(y,x)=p(y) p(x y). (2.7) | each class is a lineark| function of x, plus a normalization constant.1 This k=1 leads to the! conditional distribution: This model can be described by the directed model shown in Figure 2.3 K (left). We can also write this model as a factor graph,1 by defining a p(y x)= exp θy + θy,jxj , (2.8) factor Ψ(y)=p(y), and a factor Ψ (y,x )=p(|x y)Z for(x each) feature x .  k k k| jk=1 This factor graph is shown in Figure 2.3 (right).  %  Another well-known classifierwhere Z( thatx)= is naturallyexp θ + representedK θx asis a a normalizing constant, and y { y j=1 y,j j} graphical model is logisticθ regressionis a bias weight (sometimes that acts known like as log thep(ymax-) in naive Bayes. Rather than y ) ) imum entropy classifier inusing the one NLP weightGenerative community). vector per This class, classifier as in can (2.8), versus we can use a different Discriminative be motivated by the assumptionnotation that in which the log a probability, single set of log weightsp(y x is), shared of across all the classes. | each class is a linear functionThe of trickx, plus is to a normalization define a set of constant.feature functions1 This that are nonzero only leads to the conditional distribution:Generative ML or MAP Learning: Naïve Bayes

K M 1 y y p(y x)= exp θ + θ x , (2.8) | Z(x)  y y,j j j=1 p(y, x)=p(y) p(xm y)  %  | where Z(x)= exp θ + K θxx is a normalizing constant,x and m=1 y { y j=1 y,j j} θ is a bias weight that acts like logp(y) in naive Bayes. Rather than Y y ) Fig.) 2.3 The naive Bayes classifier, as a directed model (left), and as a factor graph (right). using one weight vector per class,• as inClass-specific (2.8), we can use a di distributionsfferent for each of M features notation in which a single set of weights is shared across all the classes. 1 The trick is to define a setBy of logfeature inDiscriminative this surveyfunctions we willthat always are nonzeroML the natural or only MAP logarithm. Learning: Logistic regression M y y 1 p(y = k x, ✓)= exp ✓T (x ) | Z(x, ✓) k m m=1 x x K M Y Fig. 2.3 The naive Bayes classifier, as a directed model (left), and as a factor graph (right). T Z(x, ✓)= exp ✓k (xm)

1 k=1 m=1 By log in this we will always mean the natural logarithm. X Y • distribution (maximum entropy classifier) • Different distribution, and normalization constant, for each x Mixture Models versus HMMs

y1 y2 y3 y4 y5 y 1,...,K Mixture i 2 { } Model x1 x2 x3 x4 x5 p(y ⇡, ✓) = Cat(y ⇡) i | i | p(x y = k, ⇡, ✓)=exp ✓T (x ) A(✓ ) i | i { k i k } y y2 y y y Hidden 1 3 4 5 Markov Model x1 x2 x3 x4 x5

p(yt ⇡, ✓,yt 1,yt 2,...) = Cat(yt ⇡yt 1 ) | | p(x y = k, ⇡, ✓)=exp ✓T (x ) A(✓ ) t | t { k t k } Recover when all rows of state transition matrix are equal. Modeling Sequential

y y2 y y y Hidden 1 3 4 5 Markov Model x1 x2 x3 x4 x5

p(yt ⇡, ✓,yt 1,yt 2,...) = Cat(yt ⇡yt 1 ) | | p(x y = k, ⇡, ✓)=exp ✓T (x ) A(✓ ) t | t { k t k }

“Maximum y1 y2 y3 y4 y5 Entropy Markov Model” x1 x2 x3 x4 x5

T p(yt = k yt 1,xt)=exp ✓k (yt 1,xt) A(yt 1,xt, ✓) | { } Most graphical models have equal claim to max-entropy terminology… The “Label Bias” Problem • Directed MEMM structure gives simple local learning problem: “classifier” for next state, given current state & observations • But the MEMM structure has a major modeling weakness: future observations provide no information about current state p(y x ,x ,x )= p(y x )p(y y ,x )p(y y ,x ) 1 | 1 2 3 1 | 1 2 | 1 2 3 | 2 3 y y X2 X3 = p(y x ) p(y y ,x ) p(y y ,x ) 1 | 1 2 | 1 2 3 | 2 3 " y " y ## X2 X3 “Maximum y1 y2 y3 y4 y5 Entropy Markov Model” x1 x2 x3 x4 x5

T p(yt = k yt 1,xt)=exp ✓k (yt 1,xt) A(yt 1,xt, ✓) | { } Sum-product algorithm has uninformative backward messages (Generative) Markov Random Fields 1 p(x ✓)= (x ✓ ) | Z(✓) f f | f f Y2F Z(✓)= (x ✓ ) f f | f x f X Y2F set of hyperedges linking subsets of nodes f F ✓ V set of N nodes or vertices, 1, 2,...,N V { } • Assume an exponential family representation of each factor: p(x ✓)=exp ✓T (x ) A(✓) | f f f ⇢ f X2F T f (xf ✓f )=exp ✓ f (xf ) A(✓) = log Z(✓) | { f } • Partition function globally couples the local factor parameters (Discriminative) Conditional Random Fields 1 p(y x, ✓)= (y x, ✓ ) | Z(✓,x) f f | f f Y2F Z(✓,x)= (y x, ✓ ) f f | f y f X Y2F • Assume an exponential family representation of each factor: p(y x, ✓)=exp ✓T (y ,x) A(✓,x) | f f f ⇢ f X2F (y x, ✓ )=exp ✓T (y ,x) A(✓,x) = log Z(✓,x) f f | f { f f f } • Log-probability is a linear function of fixed, possibly very complex features of input x and output y • Partition function globally couples the local factor parameters, and every training example has a different normalizer 2.3 Linear-chain CRFs 289

Notice that a linear chain CRF can be described as a factor graph over x and y, i.e.,

T 1 p(y x)= Ψt(yt,yt 1,xt) (2.20) | Z(x) − t=1 ! where each local function Ψt has the special log-linear form: K Ψt(yt,yt 1,xt) = exp θkfk(yt,yt 1,xt) . (2.21) − " − $ #k=1 This will be useful when we move to general CRFs in the next section. Typically we will learn the parameter vector θ from data, as described in Section 5. Previously we have seen that if the joint p(y,x) factorizes as an HMM, then the associated conditional distribution p(y x) is a linear- | chain CRF. This HMM-like CRF is pictured in Figure 2.5. Other types of linear-chain CRFs are also useful, however. For example, in an HMM, a transition from state i to state j receives the same score, logp(yt = j yt 1 = i), regardless of the input. In a CRF, we can allow the score | − of the transition (i,j) to depend on the current observation vector,

simply by adding a feature 1 yt=j 1 yt 1=1 1 xt=o . A CRF with this { } { − } { } kind of transition feature, which is commonly used in text applications, is pictured in Figure 2.6. In fact, since CRFs do not represent dependencies among the vari- ables x1,...xT , we can allow the factors Ψt to depend on the entire observation vector x without breaking the linear graphical structure — allowing us to treat x as a single monolithic variable. As a result, the feature functions can be written fk(yt,yt 1,x) and have the freedom − to examine all the input variables x together. This fact applies gener- ally to CRFs and is not specific to linear chains. A linear-chain CRF CRF Models for Sequential Data . . . Direct y Extension . . . of290 HMMModeling x

Fig.290 2.5Modeling Graphicalp model(y ofx the) HMM-like ( linear-chainy ,x ) CRF from(y equation,y ) (2.17). y | / t t t t,t+1 t t+1. . . t y Y . . . State . . . Transitions x . . . depend on x ObservationsFig. 2.6 Graphical model of a linear-chain CRF in which the transition factors depend on the current observation.p(y x) (y ,x ) (y ,y ,x ) Fig. 2.6 Graphical model| of a linear-chain/ t CRFt in whicht t,t the+1 transitiont factorst+1 dependt on the current observation. t y Y . . . Arbitrary y . . . Non-Local . . . Features x . . . x p(y x) (y ,x) (y ,y ,x) Fig. 2.7 Graphical model| of a/ linear-chaint CRFt in whicht,t the+1 transitiont t factors+1 depend on allFig. of 2.7 the Graphical observations. model of a linear-chaint CRF in which the transition factors depend on all of the observations. Y with this structure in shown graphically in Figure 2.7. In this figure we showwith thisx =( structurex1,...xT in) as shown a single graphically large observed in Figure node 2.7. on In which this figure all of we the factorsshow x depend,=(x1,... ratherxT ) as than a single showing large eachobserved of the nodex1,... on whichxT as allindividual of the nodes.factors depend, rather than showing each of the x1,...xT as individual nodes.To indicate in the definition of linear-chain CRF that each feature functionTo indicate can depend in the on definition observations of linear-chain from any time CRF step, that eachwe have feature writ- function can depend on observations from any time step, we have writ- ten the observation argument to fk as a vector xt, which should be understoodten the observation as containing argument all the to componentsfk as a vector of thext, global which observations should be understood as containing all the components of the global observations x that are needed for computing features at time t. For example, if the x that are needed for computing features at time t. For example, if the CRF uses the next word x as a feature, then the feature vector x CRF uses the next word x t+1 as a feature, then the feature vector x t is assumed to include the identityt+1 of word x . t is assumed to include the identity of word x t+1. Finally, note that the normalization constantt+1 Z(x) sums over all Finally, note that the normalization constant Z(x) sums over all possible state sequences, an exponentially large number of terms. possible state sequences, an exponentially large number of terms. Nevertheless, it can be computed efficiently by the forward–backward Nevertheless, it can be computed efficiently by the forward–backward algorithm,algorithm, as as we we explain in in Section Section 4.1. 4.1.

2.42.4 General General CRFs NowNow we we generalize generalize the previous previous discussion discussion from from a a linear-chain linear-chain to to a a generalgeneral graph, graph, the the definition definition of of a a CRF CRF originally originally given given in in 286 Modeling

Because a generative model takes the form p(y,x)=p(y)p(x y), it | is often natural to represent a generative model by a directed graph in which in outputs y topologically precede the inputs. Similarly, we will see that it is often natural to represent a by a undirected graph. However, this need not always be the case, and both undirected generative models, such as the Markov random field (2.32), and directed discriminative models, such as the MEMM (6.2), are sometimes used. It can also be useful to depict discriminative models by directed graphs in which the x precede the y. The relationship between naive Bayes and logistic regression mirrors the relationship between HMMs and linear-chain CRFs. Just as naive Bayes and logistic regression are a generative-discriminative pair, there is a discriminative analogue to the HMM, and this analogue is a partic- ular special case of CRF, as we explain in the next section. This analogy between naive Bayes, logistic regression, generative models, and CRFs is depictedFamilies in Figure 2.4. of Graphical Models

Fig. 2.4 Diagram of the relationship between naiveT Bayes, logistic regression, HMMs, linear- chain CRFs,p(y generative✓)=exp models, and general✓ CRFs. (y ) A(✓) | f f f f ⇢ 2F 2.3 Linear-chain CRFs X T p(y x, ✓)=exp ✓ f (yf ,x) A(✓,x) To motivate| our introduction off linear-chain CRFs, we begin by considering the conditional⇢ f distribution p(y x) that follows from the X2F | Withjoint informative distribution pobservations(y,x) of an HMM. (good The features), key point the is that posterior this mayconditional have a distribution simpler graph is in fact (Markov) a CRF with structure a particular than choice the prior of feature functions. Generative Learning for MRFs • Undirected graph encodes dependencies within a single training example: N 1 p( ✓)= f (xf,n ✓f ) = x ,1,...,x ,N D| Z(✓) | D { V V } n=1 f Y Y2F • Given N independent, identically distributed, completely observed samples: N log p( ✓)= ✓T (x ) NA(✓) D| f f f,n " n=1 f # 2F • Take gradient with respectX to parametersX for a single factor: N

✓f log p( ✓)= f (xf,n) NE✓[f (xf )] r D| " n=1 # X • Must be able to compute prior marginal distributions for factors • At each gradient step, must solve a single inference problem to find the marginal of the prior distribution Discriminative Learning for CRFs • Undirected graph encodes dependencies within a single training example: N 1 p(y x, ✓)= f (yf,n x ,n, ✓f ) | Z(✓,x ,n) | V n=1 V f Y Y2F • Given N independent, identically distributed, completely observed samples: N T log p(y x, ✓)= ✓f f (yf,n,x ,n) A(✓,x ,n) | V V n=1 " f # X X2F • Take gradient with respect to parameters for a single factor:

N

✓f log p(y x, ✓)= f (yf,n,x ,n) E✓[f (yf ,x ,n) x ,n] r | V V | V n=1 " # X • Must be able to compute conditional marginal distributions for factors • At each gradient step, must solve a N inference problems to find the conditional marginal statistics of the posterior for every training example Convex Conditional Likelihood Surrogates • To train CRF models on graphs with cycles: N T log p(y x, ✓)= ✓f f (yf,n,x ,n) A(✓,x ,n) | V V n=1 " f # 2F XN X T log p(y x, ✓) ✓f f (yf,n,x ,n) B(✓,x ,n) | V V n=1 " f # X X2F for a convex bound satisfying A(✓,x) B(✓,x)  • Apply tree-reweighted Bethe variational bound: N

✓f log p(y x, ✓)= f (yf,n,x ,n) E⌧ [f (yf ,x ,n) x ,n] r | V V | V n=1 " # X • Gradients depend on expectations of pseudo-marginals produced by applying tree-reweighted BP with current model parameters, for features of every training example Inference in Graphical Models

xE observed evidence variables (subset of nodes)

xF unobserved query nodes we’d like to infer x remaining variables, extraneous to this query R but part of the given graphical representation

p(xE,xF )= p(xE,xF ,xR) R = V E,F \{ } x XR Maximum a Posteriori (MAP) Estimates xˆF = arg max p(xF xE) = arg max p(xE,xF ) xF | xF Posterior Marginal Densities p(xE,xF ) p(xF xE)= p(xE)= p(xE,xF ) | p(xE) xF Provides Bayesian estimators, confidence measures,X and sufficient statistics for iterative parameter estimation Global versus Local MAP Estimation Maximum a Posteriori (MAP) Estimates xˆF = arg max p(xF xE) = arg max p(xE,xF ) xF | xF Maximizer of Posterior Marginals (MPM) Estimates p(x x )= p(x x ) s | E F | E xF s X\ xˆs = arg max p(xs xE) xs | MAP and MPM estimators are not equivalent

MPM: (1,1) MAP: (1,2) X 123 1 .6 1 0 .6 .4 123 2 .2 2 1 00 .4 .36 .24 3 .2 3 1 00 Y px( ) p( y | x ) py( )

MAP in Directed Graphs

X 4

X 2

X 6 X1

X 3 X 5 A MAP Elimination Algorithm Algebraic Maximization Operations • Determine maximal setting of variable being eliminated, for every possible configuration of its neighbors • Compute a new potential table involving all other variables which depend on the just-marginalized variable Graph Manipulation Operations • Remove, or eliminate, a single node from the graph • Add edges (if they don’t already exist) between all pairs of nodes who were neighbors of the just-removed node A Graph Elimination Algorithm • Choose an elimination ordering • Eliminate a node, remove its incoming edges, add edges between all pairs of its neighbors • Iterate until all non-observed nodes are eliminated X 4 X 4

X 2 X 2

X 6 X 6 X1 X1

X 3 X 5 X 3 X 5

Graph Elimination Example (a) (b) Elimination Order: (6,5,4,3,2,1)

X 4 X 4 X 4 X 4 X 2 X 2 X 2 X 2 X 2

X 6 X 6 X1 X1 X1 X1 X1

X 3 X 5 X 3 X 5 X 3 X 5 X 3 X 3 (a) (b) (c) (d) (e)

X 4 X 4 X 2 X 2 X 2 X 2

X1 X1 X1 X1 X1

X 3 X 5 X 3 X 3 (f) (g) (c) (d) (e)

X 2

X1 X1

(f) (g) X 4 X 4

X 2 X 2

X 6 X 6 X1 X1

X 4 X 4

X 2 X 2

X 6 X 6 X1 X1 X 3 X 5 X 3 X 5

(a) (b)

X X GraphX 3 4 XElimination5 Example4X 3 X 5 X X 2 Elimination(a) Order:X 2 (6,5,4,3,2,1) (b) 2

X1 X 4 X1 X 4 X1

X 2 X 2 X 2

X1 X1 X1

X 3 X 5 X 3 X 3

(c) (d) (e)

X 3 X 5 X 3 X 3

(c) X 2 (d) (e)

X1 X 2 X1

X1 X1

(f) (g) (f) (g) Elimination Algorithm Complexity X 4

X 2

X 6 X1

X 3 X 5

• Elimination cliques: Sets of neighbors of eliminated nodes • Maximization cost: Exponential in number of variables in each elimination clique (dominated by largest clique) • Treewidth of graph: Over all possible elimination orderings, the smallest possible max-elimination-clique size, minus one • NP-Hard: Finding the best elimination ordering for an arbitrary input graph (but heuristic algorithms often effective) The Generalized Distributive Law • A commutative semiring is a pair of generalized “multiplication” and “addition” operations which satisfy: Commutative: a + b = b + a a b = b a Associative: a +(b + c)=(a + b)+c a (b ·c)=(a· b) c Distributive: a (b + c)=a b + a c · · · · (Why not a ring? May· be no additive/multiplicative· · inverses.) • Examples: Addition Multiplication sum product max product max sum min sum • For each of these cases, our factorization-based dynamic programming derivation of is still valid • Leads to max-product and min-sum belief propagation algorithms for exact MAP estimation in trees Belief Propagation (Max-Product) Max-Marginals: pˆ (x ) (x ) m (x ) t t / t t ut t u (t) 2Y xt pˆt(xt) arg max p(x)I(Xt = xt) / x MESSAGES:

mts(xs) max st(xs,xt) t(xt) mut(xt) / xt u (t) s 2Y \ • If MAP is unique, find via arg-max of each max-marginal independently xs xt • Otherwise, must backtrack from some chosen root node