Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2

Yasemin Altun

January 22, 2007

Yasemin Altun Lecture 2 Graphical Models

A framework for multivariate statistical models Dependency between variables encoded in the graph structure A is associated with a family of probability distributions characterized by conditional independencies numerical representations Captures a joint probability distribution from a set of local pieces 2 types Bayesian Networks: Graphical models with DAGs Markov Random Fields: Graphical models with undirected graphs

Yasemin Altun Lecture 2 Directed Graphical Models Bayes(ian) Net(work)s Example DAG Consider directed acyclic graphs over n variables. Consider this six node network: The joint probability is now: • • X 4 Each node has (possibly empty) set of parents πi. • X 2 X P(x , x , x , x , x , x ) = Each node maintains a function fi(xi; xπ ) such that X 6 1 2 3 4 5 6 • i 1 f > 0 and f (x ; xπ ) = 1 π . P(x1)P(x2 x1)P(x3 x1) i xi i i i ∀ i | | Define the joint probability to be: P(x4 x2)P(x5 x3)P(x6 x2, x5) • ! | | | P(x , x , . . . , x ) = f (x ; x ) X 3 X 5 1 2 n i i πi x 2 i 0 1 x 1 0 " 0 1 x 4 Even with no further restriction on the the f , it is always true that G(V , E), V set of nodes, E set of directed edges 1 i x 5 0 1 x 2 0 1 x x G is acyclic X 4 fi(xi; πi) = P(xi πi) 0 | x 6 Each node associated with a random variableXX2 i with 1 so we will just write 0 1 0 assignments xi X 6 x 2 x 1 X1 P(x1, x2, . . . , xn) = P(xi xπ ) 1 | i πi : Parents of node i, vi : Non descendants of node i i X 3 X 5 Local Markov property: A node is conditionally x 3 " x 1 0 1 Factorization of the joint in terms of local conditional probabilities. 0 1 independent of its non-descendants given its parent. 0 • 0 x 5 x 3 1 Exponential in “fan-in” of each node instead of in total variables n. 1 Il (G) = {Xi ⊥ Xvi |Xπi }.

Yasemin Altun Lecture 2

Conditional Independence in DAGs Missing Edges

If we order the nodes in a directed graphical model so that parents Key point about directed graphical models: • always come before their children in the ordering then the graphical • Missing edges imply conditional independence model implies the following about the distribution: Remember, that by the chain rule we can always write the full joint • x x xπ i as a product of conditionals, given an ordering: { i ⊥ πi| i}∀ where x are the nodes coming before x that are not its parents. P(x , x , x , x , . . .) = P(x )P(x x )P(x x , x )P(x x , x , x ) . . . πi i 1 2 3 4 1 2 1 3 1 2 4 1 2 3 # | | | In other words, the DAG is telling us that each variable is If the joint is represented by a DAGM, then some of the • condition#ally independent of its non-descendants given its parents. • conditioned variables on the right hand sides are missing. Such an ordering is called a “topological” ordering. This is equivalent to enforcing conditional independence. • Start with the “idiot’s graph”: each node has all previous nodes in • the ordering as its parents. Now remove edges to get your DAG. • Removing an edge into node i eliminates an argument from the • conditional probability factor p(xi x1, x2, . . . , xi 1) | − Bayes Nets

Defn: P factorizes wrt G if the joint probability n Y P(X1:n) = P(Xi |Xπi ) i=1

Thm: If Il (G) ⊆ I(P), then P factorizes according to G. Proof:

P(X1:n) = P(X1)P(X2|X1)P(X3|X1, X2) ... N N N Y Y Y = P(Xi |X1:i−1) = P(Xi |Xπi , XV \πi ) = P(Xi |Xπi ) i=1 i=1 i=1

Infer global independencies I(G) using Bayes Ball Alg Thm: If P factorizes wrt G, then I(G) ⊆ I(P). Thm: If I :(X ⊥ Y |Z) ∈/ I(G), there is some distribution P that factorized wrt G and X ⊥P Y |Z

Yasemin Altun Lecture 2 Markov Random Fields (MRFs)

G(V , E), V set of nodes, E set of undirected edges Global Markov Property: A node i is conditionally independent from its non-neighbors given its neighbors Ni . Separation: XA ⊥ XC|XB if every path from a node in XA to a node in XC includes at least one node in XB. Clique c ∈ C, fully connected subset of nodes 1 Y X Y P(X ) = ψ (x ), Z = ψ (x ) 1:N Z c c c c c∈C X c

Even more structure C the set of maximalUn cliquesdirected Models Surprisingly, once you have specified the basic conditional ψ potentialAlso (gcompatibilityraphs with one node p)er functionsrandom variable (positive,and edges that not • • independencies, there are other ones that follow from those. probabilistic)connect pairs of nodes, but now the edges are undirected. In general, it is a hard problem to say which extra CI statements Semantics: every node is conditionally independent from its • • P follow from a basic set. However, in the case of DAGMs, we have Z the partitionnon-neighb functionours given its n toeighb ensureours, i.e. x P(x) = 1. an efficient way of generating all CI statements that must be true x x x if every path b/w x and x goes through x A ⊥ C | B A C B given the connectivity of the graph. This involves the idea of d-separation in a graph. • Notice that for specific (numerical) choices of factors at the nodes • there may be even more conditional independencies, but we are only concerned with statements that are always true of every XB member of the family of distributions, no matter what specific XA Yasemin Altun LectureXC 2 factors live at the nodes. Can model symmetric interactions that directed models cannot. Remember: the graph alone represents a family of joint distributions • aka Markov Random Fields, Markov Networks, Boltzmann • consistent with its CI assumptions, not any specific distribution. • Machines, Spin Glasses, Ising Models

Explaining Away Simple Graph Separation X Z X Z In undirected models, simple graph separation (as opposed to • d-separation) tells us about conditional independencies. x x x if every path between x and x is blocked • A ⊥ C| B A C by some node in xB. Y Q: When we condition on y, are x and z independent? • P(x, y, z) = P(x)P(z)P(y x, z) | x and z are marginally independent, but given y they are XB • XA conditionally dependent. XC This important effect is called explaining away (Berkson’s paradox.) “Markov Ball” : • • For example, flip two coins independently; let x=coin1,z=coin2. remove xB and see if there is any path from xA to xC. • Let y=1 if the coins come up the same and y=0 if different. x and z are independent, but if I tell you y, they become coupled! • Hammersley Clifford Theorem

For positive distributions, the following two characterization are equivalent

Markov: XA ⊥ XB|XS whenever S seperates A from B Factorization (Gibbs)

1 Y p(x) = ψ (x ) Z C C C∈C Proof: Factorization ⇒ Markov. Show p(xA|xB, xS) = p(xA|xS)

1 Y Y p(x) = ψ (x ) ψ (x ) Z C C C C C⊆A∪S C⊆B∪S,C⊆S

Yasemin Altun Lecture 2 CS 281a / Stat 241a Lecture 4 — September 13 Fall 2005 CS 281a / Stat 241a Lecture 4 — September 13 Fall 2005

as the energy function, from its connections in statistical mechanics in physics. We as the enerthengy hafunctionve the,nicefrominterpretationits connectionsthatinthestatisticalmost likelymecconfigurationshanics in physics.of ourWsysteme have then havelothewerniceenergies.interpretation that the most likely configurations of our system have lower energies.As in the directed case, we have two ways of characterizing the set of probability distri- As in butionsthe directedrepresencase,tedwebyhatheve tgraph:wo ways of characterizing the set of probability distri- butions represen1. Markted bov:y theX graph:X X whenever S seperates A from B A ⊥⊥ B| S 1. Markov: X X X whenever S seperates A from B 2. Factorization:A B S p(x) = 1 ψ (x ), where Z = ψ (x ) ⊥⊥ | Z C C C x1,x2,...,xn C C C 2. Factorization: p(x) = 1 ψ (x ),∈Cwhere Z = ψ (x )∈C Z C C C x1,x2,...,xn C C C Theorem 4.1 (Hamm∈erslC ey-C! lifford, 1973). For strictly" ∈pCositiv!e distributions, the two Theoremcharacterizations4.1 (Hammerslareey-C!equivliffalenord,t 1973). For strictly" positiv!e distributions, the two characterizations are equivalent Proof: For Markov Factorization, see 4.5.3 of chapter 16 for a proof using the M¨obius ⇒ § Proof: FInorvMarkersionovFormFula.actorization, see 4.5.3 of chapter 16 for a proof using the M¨obius Inversion FormLetula.show⇒Factorization Mark§ ov. Let S seperate A and B. We want to show that ⇒ Let shop(wXAFactorizationXB, XS) = p(XAMarkXB).ov. Let S seperate A and B. We want to show that | ⇒ | p(X X , X Claim:) = p(XWeXcan). write A| B S A| B Claim: We can write 1 p(x) = ψ (x ) ψ (x ), 1 Z C C C C C A S C B S p(x) = ψC (xC ) ψC (x⊆C )∪, Z ⊆#∪ C#!S C B S C A S ⊆ ∪ because S separates A and B⊆#and∪ so all theC#!cliquesS are included in this product. The C ! S because SconditionseparateswAasandaddedB andin orderso alltothenotcliquesdoublearecounincludedt the incliquesthis prowhicduct.h areThewhollyC ! includedS in conditionSw.asWaddede thus inhavordere: to not double count the cliques which are wholly included in

S. We thus have: p(xA, xB, xS) p(xA xB, xS) = | p(xA, xB, xSp)(xB, xS) p(xA xB, xS) = C B S | p(xB, xS) C A S ψC (xC ) ⊆ ∪ ψC (xC ) ⊆ ∪ C!S = ψC (xC ) C B S ψC (xC ) C A S C⊆ψS∪ (x ) C B S ψ (x ) ⊆ ∪ x!A C A S ! C C! ⊆ ∪ C C = ⊆ ∪ C!S ψ (x ) C B S ψ (x ) x!A C A S C C! ⊆ ∪ CC BCS ⊆"∪ !C A S ψC (Cx!CS ) !⊆ ∪ ψC (xC ) = ⊆ ∪ C!S " !C A S ψC (xC ) !C B S ψC (xC ) C⊆ψS∪ (x ) C B S ψ (x ) ⊆ ∪ x!A C A S ! C C! ⊆ ∪ C C = ⊆ ∪ C!S ψ (x ) C B S ψ (x ) x!A C A S C C! ⊆ ∪ C C $⊆"∪ C!A S ψC (xC!)S % ! = ⊆ ∪ ψC (xC ) $" C!A S x C A %S!ψC (xC ) = ⊆ ∪ !A ⊆ ∪ x C A S ψC (xC ) and !A $⊆"∪ ! % and p(xA$,"xS) ! % p(xA xS) = |p(xA, xS) p(xS) p(xA xS) = | p(xS) C B S ψ (x ) ψ (x ) xB ⊆ ∪ C C C A S C C C!S ⊆ ∪ = C B S ψC (xC ) ψC (xC ) = p(xA xB, xS) xB &⊆ ∪ C A S' | C!"S C!B S ψ (x )⊆ ∪ ! ψ (x ) = xB ⊆ ∪ C C xA C A S =Cp(xCA xB, xS) & C!S ' ⊆ ∪ | " C!B S ψ (x ) ! ψ (x ) xB &⊆ ∪ C C xA 'C A S C C C!"S ! $⊆"∪ ! % ! & Yasemin Altun' Lecture 2 " ! $" ! % ! 4-5 4-5 Conversion of MRFs and Bayes Nets

Cannot always convert MRFs to Bayes nets and vice versa a) No directed model represent x ⊥ y|{w, x}, w ⊥ z|{x, y} and only those Expressive Power Probability Tables & CPTs b) NoCan undirectedwe always conv modelert direct represented undirecxted⊥? y and only that For discrete (categorical) variables, the most basic parametrization • ↔ • th No. is the probability table which lists p(x = k value). • W Since PTs must be nonnegative and sum to 1, for k-ary nodes • there are k 1 free parameters. − If a discrete node has discrete parent(s) we make one table for each X Y X Y • setting of the parents: this is a conditional probability table or CPT. x 2 0 1 x 1 0 0 1 x 4 Z Z 1 x 5 0 1 x 2 0 1 (a) (b) X 4 0 x 6 No directed model X 2 1 No undirected model 0 1 0 can represent these X 6 x 2 x 1 X1 can represent these 1 and only these and only these X 3 X 5 independencies. x 3 x 1 0 1 independencies. 0 1 x y w z 0 , 0 x 5 ⊥ | { } Yasemin Altun Lecturex y 2 x 3 1 w z x, y ⊥ 1 ⊥ | { }

What’s Inside the Nodes/Cliques? Exponential Family We’ve focused a lot on the structure of the graphs in directed and For a numeric x • undirected models. Now we’ll look at specific functions that can • p(x η) = h(x) exp η$T (x) A(η) live inside the nodes (directed) or on the cliques (undirected). | 1 { − } For directed models we need prior functions p(x ) for root nodes = h(x) exp η$T (x) • i Z(η) { } and parent-conditionals p(x xπ ) for interior nodes. i| i is an exponential family distribution with For undirected models we need clique potentials ψC(xC) on the natural parameter η. • x maximal cliques (or log potentials/energies HC( C)). Function T (x) is a sufficient statistic. • We’ll consider various types of nodes: binary/discrete (categorical), Function A(η) = log Z(η) is the log normalizer. • continuous, interval, and integer counts. • Key idea: all you need to know about the data in order to estimate We’ll see some basic probability models (parametrized families of • parameters is captured in the summarizing function T (x). • distributions); these models live inside nodes of directed models. Examples: Bernoulli, binomial/geometric/negative-binomial, We’ll also see a variety of potential/energy functions which take • Poisson, gamma, multinomial, Gaussian, ... • multiple node values as arguments and return a scalar compatibility; these live on the cliques of undirected models. 2 Elimination, Moralization and Sum-Product Algorithm

p(x1) ∝ ψ(x1, x2)ψ(x1, x3)ψ(x2, x4)ψ(x1, x4) x2!,x3,x4

= ψ(x1, x2)ψ(x1, x3)m4(x2, x3) x!2,x3

= ψ(x1, x2) ψ(x1, x3)m4(x2, x3) !x2 !x3

= ψ(x1, x2)m3(x1, x2) !x2

= m2(x1) −→ last message

m2(x1) so, p(x1) = x1 m2(x1) "

During elimination process, a new edge between x2 and x3 is created (i.e., a message containing both x2 and x3), which is shown in figure 1. Note that in undirected graphical models, we do not need to compute the Moralizationnormalization term explicitly; instead we can use the final message to quickly compute the term Z for all marginal probabilities.

Transfering a Bayes net to an MRF 2 Moralization Marry the parents of the nodes in the directed graph and Mconvertoralization is thea tra edgesnsformation tofrom undirecteddirected graphica edgesl model to an undirected graphical model. In brief, for each node in a directed graph, add an undirected edge between all pairs of parents of that node that are nFewerot already independenceconnected. Then change relations,all edges in the g largeraph from classdirected ofto u distributions.ndirected, as in figure 2.

(Figure 2)

Moralization captures the conYasemincept tha Altunt we need Lectureto pay 2the expense of computing the term involving the parents of a child node while doing elimination. From Bayes net to MRF Moralization Bayes net to Markov net Assign each conditional probability distribution to one Defn: the H(G) of a DAG is constructed by adding We asscliqueign each CPD to one of the clique potentials that contains • undirected edges between any pair of disconnected (“unmarried”) • it, e.g. nodes X,Y that are parents of a child Z, and then dropping all 1 P (U, X, Y, Z) = ψ(U, X) ψ(X, Y, Z) remaining arrows. Z × 1 Thm 5.7.5: The moral graph H(G) is the minimal I-map for Bayes = P (U)P (X U) P (Y )P (Z X, Y ) • net G. 1 | × | U1 U2 = P (X, U) P (Z X, Y )P (Y ) U1 U2 × | X X U U Z1 Z2 Z1 Z1

Y1 Y2 Y1 Y2 X Y X Y

Z Z

Yasemin Altun Lecture 2

From Markov nets to Bayes nets Graph triangulation A A

Defn: A Bayes net G is an I-map for a Markov net H if B C B C • I(G) I(H). ⊆ D E D E We can construct a directed I-map by choosing a node ordering, • F F and then picking the parents of node Xi as the subset U that renders X independent of its other predecessors X , . . . , X . The example above showed how we added extra edges to the DAG i 1 i−1 • e.g., when we add C, the ancestors are A, B; since C B A, we so that the largest loop was a 3-cycle (triangle). • need to add an edge from B to C. #⊥ | Defn: An undirected graph is called chordal or triangulated if A A • every loop X1 X2 Xk X1 for k 4 has a chord, i.e., an B C B C − · · · − ≥ edge connecting Xi and Xj for i, j non-adjacent.

D E D E Defn: a directed graph is chordal if its underlying undirected graph • F F is chordal. Different orderings may induce different edges. Thm 5.7.15: If G is a minimal I-map for Markov net H, then G is • • chordal. Clique potentials

We have the decomposition of the joint distribution into potential functions leading to advantages Less number of parameters (less data) Computational feasibility Note Bayes net as special case of MRF with CPD as potential functions and Z = 1 How does the potential functions look like?

Define feature function fi (xCi ), for Ci ⊆ C and a scalar parameter θi for each feature Potential function   X ψC(xC) = exp  θi fi (xci )

i∈IC

eg NER f1(x1, x2) = δ(x1 = Mr., x2 = Brown)

Yasemin Altun Lecture 2 Log-linear models Log-linear ModelsLog-linear models

We can parameterize each clique potential (factor) ψc(xc) as Overall distribution is just a log-linear model (exponential family) • • follows. 1 P (x θ) = ψc(xc) Define a feature function f (x ), where C C is a subset of the | Z(θ) i ci i c∈C • variables in C. ⊆ & 1 Associate a scalar weight θi with each such feature. = exp θ f (xc ) • Z(θ)  i i i  Then define c&∈C i#∈IC •   1 ψc(xc) = exp θ f (xc ) = exp θifi(xc )  i i i  Z(θ)  i  c∈C i∈I i#∈IC # #C     e.g., for the spelling model, f1(x1, x2, x3) = δ(x1:3 = ing), 1 • = exp θ f (xc ) f2(x1, x2, x3) = δ(x1:2 = qu), etc. Z(θ)  i i i  #i∈I We can infer the graph structurefrom the features by connecting • all the variables that are mentioned in the same function.

Yasemin Altun Lecture 2

Log-linear models Structured CPDs

So far we have discussed how to represent potentials/factors using • 1 a number of parameters that is less than exponential in the number P (x θ) = exp θ f (xc ) | Z(θ)  i i i  of nodes in the clique. i∈I # Now we will examine analogous techniques for compact This form is completely general. By defining one indicator feature • • representations of conditional probability distributions. for every possible value of xci, we can associate a separate parameter with each cell in the multi-dimensional array representing We will start by examining compact representations for • unconditional probability distributions, i.e., nodes with no parents. ψci. For Gaussians, we can use features f (x , x ) = x x for every • ij i j i × j pair of connected nodes, fi(xi) = xi for every single node, and f0 = 1 as a constant term: 1 P (x ) = e−H(x) 1:n Z H(x) = Vijxixj + αixi + C #ij #i Log-linear Models

This characterization of clique potentials lead to log-linear models, exponential families

p(x|θ) = h(x) exp(hθ, T (x)i − A(θ))

T (x): sufficient statistics θ: natural parameters A(θ) = log Z(θ), log-partition function Bernoulli (x = {0, 1}), Poisson (count), Multinomial (discrete with K values), Gaussian (continous)

Yasemin Altun Lecture 2 Multinomial Distribution

For discrete random variable with K possible values. Let πk th be the probability of the k value. Define x = (x1,..., xK ) with xk = 1 if the variable has value k, 0 o. w.

K ! Y xi X p(x|π) = πi = exp xi log(πi ) i=1 i

For multiple observations of this variable X = {x1,..., xN }, joint probability written in terms of observed counts of each value

K ! Y n X X n p(X|π) = p(x |π) = exp xi log(πi ) n=1 i n

P n Ci = n xi observed count

Yasemin Altun Lecture 2 Multinomial Distribution

P PK −1 Enforce constraint k πk = 1, by πK = 1 − i=1 πi

K −1 ! X πi p(x|π) = exp xi log( ) + log πK πK i=1

π K i X θi h(x) = 1, θi = log , T (xi ) = xi , A(θ) = log( e ) πK i=1

θi P θj Softmax function πi = e / j e

Yasemin Altun Lecture 2 Why exponential families

P Shannon’s Entropy: H(p) = x p(x) log p(x) We get maximum entropy for uniform distribution, ie. least information There is a duality of maximizing Shannon’s entropy wrt CS 281A Lecture 12 — October 11 Fall 2005 CS 281A data constraintsLect andure maximizing12 — Octo log-likelihood.ber 11 Fall 2005

Choose pˆ = arg maxp H(p) st Ex∼p[ψ(x)] =µ ˆ Thus we know the maximum is unique. Write down the Lagrangian, take the derivative and equate Thus we knoFor thew theLagrangian,maxim(conumvisertunique.constrained to unconstrained) we have: to 0. For the Lagrangian, L(con(p, ηv, rert) = constrainedH(p) η [µto unconstrained)p(x)φ (x)] + γ[1we havpe:(x)] (12.15) − α α − α − α I x x !∈ ! ! We knoLw(p,weηha, rv)e=a uniqueH(p)optimumηαwhen[µα ∂L = 0, p∂(Lx=)φ0αand(x)]∂+L =γ0.[1 With somep(xalgebra)] (12.15) − −∂p ∂ηα ∂γ − we derive the following: α I x x !∈ ! ! ∂L ∂L ∂L ∂L We know we have a unique optim=um1 whenlog p(x) += 0,η φ (x=) 0γand= 0 = 0. With(12.16)some algebra ∂p(x) − − ∂p α∂ηαα − ∂γ α I we derive the following: !∈

∂L p(x; η) exp η φ (x) (12.17) = 1 log∝ p(x) + αη αφ (x) γ = 0 (12.16) "α I α α# ∂p(x) − − ∈ − !α I !∈

Yasemin Altun Lecture 2 p(x; η) exp η φ (x) (12.17) ∝ α α "α I # !∈

12-7

12-7 Inference

How can we use these models to efficiently answer probabilistic queries? Estimate values of hidden variables from observed ones From causes to effects: Given the parent node, how likely to observe the child node? Read of the conditional probability. From effects to causes: Given child node, how to infer ancestor? Use Bayes rule P p(xE , xQ, xM ) P(x |x ) = xM Q E P p(x , x , x )) xQ ,xM E Q M Can we compute these sums efficiently?

Yasemin Altun Lecture 2 Computational complexity

Exact inference in discrete Bayes nets is NP-hard. Proof: Encode 3SAT problem as a polynomial sized Bayes net by defining X1,..., Xn independent Bernoulli variables 1 with P(Xi = 1) = 2 . Σ set of clauses with 3 literals. Cj a rv with value 1 if jth clause is satisfied. cj from x1,..., xn using conditional distribution table with 3 parents. Aj a rv with value 1 if C1,..., Cj are satisfied. aj form aj−1 and cj and a1 from c1. P(Ak = 1) proportional to number of truth assignments satisfying all clauses. We will see that exact inference can be done in O(N|Σ|K ) with N number of variables, Σ set of values, K fan-in.

Yasemin Altun Lecture 2 Variable Elimination Algorithm

Calculating the conditional probability of a single node given an arbitrary set of nodes. Limited to one node! Shaded nodes are conditioned on, evidence nodes. Clamp their values. P(A, B = 1) P(A|B = 1) = ∝ P(A, B = 1) P(B = 1)

x¯i to denote Xi ’s value is clamped to the evidence Requires an ordering which tells it in which order it will do the summations. The query node must appear as last Push sums inside products and store shared expressions

Yasemin Altun Lecture 2 Simple Case: Bayes Rule For simple models, we can derive the inference formulas by hand Lecture 4: • using Bayes rule (e.g. responsibility in mixture models). Inference in Chains and Trees, X X X a) p(x, y) = p(x)p(y x) | Message Passing, b) p(y x) | p(x)p(y x) Y Y Y c) p(x y) = | | x p(x)p(y x) (a) (b) (c) | This is called “reversing the arrow”. ! Sam Roweis In general, the calculation we want to do is: • x p(xE, xF , xR) p(x x ) = R F | E p(x , x , x ) !xF ,xR E F R Q: Can we do these sums efficiently? Friday January 28, 2005 • ! Summer School Can we avoid repeating unecessary work each time we do inference? A: Yes, if we exploit the factorization of the joint distribution.

Example Probabilistic Inference Example

X 4 Partition the random variables in a domain X into three disjoint X 2 The key is to factor and then apply the distributive law. • X 6 subsets, xE,xF ,xR. The general probabilistic inference problem is X1 p(x1 x¯6) = p(x1, x¯6)/p(x¯6) | to compute the posterior p(x x ) over query nodes x . x1 x6 x1 x6 F | E F = p( , ¯ )/ p( !, ¯ ) x1! This involves conditioning on evidence nodes x and X 3 X 5 " • E integrating (summing) out marginal nodes xR. p(x1, x¯6) = p(x1)p(x2 x1)p(x3 x1)p(x4 x2)p(x5 x3)p(x¯6 x2, x5) | | | | | x2 x3 x4 x5 If the joint distribution is represented as a huge table, this is trivial: " " " " • = p(x1) p(x2 x1) p(x3 x1) p(x4 x2) p(x5 x3)p(x¯6 x2, x5) just select the appropriate indicies in the columns corresponding to | | | | | x2 x3 x4 x5 xE based on the values, sum over the columns corresponding to " " " " = p(x1) p(x2 x1) p(x3 x1)Φ5(x2, x3) p(x4 x2) x , and renormalize the resulting table over x . | | | R F x2 x3 x4 " " " If the joint is a known continuous function this can sometimes be = p(x1) p(x2 x1)Φ4(x2) p(x3 x1)Φ5(x2, x3) • | | done analytically. (e.g. Gaussian: eliminate rows/cols x2 x3 " " corresponding to x ; apply conditioning formulas for p(x x )). = p(x1) p(x2 x1)Φ4(x2)Φ3(x1, x2) R F E | | x2 But what if the joint distribution over X is represented by a " = p(x1)Φ2(x1) • directed or undirected graphical model?

Yasemin Altun Lecture 2 Variable Elimination Algorithm

Yasemin Altun Lecture 2 Computational Complexity

WG,I = Width of a graph G wrt an ordering I is maximum induced clique size minus 1

WG = minI WG,I minimum induced with of G Complexity of Variable Elimination Algorithm is O(N|Σ|WG+1)

Finding I to minimize WG,I is NP-hard. In practice, greedy are used.

Yasemin Altun Lecture 2 Single Node Posteriors Elimination Algorithm

For a single node posterior (i.e. x is a single node), there is a For• undirected graphs, same algorithm,F use cliques insteadsimp ofle, conditionalefficient alg probabilitiesorithm based on eliminating nodes. SummingNotatio outn: x¯ a variableis the va correspondslue of evidence ton removingode x . the node • i i fromT thehe a graphlgorith andm, ca viselled versaelimination, requires a node ordering to be • for eachgiven node, whichxi tinells orderingit whichIorder to do the summations in. Iconnectn this or neighborsdering, the ofqxuiery node must appear last. remove x from the graph • Graphicalliy, we’ll remove a node from the graph once we sum it out.

X 4 X 4 X 4 X 4

X 2 X 2 X 2 X 2 X 2

X 6 X 6 X1 X1 X1 X1 X1

X 2

X1 X1

X 3 X 5 X 3 X 5 X 3 X 5 X 3 X 3

(a) (b) (c) (d) (e) (f) (g)

Summing out xi results in a function of its neighbors (which Above, Ti denotes i plus all other nodes referenced by potentials on i are now dependent/connected)

Yasemin Altun Lecture 2 Evidence Potentials Algorithm Details

Elimination also uses a bookeeping trick, called evidential functions: At each step we are trying to remove the current variable in the • • g(x¯ ) = g(x )δ(x , x¯ ) elimination ordering from the distribution. i i i i For marginal nodes this sums them out, for evidence nodes this xi ! conditions on their observed values using the evidential functions. where δ(xi, x¯i) is 1 if xi = x¯i and 0 otherwise. Each step in Update performs a sum over a product of potential This trick allows us to treat conditioning in the same way as we • • functions. Potentials can be original functions p(xi xπ ), evidential treat marginalization. So everything boils down to doing sums: | i functions δ(xi, x¯i) or intermediate potentials mi. p(xF x¯E) = p(xF , x¯E)/p(x¯E) The algorithm terminates when we reach the query node, | • p(xF , x¯E) = p(xF , xE, xR)δ(xE, x¯E) which always appears last in the ordering. x x !R !E We renormalize what we have left to get the final result: • p(x¯E) = p(xF , xE, xR)δ(xE, x¯E) p(xF xE) x x x | !R !E !F For undirected models, everything is the same except the We just pick an ordering and go for it... • initialization phase uses the clique potentials instead of the • parent-conditionals. Triangulation

Adding all these edges to the original graph gives a triangulated graph. Triangulated (Chordal) graph: A graph whose every loop of length ≥ 4 contains a chord, an edge between non-adjacent nodes. Elimination algorithms gives a triangulated graph. Proof by induction. Base case, single node, holds trivially. Assume it holds for n nodes on a graph Gn. Eliminate a node s from Gn+1 results a graph Gn and connects ˜ neighbors of s. The reconstituted graph Gn is triangulated ˜ by inductive hypothesis. Adding s to Gn cannot create a chordless cycle, because all its neighbors are connected. For directed graphs, parents may not be explicitly connected. Moralize first!

Yasemin Altun Lecture 2 Multiple queries

Variable Elimination algorithm is limited to a single node query If there are multiple queries, we should make use of the calculations across different queries Sum-Product/Belief Propogation Algorithm can compute all single-node marginals It works only in trees (or tree-like structures) We will come back to this as a special case of junction tree algorithm for Hidden Markov Models.

Yasemin Altun Lecture 2 Junction Tree Algorithm

Junction Tree algorithm is for general graphs and makes use of locality Given a graph, the computations can be over the clustering EECS 281Aof/ nodesSTAT 2 forming41A Lecture a tree.19 — November 3 Fall 2005 Def: A clique tree is a tree structured graph whose vertices are maximal cliques of the original graph.

1 2

123 23 234

3 4

A Seperator (square nodes) on an edge is set of nodes that isFigure the intersect19.3. Second of theexam twople cliques.graph and its clique tree

Yasemin Altun Lecture 2

1 2 12 24

3 4 13 34

Figure 19.4. Third example graph and its clique tree in the graph are the same. Another way to look at it is that conditioned on 1,3 and/or 1,2 we would expect that 3,4 is independent of 2,4 , which is not true!{Hence} if we {run }sum product algorithm on{ this} ”tree”, we would no{ t c}onverge to the correct solution. The problem here is that we want to run a local message passing algorithm. So we need to refine our notion of clique tree to ensure that local computation outputs globally correct answers. Definition: (Junction Tree) A junction tree is a special kind of clique tree such that for all cliques and of the original graph, belongs to every separator set on the C1 C2 C1 ∩ C2 unique path from 1 to 2. This property ofC theCseparator set is known as the running intersection property. We see from this definition that the first two clique trees are junction trees, while the third one is not a junction tree. Over the next few days we will show that if we run sum-product algorithm on a junction tree, it’s going to be a local algorithm on a junction tree, but it would compute exact marginals. We will also look at what graphs have junction trees, and look at connections with elimination algorithm.

19-6 EECS 281A / STAT 241ALecture 20 — November 15 Fall 2005

Junction Tree 1 2 3

4 5

We need a notion to allow us to preserveEECS 28 globalFigure1A / ST20.1.AT The241AfiLrstectjunurcetion20tree— Noexamvemple.ber 15 Fall 2005

consistency via local consistency.123 1 23 Def: A junction tree is a clique tree st. for all cliques C2 and 3 234 1 Figure 20.2. Here ovals represent cliques, and squares represent C2 in the original graph, C1 ∩ C2 belongs34 to every separator sets. This is a junction tree because 41, 2, 3 3, 4, 5 =5 EECS 281A / STAT 241ALecture 20 — November 15 3 is present in evFeryalslepar200ator5set on the pat{h 123} {345. } 345 { } →! seperator on the unique path from C1 to C2. Figure 20.1. The first junction tree example.

123 1 123

23 23 2 3 345 Figure 20.3.234 This is not a junction tree because 2 should be in both separator seFigurts, ande 20.2.it is notHere. ovals represent cliques, and squares represent 34 34 separator sets. This is a junction tree because 1, 2, 3 3, 4, 5 = 3 is present in every separator set on the pat{h 123} {345. } { } →! 4 5 234 345

Figure 20.1. The first junction tree exam20p.3le..1 Exampl123 es of Junction Trees

Consider Figure 220.13 . There are 3 cliques in this graph: 123, 234, 345 . There are many Figure b is not a junction tree (becauseways to form a cliq ofue tree: node 2), but c is.{ } 123 345 Figure 20.3. This is not a junction tree because 2 should be in There exists a junction tree of a graph iff theboth sep grapharator sets, and i ist is not. 34 23 Next consider the next example, shown in Figures 20.4 and 20.5. The clique tree shown triangulated. is not a junction tree.234 In fact, there is no way to make a junction tree for this graph. 234 Figure 20.2. Here ovals represent cliques, and squares represent 20.3.1 Examples of Junction Trees 34 separator sets. This is a junction tree because 1, 2, 3 3, 4, 5 = Consider Figure 20.1. There are 3 cliques in this graph: 123, 234, 345 . There are many { } 3 is present in every separator set on the pat{h 123} {345. } ways to form a clique tree: { } →! 345 Yasemin Altun Lecture 2 20-2

Next consider the next example, shown in Figures 20.4 and 20.5. The clique tree shown is not a junction tree. In fact, there is no way to make a junction tree for this graph. 123

23

345 Figure 20.3. This is not a junction tree because 2 should be in 20-2 both separator sets, and it is not. 34

234

20.3.1 Examples of Junction Trees Consider Figure 20.1. There are 3 cliques in this graph: 123, 234, 345 . There are many ways to form a clique tree: { }

Next consider the next example, shown in Figures 20.4 and 20.5. The clique tree shown is not a junction tree. In fact, there is no way to make a junction tree for this graph.

20-2 Junction Tree Algorithm

Moralize if the graph is directed Triangulate the graph G (using variable elimination algorithm) Construct a junction tree J from the triangulated graph Initialize the potential of clique C of J as the product of potentials from G all assigned factors from the model. Propagate local messages on the junction tree

Yasemin Altun Lecture 2 Jtree iff MWST From Bayes net to jtree

C o h e r e n c e C o h e r e n c e C o h e r e n c e

− M 1 D i f f i c u l t y I n t e l l i g e n c e D i f f i c u l t y I n t e l l i g e n c e D i f f i c u l t y I n t e l l i g e n c e w(T ) = |Sj| G r a d e S A T G r a d e S A T G r a d e S A T j=1 !− M 1 N L e t t e r L e t t e r L e t t e r = 1(Xk ∈ Sj) J o b J o b J o b j=1 k ! !=1 H a p p y H a p p y H a p p y N M−1 = 1(Xk ∈ Sj) BN Moralize Triangulate k=1 j=1 ! ! 1 1 N M C,D G,I,D G,S,I G,S,L L,S,J C,D G,I,D G,S,I G,S,L L,S,J ≤ 1(Xk ∈ Ci) − 1 1 2 2 2 D G,I G,S G,L k=1 " i=1 # 1 G 1 !M N! 1 G,H G,H = 1(Xk ∈ Ci) − N i=1 k=1 !M ! Jgraph Jtree

= |Ci| − N i=1 ! • This is an equality iff T is a jtree. • To make a jtree from a set of cliques of a – Build a junction graph, where weight on edge Ci − Cj is |Sij|. – Find MWST using Prim’s or Kruskal’s algorithm.

Initializing clique trees Message passing in clique trees

C o h e r e n c e

D i f f i c u l t y I n t e l l i g e n c e • To compute P (J), we find some clique that contains J (eg. C5) Propagating messages G r a d e S A T and call it the root.

L e t t e r • We then send mToe computessagesp(XfrFo),m createth ae treelea byve selectings up t ao cliquethe thatroot.

J o b contains XF as the root. H a p p y • A node Ci canSendsen messagesd to Cj from(c leavesloser tot theo t root.he root) once it has received messages from Message-Passingall its other Protocol:neighbAo cliquers CCi can. send a • The potential for clique c is initialized to the product of all assigned message to neighbor Cj (closer to the root)k if it has factors from the model: • The order to sereceivednd the messagesmessa fromges alli itss othercalle neighbors.d a schedule. π (C ) = φ 1: 2: 3: 5: 4: j j ! C,D G,I,D G,S,I G,J,S,L H,G,J φ:α(φ)=j !1 2(D): !2 3(G,I) !3 5(G,S) !4 5(G,J)

∑ " (C ) ∑ D" 0 (C2 )$!1#2 ∑ I " 0 (C3 )$!2#3 ∑ H " 0 (C4 ) 1: 2: 3: 5: 4: C 0 1 C,D G,I,D G,S,I G,J,S,L H,G,J D G,I G,S G,J Message propagation slides are from Lecture notes of K. Murphy. P(D | C) P(G | I,D) P(I) P(L | G) P(H | G,J) P( C) P(S | I) P(J | L,S) Yasemin Altun Lecture 2 Propagating messages

Computing p(J) throughCo treellec rootedt to withC5 C5 Collect to C3

0 0 δ1→2(D) = π1(C) δ1→2(D) = π1(C) C C !0 !0 π2(G, I, D) = π2(G, I, D)δ1→2(D) π2(G, I, D) = π2(G, I, D)δ1→2(D)

δ2→3(G, I) = π2(G, I, D) δ2→3(G, I) = π2(G, I, D) D D ! ! π (G, S, I) = π0(G, S, I)δ (G, I) 0 3 3 2→3 δ4→5(G, J) = π4(H, G, J) H δ3→5(G, S) = π3(G, S, I) !0 I π5(G, J, S, L) = π (G, J, S, L)δ4→5(G, J) ! 5 0 δ4→5(G, J) = π4(H, G, J) δ5→3(G, S) = π5(G, J, S, L) H J,L !0 !0 π5(G, J, S, L) = π5(G, J, S, L)δ3→5(G, S)δ4→5(G, J) π3(G, S, I) = π3(G, S, I)δ2→3(G, I)δ5→3(G, S)

1: 2: 3: 5: 4: 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J C,D G,I,D G,S,I G,J,S,L H,G,J

(D): (G,I) (G,S) (G,J) !1 2 !2 3 !3 5 !4 5 !1 2(D): !2 3(G,I): !5 3(G,S): !4 5(G,J): ∑ " 0 (C2 )$!1 2 ∑ " 0 (C3 )$!2 3 ∑ " 0 (C4 ) (C ) ∑ "0 (C1 ) D # I # H (C ) ∑ " 0 (C2 )$!1#2 ∑ J,L" 0 5 $!4#5 ∑ " 0 (C4 ) C ∑ C"0 1 D H

Yasemin Altun Lecture 2

General procedure for upwards pass Sub-functions 1 def 0 def ψr = function Ctree-VE-up({φ}, T, α, r) {ψi } = function initializeCliques(φ, α)

DT := mkRootedTree(T, r) for i := 1 : C 0 0 {ψi } := initializeCliques(φ, α) ψi (Ci) = φ:α(φ)=i φ for i ∈ postorder(DT ) ! j := pa(DT, i) 0 δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψ ) def 0 i δi→j = function VE-msg({δk i}, ψ ) end → i 1 0 ψr := ψr k∈ch(DT,r) δk→r 1 0 ψ (Ci) := ψ (Ci) k δk→i ! i i ! δ (S ) := ψ1(C ) i→j i,j Ci\Sij i i

Ck Ci Cj Cr Ck’ Propagating messages

Collect to C5 Computing p(J) throughCo treelle rootedct to withC3 C3

0 0 δ1→2(D) = π1(C) δ1→2(D) = π1(C) C C !0 !0 π2(G, I, D) = π2(G, I, D)δ1→2(D) π2(G, I, D) = π2(G, I, D)δ1→2(D)

δ2→3(G, I) = π2(G, I, D) δ2→3(G, I) = π2(G, I, D) D D ! ! π (G, S, I) = π0(G, S, I)δ (G, I) 0 3 3 2→3 δ4→5(G, J) = π4(H, G, J) H δ3→5(G, S) = π3(G, S, I) !0 I π5(G, J, S, L) = π (G, J, S, L)δ4→5(G, J) ! 5 0 δ4→5(G, J) = π4(H, G, J) δ5→3(G, S) = π5(G, J, S, L) H J,L !0 !0 π5(G, J, S, L) = π5(G, J, S, L)δ3→5(G, S)δ4→5(G, J) π3(G, S, I) = π3(G, S, I)δ2→3(G, I)δ5→3(G, S)

1: 2: 3: 5: 4: 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J C,D G,I,D G,S,I G,J,S,L H,G,J

(D): (G,I) (G,S) (G,J) !1 2 !2 3 !3 5 !4 5 !1 2(D): !2 3(G,I): !5 3(G,S): !4 5(G,J): ∑ " 0 (C2 )$!1 2 ∑ " 0 (C3 )$!2 3 ∑ " 0 (C4 ) (C ) ∑ "0 (C1 ) D # I # H (C ) ∑ " 0 (C2 )$!1#2 ∑ J,L" 0 5 $!4#5 ∑ " 0 (C4 ) C ∑ C"0 1 D H

Yasemin Altun Lecture 2

General procedure for upwards pass Sub-functions 1 def 0 def ψr = function Ctree-VE-up({φ}, T, α, r) {ψi } = function initializeCliques(φ, α)

DT := mkRootedTree(T, r) for i := 1 : C 0 0 {ψi } := initializeCliques(φ, α) ψi (Ci) = φ:α(φ)=i φ for i ∈ postorder(DT ) ! j := pa(DT, i) 0 δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψ ) def 0 i δi→j = function VE-msg({δk i}, ψ ) end → i 1 0 ψr := ψr k∈ch(DT,r) δk→r 1 0 ψ (Ci) := ψ (Ci) k δk→i ! i i ! δ (S ) := ψ1(C ) i→j i,j Ci\Sij i i

Ck Ci Cj Cr Ck’ Collect to C5 Collect to C3

0 0 δ1→2(D) = π1(C) δ1→2(D) = π1(C) C C !0 !0 π2(G, I, D) = π2(G, I, D)δ1→2(D) π2(G, I, D) = π2(G, I, D)δ1→2(D)

δ2→3(G, I) = π2(G, I, D) δ2→3(G, I) = π2(G, I, D) D D ! ! π (G, S, I) = π0(G, S, I)δ (G, I) 0 3 3 2→3 δ4→5(G, J) = π4(H, G, J) H δ3→5(G, S) = π3(G, S, I) !0 I π5(G, J, S, L) = π (G, J, S, L)δ4→5(G, J) ! 5 0 δ4→5(G, J) = π4(H, G, J) δ5→3(G, S) = π5(G, J, S, L) H J,L !0 !0 π5(G, J, S, L) = π5(G, J, S, L)δ3→5(G, S)δ4→5(G, J) π3(G, S, I) = π3(G, S, I)δ2→3(G, I)δ5→3(G, S)

1: 2: 3: 5: 4: 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J C,D G,I,D G,S,I G,J,S,L H,G,J

(D): (G,I) (G,S) (G,J) !1 2 !2 3 !3 5 !4 5 !1 2(D): !2 3(G,I): !5 3(G,S): !4 5(G,J): ∑ " 0 (C2 )$!1 2 ∑ " 0 (C3 )$!2 3 ∑ " 0 (C4 ) (C ) ∑ "0 (C1 ) D # I # H (C ) ∑ " 0 (C2 )$!1#2 ∑ J,L" 0 5 $!4#5 ∑ " 0 (C4 ) C ∑ C"0 1 D H

Message Passing Algorithm

General procedure for upwards pass Sub-functions 1 def 0 def ψr = function Ctree-VE-up({φ}, T, α, r) {ψi } = function initializeCliques(φ, α)

DT := mkRootedTree(T, r) for i := 1 : C 0 0 {ψi } := initializeCliques(φ, α) ψi (Ci) = φ:α(φ)=i φ for i ∈ postorder(DT ) ! j := pa(DT, i) 0 δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψ ) def 0 i δi→j = function VE-msg({δk i}, ψ ) end → i 1 0 ψr := ψr k∈ch(DT,r) δk→r 1 0 ψ (Ci) := ψ (Ci) k δk→i ! i i ! δ (S ) := ψ1(C ) i→j i,j Ci\Sij i i

Ck Ci Cj Cr Ck’

Yasemin Altun Lecture 2 Collect to C5 Collect to C3

0 0 δ1→2(D) = π1(C) δ1→2(D) = π1(C) C C !0 !0 π2(G, I, D) = π2(G, I, D)δ1→2(D) π2(G, I, D) = π2(G, I, D)δ1→2(D)

δ2→3(G, I) = π2(G, I, D) δ2→3(G, I) = π2(G, I, D) D D ! ! π (G, S, I) = π0(G, S, I)δ (G, I) 0 3 3 2→3 δ4→5(G, J) = π4(H, G, J) H δ3→5(G, S) = π3(G, S, I) !0 I π5(G, J, S, L) = π (G, J, S, L)δ4→5(G, J) ! 5 0 δ4→5(G, J) = π4(H, G, J) δ5→3(G, S) = π5(G, J, S, L) H J,L !0 !0 π5(G, J, S, L) = π5(G, J, S, L)δ3→5(G, S)δ4→5(G, J) π3(G, S, I) = π3(G, S, I)δ2→3(G, I)δ5→3(G, S)

1: 2: 3: 5: 4: 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J C,D G,I,D G,S,I G,J,S,L H,G,J

(D): (G,I) (G,S) (G,J) !1 2 !2 3 !3 5 !4 5 !1 2(D): !2 3(G,I): !5 3(G,S): !4 5(G,J): ∑ " 0 (C2 )$!1 2 ∑ " 0 (C3 )$!2 3 ∑ " 0 (C4 ) (C ) ∑ "0 (C1 ) D # I # H (C ) ∑ " 0 (C2 )$!1#2 ∑ J,L" 0 5 $!4#5 ∑ " 0 (C4 ) C ∑ C"0 1 D H

Message Passing Algorithm

General procedure for upwards pass Sub-functions 1 def 0 def ψr = function Ctree-VE-up({φ}, T, α, r) {ψi } = function initializeCliques(φ, α)

DT := mkRootedTree(T, r) for i := 1 : C 0 0 {ψi } := initializeCliques(φ, α) ψi (Ci) = φ:α(φ)=i φ for i ∈ postorder(DT ) ! j := pa(DT, i) 0 δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψ ) def 0 i δi→j = function VE-msg({δk i}, ψ ) end → i 1 0 ψr := ψr k∈ch(DT,r) δk→r 1 0 ψ (Ci) := ψ (Ci) k δk→i ! i i ! δ (S ) := ψ1(C ) i→j i,j Ci\Sij i i

Ck Ci Cj Cr Ck’

Yasemin Altun Lecture 2 Uses of dfs CCorrectnessorrect ofn Upwardess passof upwards pass Ck Cj Cr • For message passing on an undirected tree: Ci Ck’

– We can root a tree at R and make all arcs point away from R by Let FI,i→j be potentials on the Ci side and VI,i→j be the • Consider edge Ci −variablesCj i ofn thesethe potentialscliqu thate aretr notee in.Sij Let F≺(i→j) be all factors starting the DFS at R and connecting π(i)→i. Message from i to j summarizes everything to the left of on the C side, andtheV edge (sinde Sij separatesbe a thell leftva fromria theb right)les on the C side that are i ≺(i→j) X Y i – preorder (parents then children) = nodes sorted by discovery time σi→j (Sij ) = ψl not in Sij. VI,i→j ψl ∈FI,i→j The root node message is proportional to the marginal – postorder (children then parents) = nodes sorted by finish time probability (upto evidence probability) • Thm 8.2.3: the message from i tXo j summarizes everything to the πr (Cr ) = P(X) • For visiting nodes in a DAG in a topological order (parents before left of the edge (since Sij separatXCesr the left from the right): children) Yasemin Altun Lecture 2 δ (S ) = φ – Topological order = nodes sorted by reverse finish time i→j ij ! " V≺(i→j) φ∈F≺(i→j) • For checking if a DAG has cycles • Corollary 8.2.4: for the root clique, – Run DFS, see if you ever encounter a back-edge to a gray node π (C ) = P #(X) • For finding strongly connected components r r ! X\Cr

Meaning of the messages Meaning of the messages X X X X . . . 1: 2: 3: 5: 4: 1 2 3 4 C,D G,I,D G,S,I G,J,S,L H,G,J Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4 D G,I G,S G,J • Partial messages may not be probability distributions unless the or- P(H | G,J) P(D | C) P(G | I,D) P(I) P(L | G) dering is topologically consistent with a Bayes net. P( C) P(S | I) P(J | L,S) • Causal order • e.g., for edge C − C , δ (X ) = P (X )p(y |X )P (X |X )p(y |X ) ∝ P (X |y ) 3 5 1→2 2 ! 1 1 1 2 1 2 2 2 1:2 X F≺(3→5) = {P (D|C), P (C), P (G|I, D), P (I), P (S|I)} 1 δ (X ) = δ (X )P (X |X )p(y |X ) ∝ P (X |y ) V≺(3→5) = {C, D, I} 2→3 3 ! 1→2 2 3 2 3 3 3 1:3 X δ (G, S) = P (D|C)P (C)P (G|I, D)P (I)P (S|I) 2 3→5 ! C,D,I • Anti-causal order δ (X ) = P (X |X )p(y |X ) = p(y |X ) 3→2 3 ! 4 3 4 4 4 3 X4 δ (X ) = δ (X )P (X |X )p(y |X ) = p(y |X ) 2→1 2 ! 3→2 2 3 2 3 3 3:4 3 X3 Uses of dfs Correctness of upwards pass Ck Cj Cr • For message passing on an undirected tree: Ci Ck’ – We can root a tree at R and make all arcs point away from R by • Consider edge C −C in the clique tree. Let F be all factors starting the DFS at R and connecting π(i)→i. i j ≺(i→j) on the Ci side, and V≺(i→j) be all variables on the Ci side that are – preorder (parents then children) = nodes sorted by discovery time not in Sij. – postorder (children then parents) = nodes sorted by finish time • Thm 8.2.3: the message from i to j summarizes everything to the • For visiting nodes in a DAG in a topological order (parents before left of the edge (since Sij separates the left from the right): children) δ (S ) = φ – Topological order = nodes sorted by reverse finish time i→j ij ! " V≺(i→j) φ∈F≺(i→j) • For checking if a DAG has cycles • Corollary 8.2.4: for the root clique, – Run DFS, see if you ever encounter a back-edge to a gray node π (C ) = P #(X) • For finding strongly connected components r r ! X\Cr

Upward Messages

Meaning of the messages Meaning of the messages X X X X . . . 1: 2: 3: 5: 4: 1 2 3 4 C,D G,I,D G,S,I G,J,S,L H,G,J Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4 D G,I G,S G,J • Partial messages may not be probability distributions unless the or- P(H | G,J) P(D | C) P(G | I,D) P(I) P(L | G) dering is topologically consistent with a Bayes net. P( C) P(S | I) P(J | L,S) • Causal order • e.g., for edge C − C , δ (X ) = P (X )p(y |X )P (X |X )p(y |X ) ∝ P (X |y ) 3 5 1→2 2 ! 1 1 1 2 1 2 2 2 1:2 X F≺(3→5) = {P (D|C), P (C), P (G|I, D), P (I), P (S|I)} 1 δ (X ) = δ (X )P (X |X )p(y |X ) ∝ P (X |y ) V≺(3→5) = {C, D, I} 2→3 3 ! 1→2 2 3 2 3 3 3 1:3 X δ (G, S) = P (D|C)P (C)P (G|I, D)P (I)P (S|I) 2 3→5 ! C,D,I • Anti-causal order δ (X ) = P (X |X )p(y |X ) = p(y |X ) 3→2 3 ! 4 3 4 4 4 3 X4

Yasemin Altun Lecture 2 δ (X ) = δ (X )P (X |X )p(y |X ) = p(y |X ) 2→1 2 ! 3→2 2 3 2 3 3 3:4 3 X3 Computing messages for each edge Shafer-Shenoy algorithm Reuse of Messages Computing messages for each edge 1 def Shafer-Shenoy algorithm • If we collect to C5 (to compute P (J)) {ψi } = function Ctree-VE-calibrate({φ}, T, α) Consider computing1: 2:multiple 3: marginal5: probabilities.4: eg. C,D G,I,D G,S,I G,J,S,L H,G,J def • If we collect to C P(J()tando cPo(Gm) pute P (J)) {ψ1} = function Ctree-VE-calibrate({φ}, T, α) 5 ! (D): ! (G,I) ! (G,S) ! (G,J) i 11: 2 2: 2 3 3: 3 5 5: 4 5 4: ∑ " (C ) ∑ D" 0 (C2 )$!1#2 ∑ I " 0 (C3 )$!2#3 ∑ H " 0 (C4 ) CC,D0 1 G,I,D G,S,I G,J,S,L H,G,J R := pickRoot(T )

!1 2(D): !2 3(G,I) !3 5(G,S) !4 5(G,J)

∑ " 0 (C2 )$!1 2 ∑ " 0 (C3 )$!2 3 ∑ " 0 (C4 ) ∑ "0 (C1 ) D # I # H DT := mkRootedTree(T, R) C R := pickRoot(T ) • If we collect to C3 (to compute P (G)) 0 1: 2: 3: 5: 4: DT{ψ:=}m:k=RoinoiteiadlTizreeCe(lTiq,uRes)(φ, α) C,D G,I,D G,S,I G,J,S,L H,G,J i • If we collect to C3 (to compute P (G)) 0 !11: 2(D): 2: !2 3(G,I): 3: !5 3(G5,S:) : !4 5(4G:,J): C,D G,I,D G,S,I G,J,S,L H,G,J {ψ(*} U:=pwinaitridaslizpeaCslsiq*u)es(φ, α) (C ) ∑ " 0 (C2 )$!1#2 ∑ J,L" 0 (C5 )$!4#5 ∑ " 0 (C4 ) i ∑ C"0 1 D H

(D): (G,I): (G,S): (G,J): !1 2 !2 3 !5 3 !4 5 (* Upwards pass *) (C ) for i ∈ postorder(DT ) (C ) ∑ " 0 (C2 )$!1#2 δ∑ J,L" 0, δ5 $!4#5, δ∑ " 0 (C4 ) Many messages∑ C"0 1 are sharedD ( 1→2 2→3 4→H 5). How can • The messages δ1→we2, computeδ2→ efficiently?3, δ4→5 are the same in both cases. for i ∈jp:o=stoprad(eDr(DT,Ti)) • The messages δ1→Collect2, δ the2→ messages3, δ4 from→5 thea leavesre th toe thesa rootm ande in both cases. 0 • In general, if the rodistributeot R i backs o ton theth leaves,e C leadingj sid toe,Ot(Nh)ecomplexitymessage from Ci→Cj j :=δipa(jD:=T,Vi)E-msg({δ : k ∈ ch(DT, i)}, ψ ) → k→i 0 i i•s Iindgeepneenradl,eniftthoef Rro.otIfRthiseornoYaseminotht AltuneisCoLecturejns 2itdhee, tChei smideess,atgheefrmomessCagi→e fCrojm δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψi ) is independent of R. If the root is on the Ci side, the message from Cj→Ci is independent of R. Ck Ci Cj Cr Cj→Ci is independent of R. Ck Ci Cj Cr • Hence we can send an edge along each edge in both directions and Ck’ t•hHereenbcyecwoemcpaunteseanldl manaregdingealsaloinngOe(aCch) teidmge.in both directions and Ck’ thereby compute all marginals in O(C) time.

SShhaaffeerr--SShheennooy allggoorriitthhmm CorCroercrtencetssneosfsSohfafSehraSfehrenSohyenoy

(*(D* oDwonwwnawradrsdspapsasss**)) • T•hmTh8m.2.87.:2.A7f:teArfrtuenrnriunngntihnegatlhgeoraitlhgmor,ithm, forfoir∈i ∈prperoerodredre(rD(DTT)) 1 1 " " ψ (Cψi)(=Ci) = P (XP, e)(X, e) forfojr j∈∈chc(hD(DTT, i,)i) i i " " X C 00 \ iX\Ci δiδ→i→j j==VVE-E-mmsgsg(({{δδkk→→ii :: kk ∈∈ Nii \\jj}},,ψψi )) i • Pf: the incoming messages δ are exactly the same as those (*(C*oCmombibniene*)*) • Pf: the incoming messagke→s δi k→i are exactly the same as those computed by making C be the root; so correctness follows from the forfoir:i=:=1 1: C: C computed by makinig Ci be the root; so correctness follows from the 1ψ1 := ψ0 0 δ correctness of collect-to-root (upwards pass). ψ :i= ψ i k∈Niδ k→i correctness of collect-to-root (upwards pass). i i !k∈Ni k→i ! Cj • The posterior of any set of nodes contained in a clique can be com- Cj Ci Ck Cr • The posterior of any set of nodes contained in a clique can be com- Ci Ck Cr puted using Ck’ puted using 1 Ck’ P (Ci|e) = ψi (Ci)1/p(e) P (Ci|e) = ψi (Ci)/p(e) where the likelihood of the evidence can be computed from any clique where the likelihood of the evidence can be computed from any clique 1 p(e) = ψ (ci) p(e") = i ψ1(c ) ci " i i ci Shafer-Shenoy Algorithm

Computing messages for each edge Shafer-Shenoy algorithm 1 def • If we collect to C5 (to compute P (J)) {ψi } = function Ctree-VE-calibrate({φ}, T, α) 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J

!1 2(D): !2 3(G,I) !3 5(G,S) !4 5(G,J)

∑ " (C ) ∑ D" 0 (C2 )$!1#2 ∑ I " 0 (C3 )$!2#3 ∑ H " 0 (C4 ) C 0 1 R := pickRoot(T ) DT := mkRootedTree(T, R) • If we collect to C3 (to compute P (G)) 0 1: 2: 3: 5: 4: {ψ } := initializeCliques(φ, α) C,D G,I,D G,S,I G,J,S,L H,G,J i

(D): (G,I): (G,S): (G,J): !1 2 !2 3 !5 3 !4 5 (* Upwards pass *) ∑ " 0 (C2 )$!1 2 ∑ " 0 (C5 )$!4#5 ∑ " 0 (C4 ) ∑ "0 (C1 ) D # J,L H C for i ∈ postorder(DT ) • The messages δ1→2, δ2→3, δ4→5 are the same in both cases. j := pa(DT, i) 0 • In general, if the root R is on the Cj side, the message from Ci→Cj δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψi ) is independent of R. If the root is on the Ci side, the message from Cj→Ci is independent of R. Ck Ci Cj Cr • Hence we can send an edge along each edge in both directions and Ck’ thereby compute all marginals in O(C) time.

Yasemin Altun Lecture 2

Shafer-Shenoy algorithm Correctness of Shafer Shenoy

(* Downwards pass *) • Thm 8.2.7: After running the algorithm, for i ∈ preorder(DT ) 1 " ψ (Ci) = P (X, e) for j ∈ ch(DT, i) i " 0 X\Ci δi→j = VE-msg({δk→i : k ∈ Ni \ j}, ψi ) (* Combine *) • Pf: the incoming messages δk→i are exactly the same as those for i := 1 : C computed by making Ci be the root; so correctness follows from the 1 0 correctness of collect-to-root (upwards pass). ψ := ψ k N δk→i i i ! ∈ i Cj • The posterior of any set of nodes contained in a clique can be com- Ci Ck Cr puted using Ck’ 1 P (Ci|e) = ψi (Ci)/p(e) where the likelihood of the evidence can be computed from any clique p(e) = ψ1(c ) " i i ci Computing messages for each edge Shafer-Shenoy algorithm 1 def • If we collect to C5 (to compute P (J)) {ψi } = function Ctree-VE-calibrate({φ}, T, α) 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J

!1 2(D): !2 3(G,I) !3 5(G,S) !4 5(G,J)

∑ " (C ) ∑ D" 0 (C2 )$!1#2 ∑ I " 0 (C3 )$!2#3 ∑ H " 0 (C4 ) C 0 1 R := pickRoot(T ) DT := mkRootedTree(T, R) • If we collect to C3 (to compute P (G)) 0 1: 2: 3: 5: 4: {ψ } := initializeCliques(φ, α) C,D G,I,D G,S,I G,J,S,L H,G,J i

(D): (G,I): (G,S): (G,J): !1 2 !2 3 !5 3 !4 5 (* Upwards pass *) ∑ " 0 (C2 )$!1 2 ∑ " 0 (C5 )$!4#5 ∑ " 0 (C4 ) ∑ "0 (C1 ) D # J,L H C for i ∈ postorder(DT ) • The messages δ1→2, δ2→3, δ4→5 are the same in both cases. j := pa(DT, i) 0 • In general, if the root R is on the Cj side, the message from Ci→Cj δi→j := VE-msg({δk→i : k ∈ ch(DT, i)}, ψi ) is independent of R. If the root is on the Ci side, the message from Cj→Ci is independent of R. Ck Ci Cj Cr • Hence we can send an edge along each edge in both directions and Ck’ thereby compute all marginals in O(C) time.

Shafer-Shenoy Algorithm

Shafer-Shenoy algorithm Correctness of Shafer Shenoy

(* Downwards pass *) • Thm 8.2.7: After running the algorithm, for i ∈ preorder(DT ) 1 " ψ (Ci) = P (X, e) for j ∈ ch(DT, i) i " 0 X\Ci δi→j = VE-msg({δk→i : k ∈ Ni \ j}, ψi ) (* Combine *) • Pf: the incoming messages δk→i are exactly the same as those for i := 1 : C computed by making Ci be the root; so correctness follows from the 1 0 correctness of collect-to-root (upwards pass). ψ := ψ k N δk→i i i ! ∈ i Cj • The posterior of any set of nodes contained in a clique can be com- Ci Ck Cr puted using Ck’ 1 P (Ci|e) = ψi (Ci)/p(e) where the likelihood of the evidence can be computed from any clique p(e) = ψ1(c ) Yasemin Altun Lecture 2 " i i ci Example of distributing from root C5 Shafer Shenoy for HMMs X X X X . . . !5 3(G,S): 1 2 3 4

∑ J,L" 0 (C5 )$!4#5 Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J 0 ψ (Xt, Xt ) = P (Xt |Xt)p(yt |Xt ) Shafer-Shenoy Algorithm(D): (G,I): (G,S): (G,J): t +1 +1 +1 +1 !1 2 !2 3 !3 5 !4 5 (C ) (C ) (C ) ∑ " (C ) ∑ D" 0 2 $!1#2 ∑ I " 0 3 $!2#3 ∑ H " 0 4 0 C 0 1 δ (X ) = δ (X )ψ (X , X ) t→t+1 t+1 ! t−1→t t t t t+1 Xt 0 G,I G,S δ (X ) = δ (X )ψ (X , X ) !3 2( ): !5 3( ): t→t−1 t t+1→t t+1 t t t+1 (C ) ! ∑ S "0 (C3 )$!5#3 ∑ J,L" 0 5 $!4#5 Xt+1 1: 2: 3: 5: 4: 1 0 C,D G,I,D G,S,I G,J,S,L H,G,J ψt (Xt, Xt+1) = δt−1→t(Xt)δt+1→t(Xt+1)ψt (Xt, Xt+1)

!1 2(D): !2 3(G,I): !3 5(G,S): !4 5(G,J): (C ) ∑ " 0 (C2 )$!1#2 ∑ " 0 (C3 )$!2#3 ∑ " 0 (C4 ) ∑ C"0 1 D I H

Downward pass from root C5 Correctness: ψ1(C ) = P P(X, e) i i XCi Proof follows from that δk→i are same as those computed by making Ci the root node and correctness of the upward pass. Forwards-backwar1 ds alPgorit1 hm for HMMs Forwards-backwards algorithm, matrix-vector form Posterior: P(Ci |e) = ψ (Ci )/ ψ (ci ) i ci i . . . X1 X2 X3 X4

def Y1 Y2 Y3 Y4 αt(i) =Yaseminδt−1 Altun→t(i) =LectureP ( 2Xt = i, y1:t) def α (j) = α (i)A(i, j)B (j) βt(i) = δt→t−1(i) = p(yt+1:T |Xt = i) t ! t−1 t def 1 i ξt(i, j) = ψt (Xt = i, Xt+1 = j) = P (Xt = i, Xt+1 = j, y1:T ) T def αt = (A αt−1). ∗ Bt P (X = j|X = i) = A(i, j) t+1 t β (i) = β (j)A(i, j)B (j) def t ! t+1 t+1 p(yt|Xt = i) = Bt(i) j α (j) = α (i)A(i, j)B (j) βt = A(βt . ∗ Bt ) t ! t−1 t +1 +1 i ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j) β (i) = β (j)A(i, j)B (j) T t ! t+1 t+1 ξt = "αt(βt+1. ∗ Bt+1) # . ∗ A j γt(i) ∝ αt(i)βt(j) ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j) def γt ∝ αt. ∗ βt γ (i) = P (X = i|y ) ∝ α (i)β (j) ∝ ξ (i, j) t t 1:T t t ! t j Construct the Junction Tree

A decomposible graph is either complete or can be divided into disjoint non-empty sets A, B, S such that S seperates A and B, S is a clique and A ∪ S and B ∪ S are decomposible. Theorem: The following properties are equivalent. D: G is decomposible. T: G is triangulated. J: G is a junction tree. Proof of T ⇒ J (sufficient condition to have a junction tree)

Yasemin Altun Lecture 2 Construct the Junction Tree

A simplicial node has all its neighbors connected. Lemma: Every triangular graph G that contains at least two nodes has at least two simplicial nodes. If G is not complete, these nodes cam be chosen to be nonadjacent. (Proof via T ⇒ D.) Proof by Induction. Base case, single node, holds trivially. Assume Gn+1 with n + 1 nodes. By Lemma, there is at least one simplicial node α. Removing that yields a triangulated graph by induction (Gn has a junction tree T ) and that all neighbors α are connected. C is the clique formed by α and its neighbors. If C α ∈ T , add α to that clique. Else, it must be the subset of a clique D ∈ T . Add C as a new leaf node for T and link to D. α is only contained in C and all other nodes are in D.

Yasemin Altun Lecture 2 Construct the Junction Tree

PM−1 Weight of a clique tree W (T ) j=1 |Sj | where M is the number of cliques and Sj is a separator. Theorem: A clique tree is a junction tree iff it is a maximal weight spanning tree. PM−1 PM Proof: For a tree, j=1 1(Xk ∈ Sj ) ≤ i=1 1(Xk ∈ Ci ) − 1 with equality if the subgraph induced by Xk is a tree (T is a junction tree). Jtree iff MWST From Bayes net to jtree

C o h e r e n c e C o h e r e n c e C o h e r e n c e

− M 1 D i f f i c u l t y I n t e l l i g e n c e D i f f i c u l t y I n t e l l i g e n c e D i f f i c u l t y I n t e l l i g e n c e w(T ) = |Sj| G r a d e S A T G r a d e S A T G r a d e S A T j=1 !− M 1 N L e t t e r L e t t e r L e t t e r = 1(Xk ∈ Sj) J o b J o b J o b j=1 k ! !=1 H a p p y H a p p y H a p p y N M−1 = 1(Xk ∈ Sj) BN Moralize Triangulate k=1 j=1 ! ! 1 1 N M C,D G,I,D G,S,I G,S,L L,S,J C,D G,I,D G,S,I G,S,L L,S,J ≤ 1(Xk ∈ Ci) − 1 1 2 2 2 D G,I G,S G,L k=1 " i=1 # 1 G 1 !M N! 1 G,H G,H = 1(Xk ∈ Ci) − N i=1 k=1 !M ! Jgraph Jtree

= |Ci| − N i=1 ! • This is an equality iff T is a jtree. • To make a jtree froYaseminm a se Altunt of cliquLecturees of 2a chordal graph – Build a junction graph, where weight on edge Ci − Cj is |Sij|. – Find MWST using Prim’s or Kruskal’s algorithm.

Initializing clique trees Message passing in clique trees

C o h e r e n c e

D i f f i c u l t y I n t e l l i g e n c e • To compute P (J), we find some clique that contains J (eg. C5)

G r a d e S A T and call it the root.

L e t t e r • We then send messages from the leaves up to the root.

J o b H a p p y • A node Ci can send to Cj (closer to the root) once it has received • The potential for clique c is initialized to the product of all assigned messages from all its other neighbors Ck. factors from the model: • The order to send the messages is called a schedule. π (C ) = φ 1: 2: 3: 5: 4: j j ! C,D G,I,D G,S,I G,J,S,L H,G,J φ:α(φ)=j !1 2(D): !2 3(G,I) !3 5(G,S) !4 5(G,J)

∑ " (C ) ∑ D" 0 (C2 )$!1#2 ∑ I " 0 (C3 )$!2#3 ∑ H " 0 (C4 ) 1: 2: 3: 5: 4: C 0 1 C,D G,I,D G,S,I G,J,S,L H,G,J D G,I G,S G,J

P(D | C) P(G | I,D) P(I) P(L | G) P(H | G,J) P( C) P(S | I) P(J | L,S) Construct the Junction Tree

Equality holds iff T is a junction tree. Use maximal weight spanning tree algorithms (Prim’s or Kruskal’s) Complexity O(N2)

Yasemin Altun Lecture 2 Shafer Shenoy Algorithm for HMMs

Example of distributing from root C5 Shafer Shenoy for HMMs X X X X . . . !5 3(G,S): 1 2 3 4

∑ J,L" 0 (C5 )$!4#5 Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J 0 ψt (Xt, Xt+1) = P (Xt+1|Xt)p(yt+1|Xt+1) !1 2(D): !2 3(G,I): !3 5(G,S): !4 5(G,J): (C ) (C ) (C ) ∑ " (C ) ∑ D" 0 2 $!1#2 ∑ I " 0 3 $!2#3 ∑ H " 0 4 0 C 0 1 δ (X ) = δ (X )ψ (X , X ) t→t+1 t+1 ! t−1→t t t t t+1 Xt 0 G,I G,S δ (X ) = δ (X )ψ (X , X ) !3 2( ): !5 3( ): t→t−1 t t+1→t t+1 t t t+1 (C ) ! ∑ S "0 (C3 )$!5#3 ∑ J,L" 0 5 $!4#5 Xt+1 1: 2: 3: 5: 4: 1 0 C,D G,I,D G,S,I G,J,S,L H,G,J ψt (Xt, Xt+1) = δt−1→t(Xt)δt+1→t(Xt+1)ψt (Xt, Xt+1)

!1 2(D): !2 3(G,I): !3 5(G,S): !4 5(G,J): (C ) ∑ " 0 (C2 )$!1#2 ∑ " 0 (C3 )$!2#3 ∑ " 0 (C4 ) ∑ C"0 1 D I H

Yasemin Altun Lecture 2

Forwards-backwards algorithm for HMMs Forwards-backwards algorithm, matrix-vector form . . . X1 X2 X3 X4 def Y1 Y2 Y3 Y4 αt(i) = δt−1→t(i) = P (Xt = i, y1:t) def α (j) = α (i)A(i, j)B (j) βt(i) = δt→t−1(i) = p(yt+1:T |Xt = i) t ! t−1 t def 1 i ξt(i, j) = ψt (Xt = i, Xt+1 = j) = P (Xt = i, Xt+1 = j, y1:T ) T def αt = (A αt−1). ∗ Bt P (X = j|X = i) = A(i, j) t+1 t β (i) = β (j)A(i, j)B (j) def t ! t+1 t+1 p(yt|Xt = i) = Bt(i) j α (j) = α (i)A(i, j)B (j) βt = A(βt . ∗ Bt ) t ! t−1 t +1 +1 i ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j) β (i) = β (j)A(i, j)B (j) T t ! t+1 t+1 ξt = "αt(βt+1. ∗ Bt+1) # . ∗ A j γt(i) ∝ αt(i)βt(j) ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j) def γt ∝ αt. ∗ βt γ (i) = P (X = i|y ) ∝ α (i)β (j) ∝ ξ (i, j) t t 1:T t t ! t j Example of distributing from root C5 Shafer Shenoy for HMMs X X X X . . . !5 3(G,S): 1 2 3 4

∑ J,L" 0 (C5 )$!4#5 Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4 1: 2: 3: 5: 4: C,D G,I,D G,S,I G,J,S,L H,G,J 0 ψt (Xt, Xt+1) = P (Xt+1|Xt)p(yt+1|Xt+1) !1 2(D): !2 3(G,I): !3 5(G,S): !4 5(G,J): (C ) (C ) (C ) ∑ " (C ) ∑ D" 0 2 $!1#2 ∑ I " 0 3 $!2#3 ∑ H " 0 4 0 C 0 1 δ (X ) = δ (X )ψ (X , X ) t→t+1 t+1 ! t−1→t t t t t+1 Xt 0 G,I G,S δ (X ) = δ (X )ψ (X , X ) !3 2( ): !5 3( ): t→t−1 t t+1→t t+1 t t t+1 (C ) ! ∑ S "0 (C3 )$!5#3 ∑ J,L" 0 5 $!4#5 Xt+1 1: 2: 3: 5: 4: 1 0 C,D G,I,D G,S,I G,J,S,L H,G,J ψt (Xt, Xt+1) = δt−1→t(Xt)δt+1→t(Xt+1)ψt (Xt, Xt+1)

!1 2(D): !2 3(G,I): !3 5(G,S): !4 5(G,J): (C ) ∑ " 0 (C2 )$!1#2 ∑ " 0 (C3 )$!2#3 ∑ " 0 (C4 ) ∑ C"0 1 D I H

Forward-Backward Algorithm for HMMs

Forwards-backwards algorithm for HMMs Forwards-backwards algorithm, matrix-vector form . . . X1 X2 X3 X4

def Y1 Y2 Y3 Y4 αt(i) = δt−1→t(i) = P (Xt = i, y1:t) def α (j) = α (i)A(i, j)B (j) βt(i) = δt→t−1(i) = p(yt+1:T |Xt = i) t ! t−1 t def 1 i ξt(i, j) = ψt (Xt = i, Xt+1 = j) = P (Xt = i, Xt+1 = j, y1:T ) T def αt = (A αt−1). ∗ Bt P (X = j|X = i) = A(i, j) t+1 t β (i) = β (j)A(i, j)B (j) def t ! t+1 t+1 p(yt|Xt = i) = Bt(i) j α (j) = α (i)A(i, j)B (j) βt = A(βt . ∗ Bt ) t ! t−1 t +1 +1 i ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j) β (i) = β (j)A(i, j)B (j) T t ! t+1 t+1 ξt = "αt(βt+1. ∗ Bt+1) # . ∗ A j γt(i) ∝ αt(i)βt(j) ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j) def γt ∝ αt. ∗ βt γ (i) = P (X = i|y ) ∝ α (i)β (j) ∝ ξ (i, j) t t 1:T t t ! t j

Yasemin Altun Lecture 2 Computing Joint Pairwise Posteriors Sum-Product, Max-Product and Semirings

We can also easily compute the joint pairwise posterior distribution Why can we use the same trick for MAP as for marginals? • Computing Joint Pairwise Posteriors • Sum-Product, Max-Product and Semirings for any pair of connected nodes xi, xj. Because multiplication distributes over max as well as sum: To do this, we simplyWteakceanthaelsporoedausiclty ocfomallpmuteesstahgeejsocinotmpinagirwinisteo posterior distributio•n Why can we use the same trick for MAP as for marginals? • • •max(ab, ac) = a max(b, c) node i (except the mfeosrsangey fpraoimr onfocdoenjn)e,ctaelldtnhoedmesesxsai,gxesj.coming Because multiplication distributes over max as well as sum: into node j (except tThoe dmoetshsaisg,ewferosmimnpolydetaik)eanthdetphreopdoutcetnotifalasll messages coming intoFormally, both the “•sum-product” and “max-product” pair are • • commutative semirings max(ab, ac) = a max(b, c) ψi(xi), ψj(xj), ψij(xnio, dxej)i. (except the message from node j), all the messages coming . The posterior is propionrttoionnoadl etojt(heisxcperpotduthcte: message from node i) and the potentials It turns out that the “Fmoarmx-saullmy,”bpoathir tihseal“ssouams-epmroidriuncgt:” and “max-product” pair are • • • commutative semirings. E ψi(xEi), ψj(xj), ψij(xi, xj). max(a + b, a + c) = a + max(b, c) p(xi, xj x¯E) ψ (xi)ψ (xj)ψ(xi, xj) mki(xi) m"j(xj) | ∝ The posterior is proportional to this product: It turns out that the “max-sum” pair is also a semiring: • k=!j c(i) "=!i c(j) which means we can•do MAP computations in the log domain: E " ∈ E " ∈ max(a + b, a + c) = a + max(b, c) These joint pairwise pp(oxsit,exriojrxs¯Eco)ver ψall t(hxei)mψax(ixmja)lψc(lxiqiu,exsj)in the mki(xi) m"j(xjm) ax p(x) = max p(x x ) = max log p(x) = max log p(x x ) | ∝ x x i πi x x i πi • tree, and so those are all we need to do learning. k=j c(i) "=i c(j) whi|ch means we can do MAP computations| in the log domain: " !∈ " !∈ !i "i Inference of other paTirhweise ojorinhtigphaeirrworidseerpjoosintetripoorstceorivoersr aisll the maximal cliques in the max p(x) = max p(x xπ ) = max log p(x) = max log p(x xπ ) • x x i| i x x i| i • possible, but more ditffireceu,lta.nd so those are all we need to do learning. !i "i Inference of other pairwise or higher order joint posteriors is • possible, but more difficult.

MFindingaximizing maximuminstead o probabilityf Summing configurations Max-Product Algorithm Elimination and Belief Propagation both summed over Choose a root node arbitrarily. • all possible values of the marginaMl (naoxni-mquizeriyn,gnoinn-setviedeandceo) fnoSduesmming • E Max-ProduEct Algorithm If j is an evidence node, ψ (xj) = δ(xj, x¯j), else ψ (xj) = 1. to get a marginal probSameability. formulation can be used exactly for finding MAP • Elimination and Belief Propagation both summed over Choose a root node arbitrarily. • Multiplication distributes over max as well as sum Pass messages from leaves up to root using: What if we wanted taollmpaoxsismibizlee voavleuresthoef ntohne-qmuaerrgyi,nnaol n(-neovnid-qenuceery, non-evidence) node•s • E E • If j is an evidence node, ψ (xj) = δ(xj, x¯j), else ψ (xj) = 1. nodes to find the protboagbeiltyaomf tahrgeinsianlgplerobbeasbtilsietyt.ting consistent max • E max max(ab, ac) = a max(b, c) mji (xi) = maxPassψm(exssja)gψe(sxfir,oxmj)leaves mupkjto (rxooj)t using: with any query and evidence? xj   What if we wanted to maximize over the non-query, non-evidence • k c(j) • ∈! max p(x) = max max max manxoWemdaxeps( canxt1)op(xfi2 donx1d)p MAP(txh3 xe1)p(rx computationo4 xb2a)pb(xi5ltxy3)po(xf6 xt2h, usingxe5)singl exactlye best s theettin sameg consistent x x1 x2 x3 x4 x5 | | | | | Remember which choiceof mx m=axx(x y)ie=ldemdamx axψimEu(mx. )ψ(x, x ) mmax(x ) = max p(x1) max p(xalgorithm2 x1) max p(x3 x1 replacing) max p(x4 x2) ma sumsx p(x5 x3) withp(x6 x2, maxx5) jji j∗ i j i j kj j x1 x2 wi|th ax3ny q|uerxy4 and| evxi5denc| e? | • xj   Given messages, compute max value using any node i: k c(j) maximum a-posteriori ∈! This is known as the max p(x) = max max max max moaxrpM(x1AP)p(x2 xc1o)pn(xfi3gxu1)pr(axt4ixo2n)p.(x5 x3)p(x6 x2, x5) • • x x1 x2 x3 x4 x5 | | | | | Remember which choiceof x = x yielded maximum.  It turns out that (on trees), w=emcaaxnp(xu1)smeaxapn(x2axl1g) moraxitph(xm3 x1e) xmaxcpt(lxy4 xl2i)kmeax p(x5 x3)p(x6 x2, x5) j j∗ x1 x2 | x3 | x4 | x5 | | E • E • max p (x EG)i=venmmaxessaψges(,xci)omputemmkai(xxvi)alue using any node i: belief-propagation toTshoilsveisthknisopwrnobalesmth.e maximum aX 4 -posteriori or MAP configuration. x | xi   X 2 • • k!c(i) X 6 ∈ It turns out that (on tXr1ees), we can use an algorithm exactly like  E E • Retrace steps from root back tomleaaxveps r(excaElli)ng=bmesatxxj∗ ψto g(exti)the mki(xi) belief-propagation to solve this problem. X 4 • x x | xi   X 2 maximizing argument (configuration) ∗. k c(i) X 3 X 5 ! X 6 ∈ X 1 Retrace steps from root back toleaves recalling best x to get the • j∗ maximizing argument (configuration) x∗. Yasemin Altun Lecture 2 X 3 X 5 Yasemin Altun Lecture 2 Yasemin Altun Lecture 2 Yasemin Altun Lecture 2 Learning

Parameter Learning: Given the graph structure, how to get

the conditional probability distributions P(Xi |XΠi )? Parameters θ are unknown constants Given fully observed training sample S Find their estimates maximizing the (penalized) log-likelihood

θˆ = argmax log P(D|θ)(−λR(θ)) θ Frequentist approach (versus Bayesian approach) Structure Learning: Given fully observed training sample S, how to get the graph G and its parameters θ? Find the estimator of the unknown constants G and θ by maximizing the log-likelihood. Iterate between G and θ steps. Consider when some variables are hidden Consider when no polynomial time exact algorithms.

Yasemin Altun Lecture 2 Yasemin Altun Lecture 2