<<

AN INTRODUCTION TO GRAPHICAL

MODELS

Michael I. Jordan

Center for Biological and Computational Learning

Massachusetts Institute of Technology

http://www.ai.mit.edu/projects/jordan.html

Acknow ledgments:

Zoubin Ghahramani, Tommi Jaakkola, Marina Meila

Lawrence Saul

December, 1997

GRAPHICAL MODELS

 Graphical mo dels are a marriage between graph

theory and theory

 They clarify the relationship between neural

networks and related network-based mo dels such as

HMMs, MRFs, and Kalman lters

 Indeed, they can be used to give a fully probabilistic

interpretation to many neural network architectures

 Some advantages of the graphical mo del p oint of view

{ inference and learning are treated together

{ sup ervised and unsup ervised learning are merged

seamlessly

{ missing handled nicely

{ a fo cus on conditional indep endence and

computational issues

{ interpretability if desired

Graphical mo dels cont.

 There are two kinds of graphical mo dels; those based

on undirected graphs and those based on directed

graphs. Our main fo cus will be directed graphs.

 Alternative names for graphical mo dels: belief

networks, Bayesian networks, probabilistic

independence networks, Markov random elds,

loglinear models, in uence diagrams

 A few myths ab out graphical mo dels:

{ they require a lo calist semantics for the no des

{ they require a causal semantics for the edges

{ they are necessarily Bayesian

{ they are intractable

Learning and inference

 A key insight from the graphical mo del p oint of view:

It is not necessary to learn that which can

be inferred

 The weights in a network make lo cal assertions ab out

the relationships between neighb oring no des

 Inference algorithms turn these lo cal assertions into

global assertions ab out the relationships between

no des

{ e.g., correlations between hidden units conditional

on an input-output pair

{ e.g., the probability of an input vector given an

output vector

 This is achieved by asso ciating a joint probability

distribution with the network

Directed graphical mo dels|basics

 Consider an arbitrary directed acyclic graph, where

each no de in the graph corresp onds to a random

variable scalar or vector:

D

A F C

B E

 There is no a priori need to designate units as

\inputs," \outputs" or \hidden"

 We want to asso ciate a

P A; B ; C ; D ; E ; F  with this graph, and we want

all of our calculations to resp ect this distribution

e.g.,

   P A; B ; C ; D ; E ; F 

C D E

P F jA; B  =

    P A; B ; C ; D ; E ; F 

C D E F

Some ways to use a graphical mo del

Prediction:

Diagnosis, control, optimization:

Supervised learning:

 we want to marginalize over the unshaded no des in

each case i.e., integrate them out from the joint

probability

 \unsup ervised learning" is the general case

Sp eci cation of a graphical mo del

 There are two comp onents to any graphical mo del:

{ the qualitative sp eci cation

{ the quantitative sp eci cation

 Where do es the qualitative sp eci cation come from?

{ prior knowledge of causal relationships

{ prior knowledge of mo dular relationships

{ assessment from exp erts

{ learning from data

{ we simply like a certain architecture e.g., a

layered graph

Qualitative sp eci cation of graphical mo dels

ABC

ABC

 A and B are marginally dependent

 A and B are conditionally independent

C C

A B A B

 A and B are marginally dependent

 A and B are conditionally independent

Semantics of graphical mo dels cont

A B A B

C C

 A and B are marginally independent

 A and B are conditionally dependent

 This is the interesting case...

\Explaining away"

Burglar Earthquake

Alarm Radio

 All connections in b oth directions are \excitatory"

 But an increase in \activation" of Earthquake leads

to a decrease in \activation" of Burglar

 Where do es the \inhibition" come from?

Quantitative sp eci cation of directed mo dels

 Question: how do we sp ecify a joint distribution over

the no des in the graph?

 Answer: asso ciate a conditional probability with

each no de:

P(D|C)

P(A) P(F|D,E) P(C|A,B)

P(B)

P(E|C)

and take the pro duct of the local to

yield the global probabilities

Justi cation

 In general, let fS g represent the set of random

variables corresp onding to the N no des of the graph

 For any no de S , let paS  represent the set of

i i

parents of no de S

i

 Then

P S  = P S P S jS    P S jS ;:::;S 

1 2 1 N N 1 1

Y

= P S jS ;:::;S 

i i1 1

i

Y

= P S jpaS 

i i

i

where the last line is by assumption

 It is possible to prove a theorem that states that if

arbitrary probability distributions are utilized for

P S jpaS  in the formula above, then the family

i i

of probability distributions obtained is exactly that

set which respects the qualitative speci cation the

relations described

earlier

Semantics of undirected graphs

ABC

ABC

 A and B are marginally dependent

 A and B are conditionally independent

Comparative semantics D

A B

A B

C

C

 The graph on the left yields conditional

indep endencies that a directed graph can't represent

 The graph on the right yields marginal

indep endencies that an undirected graph can't

represent

Quantitative sp eci cation of undirected mo dels

D

A F C

B E

 identify the cliques in the graph:

A, C

C, D, E D, E, F

B, C

 de ne a con guration of a clique as a sp eci cation of

values for each no de in the clique

 de ne a potential of a clique as a function that

asso ciates a real number with each con guration of

the clique

Φ A, C Φ Φ C, D, E D, E, F

Φ B, C

Quantitative sp eci cation cont.

 Consider the example of a graph with binary no des

 A \p otential" is a table with entries for each

combination of no des in a clique

B 0 1 AB 0 1.5 .4 A

1 .7 1.2

 \Marginalizing" over a p otential table simply

collapsing summing the table along one or more dimensions

marginalizing over B marginalizing over A

B 0 1.9 0 1 A 2.2 1.6 1 1.9

Quantitative sp eci cation cont.

 nally, de ne the probability of a global

con guration of the no des as the pro duct of the local

p otentials on the cliques:

P A; B ; C ; D ; E ; F  =    

A;B  B;C  C;D;E  D;E ;F 

where, without loss of generality, we assume that the

normalization constant if any has b een absorb ed

into one of the p otentials

 It is then possible to prove a theorem that states

that if arbitrary potentials are utilized in the

product formula for probabilities, then the family

of probability distributions obtained is exactly that

set which respects the qualitative speci cation the

conditional independence relations described

earlier

This theorem is known as the Hammersley-Cli ord theorem

Boltzmann machine

 The is a sp ecial case of an

undirected graphical mo del

 For a Boltzmann machine all of the p otentials are

formed by taking pro ducts of factors of the form

expfJ S S g

ij i j

Si

Jij

Sj

 Setting J equal to zero for non-neighb oring no des

ij

guarantees that we resp ect the clique b oundaries

 But we don't get the full conditional probability

semantics with the Boltzmann machine

parameterization

{ i.e., the family of distributions parameterized by a

Boltzmann machine on a graph is a prop er subset

of the family characterized by the conditional indep endencies

Evidence and Inference

 \Absorbing evidence" means observing the values of

certain of the no des

 Absorbing evidence divides the units of the network

into two groups:

visible units those for which we have

fV g instantiated values

\evidence no des".

hidden units those for which we do not

fH g have instantiated values.

 \Inference" means calculating the conditional

distribution

P H; V 

P H jV  =

 P H; V 

fH g

{ prediction and diagnosis are sp ecial cases

Inference algorithms for directed graphs

 There are several inference algorithms; some of which

op erate directly on the directed graph

 The most p opular inference algorithm, known as the

which we'll discuss here,

op erates on an undirected graph

 It also has the advantage of clarifying some of the

relationships between the various algorithms

To understand the junction tree algorithm, we need to

understand how to \compile" a directed graph into an

undirected graph

Moral graphs

 Note that for b oth directed graphs and undirected

graphs, the joint probability is in a pro duct form

 So let's convert lo cal conditional probabilities into

p otentials; then the pro ducts of p otentials will give

the right answer

 Indeed we can think of a conditional probability, e.g.,

P C jA; B  as a function of the three variables A; B ,

and C we get a real number for each con guration:

A P(C|A,B)

B C

 Problem: A no de and its parents are not generally in

the same clique

 Solution: Marry the parents to obtain the \" Φ = P(C|A,B) A A,B,C

B C

Moral graphs cont.

 De ne the p otential on a clique as the pro duct over

all conditional probabilities contained within the

clique

 Now the pro ducts of p otentials gives the right

answer:

P A; B ; C ; D ; E ; F 

= P AP B P C jA; B P D jC P E jC P F jD; E 

=   

A;B ;C  C;D;E  D;E ;F 

where

 = P AP B P C jA; B 

A;B ;C 

and

 = P D jC P E jC 

C;D;E 

and

 = P F jD; E 

D;E ;F 

D D

A F A F C C

B E B E

Propagation of probabilities

 Now supp ose that some evidence has b een absorb ed.

How do we propagate this e ect to the rest of the graph?

Clique trees

 A clique tree is an undirected tree of cliques

A, C

C, D, E D, E, F

B, C

 Consider cases in which two neighb oring cliques V

and W have an overlap S e.g., A; C  overlaps with

C; D ; E .

Φ V ΦW ΦS

VSW

{ the cliques need to \agree" on the probability of

no des in the overlap; this is achieved by

marginalizing and rescaling:

X



 = 

V

S

V nS





S



 = 

W

W



S

{ this o ccurs in parallel, distributed fashion

throughout the clique tree

Clique trees cont.

 This simple lo cal message-passing algorithm on a

clique tree de nes the general probability

propagation algorithm for directed graphs!

 Many interesting algorithms are sp ecial cases:

{ calculation of p osterior probabilities in mixture

mo dels

{ Baum-Welch algorithm for hidden Markov mo dels

{ p osterior propagation for probabilistic decision

trees

{ Kalman lter up dates

 The algorithm seems reasonable. Is it correct?

A problem

 Consider the following graph and a corresp onding

clique tree:

ABA,B B,D

CDA,C C,D

 Note that C app ears in two non-neighb oring cliques.

 Question: What guarantee do we have that the

probability asso ciated with C in these two cliques

will be the same?

 Answer: Nothing. In fact this is a problem with the

algorithm as describ ed so far. It is not true that in

general local consistency implies global consistency.

 What else do we need to get such a guarantee?

Triangulation last idea, hang in there

 A triangulated graph is one in which no cycles with

four or more nodes exist in which there is no chord

 We triangulate a graph by adding chords:

ABAB

CDCD

 Now we no longer have our problem:

ABA,B,C

CDB,C,D

 A clique tree for a triangulated graph has the

running intersection property : if a no de app ears in

two cliques, it app ears everywhere on the path

between the cliques

 Thus local consistency implies global consistency

for such clique trees

Junction trees

 A clique tree for a triangulated graph is referred to as

a junction tree

 In junction trees, lo cal consistency implies global

consistency. Thus the lo cal message-passing

algorithm is provably correct.

 It's also p ossible to show that only triangulated

graphs have the prop erty that their clique trees are

junction trees. Thus, if we want lo cal algorithms, we

must triangulate.

Summary of the junction tree algorithm

1. Moralize the graph

2. Triangulate the graph

3. Propagate by lo cal message-passing in the junction

tree

 Note that the rst two steps are \o -line"

 Note also that these steps provide a b ound of the

complexity of the propagation step

Example: Gaussian mixture mo dels

 A Gaussian mixture mo del is a p opular clustering mo del

X 0 q = 1 0 x2 X X X X X X

X X 1 X X q = 0 X 0 0 X X q = 0 1 X

x1

 hidden state q is a multinomial RV

 output x is a Gaussian RV

 prior probabilities on hidden states:

 = P q = 1

i i

 class-conditional probabilities:

1 1

T 1

x  g x    P xjq = 1 = expf

i i i

i

d=2 1=2

2  j j 2 i

Gaussian mixture mo dels as a graphical mo del

q q

x x

 The inference problem is to calculate the p osterior

probabilities:

P xjq = 1P q = 1

i i

P q = 1jx =

i

P x



1

T 1

i

expf x    x  g

i i

1=2

i

2

j j

i

=

P 

j 1

1

T

x  g x    expf

j j j

j

1=2

2

j j

j

 This is a trivial example of the rescaling op eration

on a clique p otential

Example: Hidden Markov mo dels

 A hidden Markov mo del is a p opular

mo del

 It is a \mixture mo del with dynamics"

AA π qq qqq123 q T BBB B

y123yy y T

 T time steps

 M states q is a multinomial RV

t

 N outputs y is a multinomial RV

t

 state transition probability matrix A:

A = P q jq 

t+1 t

 emission matrix B :

B = P y jq 

t t

 initial state probabilities  :

 = P q  1

HMM as a graphical mo del

 Each no de has a probability distribution asso ciated

with it.

 The graph of the HMM makes conditional

indep endence statements.

 For example,

P q jq ; q  = P q jq 

t+1 t t1 t+1 t

can be read o the graph as a separation prop erty.

HMM probability calculations

AA qq π qqq123 q T BBB B

y123yy y T

 The time series of y values is the evidence

t

 The inference calculation involves calculating the

probabilities of the hidden states q given the

t

evidence

 The classical algorithm for doing this calculation is

the forward-backward algorithm

 The forward-backward algorithm involves:

{ multiplication conditioning

{ summation marginalization in the lattice.

 It is a sp ecial case of the junction tree algorithm cf.

Smyth, et al., 1997

Is the algorithm ecient?

 To answer this question, let's consider the junction

tree

 Note that the moralization and triangulation steps

are trivial, and we obtain the following junction tree:

q1, q2 q2 , q3 . . .

q1, y1 q2 , y2

2

 The cliques are no bigger than N , thus the

marginalization and rescaling required by the

2

junction tree algorithm runs in time O N  per clique

 There are T such cliques, thus the algorithm is

2

O N T  overall

Hierarchical mixture of exp erts

 The HME is a \soft" decision tree

 The input vector x drops through a tree of \gating

networks," each of which computes a probability for

the branches b elow it

x

η

ωωω 1in

υ υ1 i υn

ωωωi1 ij in

ζ ζζ i1 ij in

ω ωω ij1 ijk ijn

θij1θ ijkθ ijn

yyy

 At the leaves, the \exp erts" compute a probability

for the output y

 The overall probability mo del is a conditional

mixture mo del

Representation as a graphical mo del

x

ωi

ωij

ωijk

y

 For learning and inference, the no des for x and y are

observed shaded

x

ωi

ωij

ωijk

y

 We need to calculate probabilities of unshaded no des

E step of EM

Learning parameter estimation

 The EM algorithm is natural for graphical mo dels

 The E step of the EM algorithm involves calculating

the probabilities of hidden variables given visible

variables

{ this is exactly the inference problem

D

A F C

B E

 The M step involves parameter estimation for a fully

observed graph

{ this is generally straightforward

D

A F C

B E

Example|Hidden Markov mo dels

AA qq π qqq123 q T BBB B

y123yy y T

Problem: Given a sequence of outputs

y = fy ; y ;:::;y g

1 2 T

1;T

infer the parameters A, B and  .

 To see how to solve this problem, let's consider a

simpler problem

Fully observed Markov mo dels

 Supp ose that at each in time we know what

state the system is in:

AA π q qqq123 T BBB B

y123yy y T

 Parameter estimation is easy in this case:

{ to estimate the state transition matrix elements,

simply keep a running count of the number n of

ij

times the chain jumps from state i to state j .

Then estimate a as:

ij

n

ij

a^ =

ij

P

n

j ij

{ to estimate the Gaussian output probabilities,

simply record which data p oints o ccurred in which

states and compute means and

{ as for the initial state probabilities, we need

multiple output sequences which we usually have

in practice

HMM parameter estimation

 When the hidden states are not in fact observed the

case we're interested in, we rst estimate the

probabilities of the hidden states

{ this is a straightforward application of the junction

tree algorithm i.e., the forward-backward

algorithm

 We then use the probability estimates instead of the

counts in the parameter estimation formulas to get

update formulas

 This gives us a b etter mo del, so we run the junction

tree algorithm again to get better estimates of the

hidden state probabilities

 And we iterate this pro cedure

CONCLUSIONS PART I

 Graphical mo dels provide a general formalism for

putting together graphs and probabilities

{ most so-called \unsup ervised neural networks" are

sp ecial cases

{ Boltzmann machines are sp ecial cases

{ mixtures of exp erts and related mixture-based

mo dels are sp ecial cases

{ some sup ervised neural networks can be treated as

sp ecial cases

 The graphical mo del framework allows us to treat

inference and learning as two sides of the same coin

INTRACTABLE GRAPHICAL MODELS

 There are a number of examples of graphical mo dels

in which exact inference is ecient:

{ chain-like graphs

{ tree-like graphs

 However, there are also a number of examples of

graphical mo dels in which exact inference can be

hop elessly inecient:

{ dense graphs

{ layered graphs

{ coupled graphs

 A variety of metho ds are available for approximate

inference in such situations:

{ Markov chain Monte Carlo sto chastic

{ variational metho ds deterministic

Computational complexity of exact computation

 passing messages requires marginalizing and scaling

the clique p otentials

 thus the time required is exp onential in the number

of variables in the largest clique

 good triangulations yield small cliques

{ but the problem of nding an optimal

triangulation is hard P in fact

 in any case, the triangulation is \o -line"; our

concern is generally with the \on-line" problem of

message propagation

Quick Medical Reference QMR

University of Pittsburgh

 600 diseases, 4000 symptoms

 arranged as a

diseases

symptoms

 No de probabilities P sy mptom jdiseases were

i

obtained from an exp ert, under a noisy-OR mo del

 Want to do diagnostic calculations:

P diseasesjf inding s

 Current metho ds exact and Monte Carlo are infeasible

QMR cont.

diseases

symptoms

 \noisy-OR" parameterization:

Y

d

j

P f = 0jd = 1 q  1 q 

i i0 ij

j 2pa

i

 rewrite in an exp onential form:

P

d

j 2pa i0 ij j

i

P f = 0jd = e

i

where log 1 q 

ij ij

 probability of p ositive nding:

P

d

i0 j 2pa ij j

i

P f = 1jd = 1 e i

QMR cont.

diseases

symptoms

 Joint probability:

P f; d = P f jdP d

2 3 2 3

Y Y

6 7 6 7

4 5 4 5

= P f jd P d 

i j

i j

! ! "

P P

d d

20 10

2j j 1j j j 2pa j 2pa

2 1

1 e = 1 e

2 3

 !

P

Y

d

6 7

j 2pa j

k 0 kj

k

4 5

  1 e P d 

j

j

 Positive ndings couple the disease no des

 size of maximal clique is 151 no des

Multilayer neural networks as graphical mo dels

cf. Neal, 1992; Saul, Jaakkola, & Jordan, 1996

 Asso ciate with no de i a latent binary variable whose

conditional probability is given by:

1

P

P S = 1jS  =

i pa

i

S

j 2pa ij j i0

i

1 + e

where pa indexes the parents of no de i

i

 A multilayer neural network with logistic hidden units:

Input

Hidden

Output

 has a joint distribution that is a pro duct of logistic

functions:

2 3

P

S + S

 

j 2pa ij j i0 i

6 7

i

e

Y

6 7

6 7

P

P S  = :

6 7

S +

4 5

j 2pa ij j i0

i

i

1 + e

Complexity of neural network inference

 When an output no de is known as it is during

learning, moralization links the hidden units:

Hidden

Output

N

 So inference scales at least as badly as O 2 

 And triangulation adds even more links

Hidden Markov mo dels

 Recall the hidden Markov mo del:

AA π qq qqq123 q T BBB B

y123yy y T

 T time steps

 M states q is a multinomial RV

t

 N outputs y is a multinomial RV

t

 state transition probability matrix A:

A = P q jq 

t+1 t

 emission matrix B :

B = P y jq 

t t

 initial state probabilities  :

 = P q  1

Factorial hidden Markov mo dels

cf. Williams & Hinton, 1991; Ghahramani & Jordan, 1997

 Imagine that a time series is created from a set of M

lo osely-coupled underlying mechanisms

 Each of these mechanisms may have their own

particular dynamic laws, and we want to avoid

merging them into a single \meta-state," with a

single transition matrix

{ avoid cho osing a single time scale

{ avoid over-parameterization

 Here is the graphical mo del that we would like to use:

(1) (1) (1) X1 X2 X3 ...

(2) (2) (2) X1 X2 X3 ...

(3) (3) (3) X1 X2 X3 ...

Y1 Y2 Y3

 When we triangulate do we get an ecient structure?

Triangulation?

 Unfortunately, the following graph is not triangulated:

...

...

...

 Here is a triangulation:

...

...

...

4

 We have created cliques of size N . The junction tree

algorithm is not ecient for factorial HMMs.

Hidden Markov decision trees

Jordan, Ghahramani, & Saul, 1997

 We can combine decision trees with factorial HMMs

 This gives a \command structure" to the factorial

representation

U1 U2 U3

Y1 Y2 Y3

 Appropriate for multiresolution time series

 Again, the exact calculation is intractable and we

must use variational metho ds

Markov chain Monte Carlo MCMC

 Consider a set of variables S = fS ;S ;:::;S g

1 2 N

 Consider a joint probability density P S 

 We would like to calculate asso ciated with

P S :

{ marginal probabilities, e.g., P S , or P S ;S 

i i j

{ conditional probabilities, e.g., P H jV 

{ likeliho o ds, i.e., P V 

 One way to do this is to generate samples from P S 

and compute empirical statistics

{ but generally it is hard to see how to sample from

P S 

 We set up a simple Markov chain whose equilibrium

distribution is P S 

Gibbs

 is a widely-used MCMC metho d

 Recall that we have a set of variables

S = fS ;S ;:::;S g

1 2 N

 We set up a Markov chain as follows:

{ initialize the S to arbitrary values

i

{ cho ose i randomly

{ sample from P S jS nS 

i i

{ iterate

 It is easy to prove that this scheme has P S  as its

equilibrium distribution

 How to do Gibbs sampling in graphical mo dels?

Markov blankets

 The Markov blanket of no de S is the minimal set of

i

no des that renders S conditionally indep endent of all

i

other no des

 For undirected graphs, the Markov blanket is just the

set of neighb ors

 For directed graphs, the Markov blanket is the set of

parents, children and co-parents:

Si

 The conditional P S jS nS  needed for Gibbs

i i

sampling is formed from the pro duct of the

conditional probabilities asso ciated with S and each

i

of its children

 This implies that the conditioning set needed to form

P S jS nS  is the Markov blanket of S which is

i i i

usually much smaller than S nS  i

Issues in Gibbs sampling

 Gibbs sampling is generally applicable

 It can be very slow

{ i.e., the time to converge can be long

 Moreover it can be hard to diagnose convergence

{ but reasonable heuristics are available

 It is often p ossible to combine Gibbs sampling with

exact metho ds

Variational metho ds

 Variational metho ds are deterministic approximation

metho ds

{ p erhaps unique among deterministic metho ds in

that they tend to work b est for dense graphs

 They have some advantages compared to MCMC

metho ds

{ they can be much faster

{ they yield upp er and lower b ounds on probabilities

 And they have several disadvantages

{ they are not as simple and widely applicable as

MCMC metho ds

{ they require more art thought on the part of the

user than MCMC metho ds

 But b oth variational approaches and MCMC

approaches are evolving rapidly

 Note also that they can be combined and can be

combined with exact metho ds

Intro duction to variational metho ds

 Intuition|in a dense graph, each no de is sub ject to

many sto chastic in uences from no des in its Markov

blanket

{ laws of large numb ers

{ coupled, nonlinear interactions between averages

Si

 We want to exploit such averaging where applicable

and where needed, while using exact inference

algorithms on tractable \backb ones"

Example of a variational transformation

5

3

1

-1

-3

-5

01230.5 1.5 2.5



log x = minfx log  1g



 x log  1

{  is a variational parameter

{ transforms a nonlinearity into a linearity

Example of a variational transformation

1.8

1.4

1.0

0.6

0.2

-0.2

-3 -2 -1 0 1 2 3



1

xH 

g x = = minfe g

x



1 + e

xH 

 e

{ where H  is the binary entropy

{ a nonlinear function now a simple exp onential

{ cf. \tilted distributions"

Convex duality approach

Jaakkola & Jordan, NIPS '97

 for concave f x:

 

T 

f x = min  x f 



 

 T

f  = min  x f x

x

 yields b ounds:

T 

f x   x f 

 T

f    x f x

2

1

0

-1

-2 12

-3

 lower b ounds obtained from convex f x

Variational transformations and inference

 Two basic approaches|sequential and block

{ we discuss the sequential approach rst and return

to the blo ck approach later

 In the sequential approach, we intro duce variational

transformations sequentially, no de by no de

{ this yields a sequence of increasingly simple

graphical mo dels

{ every transformation intro duces a new variational

parameter, which we lazily put o evaluating

until the end

{ eventually we obtain a mo del that can be handled

by exact techniques, at which p oint we stop

intro ducing transformations

Variational QMR

Jaakkola & Jordan, 1997

 Recall the noisy-OR probability of a p ositive nding:

P

d

i0 j 2pa ij j

i

P f = 1jd = 1 e

i

 The logarithm of this function is concave, thus we

can utilize convex duality

{ evaluating the conjugate function, we obtain:



f  =  ln  +  + 1 ln  + 1

 Thus we upp er b ound the probability of a p ositive

nding:

P



  + d f  

i i0 j 2pa ij j i

i

P f = 1jd  e

i

" 

d



Y

j

  f  

i ij i i0 i

e = e

j 2pa

i

 This is a factorized form; it e ectively changes the



i ij

\priors" P d  by multiplying them by e and

j

th

delinking the i no de from the graph

Variational QMR cont.

 Recall the joint distribution:

" ! !

P P

d d

10 20

j 2pa 1j j j 2pa 2j j

1 2

P f; d = 1 e 1 e

2 3

 !

P

Y

d

6 7

j 2pa j

k 0 kj

k

4 5

  1 e P d 

j

j

 After a variational transformation:

! ! "

P P

d d

20 10

2j j 1j j j 2pa j 2pa

2 1

1 e P f; d  1 e

1 3 0

2 3

" 

d



Y Y

j

C 7 B

  f  

6 7

k kj C 7 k k 0 k B

4 5

e P d  e  

A 5 @

j

j j 2pa

k

 Use a greedy metho d to intro duce the variational

transformations

{ cho ose the no de that is estimated to yield the

most accurate transformed p osterior

 Optimize across the variational parameters 

i

{ this turns out to be a convex optimization problem

QMR results

 We studied 48 clinicopathologic CPC cases

 4 of these CPC cases had less than 20 p ositive

ndings; for these we calculated the disease p osteriors

exactly

 Scatterplots of exact vs. variational p osteriors for a

4 p ositive ndings treated exactly, b 8 p ositive

ndings treated exactly:

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

 Scatterplots of exact vs. variational p osteriors for a

12 p ositive ndings treated exactly, b 16 p ositive

ndings treated exactly:

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

QMR results cont.

 For the remaining 44 CPC cases, we had no way to

calculate the

 Thus we assessed the variational accuracy via a

sensitivity estimate

{ we calculated the squared di erence in p osteriors

when a particular no de was treated exactly or

variationally transformed, and averaged across

no des

{ a small value of this average suggests that we have

the correct p osteriors

{ we validated this surrogate on the 4 cases for

which we could do the exact calculation

 Sensitivity estimates vs. number of p ositive ndings

when a 8 p ositive ndings treated exactly, b 12

p ositive ndings treated exactly:

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70

QMR results cont.

 Timing results in seconds Sparc 10 as a function of

the number of p ositive ndings treated exactly solid

line|average across CPC cases; dashed

line|maximum across CPC cases:

140

120

100

80

60

40

20

0 0 2 4 6 8 10 12

A cautionary note

 these disease marginals are based on the upp er

variational distribution, which app ears tight

 this distribution is guaranteed to upp er b ound the

likelihood

 to obtain direct upp er and lower b ounds on the

disease marginals, which are conditional

probabilities, we need upp er and lower b ounds on

the likeliho o d

 the lower b ounds we obtain in our current

implementation, however, are not tight enough

 thus, although the marginals we rep ort app ear to

yield good approximations empirically, we cannot

guarantee that they b ound the true marginals

Variational transformations and inference

 Two basic approaches|sequential and block

 The blo ck approach treats the approximation

problem as a global optimization problem

 Consider a dense graph characterized by a joint

distribution P H; V j 

 We remove links to obtain a simpler graph

characterized by a conditional distribution

QH jV; ; 

 Examples will be provided b elow...

Variational inference cont.

Dayan, et al., 1995; Hinton, et al., 1995; Saul & Jordan, 1996

 QH jV; ;  has extra degrees of freedom, given by

variational parameters 

i

{ we can think of these as b eing obtained by a

sequence of variational transformations applied to

the no des

{ but we now want to take a more global view

 Cho ose  so as to minimize

i

QH jV; ; 

X

KLQkP  = QH jV; ;  log

P H jV; 

H

 We will show that this yields a lower b ound on the

probability of the evidence the likeliho o d

 Minimizing the KL requires us to

compute averages under the Q distribution; we must

cho ose Q so that this is p ossible

{ i.e., we cho ose our simpli ed graph so that it is

amenable to exact metho ds

Variational inference cont.

 The fact that this is a lower b ound follows from

Jensen's inequality

log P V  = log  P H; V 

H

P H; V 

= log  QH jV  

H

QH jV 

3 2

P H; V 

7 6

7 6

  QH jV  log

5 4

H

QH jV 

 The di erence between the left and right hand side is

the KL divergence:

2 3

QH jV 

6 7

6 7

KLQjjP  =  QH jV  log

4 5

H

P H jV 

which is p ositive; thus we have a lower b ound

Linking the two approaches

Jaakkola, 1997

 The blo ck approach can be derived within the convex

duality framework

T 

f x   x f 

{ treat the distribution QH jV; ;  as ; a

vector-valued variational parameter one value for

each con guration H 

{ the argument x b ecomes log P H; V j ; also a

vector-valued variable one value for each

con guration H 

{ the function f x b ecomes log P V j 



{ it turns out that the conjugate function f x is

the negative entropy function

 Thus convex duality yields:

X

log P V   QH jV  log P H; V 

H

X

QH jV  log QH jV 

H

which is the b ound derived earlier from Jensen's

inequality

Learning via variational metho ds

Neal & Hinton, in press

 MAP parameter estimation for graphical mo dels:

{ the EM algorithm is a general metho d for MAP

estimation

{ \inference" is the E step of EM for graphical

mo dels calculate P H jV  to \ ll in" the hidden

values

{ variational metho ds provide an approximate E

step

{ more sp eci cally we increase the lower b ound on

the likeliho o d at each iteration

Neural networks and variational approximations

Saul, Jaakkola, & Jordan, 1996

 A multilayer neural network with logistic hidden units:

Input

Hidden

Output

has a joint distribution that is a pro duct of logistic

functions:

3 2

P

S + S

 

j 2pa ij j i0 i

7 6

i

e

Y

7 6

7 6

P

P H; V j  = :

7 6

S +

5 4

j 2pa ij j i0

i

i

1 + e

 The simplest variational approximation, which we

will refer to as a eld approximation considers

the factorized approximation:

S

1S

i

i

QH jV;  =   1   :

i2H i i

Division of lab or

 The KL b ound has two basic comp onents:

1 0

variational

C B

C B

=  QH jV  log QH jV 

A @

H

entropy

0 1

variational

B C

B C

=  QH jV  log P H; V 

@ A

H

energy

 What we need is the di erence:

0 1 0 1

variational variational

B C B C

B C B C

log P V j  

@ A @ A

entropy energy

Maximizing the lower b ound

 We nd the b est approximation by varying f g to

i

minimize KLQjjP .

 This amounts to maximizing the lower b ound:

0 1 0 1

variational variational

B C B C

B C B C

log P V j   :

@ A @ A

entropy energy

 Abuse of notation: we de ne clamped values 0 or 1

for the instantiated no des,

 = S for i 2 V:

i i

Variational entropy

 The variational entropy is:

 QH jV  log QH jV 

H

 Our factorized approximation is:

S

1S

i

i

QH jV  =   1   ;

i i

i

 The joint entropy is the sum of the

individual unit entropies:

 [ log  + 1   log 1  ] :

i i i i i

Variational energy

 The variational energy is:

 QH jV  log P H; V 

H

 In sigmoid networks:

X

log P H; V  = S S

ij i j

ij

" 

P

X

S

j ij j

+ log 1 + e :

i

 The rst term are common to undirected networks.

The last term is not.

 Averaging over QH jV  gives:

* " +

P

X X

S

j ij j

log 1 + e  

ij i j

ij i

 The last term is intractable, so again

we use Jensen's inequality to obtain an

upp er b ound. . .

 Lemma Seung: for any z , and

any real number  :

* +

z z 1 z

hlog [1 + e ]i   hz i + log e + e :

 Ex: z is Gaussian with zero mean and unit .

1

0.95

0.9

bound 0.85

exact 0.8

0.75 0 0.2 0.4 0.6 0.8 1

ξ

 Ex: z is the sum of weighted inputs to S ;

i

X

S + h : z =

ij j i

j

The lemma provides an upp er b ound on the

intractable terms in the variational energy:

+ * "

P

X

S

j ij j

log 1 + e i

Variational mean eld equations

 The b ound on the log-likeliho o d,

1 1 0 0

variational variational

C C B B

C C B B

; log P V j  

A A @ @

energy entropy

is valid for any setting of the variational parameters,

f g

i

 The optimal f g are found by solving the

i

variational equations:

0 1

X

B C

@ A

 =  [  +    K ]

i ij j ji j j ji j

Si

 The e ective input to S is comp osed of

i

terms from its Markov blanket

Numerical exp eriments

 For small networks, the variational b ound can

be compared to the true likeliho o d obtained

by exact enumeration.

 We considered the event that all the units in

the b ottom layer were inactive.

 This was done for 10000 random networks

whose weights and biases were uniformly

distributed between -1 and 1.

 Mean eld approximation:

4500

4000

3500

3000 mean field approximation

2500

2000

1500

1000

500

0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07

relative error in log−likelihood

 Uniform approximation:

4000

3500

3000

2500 uniform approximation

2000

1500

1000

500

0 −1 −0.5 0 0.5 1 1.5 relative error in log−likelihood

Digit recognition

 Images:

 Confusion matrix:

0 1 2 3 4 5 6 7 8 9

0 388 2 2 0 1 3 0 0 4 0

1 0 393 0 0 0 1 0 0 6 0

2 1 2 376 1 3 0 4 0 13 0

0 2 4 373 0 12 0 0 6 3 3

4 0 0 2 0 383 0 1 2 2 10

5 0 2 1 13 0 377 2 0 4 1

6 1 4 2 0 1 6 386 0 0 0

7 0 1 0 0 0 0 0 388 3 8

1 9 1 7 0 7 1 1 369 4 8

9 0 4 0 0 0 0 0 8 5 383

 Comparative results:

algorithm test error

nearest neighbor 6.7

back propagation 5.6

4.8 Helmholtz machine

variational 4.6

Example|Factorial HMM

Ghahramani & Jordan, 1997

 Recall the factorial hidden Markov mo del, which

yielded intractably large cliques when triangulated:

(1) (1) (1) X1 X2 X3 ...

(2) (2) (2) X1 X2 X3 ...

(3) (3) (3) X1 X2 X3 ...

Y1 Y2 Y3

 We can variationally transform this mo del into:

...

...

...

where we see that we have to solve separate simple

HMM problems on each iteration. The variational

parameters couple the chains.

Example|Factorial HMM cont.

(1) (1) (1) X1 X2 X3 ...

(2) (2) (2) X1 X2 X3 ...

(3) (3) (3) X1 X2 X3 ...

Y1 Y2 Y3

 M indep endent Markov chains given the observations

T M

Y Y

m m m

j  QX QfX gj  = QX ; ; jX

t

t

1 t1

t=2 m=1

where

m m

m

QX j  =  h

1 1

m m m

m

QX jX ;  = P h :

t t

t1

m

,  The parameters of this approximation are the h

t

which play the role of observation log probabilities

for the HMMs.

Example|Factorial HMM cont.

 Minimizing the KL divergence results in the following

xed p oint equation:

 !

0 0

m m

m 1 m 1 m

^

h / expfW C Y Y + W C W hX i

t t

t t

1

m

 g;

2

where

M

X

`

t `

^

i; Y = W hX

t

`=1

m

and  is the vector of diagonal elements of

0

m 1 m

W C W .

 Rep eat until convergence of KLQkP :

m

using xed-p oint equation, which 1. Compute h

t

m

dep ends on hX i

t

m

i using forward-backward 2. Compute hX

t

algorithm on HMMs with observation log

m

probabilities given by h t

Results

 Fitting the FHMM to the Bach Chorale dataset:

Ghahramani & Jordan, 1997:

Modeling J. S. Bach’s chorales

Discrete event sequences:

Attribute Description Representation pitch pitch of the event int keysig key signature of the chorale int (num of sharps and flats) timesig time signature of the chorale int (1/16 notes) fermata event under fermata? binary st start time of event int (1/16 notes) dur duration of event int (1/16 notes)

First 40 events of 66 chorale melodies: training: 30 melodies

test: 36 melodies

See Conklin and Witten (1995).

Fitting a single HMM to the Bach chorale data

HMM model of Bach Chorales −4

−5

−6

−7

−8 Validation set log likelihood −9

−10 0 20 40 60 80 Size of state space

Fitting a factorial HMM to the Bach chorale data

Factorial HMM Model of Bach Chorales −4

−5

−6

−7

−8

−9

−10 1 2 3 10 10 10

Example|hidden Markov decision tree

Jordan, Ghahramani, & Saul, 1997

 Recall the hidden Markov decision tree, which also

yielded intractably large cliques when triangulated:

U1 U2 U3

Y1 Y2 Y3

 We can variationally transform this mo del into one of

two simpli ed mo dels:

... x1 x2 x3

1 1 1 z1 z z ... 2 3

2 2 2 z1 z2 z3 ... 3 3 3 z1 z2 z3

y1 y2 y3 or

Forest of chains approximation

 Eliminating the vertical links that couple the states

yields an approximating graph that is a forest of

chains:

 The Q distribution is given by:

1 2 3

Qfz ; z ; z g j fy g; fx g =

t t

t t t

T

1

Y

1 1 1 2 2 2 3 3 3

a~ z jz ~a z jz ~a z jz 

t t t1 t t t1 t t t1

t=2

Z

Q

T

Y

1 1 2 2 3 3

q~ z ~q z ~q z 

t t t t t t

t=1

i i i i i

where a~ z jz  and q~ z  are p otentials that

t t t1 t t

provide the variational parameterization

 The resulting algorithm is ecient b ecause we know

an ecient subroutine for single chains the

forward-backward algorithm

Forest of trees approximation

 Eliminating the horizontal links that couple the

states yields an approximating graph that is a forest

of trees:

 The Q distribution is given by:

1 2 3

Qfz ; z ; z g j fy g; fx g =

t t

t t t

T

Y

1 1 2 2 1 3 3 1 2

z ~r z jz ~r z jz ; z  r~

t 1 t 1 1 t 1 1 1

t=1

 The resulting algorithm is ecient b ecause we know

an ecient subroutine for decision trees the upward

recursion from Jordan and Jacobs

A Viterbi-like approximation

 We can develop a Viterbi-like algorithm by utilizing

an approximation Q that assigns probability one to a

1 2 3

  

single path fz ; z ; z g:

t t t

8

> i i

>

<



1 if z = z ; 8t; i

1 2 3

t t

Qfz ; z ; z g j fy g; fx g =

t t

>

t t t

>

:

0 otherwise

 Note that the entropy Q ln Q is zero

 The evaluation of the energy Q ln P reduces to

i i



substituting z for z in P

t t

 The resulting algorithm involves a subroutine in

which a standard is run on a single

chain, with the other  xed chains providing eld terms

Mixture-based variational approximation

Jaakkola & Jordan, 1997

 Naive mean eld approximations are unimo dal

 Exact metho ds running as subroutines provide a way

to capture multimo dality

 But we would like a metho d that allows

multimo dality in the approximation itself

 This can be done by letting the Q distribution be a

mixture:

X

Q H jV; m Q H jV  =

m mf mix

m

where each comp onent Q H jV  is a factorized

mix mo del

Mixture-based approximation cont.

 The b ound on the likeliho o d takes the following form:

X

F Q  = F Q jm + I m; H 

mix m mf

m

where I m; H  is the mutual information between

the mixture comp onents

 Derivation:

P H; V 

X

F Q  = Q H  log

mix mix

Q H 

H

mix

2 3

P H; V 

X

6 7

6 7

Q H jm log =

4 5

mf m

Q H 

m;H

mix

2

Q  P H; V 

X

6

mf

6

+ Q H jm log = Q H jm log

4

mf m mf

Q H jm Q

m;H

mf mi

Q H jm

X X

mf

= F Q jm + Q H jm log

m mf m mf

m

Q H 

m;H

mix

X

= F Q jm + I m; H 

m mf

m

 We see that the b ound can be improved vis-a-vis

naive mean eld via the mutual information term

Mixture-based variational approximation

Bishop, et al., 1997

5 components, mean: 0.011394 4 components, mean: 0.012024 3000 3000

2000 2000

1000 1000

0 0 0 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08 3 components, mean: 0.01288 2 components, mean: 0.013979 3000 3000

2000 2000

1000 1000

0 0 0 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08 1 component, mean: 0.015731 3000 0.016

2000 0.014

1000 0.012 mean error

0 0.01 0 0.02 0.04 0.06 0.08 1 2 3 4 5 no. of components

CONCLUSIONS PART II

 General frameworks for probabilistic computation in

graphical mo dels:

{ exact computation

{ deterministic approximation

{ sto chastic approximation

 Variational metho ds are deterministic approximation

metho ds

{ they aim to take advantage of

law-of-large-numb ers phenomena in dense

networks

{ they yield upp er and lower b ounds on desired

probabilities

 The techniques can be combined to yield hybrid

techniques which may well turn out to be b etter than

any single technique

ADDITIONAL TOPICS

 There are many other topics that we have not

covered, including:

{ learning structure see Heckerman tutorial

{ causality see UAI homepage

{ qualitative graphical mo dels see UAI homepage

{ relationships to planning see UAI homepage

{ in uence diagrams see UAI homepage

{ relationships to error control co ding see Frey

thesis

 The Uncertainty in Arti cial Intelligence UAI

homepage:

http://www.auai.org

REFERENCES

Jensen, F. 1996. An Introduction to Bayesian Networks. London: UCL Press also

published by Springer-Verlag.

Lauritzen, S. L., & Spiegelhalter, D. J. 1988. Lo cal computations with probabilities on

graphical structures and their application to exp ert systems with discussion. Journal of

the Royal Statistical Society B, 50, 157-224.

Jensen, F. V., Lauritzen, S. L. and Olesen, K. G., 1990 Bayesian up dating in recursive

graphical mo dels by lo cal computations. Computational Statistical Quarterly. 4, 269{282.

Dawid, A. P. 1992. Applications of a general propagation algorithm for probabilistic

exp ert systems. Statistics and Computing, 2, 25-36.

Pearl, J. 1988. Probabilistic Reasoning in Intel ligent Systems: Networks of Plausible

Inference. San Mateo, CA: Morgan Kaufman.

Whittaker, J. 1990. Graphical Models in Applied . New York: John

Wiley.

Lauritzen, S. 1996. Graphical Models. Oxford: Oxford University Press.

Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S. L., & Cowell, R. G. 1993. Bayesian

Analysis in Exp ert Systems, Statistical Science, 8, 219-283.

Heckerman, D. 1995. A tutorial on learning Bayesian networks. [available through

http://www.auai.org].

Buntine, W. 1994. Op erations for learning with graphical mo dels. Journal of Arti cial

Intel ligence Research, 2, 159-225. [available through http://www.auai.org].

Titterington, D., Smith, A. and Makov, U. 1985 Statistical analysis of nite mixture

mo dels. Wiley, Chichester, UK.

Little, R. J. A., & Rubin, D. B. 1987. Statistical Analysis with . New York:

John Wiley.

Rabiner, L., 1989 A tutorial on hidden Markov mo dels and selected applications in

sp eech recognition. Proceedings of the IEEE, 77, 257-285.

Smyth, P., Heckerman, D., and Jordan, M. I. 1997. Probabilistic indep endence networks

for hidden Markov probability mo dels. Neural Computation, 9, 227{270.

Jordan, M. and Jacobs, R. 1994 Hierarchical mixtures of exp erts and the EM algorithm.

Neural Computation, 6:181{214.

Saul, L. K., and Jordan, M. I. 1995 Boltzmann chains and hidden Markov mo dels. In G.

Tesauro, D. S. Touretzky & T. K. Leen, Eds., Advances in Neural Information Processing

Systems 7, MIT Press, Cambridge MA.

Neal, R. 1995 Probabilistic inference using Markov Chain Monte Carlo metho ds.

[http://www.cs.toronto.edu / radf ord/ ]

BUGS [http://www.mrc-bsu.cam. ac. uk/b ugs]

Jordan, M. I., & Bishop, C. 1997 Neural networks. In Tucker, A. B. Ed., CRC

Handbook of Computer Science, Bo ca Raton, FL: CRC Press.

Neal, R. 1992 Connectionist learning of b elief networks. Arti cial Intel ligence, 56, pp

71-113.

Neal, R. & Hinton, G. E. in press A view of the EM algorithm that justi es incremental,

sparse, and other variants. In M. I. Jordan, Ed., Learning in Graphical Models. Kluwer

Academic Press.

Hinton, G. E., Dayan, P., Frey, B., and Neal, R. 1995 The wake-sleep algorithm for

unsup ervised neural networks. Science, 268:1158{1161.

Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. 1995 Helmholtz Machines. Neural

Computation,bf 7, 1022-1037.

Saul, L. K., & Jordan, M. I. 1996 Exploiting tractable substructures in intractable

networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo Eds., Advances in Neural

Information Processing Systems 8, MIT Press, Cambridge MA.

Saul, L., Jaakkola, T. and Jordan, M. 1996 Mean eld theory for sigmoid b elief networks.

Journal of Arti cial Intel ligence Research, 4, 61{76.

Williams, C. K. I., & Hinton, G. E. 1991 Mean eld networks that learn to discriminate

temp orally distorted strings. In Touretzky, D. S., Elman, J., Sejnowski, T., & Hinton, G.

E., Eds., Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA:

Morgan Kaufmann.

Ghahramani, Z., and Jordan, M. I. 1996 Factorial Hidden Markov mo dels. In D. S.

Touretzky, M. C. Mozer, &M. E. Hasselmo Eds., Advances in Neural Information

Processing Systems 8, MIT Press, Cambridge MA.

Jaakkola, T., & Jordan, M. I. 1996. Computing upp er and lower b ounds on likeliho o ds in

intractable networks. In E. Horvitz Ed., Workshop on Uncertainty in Arti cial

Intel ligence,Portland, Oregon.

Jaakkola, T., & Jordan, M. I. in press. Bayesian : avariational

approach. In D. Madigan & P. Smyth Eds., Pro ceedings of the 1997 Conference on

Arti cial Intelligence and Statistics, Ft. Lauderdale, FL.

Saul, L. K., & Jordan, M. I. in press. Mixed memory Markov mo dels. In D. Madigan &

P. Smyth Eds., Pro ceedings of the 1997 Conference on Arti cial Intelligence and

Statistics, Ft. Lauderdale, FL.

Frey, B., Hinton, G. E., &Dayan, P. 1996. Do es the wake-sleep algorithm pro duce good

density estimators? In D. S. Touretzky, M. C. Mozer, M. E. Hasselmo Eds., Advances in

Neural Information Processing Systems 8, MIT Press, Cambridge MA.

Jordan, M. I., Ghahramani, Z., & Saul, L. K. 1997. Hidden Markov decision trees. In M.

C. Mozer, M. I. Jordan, &T. Petsche Eds., Advances in Neural Information Processing

Systems 9, MIT Press, Cambridge MA.

Jaakkola, T. S. 1997. Variational methods for inference and estimation in graphical

models. Unpublished do ctoral dissertation, Massachusetts Institute of Technology.

Jaakkola, T., & Jordan, M. I. 1997. Recursive algorithms for approximating probabilities

in graphical mo dels. In M. C. Mozer, M. I. Jordan, &T. Petsche Eds., Advances in

Neural Information Processing Systems 9, MIT Press, Cambridge MA.

Frey, B. 1997. Bayesian networks for pattern classi cation, data compression, and

channel coding. Unpublished do ctoral dissertation, University of Toronto.