AN INTRODUCTION TO GRAPHICAL
MODELS
Michael I. Jordan
Center for Biological and Computational Learning
Massachusetts Institute of Technology
http://www.ai.mit.edu/projects/jordan.html
Acknow ledgments:
Zoubin Ghahramani, Tommi Jaakkola, Marina Meila
Lawrence Saul
December, 1997
GRAPHICAL MODELS
Graphical mo dels are a marriage between graph
theory and probability theory
They clarify the relationship between neural
networks and related network-based mo dels such as
HMMs, MRFs, and Kalman lters
Indeed, they can be used to give a fully probabilistic
interpretation to many neural network architectures
Some advantages of the graphical mo del p oint of view
{ inference and learning are treated together
{ sup ervised and unsup ervised learning are merged
seamlessly
{ missing data handled nicely
{ a fo cus on conditional indep endence and
computational issues
{ interpretability if desired
Graphical mo dels cont.
There are two kinds of graphical mo dels; those based
on undirected graphs and those based on directed
graphs. Our main fo cus will be directed graphs.
Alternative names for graphical mo dels: belief
networks, Bayesian networks, probabilistic
independence networks, Markov random elds,
loglinear models, in uence diagrams
A few myths ab out graphical mo dels:
{ they require a lo calist semantics for the no des
{ they require a causal semantics for the edges
{ they are necessarily Bayesian
{ they are intractable
Learning and inference
A key insight from the graphical mo del p oint of view:
It is not necessary to learn that which can
be inferred
The weights in a network make lo cal assertions ab out
the relationships between neighb oring no des
Inference algorithms turn these lo cal assertions into
global assertions ab out the relationships between
no des
{ e.g., correlations between hidden units conditional
on an input-output pair
{ e.g., the probability of an input vector given an
output vector
This is achieved by asso ciating a joint probability
distribution with the network
Directed graphical mo dels|basics
Consider an arbitrary directed acyclic graph, where
each no de in the graph corresp onds to a random
variable scalar or vector:
D
A F C
B E
There is no a priori need to designate units as
\inputs," \outputs" or \hidden"
We want to asso ciate a probability distribution
P A; B ; C ; D ; E ; F with this graph, and we want
all of our calculations to resp ect this distribution
e.g.,
P A; B ; C ; D ; E ; F
C D E
P F jA; B =
P A; B ; C ; D ; E ; F
C D E F
Some ways to use a graphical mo del
Prediction:
Diagnosis, control, optimization:
Supervised learning:
we want to marginalize over the unshaded no des in
each case i.e., integrate them out from the joint
probability
\unsup ervised learning" is the general case
Sp eci cation of a graphical mo del
There are two comp onents to any graphical mo del:
{ the qualitative sp eci cation
{ the quantitative sp eci cation
Where do es the qualitative sp eci cation come from?
{ prior knowledge of causal relationships
{ prior knowledge of mo dular relationships
{ assessment from exp erts
{ learning from data
{ we simply like a certain architecture e.g., a
layered graph
Qualitative sp eci cation of graphical mo dels
ABC
ABC
A and B are marginally dependent
A and B are conditionally independent
C C
A B A B
A and B are marginally dependent
A and B are conditionally independent
Semantics of graphical mo dels cont
A B A B
C C
A and B are marginally independent
A and B are conditionally dependent
This is the interesting case...
\Explaining away"
Burglar Earthquake
Alarm Radio
All connections in b oth directions are \excitatory"
But an increase in \activation" of Earthquake leads
to a decrease in \activation" of Burglar
Where do es the \inhibition" come from?
Quantitative sp eci cation of directed mo dels
Question: how do we sp ecify a joint distribution over
the no des in the graph?
Answer: asso ciate a conditional probability with
each no de:
P(D|C)
P(A) P(F|D,E) P(C|A,B)
P(B)
P(E|C)
and take the pro duct of the local probabilities to
yield the global probabilities
Justi cation
In general, let fS g represent the set of random
variables corresp onding to the N no des of the graph
For any no de S , let paS represent the set of
i i
parents of no de S
i
Then
P S = P S P S jS P S jS ;:::;S
1 2 1 N N 1 1
Y
= P S jS ;:::;S
i i 1 1
i
Y
= P S jpaS
i i
i
where the last line is by assumption
It is possible to prove a theorem that states that if
arbitrary probability distributions are utilized for
P S jpaS in the formula above, then the family
i i
of probability distributions obtained is exactly that
set which respects the qualitative speci cation the
conditional independence relations described
earlier
Semantics of undirected graphs
ABC
ABC
A and B are marginally dependent
A and B are conditionally independent
Comparative semantics D
A B
A B
C
C
The graph on the left yields conditional
indep endencies that a directed graph can't represent
The graph on the right yields marginal
indep endencies that an undirected graph can't
represent
Quantitative sp eci cation of undirected mo dels
D
A F C
B E
identify the cliques in the graph:
A, C
C, D, E D, E, F
B, C
de ne a con guration of a clique as a sp eci cation of
values for each no de in the clique
de ne a potential of a clique as a function that
asso ciates a real number with each con guration of
the clique
Φ A, C Φ Φ C, D, E D, E, F
Φ B, C
Quantitative sp eci cation cont.
Consider the example of a graph with binary no des
A \p otential" is a table with entries for each
combination of no des in a clique
B 0 1 AB 0 1.5 .4 A
1 .7 1.2
\Marginalizing" over a p otential table simply means
collapsing summing the table along one or more dimensions
marginalizing over B marginalizing over A
B 0 1.9 0 1 A 2.2 1.6 1 1.9
Quantitative sp eci cation cont.
nally, de ne the probability of a global
con guration of the no des as the pro duct of the local
p otentials on the cliques:
P A; B ; C ; D ; E ; F =
A;B B;C C;D;E D;E ;F
where, without loss of generality, we assume that the
normalization constant if any has b een absorb ed
into one of the p otentials
It is then possible to prove a theorem that states
that if arbitrary potentials are utilized in the
product formula for probabilities, then the family
of probability distributions obtained is exactly that
set which respects the qualitative speci cation the
conditional independence relations described
earlier
This theorem is known as the Hammersley-Cli ord theorem
Boltzmann machine
The Boltzmann machine is a sp ecial case of an
undirected graphical mo del
For a Boltzmann machine all of the p otentials are
formed by taking pro ducts of factors of the form
expfJ S S g
ij i j
Si
Jij
Sj
Setting J equal to zero for non-neighb oring no des
ij
guarantees that we resp ect the clique b oundaries
But we don't get the full conditional probability
semantics with the Boltzmann machine
parameterization
{ i.e., the family of distributions parameterized by a
Boltzmann machine on a graph is a prop er subset
of the family characterized by the conditional indep endencies
Evidence and Inference
\Absorbing evidence" means observing the values of
certain of the no des
Absorbing evidence divides the units of the network
into two groups:
visible units those for which we have
fV g instantiated values
\evidence no des".
hidden units those for which we do not
fH g have instantiated values.
\Inference" means calculating the conditional
distribution
P H; V
P H jV =
P H; V
fH g
{ prediction and diagnosis are sp ecial cases
Inference algorithms for directed graphs
There are several inference algorithms; some of which
op erate directly on the directed graph
The most p opular inference algorithm, known as the
junction tree algorithm which we'll discuss here,
op erates on an undirected graph
It also has the advantage of clarifying some of the
relationships between the various algorithms
To understand the junction tree algorithm, we need to
understand how to \compile" a directed graph into an
undirected graph
Moral graphs
Note that for b oth directed graphs and undirected
graphs, the joint probability is in a pro duct form
So let's convert lo cal conditional probabilities into
p otentials; then the pro ducts of p otentials will give
the right answer
Indeed we can think of a conditional probability, e.g.,
P C jA; B as a function of the three variables A; B ,
and C we get a real number for each con guration:
A P(C|A,B)
B C
Problem: A no de and its parents are not generally in
the same clique
Solution: Marry the parents to obtain the \moral graph" Φ = P(C|A,B) A A,B,C
B C
Moral graphs cont.
De ne the p otential on a clique as the pro duct over
all conditional probabilities contained within the
clique
Now the pro ducts of p otentials gives the right
answer:
P A; B ; C ; D ; E ; F
= P AP B P C jA; B P D jC P E jC P F jD; E
=
A;B ;C C;D;E D;E ;F
where
= P AP B P C jA; B
A;B ;C
and
= P D jC P E jC
C;D;E
and
= P F jD; E
D;E ;F
D D
A F A F C C
B E B E
Propagation of probabilities
Now supp ose that some evidence has b een absorb ed.
How do we propagate this e ect to the rest of the graph?
Clique trees
A clique tree is an undirected tree of cliques
A, C
C, D, E D, E, F
B, C
Consider cases in which two neighb oring cliques V
and W have an overlap S e.g., A; C overlaps with
C; D ; E .
Φ V ΦW ΦS
VSW
{ the cliques need to \agree" on the probability of
no des in the overlap; this is achieved by
marginalizing and rescaling:
X
=
V
S
V nS
S
=
W
W
S
{ this o ccurs in parallel, distributed fashion
throughout the clique tree
Clique trees cont.
This simple lo cal message-passing algorithm on a
clique tree de nes the general probability
propagation algorithm for directed graphs!
Many interesting algorithms are sp ecial cases:
{ calculation of p osterior probabilities in mixture
mo dels
{ Baum-Welch algorithm for hidden Markov mo dels
{ p osterior propagation for probabilistic decision
trees
{ Kalman lter up dates
The algorithm seems reasonable. Is it correct?
A problem
Consider the following graph and a corresp onding
clique tree:
ABA,B B,D
CDA,C C,D
Note that C app ears in two non-neighb oring cliques.
Question: What guarantee do we have that the
probability asso ciated with C in these two cliques
will be the same?
Answer: Nothing. In fact this is a problem with the
algorithm as describ ed so far. It is not true that in
general local consistency implies global consistency.
What else do we need to get such a guarantee?
Triangulation last idea, hang in there
A triangulated graph is one in which no cycles with
four or more nodes exist in which there is no chord
We triangulate a graph by adding chords:
ABAB
CDCD
Now we no longer have our problem:
ABA,B,C
CDB,C,D
A clique tree for a triangulated graph has the
running intersection property : if a no de app ears in
two cliques, it app ears everywhere on the path
between the cliques
Thus local consistency implies global consistency
for such clique trees
Junction trees
A clique tree for a triangulated graph is referred to as
a junction tree
In junction trees, lo cal consistency implies global
consistency. Thus the lo cal message-passing
algorithm is provably correct.
It's also p ossible to show that only triangulated
graphs have the prop erty that their clique trees are
junction trees. Thus, if we want lo cal algorithms, we
must triangulate.
Summary of the junction tree algorithm
1. Moralize the graph
2. Triangulate the graph
3. Propagate by lo cal message-passing in the junction
tree
Note that the rst two steps are \o -line"
Note also that these steps provide a b ound of the
complexity of the propagation step
Example: Gaussian mixture mo dels
A Gaussian mixture mo del is a p opular clustering mo del
X 0 q = 1 0 x2 X X X X X X
X X 1 X X q = 0 X 0 0 X X q = 0 1 X
x1
hidden state q is a multinomial RV
output x is a Gaussian RV
prior probabilities on hidden states:
= P q = 1
i i
class-conditional probabilities:
1 1
T 1
x g x P xjq = 1 = expf
i i i
i
d=2 1=2
2 j j 2 i
Gaussian mixture mo dels as a graphical mo del
q q
x x
The inference problem is to calculate the p osterior
probabilities:
P xjq = 1P q = 1
i i
P q = 1jx =
i
P x
1
T 1
i
expf x x g
i i
1=2
i
2
j j
i
=
P
j 1
1
T
x g x expf
j j j
j
1=2
2
j j
j
This is a trivial example of the rescaling op eration
on a clique p otential
Example: Hidden Markov mo dels
A hidden Markov mo del is a p opular time series
mo del
It is a \mixture mo del with dynamics"
AA π qq qqq123 q T BBB B
y123yy y T
T time steps
M states q is a multinomial RV
t
N outputs y is a multinomial RV
t
state transition probability matrix A:
A = P q jq
t+1 t
emission matrix B :
B = P y jq
t t
initial state probabilities :
= P q 1
HMM as a graphical mo del
Each no de has a probability distribution asso ciated
with it.
The graph of the HMM makes conditional
indep endence statements.
For example,
P q jq ; q = P q jq
t+1 t t 1 t+1 t
can be read o the graph as a separation prop erty.
HMM probability calculations
AA qq π qqq123 q T BBB B
y123yy y T
The time series of y values is the evidence
t
The inference calculation involves calculating the
probabilities of the hidden states q given the
t
evidence
The classical algorithm for doing this calculation is
the forward-backward algorithm
The forward-backward algorithm involves:
{ multiplication conditioning
{ summation marginalization in the lattice.
It is a sp ecial case of the junction tree algorithm cf.
Smyth, et al., 1997
Is the algorithm ecient?
To answer this question, let's consider the junction
tree
Note that the moralization and triangulation steps
are trivial, and we obtain the following junction tree:
q1, q2 q2 , q3 . . .
q1, y1 q2 , y2
2
The cliques are no bigger than N , thus the
marginalization and rescaling required by the
2
junction tree algorithm runs in time O N per clique
There are T such cliques, thus the algorithm is
2
O N T overall
Hierarchical mixture of exp erts
The HME is a \soft" decision tree
The input vector x drops through a tree of \gating
networks," each of which computes a probability for
the branches b elow it
x
η
ωωω 1in
υ υ1 i υn
ωωωi1 ij in
ζ ζζ i1 ij in
ω ωω ij1 ijk ijn
θij1θ ijkθ ijn
yyy
At the leaves, the \exp erts" compute a probability
for the output y
The overall probability mo del is a conditional
mixture mo del
Representation as a graphical mo del
x
ωi
ωij
ωijk
y
For learning and inference, the no des for x and y are
observed shaded
x
ωi
ωij
ωijk
y
We need to calculate probabilities of unshaded no des
E step of EM
Learning parameter estimation
The EM algorithm is natural for graphical mo dels
The E step of the EM algorithm involves calculating
the probabilities of hidden variables given visible
variables
{ this is exactly the inference problem
D
A F C
B E
The M step involves parameter estimation for a fully
observed graph
{ this is generally straightforward
D
A F C
B E
Example|Hidden Markov mo dels
AA qq π qqq123 q T BBB B
y123yy y T
Problem: Given a sequence of outputs
y = fy ; y ;:::;y g
1 2 T
1;T
infer the parameters A, B and .
To see how to solve this problem, let's consider a
simpler problem
Fully observed Markov mo dels
Supp ose that at each moment in time we know what
state the system is in:
AA π q qqq123 T BBB B
y123yy y T
Parameter estimation is easy in this case:
{ to estimate the state transition matrix elements,
simply keep a running count of the number n of
ij
times the chain jumps from state i to state j .
Then estimate a as:
ij
n
ij
a^ =
ij
P
n
j ij
{ to estimate the Gaussian output probabilities,
simply record which data p oints o ccurred in which
states and compute sample means and covariances
{ as for the initial state probabilities, we need
multiple output sequences which we usually have
in practice
HMM parameter estimation
When the hidden states are not in fact observed the
case we're interested in, we rst estimate the
probabilities of the hidden states
{ this is a straightforward application of the junction
tree algorithm i.e., the forward-backward
algorithm
We then use the probability estimates instead of the
counts in the parameter estimation formulas to get
update formulas
This gives us a b etter mo del, so we run the junction
tree algorithm again to get better estimates of the
hidden state probabilities
And we iterate this pro cedure
CONCLUSIONS PART I
Graphical mo dels provide a general formalism for
putting together graphs and probabilities
{ most so-called \unsup ervised neural networks" are
sp ecial cases
{ Boltzmann machines are sp ecial cases
{ mixtures of exp erts and related mixture-based
mo dels are sp ecial cases
{ some sup ervised neural networks can be treated as
sp ecial cases
The graphical mo del framework allows us to treat
inference and learning as two sides of the same coin
INTRACTABLE GRAPHICAL MODELS
There are a number of examples of graphical mo dels
in which exact inference is ecient:
{ chain-like graphs
{ tree-like graphs
However, there are also a number of examples of
graphical mo dels in which exact inference can be
hop elessly inecient:
{ dense graphs
{ layered graphs
{ coupled graphs
A variety of metho ds are available for approximate
inference in such situations:
{ Markov chain Monte Carlo sto chastic
{ variational metho ds deterministic
Computational complexity of exact computation
passing messages requires marginalizing and scaling
the clique p otentials
thus the time required is exp onential in the number
of variables in the largest clique
good triangulations yield small cliques
{ but the problem of nding an optimal
triangulation is hard P in fact
in any case, the triangulation is \o -line"; our
concern is generally with the \on-line" problem of
message propagation
Quick Medical Reference QMR
University of Pittsburgh
600 diseases, 4000 symptoms
arranged as a bipartite graph
diseases
symptoms
No de probabilities P sy mptom jdiseases were
i
obtained from an exp ert, under a noisy-OR mo del
Want to do diagnostic calculations:
P diseasesjf inding s
Current metho ds exact and Monte Carlo are infeasible
QMR cont.
diseases
symptoms
\noisy-OR" parameterization:
Y
d
j
P f = 0jd = 1 q 1 q
i i0 ij
j 2pa
i
rewrite in an exp onential form:
P