Journal of Articial Intelligence Research Submitted published

Op erations for Learning with Graphical Mo dels

Wray L Buntine wraykronosarcna sagov

RIACS NASA Ames Research Center Mail Stop

Moett Field CA USA

Abstract

This pap er is a multidisciplinary review of empirical statistical learning from a graph

ical mo del p ersp ective Wellknown examples of graphical mo dels include Bayesian net

works directed graphs representing a Markov chain and undirected networks representing

a Markov eld These graphical mo dels are extended to mo del analysis and empirical

learning using the notation of plates Graphical op erations for simplifying and manipulat

ing a problem are provided including decomp osition dierentiation and the manipulation

of mo dels from the exp onential family Two standard algorithm schemas for

learning are reviewed in a graphical framework Gibbs and the exp ectation max

imization algorithm Using these op erations and schemas some p opular algorithms can

b e synthesized from their graphical sp ecication This includes versions of linear regres

sion techniques for feedforward networks and learning Gaussian and discrete Bayesian

networks from data The pap er concludes by sketching some implications for data analysis

and summarizing how some p opular algorithms fall within the framework presented

The main original contributions here are the decomp osition techniques and the demon

stration that graphical mo dels provide a framework for understanding and developing com

plex learning algorithms

Intro duction

A probabilistic graphical mo del is graph where the no des represent variables and the arcs

directed or undirected represent dep endencies b etween variables They are used to dene

a mathematical form for the joint or conditional b etween variables

Graphical mo dels come in various forms Bayesian networks used to represent causal and

probabilistic pro cesses dataow diagrams used to represent deterministic computation

inuence diagrams used to represent decision pro cesses and undirected Markov networks

random elds used to represent correlation for images and hidden causes

Graphical mo dels are used in domains such as diagnosis probabilistic exp ert systems

and more recently in planning and control Dean Wellman Chan Shachter

dynamic systems and timeseries Kjru Dagum Galp er Horvitz Seiver

and general data analysis Gilks et al a and Whittaker This

pap er shows the task of learning can also b e mo deled with graphical mo dels This meta

level use of graphical mo dels was rst suggested by Spiegelhalter and Lauritzen in

the context of learning for Bayesian networks

Graphical mo dels provide a representation for the decomp osition of complex problems

They also have an asso ciated set of mathematics and algorithms for their manipulation

When graphical mo dels are discussed b oth the graphical formalism and the asso ciated

algorithms and mathematics are implicitly included In fact the graphical formalism is

unnecessary for the technical development of the approach but its use conveys the imp ortant

c

AI Access Foundation and Morgan Kaufmann Publishers All rights reserved

Buntine

structural information of a problem in a natural visual manner Graphical op erations

manipulate the underlying structure of a problem unhindered by the ne detail of the

connecting functional and distributional equations This structuring pro cess is imp ortant

in the same way that a highlevel programming language leads to higher pro ductivity over

assembly language

A graphical mo del can b e develop ed to represent the basic prediction done by linear

regression a for an exp ert system a hidden Markov mo del or a con

nectionist feedforward network Buntine A graphical mo del can also b e used to

represent and reason ab out the task of learning the parameters weights and structure of

each of these representations An extension of the standard graphical mo del that allows

this kind of learning to b e represented is used here The extension is the notion of a plate

intro duced by Spiegelhalter Plates allow samples to b e explicitly represented on

the graphical mo del and thus reasoned ab out and manipulated This makes data analysis

problems explicit in much the same way that utility and decision no des are used for decision

analysis problems Shachter

This pap er develops a framework in which the basic computational techniques for learn

ing can b e directly applied to graphical mo dels This forms the basis of a computational

theory of Bayesian learning using the language of graphical mo dels By a computational

theory we that the approach shows how a wide variety of learning algorithms can

b e created from graphical sp ecications and a few simple algorithmic criteria The basic

computational techniques of probabilistic Bayesian inference used in this computational

theory of learning are widely reviewed Tanner Press Kass Raftery

Neal Bretthorst These include various exact metho ds Markov chain Monte

Carlo metho ds such as the exp ectation maximization EM algorithm and

the Laplace approximation More sp ecialized computational techniques also exist for han

dling missing values Little Rubin making a batch algorithm incremental and

adapting an algorithm to handle large samples With creative combination these techniques

are able to address a wide of data analysis problems

The pap er provides the blueprint for a software to olkit that can b e used to construct

many data analysis and learning algorithms based on a graphical sp ecication The con

ceptual architecture for such a to olkit is given in Figure Probability and decision theory

are used to decomp ose a problem into a computational prescription and then search and

optimization techniques are used to ll the prescription A version of this to olkit already

exists using Gibbs sampling as the general computational scheme Gilks et al b The

list of algorithms that can b e constructed in one form or another by the scheme in Fig

ure is impressive But the real gain from the scheme do es not arise from the p otential

reimplementation of existing software but from understanding gained by putting these in

a common language the ability to create novel hybrid algorithms and the ability to tailor

sp ecial purp ose algorithms for sp ecic problems

This pap er is tutorial in the sense that it collects material from dierent communities and

presents it in the language of graphical mo dels This pap er intro duces graphical mo dels

to represent rstorder inference and learning Second this pap er develops and reviews

a numb er of op erations on graphical mo dels Finally this pap er gives some examples

The notion of a replicated no de was my version of this develop ed indep endentl y I have adopted the

notation of Spiegelhalter and colleagues for uniformity

Learning with Graphical Models

GLM C statistical & decision λ x1 xn methods

θ basis basis 1 m σ

m Linear optimizing y & search

N Gaussian methods

Figure A software generator

of developing learning algorithms using combinations of the basic op erations The main

original contribution is the demonstration that graphical mo dels provide a framework for

understanding and developing complex learning algorithms

In more detail the pap er covers the following topics

Intro duction to graphical mo dels Graphical mo dels are used in two ways

Graphical mo dels for representing inference tasks Section reviews some ba

sics of graphical mo dels Graphical mo dels provide a of representing pat

terns of inference

Adapting graphical representations to represent learning Section discusses

the representation of the problem of learning within graphical mo dels using the

notion of a plate

Op erations on graphical mo dels Op erations take a graphical representation of a learn

ing problem and simplify it or p erform an imp ortant calculation required to solve the

problem

Op erations using closedform solutions Section covers those classes of learn

ing problems where closedform solutions to learning are known This section

adapts standard statistical metho ds to graphical mo dels

Other basic op erations Other op erations on graphs are required to b e able to

handle more complex problems These are covered in Section includin g

Decomp osition Breaking a learning problem into indep endent comp onents

and evaluating each comp onent

Dierentiation Computing the derivative of a probability or log probability

with resp ect to variables on the graph Dierentiation can b e decomp osed

into op erations lo cal to groups of no des in the graph as is p opular in neural

networks

Some approximate op erations Some approximate algorithms follow naturally

from the ab ove metho ds Section reviews Gibbs sampling and its deterministic

cousin the EM algorithm

Buntine

Some example algorithms The closedform solutions to learning can sometimes b e

used to form a fast inner lo op of more complex algorithms Section illustrates how

graphical mo dels help here

The conclusion lists some common algorithms and their derivation within the ab ove frame

work Pro ofs of lemmas and theorems are collected in App endix A

Intro duction to graphical mo dels

This section intro duces graphical mo dels The brief tour is necessary b efore intro ducing

the op erations for learning

Graphical mo dels oer a unied qualitative and quantitative framework for representing

and reasoning with probabilities and indep endenci es They combine a representation for

uncertain problems with techniques for p erforming inference Flexible to olkits and systems

exist for applying these techniques Srinivas Breese Andersen Olesen Jensen

Jensen Cowell Graphical mo dels are based on the notion of indep endence

which is worth rep eating here

Denition A is independent of B given C if pA B jC pAjC pB jC whenever

pC for al l A B C

The theory of indep endenc e as a basic to ol for knowledge structuring is develop ed by Dawid

and Pearl A graphical mo del can b e equated with the set of probability distri

butions that satisfy its implied constraints Two graphical mo dels are equivalent probability

models if their corresp onding sets of satisfying probability distributions are equivalent

Directed graphical mo dels

The basic kind of graphical mo del is the Bayesian network also called b elief net which

is most p opular in articial intelligence See Charniak Shachter and Heckerman

and Pearl for an intro duction This is also a graphical representation for a

Markov chain A Bayesian network is a graphical mo del that uses directed arcs exclusively

to form a DAG ie a directed graph without directed cycles

Figure adapted from Shachter Heckerman shows a simple Bayesian network for

Age Occupation Climate

Symptoms Disease

Figure A simplied medical problem

a simplied medical problem The graphical mo del represents a conditional decomp osition

of the joint probability see Lauritzen Dawid Larsen Leimer for more details

and interpretations This decomp osition works as follows full variable names have b een

Learning with Graphical Models

abbreviated

pAg e O cc C lim D is S y mpjM

pAg ejM pO ccjM pC l imjM pDisjAg e O cc C l im M pS y mpjD is M

where M is the conditioning context for instance the exp erts prior knowledge and the

choice of the graphical mo del in Figure Each variable is written conditioned on its

parents where par entsx is the set of variables with a directed arc into x The general

form for this equation for a set of variables X is

Y

pX jM pxjpar entsx M

xX

This equation is the interpretation of a Bayesian network used in this pap er

Undirected graphical mo dels

Another p opular form of graphical mo del is an undirected graph sometimes called a Markov

network Pearl This is a graphical mo del for a Markov random eld Markov random

elds b ecame used in statistics with the advent of the HammersleyCliord theorem Besag

York Mollie A variant of the theorem is given later in Theorem Markov

random elds are used in imaging and spatial reasoning Ripley Geman Geman

Besag et al and various sto chastic mo dels in neural networks Hertz Krogh

Palmer Undirected graphs are also imp ortant b ecause they simplify the theory

of Bayesian networks Lauritzen et al

Figure shows a simple image and an undirected mo del for the image This mo del

p1,1 p1,2 p1,3 p1,4

p2,1 p2,2 p2,3 p2,4

p3,1 p3,2 p3,3 p3,4

p4,1 p4,2 p4,3 p4,4

Figure A simple mage and its graphical mo del

is based on the rst degree Markov assumption that is the current pixel is only directly

inuenced by pixels p ositioned next to it as indicated by the undirected arcs b etween

variables p and p p and p etc Each no de x corresp onding to a pixel has

ij ij ij ij

its set of neighborsthose no des it is directly connected to by an undirected arc For

instance the neighb ors of p are p p and p For the variableno de x denote these

by neig hbor sx

In general there is no formula for undirected graphs in terms of conditional probabilities

corresp onding to Equation for the Bayesian network of Figure However a functional

decomp osition do es exist in another form based on the maximal cliques in Figure Maximal

Buntine

cliques are subgraphs that are fully connected but are not strictly contained in other fully

connected subgraphs These are the sets of cliques such as fp p p p g The

interpretation of the graph is that the joint probability is a pro duct over functions of the

maximal cliques

pp p

f p p p p f p p p p f p p p p

f p p p p f p p p p f p p p p

f p p p p f p p p p f p p p p

for some functions f f dened up to a constant From this formula it follows that p

is conditionally indep endent of its nonneighb ors given its neighb ors p p p p p

The general form for Equation for a set of variables X is given in the next theorem

Compare this with Equation

Theorem An undirected graph G is on variables in the set X The set of maximal

X

cliques on G is C l iq uesG The distribution pX probability or probability density

is strictly positive in the domain domainx Then under the distribution pX x

xX

is independent of X fxg neig hbor sx given neig hbor sx for al l x X Frydenberg

refers to this condition as local GMarkovian if and only if pX has the functional

representation

Y

pX f C

C

C C liq uesG

for some functions f

C

The general form of this theorem for nite discrete domains is called the Hammersley

Cliord Theorem Geman Besag et al Again this equation is used as the

interpretation of a Markov network

Conditional probability mo dels

Consider the conditional probability pD isjAg e O cc C lim found in the simple medical

problem from Figure and Equation This conditional probability mo dels how the

disease should vary for given values of age o ccupation and climate Class probability trees

Breiman Friedman Olshen Stone Quinlan graphs and rules Rivest

Oliver Kohavi and feedforward networks are representations devised to

express conditional mo dels in dierent ways In statistics the conditional distributions are

also represented as regression mo dels and generalized linear mo dels McCullagh Nelder

The mo dels of Figure and Equation and Figure and Equation show how

the joint distribution is comp osed from simpler comp onents That is they give a global

mo del of variables in a problem The conditional probability mo dels in contrast give a

mo del for a subset of variables conditioned on knowing the values of another subset In

diagnosis the concern may b e a particular direction for reasoning such as predicting the

disease given patient details and symptoms so the full joint mo del provides unnecessary

detail The full joint mo del may require extra parameters and thus more data to learn In

Learning with Graphical Models

sup ervised learning applications the general view is that conditional mo dels are sup erior

unless prior knowledge dictates a full joint mo del is more appropriate This distinction is

sometimes referred to as the diagnostic versus the discriminant approach to classication

Dawid

There are a numb er of ways for explicitly representing conditional probability mo d

els Any joint distribution implicitly gives the conditional distribution for any subset of

variables by denition of conditional probability For instance if there is a mo del for

pAg e O cc C l im Dis S y mp then by the denition of conditional probability a condi

tional mo del follows

pAg e O cc C l im Dis S y mp

P

pD isjAg e O cc C l im S y mp

pAg e O cc C l im D is S y mp

D is

pAg e O cc C l im Dis S y mp

Conditional distributions can also b e represented by a single no de that is lab eled to identify

which functional form the no de takes For instance in the graphs to follow lab eled Gaussian

no des linear no des and other standard forms are all used Conditional mo dels such as rule

sets and feedforward networks can b e constructed by the use of sp ecial deterministic no des

For instance Figure shows four mo del constructs In each case the input variables to the

x µ σ x1 xn x1 xn

rule rule c unit y

1 m Logistic Gaussian

Figure Graphical conditional mo dels

conditional mo dels represented have b een shaded This shading means the values of these

variables are known or given In Figure b the shading indicates that the value for x is

known but the value for c is unknown Presumably c will b e predicted using x Figure a

represents a rule set The no des with double ovals are deterministic functions of their inputs

in contrast to the usual no des which are probabilistic functions This means that the value

for r ul e is a deterministic function of x x Notice that this implies that the values

n

for r ul e and the others are known as well The conditional probability for a deterministic

no de as required for Equation is treated as a delta function Figure c corresp onds

to the statement

if unit f x x

n

punitjx x

n

unit f x x

n

1

otherwise

for some function f not sp ecied A Bayesian network constructed entirely of double ovals is

equivalent to a data ow graph where the inputs are shaded The analysis of deterministic

no des in Bayesian networks and more generally in inuence diagrams is considered by

Shachter For some purp oses deterministic no des are b est treated as intermediate

variables and removed from the problem The metho d for doing this

is given later in Lemma

Buntine

The logical or conjunctive form of each rule in Figure a is not expressed in the graph

and presumably would b e given in the formulas accompanying the graph however the basic

functional structure of the rule set exists In Figure b a no de has b een lab eled with its

functional typ e The functional typ e for this no de with a Bo olean variable c is the function

pc jx S ig moidx Log istic x

x

e

which maps a real value x onto a probability in for the binary variable c This function

is the inverse of the logistic or logit function used in generalized linear mo dels McCullagh

Nelder and is also the sigmoid function used in feedforward neural networks

Figure c uses a deterministic no de to repro duce a single unit from a connectionist feed

forward network where the units activation is computed via a sigmoid function Figure d

is a simple univariate Gaussian which makes y normally distributed with mean and

Here the no de is lab eled in italics to indicate its conditional typ e

At a more general level networks can b e conditional Figure shows two conditional

versions of the simple medical problem If the shading of no des is ignored the joint proba

(a) (b) Age Occupation Climate Age Occupation Climate

Symptoms Disease Symptoms Disease

Figure Two equivalent conditional mo dels of the medical problem

bility pAg e O cc C l im D is S y mp for the two graphs a and b is

pAg e pO ccjAg e pC l imjAg e O cc pD isjAg e O cc C l im pS y mpjAg e D is

pAg e pO cc pC l im pD isjAg e O cc C l im pS y mpjAg e D is

However b ecause four of the ve no des are shaded this means their values are known The

conditional distributions computed from the ab ove are identical

pD isjAg e O cc C l im S y mp

pD isjAg e O cc C lim pS y mpjAg e D is

P

pD isjAg e O cc C l im pS y mpjAg e D is

D is

Why do these distinct graphs b ecome identical when viewed from the conditional p ersp ec

tive Because conditional comp onents of the mo del corresp onding to age o ccupation and

climate cancel out when the conditional distribution is formed However the symptoms

no de has the unknown variable disease as a parent so the arc from age to symptoms is

kept

More generally the following simple lemma applies and is derived directly from Equa

tion

Learning with Graphical Models

Lemma Given a Bayesian network G with some nodes shaded representing a condi

tional probability distribution if a node X and al l its parents have their values given then

the Bayesian network G created by deleting al l the arcs into X represents an equivalent

probability model to the Bayesian network G

This do es not mean for instance in the graphs just discussed that there is no causal or

inuential links b etween the variables age o ccupation and climate rather that their ef

fects b ecome irrelevant in the conditional mo del considered b ecause their values are already

known A corresp onding result holds for undirected graphs and follows directly from The

orem

Lemma Given an undirected graph G with some nodes shaded representing a condi

tional probability distribution delete an arc between nodes A and B if al l their common

neighbors are given The resultant graph G represents an equivalent probability model to

the graph G

Mixed graphical mo dels

Undirected and directed graphs can also b e mixed in a sequence These mixed graphs are

called chain graphs Wermuth Lauritzen Frydenb erg These chain graphs

are sometimes used here However a precise understanding of them is not required for this

pap er A simple chain graph is given in Figure In this case the single disease no de

Age Occupation Climate

Heart Disease Symptom A Lung Disease

Symptom B

Symptom C

Figure An expanded medical problem

and single symptom no de of Figure are expanded to represent the case where there are

two p ossibly coo ccurring diseases and three p ossibly coo ccurring symptoms The medical

sp ecialist may have said something like Lung disease and heart disease can inuence

each other or may have some hidden common cause however it is often dicult to tell

which is the cause of which if at all In the causal mo del join the two disease no des

by an undirected arc to represent direct inuence Likewise for the symptom no des The

resultant joint distribution takes the form

pAg e O cc C lim H ear tD is Lung D is S y mpA S y mpB S y mpC

pAg e pO cc pC lim pH ear tD is Lung D isjAg e O cc C l im

pS y mpA S y mpB S y mpC jH ear tD is Lung D is

Buntine

where the last two conditional probabilities can take on an arbitrary form Notice that the

probabilities now have more than one variable on the left side

In general a chain graph consists of a chain of undirected graphs connected by directed

arcs Any cycle through the graph cannot have directed arcs going in opp osite directions

Chain graphs can b e interpreted as Bayesian networks dened over the comp onents of the

chain instead of the original variables This go es as follows

Denition Given a subgraph G over some variables X the chain comp onents are sub

sets of X that are maximal undirected connected subgraphs in a chain graph G Frydenberg

Furthermore let chaincomponentsA denote al l nodes in the same chain compo

nent as at least one variable in A

The chain comp onents for the graph ab ove ordered consistently with the directed arcs

are fAg eg fO ccg fC l img fH ear tD is Lung D isg and fS y mpA S y mpB S y mpC g

Informally a chain graph over variables X with chain comp onents given by the set T

is interpreted rst as the decomp osition corresp onding to the decomp osition of Bayesian

networks in Equation

Y

pX jM p jpar ents M

T

where

par entsA par entsa A

aA

Each comp onent probability p jpar ents M has a form similar to Equation

Sometimes to pro cess graphs of this form without having to consider the mathematics

of chain graphs the following device is used

Comment When a set of nodes U in a chain graph form a clique a ful ly connected

subgraph and al l have identical children and parents otherwise then the set of nodes can

be replaced a single node representing the cross product of the variables

This op eration for Figure is done to get Figure Furthermore chain graphs are sometimes

Age Occupation Climate

Heart Disease & Lung Disease Symptom A & Symptom B &

Symptom C

Figure An expanded medical problem

used here where Bayesian networks can also b e used In this case

Comment The chain components of a Bayesian network are the singleton sets of in

dividual variables in the graph Furthermore chaincomponentsA A

Learning with Graphical Models

Chain graphs can b e decomp osed into a chain of directed and undirected graphs An

example is given in Figure Figure a shows the original chain graph Figure b shows

(a) a b (b) a b b a, b

a b c c c, d f f d c e e h h e, f, g, h

g d g

Figure Decomp osing a chain graph

its directed and undirected comp onents together with the Bayesian network on the right

showing how they are pieced together Having done this decomp osition the comp onents

are analyzed using all the machinery of directed and undirected graphs The interpretation

of these graphs in terms of indep endence statements and the implied functional form of

the joint probability is a combination of the previous two forms given in Equation

and Theorem based on Frydenb erg Theorem and on the interpretation of

conditional graphical mo dels in Section

Intro duction to learning with graphical mo dels

A simplied inference problem is represented in Figure Here the no des v ar v ar and

v ar are shaded This represents that the value of these no des is given so the inference task

is to predict the value of the remaining variable cl ass This graph matches the socalled

idiots Bayes classier Duda Hart Langley Iba Thompson used in

sup ervised learning for its sp eed and simplicity The probabilities on this network are easily

learned from data ab out the three input variables v ar v ar and v ar and cl ass This

graph also matches an unsup ervised learning problem where the class cl ass is not in the

data but is hidden An unsup ervised learning algorithm learns hidden classes Cheeseman

Self Kelly Taylor Freeman Stutz McLachlan Basford

class

var1

var2

var3

Figure A simple classication problem

Buntine

The implied joint for these variables read from the graph is

pcl ass v ar v ar v ar pcl ass pv ar jcl ass pv ar jcl ass pv ar jcl ass

The Bayesian classier gets its name b ecause it is derived by applying Bayes theorem to

this joint to get the conditional formula

pcl asspv ar jcl ass pv ar jcl ass pv ar jcl ass

P

pcl assjv ar v ar v ar

pcl asspv ar jcl ass pv ar jcl ass pv ar jcl ass

class

The same formula is used to predict the hidden class for ob jects in the simple unsup er

vised learning framework Again this formula and corresp onding formula for more general

classiers can b e found automatically by using exact metho ds for inference on Bayesian

networks

Consider the simple mo del given in Figure If the unsup ervised learning

problem for this mo del was represented a of N cases of the variables would b e

observed with the rst case b eing v ar v ar v ar and the N th case b eing v ar

N

v ar v ar The corresp onding hidden classes cl ass to cl ass would not b e observed

N N N

but interest would b e in p erforming inference ab out the parameters needed to sp ecify the

hidden classes The learning problem is represented in Figure This includes two added

features an explicit representation of the mo del parameters and and a representation

of the sample as N rep eated subgraphs The parameter a vector of class probabilities

φ θ θ θ 1 2 3

class 1 classN

var 1,1 var var3,1 class 2,1 2 var ... 1,N var2,N var3,N var

1,2 var2,2 var3,2

Figure Learning the simple classication

gives the prop ortions for the hidden classes and the three parameters and give

how the variables are distributed within each hidden class For instance if there are

classes then is a vector of class probabilities such that the of a case

b eing in class c is If v ar is a binary variable then would b e probabilities one

c

for each class such that if the case is known to b e in class c then the probability v ar is

true is given by and the probability v ar is false is given by This yields the

c c

following equations

pcl ass cj M

c

pv ar tr uejcl ass c M

j j jc

Learning with Graphical Models

The unknown mo del parameters and are included in the graphical mo del to

explicitly represent all unknown variables and parameters in the learning problem

Intro duction to Bayesian learning

Now is a useful time to intro duce the basic terminology of Bayesian learning theory This is

not an intro duction to the eld Intro ductions are given in Bretthorst Press

Loredo Bernardo Smith Cheeseman This section reviews notions

such as the sample likeliho o d and imp ortant for subsequent results

For the ab ove unsup ervised learning problem there is the model M which is the use of

the hidden class and the particular graphical structure of Figure There are data assumed

to b e indep endently sampled and there are the parameters of the mo del etc In

order to use the theory it must b e assumed that the mo del is correct That is the true

distribution for the data can b e assumed to come from this mo del with some parameters In

practice hop efully the mo del assumptions are suciently close to the truth Dierent sets

of mo del assumptions may b e tried Typically the true mo del parameters are unknown

although there may b e some rough idea ab out their values Sometimes several mo dels are

considered for instance dierent kinds of Bayesian networks but it is assumed that just

one of them is correct Mo del selection or mo del averaging metho ds are used to deal with

them

For the Bayesian classier ab ove a sub jective probability is placed over the mo del

parameters in the form p jM This is called the prior probability Bayesian

statistics and decision theory is distinguished from all other statistical approaches in that

it places initial probability distributions the prior probability over unknown mo del pa

rameters If the mo del is a feedforward neural network then a prior probability needs to

b e placed over the network weights and the standard deviation of the error If the mo del

is with Gaussian error then it is over the linear parameters and the

standard deviation of the error Prior probabilities are an active area of research and are

discussed in most intro ductions to Bayesian metho ds

The next imp ortant comp onent is the sample likelihood which on the basis of the mo del

assumptions M and given a set of parameters and says how likely the sample of

data was This is psampl ej M The mo del needs to completely determine the

sample likeliho o d The sample likeliho o d is the basis of the maximum likeliho o d principle

and many hyp othesis testing metho ds Casella Berger This combines with the

prior to form the

psampl ej M p jM

p jsampl e M

psampl ejM

This equation is Bayes theorem and the term psampl ejM is derived from the prior and

sample likeliho o d using an integration or sum that is often dicult to do

Z

psampl ejM psampl ej M p jM d

1 2 3

This term is called the evidence for mo del M or model likelihood and is the basis for

most Bayesian mo del selection mo del averaging metho ds and Bayesian hyp othesis testing

Buntine

metho ds using Bayes factors Smith Spiegelhalter Kass Raftery The

Bayes factor is a relative quantity used to compare one mo del M with another M

psampl ejM

B ay es f actor M M

psampl ejM

Kass and Raftery review the large variety of metho ds available for computing or

estimating the evidence for a mo del including numerical integration imp ortance sampling

and the Laplace approximation In implementation the log of the Bayes factor is used to

keep the arithmetic within reasonable b ounds The log of the evidence can still pro duce

large numb ers and since rounding errors in oating p oint arithmetic scales with the order

of magnitude the log Bayes factor is the preferred quantity to consider in implementation

The evidence is often simpler in mathematical analysis The Bayes factor is the Bayesian

equivalent to the likeliho o d ratio test used in ortho dox statistics and develop ed by Wilks

See Casella and Berger for an intro duction and Vuong for a recent review

The evidence and Bayes factors are fundamental to Bayesian metho ds It is often the

case that a complex nonparametric mo del a statistical term that lo osely translates as

many and varied parameter mo del b e used for a problem rather than a simple mo del

with some xed numb er of parameters Examples of such mo dels are decision trees most

neural networks and Bayesian networks For instance supp ose two mo dels are prop osed

with M and M b eing two Bayesian networks suggested by the domain exp ert These

are given in Figure They are over two multinomial variables v ar and v ar and two

Gaussian variables x and x Mo del M has an additional arc going from the discrete

variable v ar to the real valued variable x

Model = M2 Model = M1 θ θ 1 1 var1 var1 var var 2 θ 2 θ 2 2 Gaussian Gaussian µ µ x1 1' x1 1 Gaussian σ Gaussian σ 1' 1 x x2 2 µ µ N 2 N 2 σ σ 2

2

Figure Two graphical mo dels Which should learning select

The parameters for mo del M parameterize probability distributions

for the rst Bayesian network and the parameters for the second The

task is to learn not only a set of parameters but also to select a Bayesian network from the

two The Bayes factor gives the comparative worth of the two mo dels This simple example

extends in principle to selecting a single decision tree rule set or Bayesian network from

Learning with Graphical Models

the huge numb er available from attributes in the domain In this case compare the p osterior

probabilities of the two mo dels pM jsampl e and pM jsampl e Assuming the truth falls

in one or other mo del the rst is computed using Bayes theorem as

psampl ejM pM

pM jsampl e

psampl ejM pM psampl ejM pM

pM

2

B ay es f actor M M

pM

1

More generally when multiple mo dels exist it still holds that

pM jsampl e pM

B ay es f actor M M

pM jsampl e pM

Notice that the computation requires of each mo del its prior and its evidence The sec

ond form reduces the computation to the relative quantity b eing the Bayes factor and

a ratio of the priors Bayesian hyp othesis testing corresp onds to checking if the Bayes

factor of the null hyp othesis compared to the alternative hyp othesis is very small or very

large Bayesian mo del building corresp onds to searching for a mo del with a high value of

pM jsampl e psampl ejM pM which usually involves comparing Bayes factors of this

mo del with alternative mo dels during the search

When making an estimate ab out a new case x the estimate b ecomes

pxjsampl e fM M g

pM jsampl e pxjsampl e M pM jsampl e pxjsampl e M

psampl ejM pM pxjsampl e M psampl ejM pM pxjsampl e M

psampl ejM pM psampl ejM pM

The predictions of the individual mo dels is averaged according to the mo del p osteriors

pM jsampl e and pM jsampl e pM jsampl e The general comp onents used in

this calculation are the mo del priors the evidence for each mo del or the Bayes factors and

the prediction for the new case made for each mo del

This pro cess of model averaging happ ens in general A typical nonparametric problem

would b e to learn class probability trees from data The numb er of class probability tree

mo dels is sup erexp onential in the numb er of features Even when learning Bayesian net

works from data the numb er of Bayesian networks is at b est quadratic in the numb er of

features Doing an exhaustive search of these spaces and doing the full averaging implied by

the equation ab ove is computationally infeasible in general It may b e the case that mo d

els have p osterior probabilities pM jsampl e b etween and and several thousand

more mo dels have p osteriors from to Rather than select a single mo del a

representative set of several mo dels might b e chosen and averaged using the identity

X

pxjsampl e pM jsampl e pxjsampl e M

i i

i

The general averaging pro cess is depicted in Figure where a Gibbs sampler is used to

generate a representative subset of mo dels with high p osterior This kind of computation is

done for class probability trees where representative sets of trees are found using a heuristic

branch and b ound algorithm Buntine b and for learning Bayesian networks Madi

gan Raftery A sampling scheme for Bayesian networks is presented in Section Buntine

p(New Data|M

p(M

Sample i |Sample) * =

i ) 0.212 0.012 0.025 GIBBS 0.324 0.032 0.010 Sampler 0.172 0.109 0.019

New Data 0.292 0.201 0.059 1.000

p(New Data|Sample) 0.090

Figure Averaging over multiple Bayesian networks

Plates representing learning with graphical mo dels

In their current form graphical mo dels do not allow the convenient representation of a

learning problem There are four imp ortant p oints to b e observed regarding the use of

graphical mo dels to improve their suitability for learning Consider again the unsup ervised

learning system describ ed in the intro duction of Section

The unknown mo del parameters and are included in the graphical mo del to

explicitly represent all variables in the problem even mo del parameters By including

these in the probabilistic mo del an explicitly Bayesian mo del is constructed Every

variable in a graphical mo del even unknown mo del parameters has a dened prior

probability

The learning sample is a rep eated set of measured variables so the basic mo del of

Figure app ears duplicated as many times as there are cases in the sample as shown

in Figure Clearly this awkward rep etition will o ccur whenever homogeneous data

is b eing mo deled typical in learning Techniques for handling this rep etition form a

ma jor part of this pap er

Neither graph in Figures and represents the goal of learning For learning to b e

goal directed additional information needs to b e included in the graph how is learned

knowledge evaluated or how can subsequent p erformance b e measured This is the

role of decision theory and it is mo deled in graphical form using inuence diagrams

Shachter This is not discussed here but is covered in Buntine

Finally it must b e p ossible to take a graphical representation of a learning problem

and the goal of learning and construct an algorithm to solve the problem Subsequent

sections discuss techniques for this

Consider a simplied version of the same unsup ervised problem In fact the simplest

p ossible learning problem containing uncertainty go es as follows there is a biased coin with

Learning with Graphical Models

heads1

θ heads2 θ heads 1.5,1.5 Beta(1.5,1.5) N Beta( )

headsN

Figure Tossing a coin mo del without and with a plate

an unknown bias for heads That is the longrun of getting heads for this coin

on a fair toss is The coin is tossed N times and each time the binary variable heads

i

is recorded The graphical mo del for this is in Figure a The heads no des are shaded

i

b ecause their values are given but the no de is not The no de has a B eta prior

This assumes is distributed according to the Beta distribution with parameters

and

1 2

p j

B eta

where B eta is the standard b eta function given in many mathematical tables This

prior is plotted in Figure A B eta prior for instance is uniform in whereas

B eta slightly favors values closer to a fairer coin Figure b is an equivalent

Figure The B eta prior and other priors on

graphical mo del using the notation of plates The rep eated group in this case the heads

i

no des is replaced by a single no de with a b ox around it The b ox is referred to as a plate

and implies that

the enclosed subgraph is duplicated N times into a stack of plates

the enclosed variables are indexed and

Buntine

any exteriorinterior links are duplicated

In Section it was shown that any Bayesian network has a corresp onding form for

the joint probability of variables in the Bayesian network The same applies to plates The

Q

plate indicates that a pro duct will app ear in the corresp onding form The probability

equation for Figure read directly from the graph is

p cl ass v ar v ar v ar cl ass v ar v ar v ar

N N N N

p p p p pcl ass j pv ar jcl ass pv ar jcl ass pv ar jcl ass

pcl ass j pv ar jcl ass pv ar jcl ass pv ar jcl ass

N N N N N N N

The corresp onding equation using pro duct notation is

p cl ass v ar v ar v ar i N p p p p

i i i i

N

Y

pcl ass j pv ar jcl ass pv ar jcl ass pv ar jcl ass

i i i i i i i

i

These two equations are equivalent However the dierences in their written form corre

sp onds to the dierences in their graphical form Each plate is converted into a pro duct

Q

the joint probability ignoring the plates is written a pro duct is added for each plate

to index the variables inside it If there are two disjoint plates then there are two disjoint

pro ducts Overlapping plates yield overlapping pro ducts The corresp onding transforma

tion for the unsup ervised learning problem of Figure is given in Figure Notice the

φ class

var θ 1 1

var θ 2 2

var θ 3 3

N

Figure Simple unsup ervised learning with a plate

hidden cl ass variable is not shaded so it is not given The corresp onding transformation

for the sup ervised learning problem of Figure where the classes are given and thus

corresp onds to the idiots Bayes classier is identical to Figure except that the cl ass

variable is shaded b ecause the classes are now part of the training sample

Many learning problems can b e similarly mo deled with plates Write down the graphical

mo del for the full learning problem with only a single case provided Put a b ox around the

data part of the mo del pull out the mo del parameters for instance the weights of the

network or the classication parameters and ensure they are unshaded b ecause they are

unknown Now add the data set size N to the b ottom left corner

The notion of a plate is formalized b elow This formalization is included for use in

subsequent pro ofs

Learning with Graphical Models

Denition A chain graph G with plates on variable set X consists of a chain graph G

on variables X with additional boxes cal led plates placed around groups of variables Only

directed arcs can cross plate boundaries and plates can be overlapping Each plate P has

an integer N in the bottom left corner indicating its cardinality Each plate indexes the

P

variables inside it with values i N Each variable V X occurs in some subset of

P

the plates Let indv al V denote the set of values for indices corresponding to these plates

That is indv al V is the cross product of index sets f N g for plates P containing

P

V

A graph with plates can b e expanded to remove the plates Figure is the expanded

form of Figure Given a chain graph with plates G on variables X construct the expanded

graph as follows

For each variable V X add a no de for V for each i indv ar V

i

For each undirected arc b etween variables U and V add an undirected arc b etween

U and V for i indv ar V indv ar U

i i

For each directed arc b etween variables U and V add a directed arc b etween U and

i

V for i indv ar V and j indv ar V where i and j have identical values for index

j

comp onents from the same plate

The parents for indexed variables in a graph with plates are the parents in the expanded

graph

par entsU par entsU

i

iindv alU

A graph with plates is interpreted using the following pro duct form If the pro duct form

for the chain graph G without plates with chain comp onents T is

Y

pX jM G p jpar ents M

T

then the pro duct form for the chain graph G with plates has a pro duct for each plate

Y Y

pX jM G p jpar ents M

i i

T

iindv al

This is given by the expanded version of the graph Testing for indep endence on chain

graphs with plates involves expanding the plates In some cases this can b e simplied

Exact op erations on graphical mo dels

This section intro duces basic inference metho ds on graphs without plates and exact inference

metho ds on graphs with plates While there are no common algorithms

explained in this section the op erations explained are the mathematical basis of most fast

learning algorithms Therefore the imp ortance of these basic op erations should not b e

underestimated Their use within more wellknown learning algorithms is explained in later

sections

Buntine

Once a graphical mo del is develop ed to represent a problem the graph can b e manip

ulated using various exact or approximate transformations to simplify the problem This

section reviews basic exact transformations available arc reversal no de removal and exact

removal of plates by recursive arc reversal The summary of op erations emphasizes the

computational asp ects A graphical mo del has an asso ciated set of denitions or tables for

the basic functions and conditional probabilities implied by the graph the op erations given

b elow eect b oth the graphical structure and these underlying mathematical sp ecications

In b oth cases the pro cess of making these transformations should b e constructive so that

a graphical sp ecication for a learning problem can b e converted into an algorithm

There are several generic approaches for p erforming inference on directed and undirected

networks without plates These approaches are mentioned but will not b e covered in

detail The rst approach is exact and corresp onds to removing indep endent or irrelevant

information from the graph then attempting to optimize an exact probabilistic computation

by nding a reordering of the variables The second approach to p erforming inference is

approximate and corresp onds to approximate algorithms such as Gibbs sampling and other

Markov chain Monte Carlo metho ds Hrycej Hertz et al Neal In

some cases the complexity of the rst approach is inherently exp onential in the numb er

of variables so the second can b e more ecient The two approaches can b e combined in

some cases after appropriate reformulation of the problem Dagum Horvitz

Exact inference without plates

The exact inference approach has b een highly rened for the case where all variables are

discrete It is not surprising that available algorithms have strong similarities Shachter

Andersen Szolovits since the ma jor choice p oints involve the ordering of the

summation and whether this ordering is selected dynamically or statically Other sp ecial

classes of inference algorithms include the cases where the mo del is a multivariate Gaussian

Shachter Kenley Whittaker or corresp onds to some sp ecic diagnostic

structure such as twolevel b elieve networks with a level of symptoms connected to a level

of diseases Henrion This subsection reviews some simple exact transformations on

graphical mo dels without plates Two representative metho ds are covered but are by no

means optimal arc reversal and arc removal They are imp ortant however b ecause they

are the building blo cks on which metho ds for graphs with plates are based Many more

sophisticated variations and combinations of these algorithms exist in the literature includ

ing the handling of deterministic no des Shachter and chain graphs and undirected

graphs Frydenb erg

Arc reversal

Two basic steps for inference are to marginalize nuisance parameters or to condition on new

evidence This may require evaluating probability variables in a dierent order The arc

reversal op erator interchanges the order of two no des connected by a directed arc Shachter

This op erator corresp onds to Bayes theorem and is used for instance to automate

the derivation of Equation from Equation The op erator applies to directed acyclic

graphs and to chain graphs where a and b are adjacent chain comp onents

Learning with Graphical Models

∪ A A parents(b) parents(a) parents(a) parents(b)

b a b a

Figure Arc reversal reversing no des a and b

Consider a fragment of a graphical mo del as given in the left of Figure The equation

for this fragment is

pa bjA pbjpar entsb pajb par entsa

Supp ose no des a and b need to b e reordered Assume that b etween a and b there is no

directed path of length greater than one If there was then reversing the arc b etween a

and b would create a cycle which is forbidden in a Bayesian network and chain graph The

formula for the variable reordering can b e found by applying Bayes theorem to the ab ove

equation

X

pajA pbjpar entsb pajpar entsa

b

pbjpar entsb pajpar entsa

P

pbja A

pbjpar entsb pajpar entsa

b

The corresp onding graph is given in the right of Figure Notice that the eect on the

graph is that no des for a and b now share their parents This is an imp ortant p oint If all

of as parents were also bs and vice versa excepting b par entsa par entsb fbg

then the graph would b e unchanged except for the direction of the arc b etween a and b

Regardless the probability tables or formula asso ciated with the graph also need to b e

up dated If the variables are discrete and full conditional probability tables are maintained

then this op eration requires instantiating the set fa bg par entsa par entsb in all ways

which is exp onential in the numb er of variables

Arc and node removal

Some variables in a graph are part of the mo del but are not imp ortant for the goal of

the data analysis These are called nuisance parameters An unshaded no de y without

children no outward going arcs that is neither an action no de nor a utility no de can

always b e removed from a Bayesian network This corresp onds to leaving out the term

py jpar entsy in the pro duct of Equation Given

pa b y pa pbja py ja b

then y can b e marginalized out trivially to yield

pa b pa pbja

Buntine

More generally this applies to chain graphsa chain comp onent whose no des are all un

shaded and have no children can b e removed If y is a no de without children then remove

the no de with y from the graph and the arcs to it ignore the factor py jpar entsy in the

full joint form Consider that the ith case in Figure no des cl ass v ar v ar v ar

i i i i

can b e removed from the mo del without aecting the rest of the graph

Removal of plates by exact metho ds

Consider the simple coins problem of Figure again The graph represents the joint prob

ability for p heads heads The main question of interest here is the conditional

N

probability of given the data heads heads This could b e obtained through re

N

p eated arc reversals b etween and heads then b etween and heads and so on until all

the data app ears b efore in the directed graph Doing this rep eated series of applications

of Bayes theorem yields a fully connected graph with N N arcs The corresp onding

formula for the p osterior simplied with Lemma is also simple

p n

1 2

p jheads heads

N

B eta p n

where p is the numb er of heads in the sequence and n N p is the numb er of tails

This is a worthwhile intro ductory exercise in Bayesian decision theory Howard that

should b e familiar to most students of statistics Compare this with Equation There

are several imp ortant p oints to notice ab out this result

Eectively this do es a parameter up date p and n requiring

no search or numerical optimization The whole sequence of tosses irresp ective of its

length and the ordering of the heads and tails can b e summed up with two numb ers

These are called sucient statistics b ecause assuming the mo del

used is correct they are sucient to explain all that is imp ortant ab out in the data

The corresp onding graph can b e simplied as shown in Figure The plate is e

ciently removed and replaced by the sucient statistics two numb ers irresp ective of the size of the sample

n,p θ

Beta(1.5+n,1.5+p)

Figure Removing the plate in the coin problem

The p osterior distribution has a simple form Furthermore all the moments of

log and log for the distribution can b e computed as simple functions of the

normalizing constant B eta For instance

log B eta

E log

jheads heads

1 1 2

N

B eta

E

jheads heads

1 1 2

N

B eta

Learning with Graphical Models

B eta B eta

E

jheads heads

1 1 2

N

B eta B eta

This result might seem somewhat obscure but it is a general prop erty holding for

a large class of distributions that allows some averages to b e calculated by symb olic

manipulation of the normalizing constant

The exp onential family

This result generalizes to a much larger class of distributions referred to as the exp onential

family Casella Berger DeGro ot This includes standard undergraduate

distributions such as Gaussians Chi squared and Gamma and many more complex distri

butions constructed from simple comp onents including class probability trees over discrete

input domains Buntine b simple discrete and Gaussian versions of a Bayesian net

work Whittaker and linear regression with a Gaussian error Thus these are a

broad and not insignicant class of distributions that are given in the denition b elow

Their general form has a linear combination of parameters and data in the exp onential

Denition A space X is indep endent of the parameter if the space remains the same

when just is changed If the domains of x y are independent of then the conditional

distribution for x given y pxjy M is in the exp onential family when

k

X

hx y

pxjy M w t x y exp

i i

Z

i

for some functions w t h and Z and some integer k for hx y The normalization

i i

constant Z is known as the partition function

Notice the functional form of Equation is similar to the functional form for an undi

rected graph of Equation as holds in many cases for a Markov random eld For the

previous coin tossing example b oth the coin tossing distribution a binomial on heads and

i

the p osterior distribution on the mo del parameters are in the exp onential family To see

this notice the following rewrites of the original probabilities These make the comp onents

w t and Z explicit

i i

pheadsj exp log log

headstr ue headsf alse

p jheads heads

N

exp p log n log

B eta p n

Table in App endix B gives a selection of distributions and their functional form Fur

ther details can b e found in most textb o oks on probability distributions DeGro ot

Bernardo Smith

The following is a simple graphical reinterpretation of the PitmanKo opman Theorem

from statistics Jereys DeGro ot In Figure a T x y is a

of xed dimension indep endent of the sample size N corresp onding to n p in the coin

tossing example The theorem says that the sample in Figure a can b e summarized in

statistics as shown in Figure b if and only if the probability distribution for xjy is

in the exp onential family In this case T x y is a sucient statistic

Buntine

y θ T(x*,y*) θ x

N

Figure The generalized graph for plate removal

Theorem Recursive arcreversal Consider the model M represented by the graphical

model for a sample of size N given in Figure a Have x in the domain X and y in the

domain Y both domains are independent of and both domains have components that are

real valued or nite discrete Let the conditional distribution for x given y be f xjy

which is positive for al l x X If rst derivatives exist with respect to al l real valued

components of x and y the plate removal operation applies for al l samples x x x

N

y y y and as given in Figure b for some sucient statistics T x y of

N

dimension independent of N if and only if the conditional distribution for x given y is

in the given by Equation In this case T x y is an invertible

function of the k averages

N

X

t x i k

i j

N

j

In some cases this extends to domains X and Y dep endent on Jereys

Sucient statistics for a distribution from the exp onential family are easily read from

the functional form by taking a logarithm For instance for the multivariate Gaussian the

sucient statistics are x for i d and x x for i j d and the normalizing

i i j

constant Z is given by

d

y

Z exp

det

As for coin tossing it generally holds that if a a binomial on

heads is in the exp onential family then the p osterior distribution for the mo del parameters

i

can also b e cast as exp onential family This is only useful when the normalizing constant

this is B eta p n in the coin tossing example and its derivatives are readily

computed

Lemma The conjugacy prop erty In the context in Theorem assume the distri

bution for x given y can be represented by the exponential family Factor the normalizing

constant Z into two components Z Z Z where the second is the constant part

independent of Assume the prior on takes the form

k

X

f

p j M w exp log Z

i i k

Z

i

Learning with Graphical Models

for some k dimensional parameter where Z is the appropriate normalizing constant

and f is any function Then the posterior distribution for p j x x M is

N

also represented by Equation with the parameters

N

k

k

N

X

t x y i k

i j j i

i

j

When the function f is trivial for instance uniformly equal to then the distribution in

Equation is referred to as the conjugate distribution which means it has a mathematical

form mirroring that of the sample likeliho o d The prior parameters by lo oking at the

up date equations in the lemma can b e thought of as corresp onding to the sucient statistics

from some prior sample and given by

k

This prop erty is useful for analytic and computational purp oses Once the p osterior

distribution is found and assuming it is one of the standard distributions the prop erty can

easily b e established Table in App endix B gives some standard conjugate prior distri

butions for those in Table and Table gives their matching p osteriors More extensive

summaries of this are given by DeGro ot and Bernardo and Smith The pa

rameters for these priors can b e set using standard reference priors Box Tiao

Bernardo Smith or elicited from a domain exp ert

There are several other imp ortant consequences of the PitmanKo opman Theorem or

recursive arc reversal that should not go unnoticed

Comment If x y are discrete and nite valued then the distribution pxjy can be

represented as a member of the exponential family This holds because a positive nite

discrete distribution can always be represented as an extended case statement in the form

X

A

pxjy exp f

i

t xy

i

ik

where the boolean functions t x y are a set of mutual ly exclusive and exhaustive conditions

i

The indicator function has the value if the boolean A is true and otherwise The

A

main importance of the exponential family is in continuous or integer domains Of course

since a large class of functions log pxjy can always be approximated arbitrarily wel l by a

polynomial in x y and with suciently many terms the exponential family covers a broad

class of distributions

The application of the exp onential family to learning is p erhaps the earliest published

result on computational learning theory The following two interpretations of the recursive

arc reversal theorem are relevant mainly for distributions involving continuous variables

Comment An incremental learning algorithm with nite memory must compress the

information it has seen so far in the training sample into a smal ler set of statistics This

can only be done without sacricing information in the sample in a context where al l

probabilities are positive if the hypothesis or search space of learning is a distribution from

the exponential family

Buntine

Comment The computational requirements for learning an exponential family distri

bution are guaranteed to be linear in the sample size rst compute the sucient statistics

and then learning proceeds independently of the sample size This could be exponential in

the dimension of the feature space however

Furthermore in the case where the functions w are ful l rank in dimension of is k

i

dw

same as w and the Jacobian of w with resp ect to is invertible det various

d

moments of the distribution can easily b e found For this situation the function w when

it exists is called the link function McCullagh Nelder

Lemma Consider the notation of Denition If the link function w for an ex

ponential family distribution exists then moments of functions of t x y and exp t x y

i i

can be expressed in terms of derivatives and direct applications of the functions Z t w

i i

and w If the normalizing constant Z and the link function are in closed form then so

wil l the moments

Techniques for doing these symb olic calculations are given in App endix B Exp onential

family distributions then fall into two groups There are those where the normalizing

constant and link function are known such as the Gaussian One can eciently compute

their moments and determine the functional form of their conjugate distributions up to

the normalizing constant For others this is not the case For others such as a Markov

random eld used in image pro cessing moments can generally only b e computed by an

approximation pro cess like Gibbs sampling given in Section

Linear regression an example

As an example consider the problem of linear regression with Gaussian error describ ed in

Figure This is an instance of a generalized linear mo del and has a linear construction

x1 xn

θ basis basis 1 M σ

m Linear

y

Gaussian

Figure Linear regression with Gaussian error

M

X

m basis x x

j j n

j

Learning with Graphical Models

at its core The M basis functions are known deterministic functions of the input vari

ables x x These would typically b e nonlinear orthogonal functions such as Legendre

n

p olynomials These combine linearly with the parameters to pro duce the mean m for the

Gaussian

The corresp onding learning problem represented as plates is expressed in Figure The

x1 xn

θ basis basis 1 M σ Gaussian Inverse Gamma m Linear

y

N Gaussian

Figure The linear regression problem

joint probability for this mo del is as follows

2

P P

N M

1

basis x y

j j i i

2

i=1 j =1

2

p p j e

p

N

where x denotes the vector of values for the ith datum x x Linear regression

i i ni

with Gaussian error falls in the exp onential family b ecause a Gaussian is in the exp onential

family and the mean of the simple Gaussian is a linear function of the regression parameters

see Lemma

For this case the corresp ondence to the exp onential family is drawn as follows The

individual data likeliho o ds py jx x need only b e considered Expand the prob

n

ability to show it is a linear sum of data terms and parameter terms

py jx x

n

M

X

B C

A

p

exp y basis x

A

j j

j

M M

X X

j j k

A

p

exp basis x y y basis x basis x

j j k

j

jk

The data likeliho o d in this last line can b e seen to b e in the same form as the general exp o

nential family where the sucient statistics are the various data terms in the exp onential

Also the link function do es not exist b ecause there are M parameters and M M

sucient statistics Buntine

S q ysq

θ σ

Gaussian Inverse Gamma

Figure The linear regression problem with the plate removed

The mo del of Figure can therefore b e simplied to the graph in Figure where

q and S are the usual sample means and obtained from the socalled normal

equations of linear regression S is a matrix of dimension M the numb er of basis functions

and q is a vector of dimension M

N

X

basis x basis x S

j i k i jk

N

i

N

X

q basis x y

j j i i

N

i

N

X

y y sq

i

N

i

These three sucient statistics can b e read directly from the data likeliho o d ab ove

Consider the formula

Z

py jx x dy

n

y

Dierentiating with resp ect to shows that the exp ected value of y given x x and

i n

is the mean as exp ected

M

X

E y m basis x

j j

y jx x

n

1

j

Dierentiating with resp ect to shows that the exp ected error from the mean is again

exp ected

E y m

y jx x

n

1

Higherorder derivatives give formula for higherorder moments such as and kur

tosis Casella Berger which are functions of the second third and fourth central

moments While these are well known for the Gaussian the interesting p oint is that these

formula are constructed by dierentiating the comp onent functions in Equation with

out recourse to integration Finally the conjugate distribution for the parameters in this

linear regression problem is the multivariate Gaussian distribution when is known and

for is the inverted Gamma

Learning with Graphical Models

Recognizing and using the exp onential family

How can recursive arc reversal b e applied automatically to a graphical mo del First when a

graphical mo del or some subset of a graphical mo del falls in the exp onential family needs to

b e identied If each conditional distribution in a Bayesian network or chain comp onent in

a chain graph is exp onential family then the full joint is exp onential family The following

lemma gives this with some additional conditions for deterministic no des This applies to

Bayesian networks using Comment

Lemma A chain graph has a single plate Let the nondeterministic variables inside

the plate be X and the deterministic variables be Y Let the variables outside the plate be

If

Al l arcs crossing a plate boundary are directed into the plate

For al l chain components the conditional distribution p jpar ents is from the

exponential family with data variables from X Y and model parameters from

furthermore log p jpar ents is a polynomial function of variables in Y

Each variable y Y can be expressed as a deterministic function of the form

l

X

y u X v

i i

i

for some functions u v

i i

Then the conditional distribution pX Y j is from the exponential family

Second how can these results b e used when the mo del do es not fall in the exp onential

family There are two categories of techniques available in this context In b oth cases

the algorithms concerned can b e constructed from the graphical sp ecications The two

new classes together with the recursive arc reversal case are given in Figure In each

case I denotes the graphical conguration and I I denotes the op erations and simplica

tions p erformed by the algorithm When the various normalization constants are known in

closed form and appropriate moments and Bayes factors can b e computed quickly all three

algorithm schemas have reasonable computational prop erties

The rst category is where a useful subset of the mo del do es fall into the exp onential

family This is represented by the partial exponential family in Figure The part of the

problem that is exp onential family is simplied using the recursive arc reversal of Theo

rem and the remaining part of the problem is typically handled approximately Decision

trees and Bayesian networks over multinomial or Gaussian variables also fall into this cate

gory This happ ens b ecause when the structure of the tree or Bayesian network is given the

remaining problem is comp osed of a pro duct of multinomials or Gaussians This is the basis

of various Bayesian algorithms develop ed for these problems Buntine b Madigan

Raftery Buntine c Spiegelhalter Dawid Lauritzen Cowell Hecker

man Geiger Chickering Strictly sp eaking decision trees and Bayesian networks

over multinomial or Gaussian variables are in the exp onential family see Comment

However it is more computationally convenient to treat them this way This category is

discussed more in Section Buntine

Exponential model Partial Exponential model (I) (I) Τ (I) Exp. Family Exp. Family | Τ C Θ X X ΘΤ Exp. Family N N X Θ N

(II) (II) (II) X Τ C Conjugate N X Θ ss(X*) Θ N ss(X*,Τ) ΘΤ Conjugate | Τ C

X Θ

N

Figure Three categories of algorithms using the exp onential family

The second category is where if some hidden variables are intro duced into the data the

problem b ecomes exp onential family if the hidden values are known This is represented

by the mixture model in Figure Mixture mo dels Titterington Smith Makov

Poland are used to mo del unsup ervised learning incomplete data in the classica

tion problems and general Mixture mo dels extend

the exp onential family to a rich class of distributions so this second category is an imp or

tant one in practice General metho ds for handling these problems corresp ond to Gibbs

sampling and other Markov chain Monte Carlo metho ds discussed in Section and its

deterministic counterpart the exp ectation maximization algorithm discussed in Section

As shown in Figure these algorithms cycle back and forth b etween a pro cess that re

estimates c given using rstorder inference and a pro cess that uses the fast exp onential

family algorithms to reestimate given c

Other op erations on graphical mo dels

The recursive arc reversal theorem of Section characterizes when plates can b e readily

removed and the sample summarized in some statistics Outside of these cases more general

classes of approximate algorithms exist Several of these are intro duced in Section and

more detail is given for instance by Tanner These more general algorithms require

a numb er of basic op erations b e p erformed on graphs

Decomp osition A learning problem can sometimes b e decomp osed into simpler sub

problems with each yielding to separate analysis One form of decomp osition of

learning problems is considered in Section Another related form that applies

to undirected graphs is develop ed by Dawid and Lauritzen Other forms of de

comp osition can b e done at the mo deling level where the initial mo del is constructed

in a manner requiring fewer parameters as is Heckermans similarity networks

Learning with Graphical Models

Exact Bayes factors Mo del selection and averaging metho ds are used to deal with

multiple mo dels Kass Raftery Buntine b Stewart Madigan

Raftery These require the computation of Bayes factors for mo dels con

structed during search Exact metho ds for computing Bayes factors are considered in

Section

Derivatives Various approximation and search algorithms require derivatives b e calcu

lated as discussed next

Derivatives

An imp ortant op eration on graphs is the calculation of derivatives of parameters This

is useful after conditioning on the known data to do approximate inference Numerical

optimization using derivatives can b e done to search for MAP values of parameters or

to apply the Laplace approximation to estimate moments This section shows how to

compute derivatives using op erations lo cal to each no de The computation is therefore

easily parallelized as is p opular for instance in neural networks

Supp ose a graph is used to compile a function that searches for the MAP values of

parameters in the graph conditioned on the known data In general this requires use of

numerical optimization metho ds Gill Murray Wright To use a gradient descent

conjugate gradient or Levenb ergMarquardt approach requires calculation of rst deriva

tives To use a NewtonRaphson approach requires calculation of second derivatives as

well While this could b e done numerically by dierence approximations more accurate

calculations exist Metho ds for symb olically dierentiating networks of functions and piec

ing together the results to pro duce global derivatives are well understo o d Griewank

Corliss For instance software is available for taking a function dened in Fortran

C co de or some other language to pro duce a second function that computes the exact

derivative These problems are also well understo o d for feedforward networks Werb os

McAvoy Su Buntine Weigend and graphical mo dels with plates only

add some additional complexity The basic results are repro duced in this section and some

simple examples given to highlight sp ecial characteristics arising from their use with chain

graphs

Consider the problem of learning a feedforward network A simple feedforward network

is given in Figure a The corresp onding learning problem is given in Figure b

representing the feedforward network as a Bayesian network Here the sigmoid units of the

network are mo deled with deterministic no des and the network output represents the mean

of a bivariate Gaussian with inverse matrix Because of the nonlinear sigmoid

function making the deterministic mapping from inputs x x x to the means m m this

learning problem has no reasonable comp onent falling in the exp onential family A rough

fallback metho d is to calculate a MAP value for the weight parameters This would b e

the metho d used for the Laplace approximation Buntine Weigend MacKay

covered in Tanner Tierney Kadane The setting of priors for feedforward

networks is dicult MacKay Nowlan Hinton Wolp ert and it will

not b e considered here other than assuming a prior is used pw The graph implies the Buntine

(a) (b) Gaussian m1 m2 o1 o2 Gaussian Σ

m m2 1 w Sigmoid Sigmoid 2

w1 h1 h2 h3 x1 x2 x3 Sigmoid Sigmoid w5 w4 x x x 1 2 3 w3

N

Figure Learning a feedforward network

following p osterior probability

p w w j o o x x x i N

i i i i i

N

Y

det

y

exp o m o m p pw w

i i i i

i

y

m S ig moidw h

i

i

y

h S ig moidw x

i

i

The undirected clique on the parameters w indicates the prior has a term pw Supp ose

the p osterior is dierentiated with resp ect to the parameters w The result is well known

to the neural network community since this kind of calculation yields the standard back

propagation equations

Rather than work through this calculation instead lo ok at the general case To develop

the general formula for dierentiating a graphical mo del a few more concepts are needed

Deterministic no des form islands of determinism within the uncertainty represented by the

graph Partial derivatives within each island can b e calculated via recursive use of the chain

rule for instance by forward or backward propagation of derivatives through the equations

For instance forward propagation for the ab ove network gives

X

m h m

i

w h w

i

i

This is called forward propagation b ecause the derivatives with resp ect to w are propagated

forward in the network In contrast backward propagation would propagate derivatives of

m with resp ect to dierent variables backwards For each island of determinism the

imp ortant variables are the output variables and their derivatives are required So for the

feedforward network ab ove partial derivatives of m m and m with resp ect to w are

required

Learning with Graphical Models

Denition The nondeterministic children of a node x denoted ndchil dr enx are the

set of nondeterministic variables y such that there exists a directed path from x to y given

by x y y y with al l intermediate variables y y being deterministic The non

n n

deterministic parents of a node x denoted ndpar entsx are the set of nondeterministic

variables y such that there exists a directed path from y to x given by y y y x with al l

n

intermediate variables y y being deterministic The deterministic children of a node

n

x denoted detchil dr enx are the set of deterministic variables y that are children of x

The deterministic parents of a node x denoted detpar entsx are the set of deterministic

variables y that are parents of x

For instance in the mo del in Figure the nondeterministic children of w are o and o

Deterministic no des can b e removed from a graph by rewriting the equations represented

into the remaining variables of the graph Because some graphical op erations do not apply

to deterministic no des this removal is often done implicitly within a theorem This go es as

follows

Lemma A chain graph G with nodes X has deterministic nodes Y X The chain

graph G is created by adding to G a directed arc from every node to its nondeterministic

children and by deleting the deterministic nodes Y The graphs G and G are equivalent

probability models on the nodes X Y

The general formula for dierentiating Bayesian networks with plates and deterministic

no des is given b elow in Lemma This is nothing more than the chain rule for dierentia

tion but it is imp ortant to notice the network structure of the computation When partial

derivatives are computed over networks there are lo cal and global partial derivatives that

can b e dierent Consider the feedforward network of Figure again On this gure place

an extra arc from w to m Now consider the partial derivative of m with resp ect to w

The value of m is inuenced by w directly as the new arc shows and indirectly via h

When computing a partial derivative involving indirect inuences we need to dierentiate

b etween the direct and indirect eects Various notations are used for this Werb os et al

Buntine Weigend Here the notation of a lo cal versus global derivative is

used The local partial derivative is subscripted with an l and represents the partial

l

derivative computed at the no de using only the direct inuencesthe parents For the ex

ample of the partial derivative of m with resp ect to w the various lo cal partial derivatives

combine to pro duce the global partial derivative

m h m m

w w h w

l l l

This is equivalent to

h h m m m m

w w h w h w

l l l l l

h

1

since

w

4

l

In general the global partial derivative for an index variable is the sum of the

i

lo cal partial derivative at the no de containing

i

Buntine

the partial derivatives for each child of that is also a nondeterministic child and

i

combinations of global partial derivatives for deterministic children found by back

ward or forward propagation of derivatives

Lemma Dierentiation A model M is represented by a Bayesian network G with

plates and deterministic nodes on variables X Denote the known variables in X by K and

the unknown variables by U X K Let the conditional probability represented by the

graph G be pU jK M Let be some unknown variable in the graph and let be if

nd

is nondeterministic and otherwise If occurs inside a plate then let i be some arbitrary

valid index i indv al otherwise let i be nul l Then

log pU jK M log p jpar ents

i i

nd

i

i l i

X

log pxjpar entsx

l i

xndchildr en childr en

i i

X X

y log pxjpar entsx

y

l i

xndchildr en y detpar entsxy

i i

Furthermore if Y U is some subset of the unknown variables then the partial derivative

of the probability of Y given the known variables pY jK M is an expected value of the

above probabilities

log pU jK M log pY jK M

E

U Y jY KM

i i

Equation contains only one global partial derivative which is inside the double sum on

the right side This is the partial derivative y and can b e computed from its lo cal island

i

of determinism using the chain rule of dierentiation for instance using forward propaga

tion from s deterministic children or backward propagation from s nondeterministic

i i

parents

To apply the Dierentiation Lemma on problems like feedforward networks or unsup er

vised learning the lemma needs to b e extended to chain graphs This means dierentiating

Markov networks as well as Bayesian networks and handling the exp ected value in Equa

tion These extensions are explained b elow after rst giving two examples

As a rst example consider the feedforward network problem of Figure By treating

the two output units in the feedforward network as a single variable a Cartesian pro duct

o o the ab ove Dierentiation Lemma can now b e applied directly to the feedforward

network mo del of Figure This uses the simplication given in Section with Com

ment Let P b e the joint probability for the feedforward network mo del given in

Equation The nondeterministic children of w are the single chain comp onent con

sisting of the two variables o and o Its parents are the set fm m g Consider the

Dierentiation Lemma There are no children of w that are also nondeterministic so

the middle sum in Equation of the Dierentiation Lemma is empty Then the lemma

yields after expanding out the inner most sum

log P log pw

w w

l

Learning with Graphical Models

N

X

log po o j m m m m log po o j m m

i i i i

m w m w

l l l l

i

m

i

is from the global where po o j m m is the twodimensional Gaussian and

i i

w

4

derivative but evaluates to a lo cal derivative

As a second example reconsider the simple unsup ervised learning problem given in the

intro duction to Section The likeliho o d for a single datum given the mo del parameters is

a marginal of the form

X

pv ar v ar v ar j

c c c c

c

Taking the logarithm of the full case probability pcl ass v ar v ar v ar j reveals the

vectors of comp onents w and t of the exp onential distribution

log pcl ass v ar v ar v ar j

X X X

log log log

classcv ar tr ue jc classcv ar f alse jc classc c

j j

c j c

Notice that the normalizing constant Z is in this case Consider nding the partial

derivative log pv ar v ar v ar j This is done for each case when dierentiating

the p osterior or the likeliho o d of the unsup ervised learning mo del Applying Equation

to this yields

log pv ar v ar v ar j

X

log

d

E

classdv ar tr ue

classcjv ar v ar v ar

2

1 2 3

d

log

d

E

classdv ar tr ue

classcjv ar v ar v ar

2

1 2 3

pcl ass jv ar v ar v ar

v ar tr ue

2

pcl ass jv ar v ar v ar

v ar f alse

2

Notice that the derivative is computed by doing rstorder inference to nd pcl ass

jv ar v ar v ar as noted by Russell Binder and Koller This prop erty holds

in general for exp onential family mo dels with missing or unknown variables Derivatives

are calculated by some rstorder inference followed by a combination with derivatives of

the w functions Consider the notation for the exp onential family intro duced previously in

Denition where the functional form is

k

X

hx y

w t x y exp pxjy M

i i

Z

i

Buntine

Consider the partial derivative of a marginal of this probability px ujy M for u x

Using Equation in the Dierentiation Lemma the partial derivative b ecomes

k

X

w log px ujy M Z

i

E t x y

i

ujxuy

i

If the partition function is not known in closed form the case with the Boltzmann ma

chine then the nal derivative Z is approximated the key formula for doing this

is Equation in App endix B

To extend the Dierentiation Lemma to chain graphs use the trick illustrated with the

feedforward network First interpret the chain graph as a Bayesian network on chain com

p onents as done in Equation then apply the Dierentiation Lemma Finally evaluate

necessary lo cal partial derivatives with resp ect to of each individual chain comp onent

i

Since undirected graphs are not necessarily normalized this may present a problem In

general there is an undirected graph G on variables X Y Following Theorem the

general form is

Q

f C

C

C C liq uesG

P Q

pX jY

f C

C

X C C liq uesG

The lo cal partial derivative with resp ect to x b ecomes

X X

log f C log pX jY log f C

C C

A A

E

X jY

x x x

l l l

C C liq uesG xC C C liq uesG xC

The diculty here is computing the exp ected value in the formula which comes from the

normalizing constant Indeed this computation forms the core of the early Boltzmann

machine algorithm Hertz et al In general this must b e done using something like

Gibbs sampling and the techniques of Section can b e applied directly

Decomp osing learning problems

Learning problems can b e decomp osed into subproblems in some cases While the material

in this section applies generally to these sorts of decomp ositions this section considers one

simple example and then proves some general results on problems decomp osition Prob

lem decomp ositions can also b e recomputed on the y to create a search through a space of

mo dels that takes advantage of decomp ositions that exist A general result is also presented

on incremental decomp osition These results are simple applications of known metho ds for

testing indep endence Frydenb erg Lauritzen et al with some added compli

cation b ecause of the use of plates

Consider the simple learning problem given in Section Figure over two multinomial

variables v ar and v ar and two Gaussian variables x and x For this problem we have

sp ecied two alternative mo dels mo del M and mo del M Mo del M has an additional

arc going from the discrete variable v ar to the real valued variable x We will use this

subsequently to discuss lo cal search of these mo dels evaluated by their evidence

A manipulation of the conditional distribution for this mo del making use of Lemma

yields for mo del M the conditional distribution given in Figure When parameters

Learning with Graphical Models

θ var 1 N 1

var1 µ 2 var1 x1 var2 N θ σ 2 2 x2 N µ var1 1

x1 N σ

1

Figure A simplication of mo del M

are a priori indep endent and their data likeliho o ds do not intro duce cross terms

b etween them the parameters b ecome a posteriori indep endent as well This o ccurs for

and the set f g This mo del simplication also implies the evidence for mo del M

decomp oses similarly Denote the sample of the variable x as x x x and

N

likewise for v ar and v ar In this case the result is

ev idenceM pv ar jM pv ar jv ar M px jv ar M px jx v ar M

The evidence for mo del M is similar except that the p osterior distribution of and is

replaced by the p osterior distribution for and

This result is general and applies to Bayesian networks undirected graphs and more

generally to chain graphs Similar results are covered by Dawid and Lauritzen

for a family of mo dels they call hyp erMarkov The general result describ ed ab ove is an

application of the rules of indep endence applied to plates This uses the notion of non

deterministic children and parents intro duced in Denition It also requires a notion of

lo cal dep endence which is called the Markov blanket following Pearl since it is a

generalization of the equivalent set for Bayesian networks

Denition We have a chain graph G without plates The Markov blanket of a node u is

al l neighbors nondeterministic parents nondeterministic children and nondeterministic

parents of the children and their chain components

M ar k ov bl ank etu neig hbor su ndpar entsu ndchil dr enu

ndpar entschaincomponentsndchil dr enu

From Frydenb erg it follows that u is indep endent of the other nondeterministic

variables in the graph G given the Markov blanket

To p erform the simplication depicted in Figure it is sucient then to nd the nest

partitioning of the mo del parameters such that they are indep endent The decomp osition

in Figure represents the nest such partition of mo del M The evidence for the mo del

will then factor according to the partition as given for mo del M in Equation For

this task there is the following theorem depicted graphically in Figure

Buntine

Theorem Decomp osition A model M is represented by a chain graph G with plates

Let the variables in the graph be X There are P possibly empty subsets of the variables X

X for i P such that unk now nX is a partition of unk now nX This induces a

i i

decomposition of the graph G into P subgraphs G where

i

the graph G contains the nodes X and any arcs and plates occurring on these nodes

i i

and

the potential functions for cliques in G are equivalent to those in G

i

The induced decomposition represents the unique nest equivalent independence model to

the original graph if and only if X for i P is the nest col lection of sets such that

i

when ignoring plates for every unknown node u in X its Markov blanket is also in X

i i

This nest decomposition takes O jX j to compute Furthermore the evidence for M now

becomes a product over each subgraph

Y

f k now nX ev idenceM pk now nX jM f

i i

i

for some functions f given in the proof

i

Figure shows how this decomp osition works when there are unknown no des Fig

ure a shows the basic problem and Figure b shows the nest decomp osition Notice

the b ottom comp onent cannot b e further decomp osed b ecause the variable x is unknown

(a) (b) θ var1 1

var2 N θ 2 θ var1 1 x2 var2 θ x x 2 3 1 N θ θ 5 3 x2 θ var2 var1 4 θ x 3 x3 1 N θ 5 θ 4 x

N 2

Figure The incremental decomp osition of a mo del

In some cases the functions f given in the Decomp osition Theorem in Equation

i

have a clean interpretation they are equal to the evidence for the subgraphs This result

can b e obtained from the following corollary

Learning with Graphical Models

Corollary Lo cal Evidence In the context of Theorem suppose there ex

ists a set of chain components from the graph ignoring plates such that X

j j j

ndpar ents where unk now nndpar ents Then

j j

f k now nX pk now n jndpar ents M

j j j j

S

If we denote the j th subgraph by mo del M then this term is the conditional evidence

j

S S

for mo del M given ndpar ents Denote by M the maximal subgraph on known

j

j

variables only induced by cl iq ues as given in the pro of of the Decomp osition Theorem

S

If the condition of Corollary holds for M for j P then it follows that the

j

evidence for the mo del M is equal to the pro duct of the evidence for each subgraph

P

Y

S

ev idenceM ev idenceM

i

i

This holds in general if the original graph G is a Bayesian network as used in learning

Bayesian networks Buntine c Co op er Herskovits

Corollary Equation holds if the parent graph G is a Bayesian network with

plates

In general we might consider searching through a family of graphical mo dels To do this

lo cal search Johnson Pap dimitriou Yannakakis or numerical optimization can b e

used to nd high p osterior mo dels or Markov chain Monte Carlo metho ds to select a sample

of representative mo dels as discussed in Section To do this how to represent a family

of mo dels must b e shown Figure for instance is similar to the mo dels of Figure

except that some arcs are hatched This is used to indicate that these arcs are optional

θ 1 var1 var 2 θ 2

µ x1 1 σ 1 x2 µ N 2

σ

2

Figure A family of mo dels optional arcs hatched

To instantiate a hatched arc they can either b e removed or replaced with a full arc This

graphical mo del then represents many dierent mo dels for all p ossible instantiations of

the arcs Prior probabilities for these mo dels could b e generated using a scheme such as in

Buntine c p or Heckerman et al where a prior probability is assigned by

Buntine

a domain exp ert for dierent parts of the mo del arcs and parameters and the prior for a

full mo del found by multiplication The family of mo dels given by Figure includes those

of Figure as instances During search or sampling an imp ortant prop erty is the Bayes

factor for the two mo dels B ay es f actor M M as describ ed in Section Because of

the Decomp osition Theorem and its corollary the Bayes factor for M versus M can b e

found by lo oking at lo cal Bayes factors The dierence b etween mo dels M and M is the

parent for the variable x

px jv ar v ar M

B ay es f actor M M

px jv ar M

That is the Bayes factor can b e computed from only considering the mo dels involving

and

This incremental mo dication of evidence Bayes factors and nest decomp ositions

is also general and follows directly from the indep endenc e test A similar prop erty for

undirected graphs is given in Dawid Lauritzen This is develop ed b elow for the

case of directed arcs and nondeterministic variables Handling deterministic variables will

require rep eated application of these results b ecause several nondeterministic variables

may b e eected when adding a single arc b etween deterministic variables

Lemma Incremental decomp osition For a graph G in the context of Theorem

we have two nondeterministic variables U and V such that U is given Consider adding

or removing a directed arc from U to V To update the nest decomposition of G there is

a unique subgraph containing the unknown variables in ndpar entschain componentV

To this subgraph add or delete an arc from U to V and add or delete U to the subgraph if

required

Shaded nondeterministic parents can b e added at will to no des in a graph and the nest

decomp osition remains unchanged except for a few additional arcs The use of hatched arcs

in these contexts causes no additional trouble to the decomp osition pro cess That is the

nest decomp osition for a graph with plates and hatched directed arcs is formed as if the

arcs where unhatched directed arcs The evidence is adjusted during the search by adding

the dierent parents as required

Bayes factors for the exp onential family

To make use of the decomp osition results in a learning system it is necessary to b e able to

generate Bayes factors or evidence for the comp onent mo dels For mo dels in the exp onential

family whose normalization constant is known in closed form this turns out to b e easy

If these exact computations are not available various approximation metho ds can b e used

to compute the evidence or Bayes factors Kass Raftery some are discussed in

Section

If the conjugate distribution for an exp onential family mo del and its derivatives can

b e readily computed then the Bayes factor for the mo del can b e found in closed form

Along with the ab ove decomp osition metho ds this result is an imp ortant basis of many

fast Bayesian algorithms considering multiple mo dels It is used explicitly or implicitly in

all Bayesian metho ds for learning decision trees directed graphical mo dels with discrete

or Gaussian variables and linear regression Buntine b Spiegelhalter et al

Learning with Graphical Models

For instance if the normalizing constant Z in Lemma was known in closed form

then the Bayes factor can b e readily computed

Lemma Consider the context of Lemma Then the mo del likeliho o d or evidence

given by ev idenceM px x jy y M can be computed as

N N

Q

N

px jy p j

j j

j

ev idenceM

p j

Z

N

Z Z

For y jx Gaussian this involves multiplying out the two sets of normalizing constants for

the Gaussian and Gamma distributions The evidence for some common exp onential family

distributions is given in App endix B in Table

For instance consider the learning problem given in Figure Assume that the variables

v ar and v ar are b oth binary or and that the parameters and are interpreted

as follows

pv ar j

pv ar jv ar

j

pv ar jv ar

j

If we use Dirichlet priors for these parameters as shown in Table then the priors are

Dirichlet

Dirichlet for j

jj jj jj jj

where is a priori indep endent of The choice of priors for these distributions

j j

is discussed in Box Tiao Bernardo Smith Denote the corresp onding

sucient statistics as n equal to the numb er of data where v ar j and n equal to

j

j ji

the numb er of data where v ar j and v ar i Then the rst two terms of the evidence

for mo del M read directly from Table can b e written as

B etan n

pv ar jM

B eta

B etan n B etan n

j j j j j j j j

pv ar jv ar M

B eta B eta

j j j j

Assume the variables x and x are Gaussian with means given by

when v ar

j

when v ar

j

x when v ar

j j

x when v ar

j j

and and resp ectively In this case we split the data set into two parts

jj jj

those when v ar and those when v ar Each get their own parameters sucient

Buntine

statistics and contribution to the evidence Conjugate priors from Table in App endix B

using y jx Gaussian are indexed accordingly as

ijj

j Gaussian for i and j

ijj ijj ijj

ijj

Gamma for i and j

ijj ijj

ijj

Notice that is onedimensional when i and twodimensional when i Suitable

ijj

sucient statistics for this situation are read from Table by lo oking at the data summaries

used there This can b e simplied for x b ecause d and y for the Gaussian is uniformly

Thus the sucient statistics for x b ecome the means and variances for the dierent

values of v ar Denote x and x as the sample means of x when v ar resp ectively

j j

and s and s their corresp onding sample variances This cannot b e done for the second

j j

from Table b ecome resp ectively case so we use the notation from Table where

Change the vector y to x when making the calculations indicated here

jj jj jj

The sucient statistics are for each case of v ar j

N

X

y

S y y

v ar j i

jj

1i

i

i

N

X

m x y

v ar j i i

jj

1i

i

N

X

y

y s x

i v ar j i

jj

1i

jj

i

The evidence for the last two terms can now b e read from Table This b ecomes

q

Y

jj

n

jj

q

px jv ar M

01jj

n

1j

n

j

j

jj

jj

jj

n

1j

01jj

N

x s

jj jj

jj

N

n

1j

02jj

n

Y

det

jj

jj jj

px jx v ar M

n

02jj

1j

det

jj

j

jj

jj

The nal simplication of the mo del is given in Figure

Approximate metho ds on graphical mo dels

Exact algorithms for learning of any reasonable size invariably involve the recursive arc re

versal theorem of Section Most learning metho ds however use approximate algorithms

at some level The most common uses of the exp onential family within approximation algo

rithms were summed up in Figure Various other metho ds for inference on plates can b e

Learning with Graphical Models

n θ 1 ,* 1 µ m2|* 2 θ n2,*|* 2 S2|* σ 2 µ 2 x1|* 1 s 2|*

s2 σ

1|* 1

Figure The full simplication of mo del M

applied either at the mo del level or the parameter level Gibbs sampling rst describ ed in

Section other more general Markov chain Monte Carlo algorithms EM style algorithms

Dempster Laird Rubin and various closed form approximations such as the

mean eld approximation and the Laplace approximation Berger AzevedoFilho

Shachter This section summarizes the main families of these approximate metho ds

Gibbs sampling

Gibbs sampling is the basic to ol of simulation and can b e applied to most probability

distributions Geman Geman Gilks et al a Ripley as long as the full

joint has no zeros all variable instantiations are p ossible It is a sp ecial case of the general

Markov chain Monte Carlo metho ds for approximate inference Ripley Neal

Gibbs sampling can b e applied to virtually any graphical mo del whether there are plates

undirected or directed arcs and whether the variables are real or discrete Gibbs sampling

do es not apply to graphs with deterministic no des however since these put zero es in the

full joint This section describ es Gibbs sampling without plates as a precursor to discussing

Gibbs sampling with plates in Section On challenging problems other forms of Markov

chain Monte Carlo sampling can and should b e tried The literature is extensive

Gibbs sampling corresp onds to a probabilistic version of gradient ascent although their

goals of averaging as opp osed to maximizing are fundamentally dierent Gradient ascent in

real valued problems corresp onds to simple metho ds from function optimization Gill et al

and in discrete problems corresp onds to lo cal repair or lo cal search Johnson et al

Minton Johnson Philips Laird Selman Levesque Mitchell Gibbs

sampling varies gradient ascent by intro ducing a random comp onent The algorithm usually

tries to ascend but will sometimes descend as a strategy for exploring further around the

search space So the algorithm tends to wander around lo cal maxima with o ccasional

excursions to other regions of the space Gibbs sampling is also the core algorithm of

simulated annealing if temp erature is held equal to one van Laarhoven Aarts

To sample a set of variables X according to some nonzero distribution pX initialize X

to some value and then rep eatedly resample each variable x X according to its conditional

Buntine

probability pxjX fxg For the simple medical problem of Figure supp ose the value

of symptoms is known and the remaining variables are to b e sampled then do as follows

Initialize the remaining variables somehow

Rep eat the following for i and record the sample of Ag e O cc C l im D is

i i i i

at the end of each cycle

a Reassign Ag e by sampling it according to the conditional

pAg ejO cc C lim D is S y mp

That is take the values of O cc C l im D is S y mp as given and compute the re

sulting conditional distribution on Ag e Then sample Ag e according to that

distribution

b Reassign O cc by sampling it according to the conditional

pO ccjAg e C lim D is S y mp

c Reassign C l im by sampling it according to the conditional

pC l imjAg e Occ D is S y mp

d Reassign D is by sampling it according to the conditional

pD isjAg e C l im Occ S y mp

This sequence of steps is depicted in Figure In this gure the basic graph has b een re

(a) (b) Age Occupation Climate Age Occupation Climate

Disease Disease

(c) (d) Age Occupation Climate Age Occupation Climate

Disease Symptoms Disease

Figure Gibbs sampling on the medical example

arranged for each step to represent the dep endencies that arise during the sampling pro cess

This uses the arc reversal and conditioning op erators intro duced previously

The eect of sampling is not immediate Ag e O cc C l im D is is conditionally dep en

dent on Ag e O cc C l im D is and in general so is Ag e O cc C l im D is for any i How

i i i i

ever the eect of the sampling scheme is that in the long run for large i Ag e O cc C l im D is

i i i i

Learning with Graphical Models

is approximately generated according to pAg e O cc C l im D isjS y mp indep endently of

Ag e O cc C l im D is In Gibbs sampling all the conditional sampling is done in accor

dance with the original distribution and since this is a stationary pro cess in the long run

the samples converge to the stationary distribution or xedp oint of the pro cess Metho ds

for making subsequent samples indep endent are known as regenerative simulation Ripley

and corresp ond to sending the temp erature back to zero o ccasionally

With this sample dierent quantities such as the probability a patient will have Ag e

and C l im tr opical given S y mp can b e estimated This is done by lo oking at the frequency

of this event in the generated sample The justication for this is the sub ject of Markov

pro cess theory C inlar Theorem The following result presented informally

applies

Comment Let x x x be a sequence of discrete variables from a Gibbs sampler

I

for the distribution px Then the average of g x approaches the expected value with

i

probability as I approaches innity

I

X

g x g x E g x

i

I

i

Further for a second function hx the ratio of two sample averages for g and h approaches

i

their true ratio

P

I

g x g x

i

i

P

I

hx

hx

i

i

This is used to approximate conditional expected values

To complete this pro cedure it is necessary to know how many Gibbs samples to take how

large to make I and how to estimate the error in the estimate Both these questions have

no easy answer but heuristic strategies exist Ripley Neal

For Bayesian networks this scheme is easy in general since the only requirement when

sampling from pxjX fxg is the conditional distribution for no des connected to x and

the global probabilities do not need to b e calculated Notice for instance that in Figure

some sampling op erations do not require all ve variables The general form for Bayesian

networks given in Equation go es as follows

Q

pxjpar entsx py jpar entsy

pX

y xpar entsy

pxjX fxg Q P P

py jpar entsy pX pxjpar entsx

y xpar entsy x x

Notice the pro duct is over a subset of variables Only include conditional distributions for

variables that have x as a parent Thus the formula only involves examining the parents

children and childrens parents of x the socalled Markov blanket Pearl Also notice

normalization is only required over the single dimension changed in the current cycle done

in the denominator For x discrete these conditional probabilities can b e enumerated and

direct sampling done for x

The kind of simplication ab ove for Bayesian networks also applies to undirected graphs

and chain graphs with or without plates Here mo dify Equation

P

f C exp

C

C xC cliq uesG

pxjX fxg

P P

exp f C

C

x C xC cliq uesG

Buntine

In this formula ignore all cliques not containing x so again Gibbs sampling only computes

with information lo cal to the no de Also the troublesome normalization constant do es not

have to b e computed b ecause the probability is a ratio of functions and so cancels out As

b efore normalization is only required over the single dimension x

Gibbs sampling on plates

Many learning problems can b e represented as Bayesian networks For instance the simple

unsup ervised learning problem represented in Figure is a Bayesian network once the

plate is expanded out It follows that Gibbs sampling is readily applied to learning as a

general inference algorithm Gilks et al a b

Consider a simplied example of this unsup ervised learning problem In this mo del

assume that each variable v ar and v ar b elongs to a mixture of Gaussians of known

variance equal to This simple mo del is given in Figure For a given class cl ass c

φ class

var1 µ 1

var2 µ

2

Figure Unsup ervised learning in two dimensions

the variables v ar and v ar are distributed as Gaussian with means and In the

c c

uniform unitvariance case the distribution for each sample is given by

X

pv ar v ar j M N v ar N v ar

c c c

c

where N is the onedimensional Gaussian probability density function with standard

deviation of This mo del might seem trivial but if the standard deviation were to vary

as well the mo del corresp onds to a Kernel density estimate so can approximate any other

distribution arbitrarily well using sucient numb er of tiny Gaussians

In this simplied Gaussian mixture mo del the sequence of steps for Gibbs sampling

go es as follows

Initialize the variables for each class c

c c c

Rep eat the following and record the sample of for each class c at the end

c c c

of each cycle

a For i N reassign cl ass according to the conditional

i

pcl ass j v ar v ar

i i i

b Reassign the vector by sampling according to the conditional

pjcl ass i N

i

Learning with Graphical Models

c Reassign the vector and by sampling according to the conditional

p jv ar cl ass i N

i i

Figure a illustrates Step a in the language of graphs Figure b illustrates

(a) (b) φ class φ i class

var1,i µ 1 var1 µ 1

var2,i µ 2 var µ 2 2

N

Figure Gibbs sampling in the unsup ervised learning problem

Steps b and c Step a represents the standard sampling op eration using infer

ence on Bayesian networks without plates Step b and c are also easy to p erform

b ecause in this case the distributions are exp onential family and the graph matches the

conditions for Lemma Therefore each of the mo del parameters are a posteri

ori indep endent and their distribution is known in closed form with the sucient statistics

calculated in O N time

One imp ortant caveat in the use of Gibbs sampling for learning is the problem of sym

metry In the ab ove description there is nothing to distinguish class from class Initially

the class centers for the ab ove pro cess will remain distinct Asymptotically since there

is nothing in the problem denition to distinguish b etween class and class they will

app ear indistinguishabl e This problem is handled by symmetry breaking force

c c

Gibbs sampling applies whenever there are variables asso ciated with the data that are

not given Hidden or latent variables are an example Incomplete data or missing values

Quinlan robust metho ds and mo deling of outliers and various density estimation

and nonparametric metho ds all fall in this family of mo dels Titterington et al

Gibbs sampling generalizes to virtually any graphical mo del with plates and unshaded

no des inside the plate the sequence of sampling op erations will b e much the same as in

Figure If the underlying distribution is exp onential family for instance Lemma

applies after shading all no des inside the plate each full cycle is guaranteed to b e linear

time in the sample size The algorithm in the exp onential family case is summed up in

Figure Figure I shows the general learning problem extending the mixture mo del

of Figure This same structure app ears in Figure However in Figure I the

sucient statistics T x u are also shown Figure I I shows more of the algorithm

generalizing Figure a and b Again the role of sucient statistics is shown The

sampling in Figure b rst computes the sucient statistics and then sampling applies

from that

Thomas Spiegelhalter and Gilks Gilks et al b have taken advantage of

this general applicabili ty of sampling to create a compiler that converts a graphical represen

tation of a data analysis problem with plates into a matching Gibbs sampling algorithm

This scheme applies to a broad variety of data analysis problems Buntine

(II) (I)

u T(x*,u*) u T(x*,u*) x θ N Exp. Family x θ N Exp. Family

u (II) T(x*,u*) x θ

N Exp. Family

Figure Gibbs sampling with the exp onential family

A closed form approximation

What would happ en to Gibbs sampling if the numb er of cases in the training sample N

was large compared to the numb er of unknown parameters in the problem A go o d way to

think of this is rst if N is suciently large then the samples of the mo del parameters

and will tend to drift around their mean b ecause their p osterior variance would

b e O I N That is after the ith step the sample is The i th sample

i i i

would b e conditionally dep endent on these but b ecause N is large the

i i i

p osterior variance of the i th sample given the ith sample would b e small so that

E

i i

i 1i 2i

This approximation is used with Markov chain Monte Carlo metho ds by the mean eld

method from statistical physics p opular in neural networks Hertz et al Rather

than sampling a sequence of parameters according to some scheme use the

i

deterministic up date

E

i i

i i

where the exp ected value is according to the sampling distribution This instead generates

a deterministic sequence that under reasonable conditions converges to some

i

maxima This kind of approach leads naturally to the EM algorithm which will b e discussed

in Section

The exp ectation maximization EM algorithm

The exp ectation maximization algorithm widely known as the EM algorithm corresp onds

to a deterministic version of Gibbs sampling used to search for the MAP estimate for mo del

parameters Dempster et al It is generally considered to b e faster than gradient

descent Convergence is slow near a lo cal maxima so some implementations switch to con

jugate gradient or other metho ds Meilijson when near a solution The computation

used to nd the derivative is similar to the computation used for the EM algorithm so this

do es not require a great deal of additional co de Also the determinism means the EM algo

rithm no longer generates unbiased p osterior estimates of mo del parameters The intended

Learning with Graphical Models

gain is sp eed not accuracy The EM algorithm can generally b e applied to exp onential

family mo dels wherever Gibbs sampling can b e applied The corresp ondence b etween EM

and Gibbs is shown b elow

Consider again the simple unsup ervised learning problem represented in Figure Sec

tion In this case the sequence of steps for the EM algorithm is similar to that for

the Gibbs sampler The EM algorithm works on the means or mo des of unknown variables

instead of sampling them Rather than sampling the set of classes and thereby comput

ing sucient statistics so that a distribution for and can b e found a sequence of

class means are generated and thereby used to compute expected sucient statistics Like

wise instead of sampling new parameters and mo des are then computed from the

exp ected sucient statistics

Consider again the unsup ervised learning problem in Figure Supp ose there are

classes and that the three variables v ar v ar v ar are nite valued and discrete and mo d

eled with a multinomial with probabilities conditional on the class value cl ass The suf

i

cient statistics in this case are all counts n is the numb er of cases where the class is j

j

n is the numb er of cases where cl ass j and v ar k

v

v k jj

N

X

n

class j j

i

i

N

X

n

class j v ar k

v k jj

i v i

i

The expected sucient statistics computed from the rules of probability for a given set of

parameters and are given by

N

X

n pcl ass j jv ar v ar v ar

j i i i i

i

N

X

n pcl ass j jv ar v ar v ar

i i i i v ar k

v k jj

v i

i

Thanks to Lemma these kinds of exp ected sucient statistics can b e computed for

most exp onential family distributions Once sucient statistics are computed for any of

the distributions p osterior means or mo des of the mo del parameters in this case and

can b e found

Initialize the parameters and

Rep eat the following until some convergence criteria is met

a Compute the exp ected sucient statistics n and n

j

v k jj

b Recompute and to b e equal to their mo de conditioned on the sucient

statistics For many p osterior distributions these can b e found in standard

tables and in most cases found via Lemma For instance using the mean for

gives

n

j j

P

j

n

j j

j

Buntine

All of the other Gibbs sampling algorithms discussed in Section can b e similarly

placed in this EM framework When the mo de is used in Step b ignoring numerical

problems the EM algorithm converges on a lo cal maxima of the p osterior distribution for

the parameters Dempster et al The general metho d is summarized in the following

comment Dempster et al

Comment The conditions of Lemma apply with data variables X inside the plate

and model parameters outside In addition some of the variables U X are latent so

they are unknown and unshaded Some of the remaining variables are sometimes missing

so for the data X variables V X U are not given This means the data given for

i i

the ith datum is X U V for i N The EM algorithm goes as fol lows

i

Estep The contribution to the expected sucient statistics for each datum is

ET E tX

i i

U V jX U V

i i i i

P

N

The expected sucient statistic is then ET ET

i

i

Mstep Maximize the conjugate posterior using the expected sucient statistics ET in

place of the sucient statistics using the MAP approach for this distribution

The xed point of this algorithm is a local maxima of the posterior for

Figure I I for Gibbs sampling in Section illustrates the steps in EM nicely Fig

ure I Ia corresp onds to the Estep The exp ected sucient statistics are found from

the parameters Rather than sampling u and therefore computing the sucient statistics

the exp ected sucient statistics are computed Figure I Ib corresp onds to the Mstep

Here given the sucient statistics the mean or mo de of the parameters are computed

instead of b eing sampled EM is therefore Gibbs with a meanmo de approximation done

at the two ma jor sampling steps of the algorithm

In some cases the exp ected sucient statistics can b e computed in closed form Assume

the exp onential family distribution for pX j has a known normalization constant Z and

the link function w exists For some i the normalizing constant for the exp onential family

distribution pU V jX U V is known in closed form Denote it by Z Then using

i i i i i

the notation of Theorem

tX if U V

i i

ET

i

dw d log Z

i

otherwise

d d

Partial exp onential mo dels

In some cases only an initial inner part of a learning problem can b e handled using the

recursive arc reversal theorem of Section In this case simplify what can b e simplied

and then solve the remainder of the problem using a generic metho d like the MAP ap

proximation This section presents several examples linear regression with heterogeneous

variance feedforward networks with a linear output layer and Bayesian networks

This general pro cess was depicted in the graphical mo del for the partial exp onential

family of Figure This is an abstraction used to represent the general pro cess Consider

Learning with Graphical Models

the problem of learning a Bayesian network b oth structure and parameters where the

distribution is exp onential family given the network structure The variable T is a discrete

variable indicating which graphical structure is chosen for the Bayesian network The

variable X represents the full set of variables given for the problem The variable

T

represents the distributional parameters for the Bayesian network and is the part of the

mo del that is conveniently exp onential family That is pX j T will b e treated as

T

exp onential for dierent and hold T xed The sucient statistics in this case are

T

given by ssX T In this case the subproblem that conveniently falls in the exp onential

family p jX T is simplied but it is necessary to resort to the more general learning

T

techniques of previous sections to solve the remaining part of the problem pT jX

Linear regression with heterogeneous variance

Consider the heterogeneous variance problem given in Figure This shows a graphical

mo del for the linear regression problem of Section mo died to the situation where the

standard deviation is heterogeneous so it is a function of the inputs x as well In this case

x1 xn

µ weights- basis1 basisM weights-σ σ

log-s m Linear Linear s Exp y

Gaussian

Figure Linear regression with heterogeneous variance

the standard deviation s is not given but is computed via

m

X

w eig hts basis x s exp

i i

i

The exp onential transformation guarantees that the standard deviation s will also b e p osi

tive

The corresp onding learning mo del can b e simplied to the graph in Figure Compare

this with the mo del given in Figure What is the dierence In this case the sucient

statistics exist but they are shown to b e deterministically dep endent on the sample and ulti

mately on the unknown parameters for the standard deviation w eig hts If the parameters

for the standard deviation were known then the graph could b e reduced to Figure Com

putationally this is an imp ortant gain It says that for a given set of values for w eig hts

calculation can b e done that is linear time in the sample size to arrive at a characterization Buntine

x1 xn

basis1 basisM weights-σ

Exp log-s s y Linear

Gaussian ysq

q S weights-µ

Figure The heterogeneous variance problem with the plate simplied

of what the parameters from the mean w eig hts should b e In short one half of a prob

lem pw eig htsjw eig hts y x i N is well understo o d To do a search for

i i

the MAP solution search the space of parameters for the standard deviation w eig hts

since the remaining w eig hts is given

Another variation of linear regression replaces the Gaussian error function with a more

robust error function such as Students t distribution or an L norm for q By

q

intro ducing a convolution these robust regression mo dels can b e handled by combining the

EM algorithm with standard Lange Sinsheimer

Feedforward networks with a linear output layer

A similar example is the standard feedforward network where the nal output layer is linear

This situation is given by Figure if we change the deterministic functions for m and m

to b e linear instead of sigmoidal In this case Lemma identies that when the weight

vectors w w and w are assumed given the distribution is in the exp onential family

Thus the simplication to Figure is p ossible using the standard sucient statistics for

multivariate linear regression Algorithmically this implies that given values for the internal

weight vectors w w and w and assuming a conjugate prior holds for the output weight

vectors w and w the p osterior distribution for the output weight vectors w and w and

their means and variances can b e found in closed form The evidence for w and w given

w w and w py jx x w w w M can also b e computed using the exact

n

metho d of Lemma so therefore the p osterior for w w and w

pw w w jy x x M pw w w jM py jx x w w w M

n n

can b e computed in closed form up to a constant This eectively cuts the problem into

two piecesw w and w followed by w and w given w w and w and provides a

clean solution to the second piece

Learning with Graphical Models

osq Σ o1 o2 m w2 m1 m2 Linear Linear S w1

h1 h2 h3 Sigmoid Sigmoid w5 w4 x x x 1 2 3 w3

N

Figure Simplied learning of a feedforward network with linear output

Bayesian networks with missing variables

Class probability trees and discrete Bayesian networks can b e learned eciently by notic

ing that their basic form is exp onential family Buntine b a c Co op er

Herskovits Spiegelhalter et al Take for instance the family of mo dels sp ec

ied by the Bayesian network given in Figure In this case the lo cal evidence corollary

Corollary applies The evidence for Bayesian networks generated from this graph

is therefore a pro duct over the no des in the Bayesian network If we change a Bayesian

network by adding or removing an arc the Bayes factor is therefore simply the lo cal Bayes

factor for the no de as mentioned in the incremental decomp osition lemma Lemma

Lo cal search is then quite fast and Gibbs sampling over the space of Bayesian networks is

p ossible A similar situation exists with trees Buntine b The same results apply to

any Bayesian network with exp onential family distributions at each no de such as Gaussian

or Poisson Results for Gaussians are presented for instance in Geiger Heckerman

This lo cal search approach is a MAP approach b ecause it searches for the network

structure maximizing p osterior probability More accurate approximation can b e done by

generating a Markov chain of Bayesian networks from the search space of Bayesian networks

Because the Bayes factors are readily computed in this case Gibbs sampling or Markov chain

Monte Carlo schemes can b e used The scheme given b elow is the Metrop olis algorithm

Ripley This only lo oks at single neighb ors until a successor is found This is done

b e rep eating the following steps

For the initial Bayesian network G randomly select a neighb oring Bayesian network

G diering only by an arc

Buntine

Compute B ay es f actor G G by making the decomp ositions describ ed in Theo

rem doing a lo cal computation as describ ed in Lemma and using the Bayes

factors computed with Lemma

Accept the new Bayesian network G with probability given by

pG

min B ay esf actor G G

pG

If accepted assign G to G otherwise G remains unchanged

A lo cal maxima Bayesian network could b e found concurrently however this scheme gen

erates a set of Bayesian networks appropriate for mo del averaging and exp ert evaluation

of the space of p otential Bayesian networks Of course initialization might search for lo

cal maxima to use as a reference This sampling scheme was illustrated in the context of

averaging in Figure

This scheme is readily adapted to learn the structure and parameters of a Bayesian

network with missing or latent variables For the Metrop olis algorithm add Step which

resamples the and latent variables

For the current complete data and Bayesian network G compute the predictive dis

tribution for the missing data or latent variables Use this to resample the missing

data or latent variables to construct a new set of complete data for subsequent use

in computing Bayes factors

Conclusion

The marriage of learning and graphical mo dels presented here provides a framework for

understanding learning It also provides a framework for developing a learning or data

analysis to olkit or more ambitiously a software generator for learning algorithms Such a

to olkit combines two imp ortant comp onents a language for representing a learning problem

together with techniques for generating a matching algorithm While a working to olb ox is

not demonstrated a blueprint is provided to show how it could b e constructed and the

construction of some wellknown learning algorithms has b een demonstrated Table lists

some standard problems the derivation of algorithms using the op erations from the previous

chapters and where in the text they are considered The notion of a learning to olkit is not

new and can b e seen in the BUGS system by Thomas Spiegelhalter and Gilks Gilks

et al b in the work of Cohen for inductive logic programming and emerging

in software for handling generalized linear mo dels McCullagh Nelder Becker

Chamb ers Wilks

There is an imp ortant role for a data analysis to olkit Every problem has its own quirks

and requirements Knowledge discovery for instance can vary in many ways dep ending

on the userdened notion of interestingness Learning is often an emb edded task in a

larger system So while there are some easy applications of learning generally learning

applications require sp ecial purp ose development of learning systems or related supp ort

software Sometimes this can b e achieved by patching together some existing techniques or

by decomp osing a problem into subproblems Nevertheless the decomp osition and patching

Learning with Graphical Models

Problem Metho d Sections

Decomp osition of exact Bayes factors

Bayesian networks and exp o

with lo cal search or Gibbs sampling to

nential family conditionals

generate alternative mo dels

Bayesian networks with miss

Gibbs sampling or EM making use of

inglatent variables and other

ab ove techniques where p ossible

unsup ervised learning mo dels

MAP metho d by exact computation of

Feedforward networks

derivatives

As ab ove with an initial removal of the

Feedforward networks with

linear comp onent

linear output

Linear regression and

Least squares EM and MAP

extensions

MAP metho d by exact computation of

Generalized linear mo dels

derivatives

Table Derivation of learning algorithms

of learning algorithms with inference and decision making can b e formalized and understo o d

within graphical mo dels In some ways the S system plays the role of a to olkit Chamb ers

Hastie It provides a system for prototyping learning algorithms includes the

ability to handle generalized linear mo dels do es automatic dierentiation of expressions

and includes many statistical and mathematical functions useful as primitives The language

of graphical mo dels is b est viewed as an additional layer on top of this kind of system Note

also that it is impractical to assume that a software generator could create algorithms

comp etitive with current nely tuned algorithms for instance for hidden Markov mo dels

However a software to olkit for learning could b e used to prototyp e an algorithm that could

later b e rened by hand

The combination of learning and graphical mo dels shares some of the sup erior asp ects

of each of the dierent learning elds Consider the philosophy of neural networks These

nonparametric systems are comp osed of simple computational comp onents usually readily

parallelizable and often nonlinear The comp onents can b e pieced together to tailor systems

for sp ecic applications Graphical mo dels for learning have these same features Graphical

mo dels also have the expressibility of probabilistic knowledge representations that were

develop ed in articial intelligence to b e used in knowledge acquisition contexts They

therefore form an imp ortant basis of knowledge renement Finally graphical mo dels for

learning allow the p owerful to ols of statistics to b e applied to the problem

Once learning problems are sp ecied in the common language of graphical mo dels their

asso ciated learning algorithms their derivation and their interrelationships can b e explored

This allows commonalities b etween seemingly diverse pairs of algorithmssuch as kmeans

clustering versus approximate metho ds for learning hidden Markov mo dels learning decision

trees versus learning Bayesian networks Buntine a and Gibbs sampling versus the

exp ectation maximization algorithm in Section to b e understo o d as variations of one

another The framework is imp ortant as an educational to ol

Buntine

App endix A Pro ofs of Lemmas and Theorems

A Pro of of Theorem

A useful prop erty of indep endence is that A is indep endent of B given C if and only if

pA B C f A C g B C for some functions f and g The only if result follows directly

from this prop erty The pro of of the if result makes use of the following simple lemma If

Q

f X for some functions f A is indep endent of B given X A B and pX

i i i

i

and variable sets X X then

i

Y

pX g X B h X A

i i i i

i

for some functions g h Notice that it is known pX g X B hX A for some

i i

functions g and h by indep endence Instantiate the variables in B to some value b Then

Y

g X B hX A B b f X B b

i i

i

Similarly instantiate A to a then

Y

g X B A a hX A f X A a

i i

i

Multiplying b oth sides of the two equalities together and substitute in

Y

f X A a B b g X B A a hX A B b pX A a B b

i i

i

Get

Y Y Y

pX f X A a B b f X B b f X A a

i i i i i i

i i i

This is dened for all X since the domain is a cross pro duct The lemma holds b ecause all

functions are strictly p ositive if

f X B b

i i

g X B

i i

f X B b A a

i i

and

h X A f X A a

i i i i

The nal pro of of the if result follows by applying Equation rep eatedly Supp ose

the variables in X are x x Now pX f X for some strictly p ositive function f

v

Therefore

pX g X fx gh fx g neig hbor sx

Denote A X fx g and A fx g neig hbor sx Rep eating the application of

i i i i i

Equation for each variable yields

Y Y

A g X pX

ji i i

v

j 1

j v i i

v

1

for strictly p ositive functions g Now consider these functions It is only necessary to

i i

v

1

S

A is maximal it is not contained in any keep the function g if the set X

ji i i

j v v

j 1

S

such set Equivalently keep the function g if the set A is minimal The

i i ji

v

1 j v j

minimal such sets are the set of cliques on the undirected graph The result follows

Learning with Graphical Models

A Pro of of Theorem

It takes O jX j op erations to remove the deterministic no des from a graph using Lemma

These no des can b e removed from the graph and then reinserted at the end Hereafter as

sume the graph contains no deterministic no des Also denote the unknown variables in a

set Y by unk now nY Y k now nY Then without loss of generality assume that X

i

contains all known variables in the Markov blanket of unk now nX

i

Showing the indep endence mo del is equivalent amounts to showing for i P that

S

unk now nX is indep endent of unk now nX given k now nX To test indep endence

i j

j i

using the metho d of Frydenb erg each plate must b e expanded that is duplicate it

the right numb er of times moralize the graph removing the given no des and then test

for separability The Markov blanket for each no de in this expanded graph corresp onds

to those no des directly connected in the moralized expanded graph Supp ose we have the

nest unique partition unk now nX of the unknown no des X s are then reconstructed

i i

by adding known variables in the Markov blankets for variables in unk now nX Supp ose

i

V is an unknown variable in a plate and V are its instances once the plate is expanded

j

Now by symmetry every V is either in the same element of the nest partition or they

j

are all in separate elements If V has a certain unknown variable in its Markov b oundary

j

outside the plate then so must V for k j by symmetry Therefore V and V are in the

k j k

same element of the partition Hence by contradiction if V is in a separate element that

j

element o ccurs wholly within the plate b oundaries Therefore this nest partition can b e

represented using plates and the nest partition identied from the graph ignoring plates

The op eration of nding the nest separated sets in a graph is quadratic in the size of the

graph hence the O jX j complexity

Assume the condition holds and consider Equation Let cl iq ues denote the

subsets of variables in par ents that form cliques in the graph formed by restricting

G to par ents and placing an undirected arc b etween all parents Let X b e the

set of chain comp onents in X From Frydenb erg we have

Y Y Y

pX jM g C

C i

X iind C cliq ues

Furthermore if u X is not known then the variables in us Markov blanket will o ccur in

i

X and therefore if u C for some clique C then C X Therefore cliques containing

i i

an unknown variable can b e partitioned according to which subgraph they b elong in Let

cl iq ues fC C cl iq ues X unk now nX C g

j

j

and add to this any remaining cliques wholly contained in the set so far

C C cl iq ues X C C cl iq ues cl iq ues

j

j

C cliq ues

j

Call any remaining cliques cl iq ues Therefore

P

Y Y Y

pX jM g C

C i

j

C cliq ues

iindC

j

Buntine

Distribution Functional form

j C dim multinomial for j C

C j

y y d

p

exp y jx Gaussianx y x for y x

x

x Gamma x e for x

Q

C

i

C dim Dirichlet

C

i

i

B eta

1

C

12

det

y d d dd

x x for x x ddim Gaussian exp

d2

2 d12

det det S

dd

Q

trace S exp S ddim Wishart d

d

d2 d(d1)4

i

i=1

for S symmetric p ositive denite

Table Distributions and their functional form

Results in

Z

Y Y

f k now nX g C dunk now nX

j j C i j

unk now nX

j

C cliq ues

iindC

j

Furthermore the p otential functions on the cliques in G are well dened as describ ed

i

A Pro of of Corollary

If X ndpar ents then every clique in a chain comp onent in will o ccur in

j j j j

cl iq ues Therefore

j

Y Y Y

p jndpar ents g C

C i

C cliq ues

j

iindC

j

p jndpar ents

j j

A Pro of of Lemma

Consider the denition of the Markov blanket If a directed arc is added b etween the no des

then the Markov blanket will only change for an unknown no de X if U now enters the set

of nondeterministic parents of the chaincomp onents containing nondeterministic children

of X This will not eect the subsequent graph separability however b ecause it will only

subsequently add arcs b etween U a given no de and other no des

App endix B The exp onential family

The exp onential family of distributions was describ ed in Denition The common use of

the exp onential family exists b ecause of Theorem Table gives a few exp onential family

distributions and their functional form Further details and more extensive tables can b e

found in most Bayesian textb o oks on probability distributions DeGro ot Bernardo

Smith Table gives some standard conjugate prior distributions for those in

Table and Table gives their matching p osteriors DeGro ot Bernardo Smith

Learning with Graphical Models

Distribution Conjugate prior

j C dim multinomial Dirichlet

C

Gamma y jx Gaussian j ddim Gaussian

2

x Gamma j Gamma

x ddim Gaussian j Gaussian N Wishart S

Table Distributions and their conjugate priors

Distribution Conjugate p osterior

Dirichletn n

C C

j C dim multinomial

P

N

j s c for n

j c c

i i

Gamma N j ddim Gaussian

2

P P

y

N N

y jx Gaussian

for y y x y

i i i

i i

i

P

y

N

y

y for x

i i

i

P

N

x x Gamma j GammaN

i

i

j Gaussian N N WishartN S S

N

0

x ddim Gaussian

for x x

N N

0

P

N

y

xx x x for S

i i

i

Table Distributions and matching conjugate p osteriors

Buntine

Distribution Evidence

j C dim multinomial B etan n B eta

C C C

( +N )2

12

0

N

det

0

0

y jx Gaussian

12 2

N2

0

det

0

0

0

N

0

0

x Gamma for xed

P

N +

N 0

x

i 0 0

i=1

d

Q

2

N

0

N i

d det S

0

0 0

x ddim Gaussian

d

( +N )2

i

dN2

i

0

N N

0

det S S

0

0

Table Distributions and their evidence

For the distributions in Table with priors in Table Table gives their matching

evidence derived using Lemma and cancelling a few common terms

In the case where the functions w are ful l rank in dimension of is k same as w and

i

d w

the Jacobian of w with resp ect to is invertible det then various moments

d

of the distribution can b e easily found

dZ dw

E tx y

xjy

d d

The vector function w now has an inverse and it is referred to as the link function

McCullagh Nelder This yields

k

X

Z w w

E t x y exp

i i

xjy

Z

i

These are imp ortant b ecause if the normalization constant Z can b e found in closed form

then it can b e dierentiated and divided for instance symb olically to construct formula for

various moments of the distribution such as E t x and E t xt x Furthermore

i i j

xj xj

Equation implies that derivatives of the normalization constant dZ d can b e

found by estimating moments of the sucient statistics for instance by Markov chain

Monte Carlo metho ds

Acknowledgements

The general program presented here is shared by many including Peter Cheeseman who

encouraged this development from its inception These ideas were presented in formative

stages at Snowbird Neural Networks for Computing April and to the Bayesian

Analysis in Exp ert Systems BAIES group in Pavia Italy June Feedback from that

group help ed further develop these ideas Graduate students at Stanford and Berkeley have

also received various incarnations of this ideas Thanks also to George John Ronny Kohavi

Scott Schmidler Scott Roy Padhraic Smyth and Peter Cheeseman for their feedback on

drafts and to the JAIR reviewers Brian Williams p ointed out the extension of the decom

p osition theorems to the deterministic case Brian Ripley reminded me of the extensive

features of S

Learning with Graphical Models

References

Andersen S Olesen K Jensen F Jensen F HUGINa shell for building

Bayesian b elief universes for exp ert systems In International Joint Conference on

Articial Intel ligence Detroit Morgan Kaufmann pp

AzevedoFilho A Shachter R Laplaces metho d approximations for probabilis

tic inference in b elief networks with continuous variables In de Mantaras R L

Po ole D Eds Uncertainty in Articial Intel ligence Proceedings of the

Tenth Conference Seattle Washington pp

Becker R Chamb ers J Wilks A The New S Language Pacic Grove

California Wadsworth Bro oksCole

Berger J Statistical Decision Theory and Bayesian Analysis New York Springer

Verlag

Bernardo J Smith A Bayesian Theory Chichester John Wiley

Besag J York J Mollie A Bayesian image restoration with two applications

in spatial statistics Ann Inst Statist Math

Box G Tiao G in Statistical Analysis Reading Mas

sachusetts AddisonWesley

Breiman L Friedman J Olshen R Stone C Classication and Regression

Trees Belmont California Wadsworth

Bretthorst G An intro duction to mo del selection using as logic

In Heidbreder G Ed Maximum Entropy and Bayesian Methods Kluwer

Academic Pro ceedings at Santa Barbara

Buntine W a Classiers A theoretical and empirical study In International Joint

Conference on Articial Intel ligence Sydney Morgan Kaufmann pp

Buntine W b Learning classication trees In Hand D Ed Articial Intel ligence

Frontiers in Statistics London Chapman Hall pp

Buntine W c Theory renement of Bayesian networks In DAmbrosio B Smets

P Bonissone P Eds Uncertainty in Articial Intel ligence Proceedings of the

Seventh Conference Los Angeles California

Buntine W Representing learning with graphical mo dels Technical Rep ort FIA

Articial Intelligence Research Branch NASA Ames Research Center Submitted

Buntine W Weigend A Bayesian backpropagation Complex Systems

Buntine W Weigend A Computing second derivatives in feedforward networks

a review IEEE Transactions on Neural Networks

Buntine

Casella G Berger R Belmont California Wadsworth

Bro oksCole

C inlar E Introduction to Stochastic Processes Prentice Hall

Chamb ers J Hastie T Eds Statistical Models in S Pacic Grove California

Wadsworth Bro oksCole

Chan B Shachter R Structural controllability and observability in inuence

diagrams In Dub ois D Wellman M DAmbrosio B Smets P Eds

Uncertainty in Articial Intel ligence Proceedings of the Eight Conference Stanford

California pp

Charniak E Bayesian networks without tears AI Magazine

Cheeseman P Self M Kelly J Taylor W Freeman D Stutz J Bayesian

classication In Seventh National Conference on Articial Intel ligence Saint Paul

Minnesota pp

Cheeseman P On nding the most probable mo del In Shrager J Langley

P Eds Computational Models of Discovery and Theory Formation Morgan Kauf

mann

Cohen W Compiling prior knowledge into an explicit bias In Ninth International

Conference on Machine Learning Morgan Kaufmann pp

Co op er G Herskovits E A Bayesian metho d for the induction of probabilistic

networks from data Machine Learning

Cowell R BAIESa probabilistic exp ert system shell with qualitative and quanti

tative learning Bernardo In Bernardo J Berger J Dawid A Smith A Eds

Oxford University Press pp

Dagum P Galp er A Horvitz E Seiver A Uncertain reasoning and forecast

ing International Journal of Forecasting Submitted

Dagum P Horvitz E Reformulating inference problems through selective con

ditioning In Dub ois D Wellman M DAmbrosio B Smets P Eds

Uncertainty in Articial Intel ligence Proceedings of the Eight Conference Stanford

California pp

Dawid A Conditional indep endence in SIAM Journal on Com

puting

Dawid A P Prop erties of diagnostic data distributions Biometrics

Dawid A Lauritzen S Hyp er Markov laws in the statistical analysis of decom

p osable graphical mo dels Annals of Statistics

Dean T Wellman M Planning and Control San Mateo California Morgan

Kaufmann

Learning with Graphical Models

DeGro ot M Optimal Statistical Decisions McGrawHill

Dempster A Laird N Rubin D Maximum likeliho o d from incomplete data

via the EM algorithm Journal of the Royal Statistical Society B

Duda R Hart P Pattern Classication and Scene Analysis New York John

Wiley

Frydenb erg M The chain graph Markov prop erty Scandinavian Journal of Statis

tics

Geiger D Heckerman D Learning Gaussian networks In de Mantaras R L

Po ole D Eds Uncertainty in Articial Intel ligence Proceedings of the

Tenth Conference Seattle Washington pp

Geman D Random elds and inverse problems in imaging In Hennequin P Ed

Ecole dEte de Probabilites de SaintFlour XVIII SpringerVerlag Berlin In

Lecture Notes in Mathematics Volume

Geman S Geman D Sto chastic relaxation Gibbs distributions and the

Bayesian relation of images IEEE Transactions on Pattern Analysis and Machine

Intel ligence

Gilks W Clayton D Spiegelhalter D Best N McNeil A Sharples L Kirby A

a Mo delling complexity applications of Gibbs sampling in medicine Journal

of the Royal Statistical Society B

Gilks W Thomas A Spiegelhalter D b A language and program for complex

Bayesian mo delling The

Gill P E Murray W Wright M H Practical Optimization San Diego

Academic Press

Griewank A Corliss G F Eds Automatic Dierentiation of Algorithms

Theory Implementation and Application Breckenridge Colorado SIAM

Heckerman D Geiger D Chickering D Learning Bayesian networks The

combination of knowledge and statistical data Technical Rep ort MSRTR

Revised Microsoft Research Advanced Technology Division Submitted Machine

Learning Journal

Heckerman D Probabilistic Similarity Networks MIT Press

Henrion M Towards ecient inference in multiply connected b elief networks In

Oliver R Smith J Eds Inuence Diagrams Belief Nets and Decision Analysis

pp Wiley

Hertz J Krogh A Palmer R Introduction to the Theory of Neural Computa

tion AddisonWesley

Buntine

Howard R Decision analysis p ersp ectives on inference decision and exp erimen

tation Proceedings of the IEEE

Hrycej T Gibbs sampling in Bayesian networks Articial Intel ligence

Jereys H Theory of Probability third edition Oxford Clarendon Press

Johnson D Pap dimitriou C Yannakakis M How easy is lo cal search In

FOCS pp

Kass R Raftery A Bayes factors and mo del uncertainty Technical Rep ort

Department of Statistics Carnegie Mellon University PA Submitted to Jnl

of American Statistical Association

Kjru U A computational scheme for reasoning in dynamic probabilistic net

works In Dub ois D Wellman M DAmbrosio B Smets P Eds

Uncertainty in Articial Intel ligence Proceedings of the Eight Conference Stanford

California pp

Kohavi R Bottomup induction of oblivious readonce decision graphs Strengths

and limitations In Twelfth National Conference on Articial Intel ligence

Lange K Sinsheimer J Normalindep endent distributions and their applica

tions in robust regression Journal of Computational and Graphical Statistics

Langley P Iba W Thompson K An analysis of Bayesian classiers In Tenth

National Conference on Articial Intel ligence San Jose California pp

Lauritzen S Dawid A Larsen B Leimer HG Indep endence prop erties of

directed Markov elds Networks

Little R Rubin D Statistical Analysis with Missing Data New York John

Wiley and Sons

Loredo T The promise of Bayesian inference for astrophysics In Feigelson E

Babu G Eds Statistical Chal lenges in Modern Astronomy SpringerVerlag

MacKay D A practical Bayesian framework for backprop networks Neural Com

putation

MacKay D Bayesian nonlinear mo deling for the energy prediction comp etition

Rep ort Draft Cavendish Lab oratory University of Cambridge

Madigan D Raftery A Mo del selection and accounting for mo del uncertainty

in graphical mo dels using Occams window Journal of the American Statistical As

sociation To app ear

McCullagh P Nelder J Generalized Linear Models second edition Chapman

and Hall London

Learning with Graphical Models

McLachlan G J Basford K E Mixture Models Inference and Applications to

Clustering New York Marcel Dekker

Meilijson I A fast improvement to the EM algorithm on its own terms J Roy

Statist Soc B

Minton S Johnson M Philips A Laird P Solving largescale constraint

satisfaction and scheduling problems using a heuristic repair metho d In Eighth Na

tional Conference on Articial Intel ligence Boston Massachusetts pp

Neal R Probabilistic inference using Markov chain Monte Carlo metho ds Technical

Rep ort CRGTR Dept of Computer Science University of Toronto

Nowlan S Hinton G Simplifying neural networks by soft weight shar

ing In Touretzky D Ed Advances in Neural Information Processing Systems

NIPS Morgan Kaufmann

Oliver J Decision graphs an extension of decision trees In Proceedings of the

Fourth International Workshop on Articial Intel ligence and Statistics pp

Extended version available as TR Department of Computer Science Monash

University Australia

Pearl J Probabilistic Reasoning in Intel ligent Systems Morgan Kaufmann

Poland W Decision Analysis with Continuous and Discrete Variables A Mixture

Distribution Approach PhD thesis Department of Engineering Economic Systems

Stanford University Stanford California

Press S Bayesian Statistics New York Wiley

Quinlan J Unknown attribute values in induction In Segre A Ed Proceedings

of the Sixth International Machine Learning Workshop Cornell New York Morgan

Kaufmann

Quinlan J C Programs for Machine Learning Morgan Kaufmann

Ripley B Spatial Statistics New York Wiley

Ripley B Stochastic Simulation John Wiley Sons

Rivest R Learning decision lists Machine Learning

Russell S Binder J Koller D Adaptive probabilistic networks Technical

Rep ort CSD July University of California Berkeley

Selman B Levesque H Mitchell D A new metho d for solving hard satis

ability problems In Tenth National Conference on Articial Intel ligence San Jose

California pp

Shachter R Evaluating inuence diagrams Operations Research

Buntine

Shachter R An ordered examination of inuence diagrams Networks

Shachter R Andersen S Szolovits P Global conditioning for probabilis

tic inference in b elief networks In de Mantaras R L Po ole D Eds

Uncertainty in Articial Intel ligence Proceedings of the Tenth Conference Seattle

Washington pp

Shachter R Heckerman D Thinking backwards for knowledge acquisition AI

Magazine Fall

Shachter R Kenley C Gaussian inuence diagrams Management Science

Smith A Spiegelhalter D Bayes factors and choice criteria for linear mo dels

Journal of the Royal Statistical Society B

Spiegelhalter D Personal communication

Spiegelhalter D Dawid A Lauritzen S Cowell R Bayesian analysis in exp ert

systems Statistical Science

Spiegelhalter D Lauritzen S Sequential up dating of conditional probabilities

on directed graphical structures Networks

Srinivas S Breese J IDEAL A software package for analysis of inuence

diagrams In Bonissone P Ed Proceedings of the Sixth Conference on Uncertainty

in Articial Intel ligence Cambridge Massachusetts

Stewart L Hierarchical Bayesian analysis using Monte Carlo integration comput

ing p osterior distributions when there are many p ossible mo dels The Statistician

Tanner M Tools for Statistical Inference Second edition SpringerVerlag

Thomas A Spiegelhalter D Gilks W BUGS A program to p erform Bayesian

inference using Gibbs sampling Bernardo In Bernardo J Berger J Dawid A

Smith A Eds Bayesian Statistics Oxford University Press pp

Tierney L Kadane J Accurate approximations for p osterior moments and

marginal densities Journal of the American Statistical Association

Titterington D Smith A Makov U Statistical Analysis of Finite Mixture

Distributions Chichester John Wiley Sons

van Laarhoven P Aarts E Simulated Annealing Theory and Applications

Dordrecht D Reidel

Vuong T Likeliho o d ratio tests for mo del selection and nonnested hyp otheses

Econometrica

Learning with Graphical Models

Werb os P J McAvoy T Su T Neural networks system identication and

control in the chemical pro cess industry In White D A Sofge D A Eds

Handbook of Intel ligent Control pp Van Nostrand Reinhold

Wermuth N Lauritzen S On substantive research hyp otheses conditional

indep endence graphs and graphical chain mo dels Journal of the Royal Statistical

Society B

Whittaker J Graphical Models in Applied Wiley

Wolp ert D Bayesian over functions rather than weights In

Tesauro G Ed Advances in Neural Information Processing Systems NIPS

Morgan Kaufmann