Bayesian Networks and Belief Propaga&On: from Rumelhart To

Bayesian Networks and Belief Propagaon: From Rumelhart to Pearl to Today

Rina Dechter Donald Bren School of Computer Science University of California, Irvine, USA

In The ELSC-ICNC Retreat 2012, Ein Gedi The David E. Rumelhart Prize For Contribuons to the Theorecal Foundaons of Human Cognion

Rumelhart Prize for Pearl, 2011 http://thesciencenetwork.org/programs/cogsci-2011/rumelhart-lecture-judea-pearl

Dr. Judea Pearl has been a key researcher in the application of probabilistic methods to the understanding of intelligent systems, whether natural or artificial. He has pioneered the development of graphical models, and especially a class of graphical models known as Bayesian networks, which can be used to represent and to draw inferences from probabilistic knowledge in a highly transparent and computationally natural fashion. Graphical models have had a transformative impact across many disciplines, from statistics and machine learning to artificial intelligence; and they are the foundation of the recent emergence of Bayesian cognitive science. Dr. Pearl’s work can be seen as providing a rigorous foundation for a theory of epistemology which is not merely philosophically defensible, but which can be mathematically specified and computationally implemented. It also provides one of the most influential sources of hypotheses about the function of the human mind and brain in current cognitive science. Rumelhart 1976: Towards an interacve model of Reading

Pearl: so we have a combination of a top down and a bottom up modes of reasoning which somehow coordinate their actions resulting in a friendly handshaking.” Rumelhart’s Proposed Soluon

Rumelhart (1976) Figure 10 Pearl 1982: Reverend Bayes on Inference Engines Bayes Net (1985)

Outline n Bayesian networks from historical perspecve n Bayesian networks, a short tutorial n The belief propagaon on trees n From trees to graphs n From Bayesian networks to graphical models n Some observaons on loopy belief propagaon Outline n Bayesian networks from historical perspecve n Bayesian networks, a short tutorial n The belief propagaon on trees n From trees to graphs n From Bayesian networks to graphical models n Some observaons on loopy belief propagaon Bayesian Networks (Pearl 1985) A Medical Diagnosis Example(Lauritsen and Spigelhalter, 1988)

P(S) Smoking BN = (G, Θ)

MPE = ﬁnd argmax P(S)· P(C|S)· P(B|S)· P(X|C,S)· P(D|C,B) Bayesian Networks Encode Independencies

Causal relationship SmokingP(S)

P(C|S) P(B|S) Lung Cancer Bronchitis

P(X|C,S) P(D|C,B) X-ray Dyspnoea

P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) Monitoring Intensive Care Paents

MINVOLSET

PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT

PAP SHUNT VENTLUNG VENITUBE

PRESS MINOVL FIO2 VENTALV

ANAPHYLAXIS PVSAT ARTCO2

TPR SAO2 INSUFFANESTH EXPCO2 Alarm network

HYPOVOLEMIA LVFAILURE CATECHOL

LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER

CVP PCWP CO HREKG HRSAT HRBP

37 variables << 37 509 parameters 2 Outline n Bayesian networks from historical perspecve n Bayesian Networks n Belief propagaon on trees n From trees to graphs n From Bayesian network to graphical models n Some observaons on loopy belief propagaon Distributed Belief Propagaon

The essence of belief propagation is to make global information be shared locally by every entity

How many people?

4 5 3 5 2 5 1 5 5

1 2 3 4 Belief Propagaon on Chains

P(x|u)

Diagnostic support Causal support Belief Propagaon in Trees e+

e- Belief Propagaon in Trees

λ (u) = λ (x) ⋅P(x | u) x ∑∏ Yk x k Distributed Belief Propagaon

Causal support

Diagnostic support Belief Propagaon on Polytrees

(a) Fragment of a polytree, (b) the parents and children of a typical node X Belief Propagaon on Polytrees

Outline n Bayesian networks from historical perspecve n Bayesian networks n Belief propagaon on trees n From trees to graphs n From Bayesian network to graphical models n Some observaon on loopy belief propagaon A Loopy Bayesian network Coping with Loops (Pearl 1988) n Clustering methods (4.4.1)

n Spigelhalter and Lauritsen: Juncon-tree propagaon (1988), Join-tree propagaon (Pearl 1988) n Condioning schemes (4.4.2)

n Loop-cutset scheme n Stochasc simulaon (Gibbs sampling) 4.4.3 n Loopy belief propagaon (exercise) (Poly)-Tree Clustering

Clustering network (a) into a (b) tree or (c) polytree Join-tree Clustering A Loop-cutset Condioning The Cutset Condioning

Bayes Net (1985) Breaking a loop Search vs. clustering

Search (condioning) Exponential Inference (eliminaon) Expponential in cutset in w A A D D B B

C F C F

E E G G But: The graph may Have too large clusters A=1 … A=k And need too many conditioning variables

D D D B B B

C F C F C F

E E E G G G

k “sparser” problems 1 “denser” problem Belief Propagaon when there are Loops Some Applicaons Human thought processes

Enable learning of BN From data

HUJI 2012 Outline n Bayesian networks from historical perspecve n Bayesian Networks n Belief propagaon on trees n From trees to graphs n From Bayesian network to graphical models; general exact and approximate algorithms n Some observaon on loopy belief propagaon Constraint Networks

Map coloring

Variables: countries (A B C etc.)

Values: colors (red green blue)

Constraints: A ≠ B, A ≠ D, D ≠ E , ...

Constraint graph A B A E A E red green D red yellow green red B D green yellow F B yellow green G F yellow red C G C

33 Constraint Sasfacon Tasks

Example: map coloring A B C D E… Variables - countries (A,B,C,etc.) Values - colors (e.g., red, green, yellow) red green red green blue Constraints: A ≠ B, A ≠ D, D ≠ E, etc. red blu gre green blue

Are the constraints consistent? … … … … gree n

… … … … red Find a solution, find all solutions red blu red green red Count all solutions e

Find a good solution

34 Graphical Models

A A P(A) R(A) P(B|A) B C B C R(A,B) P(C|A) R(A,C) P(D|A,B) R(A,B,D)

P(F|B,C) D F D F R(B,C,F) P(G|D,F) R(D,F,G) G G a) Belief network b) Constraint network

Test Drill Oil Test cost cost sales ϕ(A) ϕ(B) A B C D ψ (A, B)

Test Oil Oil sale Drill result produced policy E F G H

Seismic Oil Sales Market structure underground cost information I J K L

c) Influence diagram d) Markov network Tree Solving is Easy

CSP – consistency Belief updating (projection-join) (sum-prod) P(X) mXY (X ) mXZ (X )

mYX (X ) mZX (X )

P(Y|X) P(Z|X)

mYT (Y) mYR (Y) mZL(Z) mZM (Z)

mTY (Y) mRY (Y) mLZ (Z) mMZ (Z)

P(T|Y) P(R|Y) P(L|Z) P(M|Z)

MPE (max-prod) #CSP (sum-prod)

Trees are processed in linear time and memory Inference vs. Condioning

n By Inference (thinking)

ABC DGF Exponential in treewidth A D G B BDEF Time and memory F C E K M EFH H L FHK J HJ KLM

n By Conditioning (guessing)

A=yellow A=green Exponential in cycle-cutset E D M B=blue B=green B=red B=blue Time-wise, linear memory L C A E D M E D M E D M E D M B K C L C L C L C L F H G J K K K K F H F H F H F H G J G J G J G J Outline n Bayesian networks from historical perspecve n Bayesian Networks n Belief propagaon on trees n From trees to graphs n From Bayesian network to graphical models n Some observaons on loopy belief propagaon Belief Propagaon when there are Loops Distributed Belief Propagaon

The essence of belief propagation is to make global information be shared locally by every entity How many people? 5

4 5 3 5 2 5 1 5 5

7 6

1 2 3 4 Linear Block Codes

Received bits a b c d e f g h

Input bits A B C D E F G H

Gaussian channel noise Parity bits + + + + + +

Received bits p1 p2 p3 p4 p5 p6 Belief Propagaon on Loopy Graphs

U1 U2 U3 (x ) π U (x1 ) π U2 1 3 (u ) λX1 1 n Pearl (1988): use of BP to loopy networks λ (u ) X 2 1 X1 X2 n McEliece, et. Al 1988: BP’s success on coding networks n Lots of research into convergence … and accuracy if convergence, sll:

n Why BP works well for coding networks

n When does BP converge?

n How accurate is it when converged?

n Can we characterize other good problem classes?

n Can we have any guarantees on accuracy (if we have converges) Constraint Propagaon

Arcs-consistency

X Y < 1, 2, 3 1, 2, 3 1 ≤ X, Y, Z, T ≤ 3 X < Y Y = Z ∧ = T < Z X ≤ T 1, 2, 3 1, 2, 3 < T Z Arc-consistency R ← R D X ∏X XY Y X Y < 1 3 1 ≤ X, Y, Z, T ≤ 3 X < Y Y = Z ∧ = T < Z X T ≤ 2 3 < T Z Distributed Relaonal Arc-consistency

n Can be applied to nodes being the constraints (dual network):

A A A

AB AC

B A C B

ABD BCF

D F DFG

SP2 45 Flaening the Bayesian Network

A P(A) A 1 .2 1 A B P(B|A) 2 .5 A B 2 1 2 .3 3 .3 1 2 3 1 3 .7 … 0 A C P(C|A) 1 3 2 1 .4 1 2 1 2 1 A C 2 3 .6 1 1 3 2 1 2 3 1 2 3 1 .1 A … … 0 3 1 A 3 2 3 2 .9 A A 3 2 A A … … 0 2 3 2 3 AB AC AB AC

B B B A C B A C A B D P(D|A,B) A B D 1 2 3 1 4 5 1 2 3 4 5 1 3 2 1 ABD BCF B C F P(F|B,C) 1 3 2 ABD BCF B C F 2 1 3 1 6 1 2 3 1 2 1 3 6 1 2 3 2 3 1 1 D F 3 2 1 1 2 3 1 D F 3 2 1 3 1 2 1 DFG … … … 0 3 1 2 DFG 3 2 1 1 3 2 1 … … … 0 D F G P(G|D,F) D F G 1 2 3 1 1 2 3 2 1 3 1 2 1 3 … … … 0 Flat constraint network Belief network Belief Zero Propagaon = Arc-consistency

IBP over belief propagation applies relational arc-consistency on the flat network h j = ( p ⋅( hi )) h j = π (R ( hi )) i ∑ i ∏ k i lij i k∈ne(i) k elim(i, j) {k∈ne j (i)}

1 2 2 2 2 2 2 h1 h4 h5 A h1 h4 h5 2 2 2 A A A B P(B|A) A h1 (A) B h1 (B) B h1 (B) 2 3 A B A B B 1 2 >0 1 >0 1 >0 1 >0 1 2 1 1 1 1 3 >0 2 >0 2 >0 3 >0 AB AC 1 3 2 2 3 2 1 >0 3 >0 3 >0 … 0 2 1 3 3 2 3 >0 … 0 … 0 B A C 2 3 3 1 >0 B 3 1 3 2 >0 4 5 3 2 … … 0 ABD BCF 6 D F DFG

Updated belief: Updated relation:

2 2 2 2 2 2 Bel(A, B) = P(B | A)⋅h1 ⋅h4 ⋅h5 = R(A, B) = R(A, B) h1 h4 h5 =

Bel A B A B (A,B) 1 3 1 3 >0 2 1 = 2 1 >0 = 2 3 2 3 >0 3 1 3 1 >0 … … 0 An Example

R A P(A) 1 1 .2 2 .5 3 .3 R2 … 0 A B P(B|A) 1 2 .3 1 R3 1 3 .7 A A C P(C|A) 2 1 .4 A A 1 2 1 2 3 .6 2 3 3 2 1 3 1 .1 AB AC … … 0 3 2 .9 … … 0 A B B C R R4 4 5 5 A B D P(D|A,B) B C F P(F|B,C) ABD BCF 1 2 3 1 1 2 3 1 6 1 3 2 1 D F 3 2 1 1 2 1 3 1 DFG … … … 0 2 3 1 1 3 1 2 1 R6 3 2 1 1 D F G P(G|D,F) … … … 0 1 2 3 1 2 1 3 1 … … … 0 Belief Propagaon Example: Iteraon 1

R A P(A) 1 1 >0 3 >0 … 0 R2 A B P(B|A) 1 3 1 1 R3 2 1 >0 A A C P(C|A) 2 3 >0 A A 1 2 1 3 1 1 2 3 3 2 1 … … 0 AB AC … … 0

A B B C R R4 4 5 5 A B D P(D|A,B) B C F P(F|B,C) ABD BCF 1 3 2 1 1 2 3 1 6 2 3 1 1 D F 3 2 1 1 3 1 2 1 DFG … … … 0 3 2 1 1 … … … 0 R6 D F G P(G|D,F) 2 1 3 1 … … … 0 Belief Propagaon Example: Iteraon 2

R A P(A) 1 1 >0 3 >0 … 0 R2 A B P(B|A) 1 3 1 1 R3 3 1 1 A A C P(C|A) … … 0 1 2 1 2 A A 3 3 2 1 AB AC … … 0

A B B C R R4 4 5 5 A B D P(D|A,B) B C F P(F|B,C) ABD BCF 1 3 2 1 3 2 1 1 6 3 1 2 1 D F … … … 0 … … … 0 DFG

R6 D F G P(G|D,F) 2 1 3 1 … … … 0 Belief Propagaon Example: Iteraon 3

R A P(A) 1 1 >0 3 >0 … 0 R2 A B P(B|A) 1 3 1 1 R3 … … 0 A A C P(C|A) 1 2 1 2 A A 3 3 2 1 AB AC … … 0

A B B C R R4 4 5 5 A B D P(D|A,B) B C F P(F|B,C) ABD BCF 1 3 2 1 3 2 1 1 6 3 1 2 1 D F … … … 0 … … … 0 DFG

R6 D F G P(G|D,F) 2 1 3 1 … … … 0 Belief Propagaon Example: Iteraon 4

R A P(A) 1 1 1 … 0

R2 A B P(B|A) 1 3 1 1 R3 … … 0 A A C P(C|A) 1 2 1 2 A A 3 3 2 1 AB AC … … 0

A B B C R R4 4 5 5 A B D P(D|A,B) B C F P(F|B,C) ABD BCF 1 3 2 1 3 2 1 1 6 … … … 0 D F … … … 0 DFG

R6 D F G P(G|D,F) 2 1 3 1 … … … 0 Belief Propagaon Example: Iteraon 5

A P(A) A B C D F G Belief R1 1 1 1 3 2 2 1 3 1 … 0 … … … … … … 0 R2 A B P(B|A) 1 3 1 1 R3 … … 0 A A C P(C|A) 1 2 1 2 A A 3 … … 0 AB AC

A B B C R R4 4 5 5 A B D P(D|A,B) B C F P(F|B,C) ABD BCF 1 3 2 1 3 2 1 1 6 … … … 0 D F … … … 0 DFG

R6 D F G P(G|D,F) 2 1 3 1 … … … 0 The Inference Power of BP for Zero Beliefs

n Theorem: Iterave BP performs arc-consistency on the ﬂat network. (no more, no less)

n Soundness: n Inference of zero beliefs by IBP converges (in nk iteraons, n variables, k= | domain|) n All the inferred zero beliefs are correct n Incompleteness: n BP is as weak and as strong as arc-consistency (weak for graph coloring, strong for implicaonal constraints.) n Connuity Hypothesis: IBP is sound for zero -à IBP is accurate for extreme beliefs? Tested empirically Experimental Results

We investigated empirically if the results for zero beliefs extend to ε-small beliefs (ε > 0)

n Network types: n Measures: n Algorithms:

n Coding n Exact/IJGP n IBP YES n Linkage analysis* histogram n IJGP

n Grids* n Recall absolute n Two-layer noisy-OR* error NO n CPCS54, CPCS360 n Precision absolute Have determinism? determinism? Have error

* Instances from the UAI08 competition Networks with Determinism: Coding

N=200, 1000 instances, w*=15 Nets with Determinism: Linkage

Exact Histogram IJGP Histogram Recall Abs. Error Precision Abs. Error pedigree1, w* = 21 Percentage Absolute Error Absolute

pedigree37, w* = 30 Percentage Absolute Error Absolute

i-bound = 3 i-bound = 7 Nets w/o Determinism: bn2o

w* = 24

w* = 27

w* = 26 Cutset Phenomena & Irrelevant Nodes

n Observed variables break the ﬂow of inference X n BP is exact when evidence variables form a loop-cutset n Unobserved variables without observed descendents send Y zero-informaon to the parent variables – it is irrelevant

n In a network without evidence, BP X converges in one iteraon top-down Nodes with Extreme Support

Observed variables with xtreme priors or xtreme support can nearly-cut informaon ﬂow:

0.0002 Root A B1 C1 0.00015 B2 B1 C1 C2 0.0001 Sink EB EC B2 C2 0.00005 D E EB C 0 0.00001 0.01 0.2 0.5 0.8 0.99 0.99999 ED prior

Average Error vs. Priors Conclusion: Networks with Determinism

BP converges & sound for zero beliefs n IBP’s power to infer zeros is as weak or as strong as arc- consistency n However: inference of extreme beliefs can be wrong. n Cutset property (Bidyuk and Dechter, 2000):

n Evidence and inferred singleton act like cutset

n If zeros are cycle-cutset, all beliefs are exact

n Extensions to epsilon-cutset were supported empirically. n IJGP is an anyme good tradeoﬀ propagaon scheme Thank You

Rina Dechter, Bozhena Bidyuk, Robert Mateescu and Emma Rollon. "On the Power of Belief Propagation: A Constraint Propagation Perspective" in Festschrift book in honor of Judea Pearl, 2010

Heuristics, Probability and Causality, A Tribute to Judea Pearl Editors Rina Dechter, Hector Geffner and Joseph Y. Halpern http://bayes.cs.ucla.edu/TRIBUTE/pearl-tribute2010.htm

http://www.ics.uci.edu/~dechter/publications.html

Coding Networks – Bit Error Rate

Coding, N=400, 1000 instances, 30 it, w*=43, sigma=.22σ = .22 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.32σ = .32 1e-1 0.00243

IJGP MC 0.00242 IBP 1e-2

0.00241 IBP IJGP

1e-3 0.00240 BER BER

0.00239 1e-4

0.00238

1e-5 0.00237 0 2 4 6 8 10 12 0 2 4 6 8 10 12 i-bound i-bound

Coding, N=400, 500 instances, 30 it, w*=43, sigma=.51σ = .51 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.65σ = .65 0.0785

0.0780 0.1914

0.0775 0.1912

IBP IBP 0.0770 IJGP 0.1910 IJGP

0.0765 0.1908 BER BER

0.0760 0.1906

0.0755 0.1904

0.0750 0.1902

0.0745 0.1900 0 2 4 6 8 10 12 0 2 4 6 8 10 12 i-bound i-bound CPCS 422 – KL Distance

CPCS 422, evid=0, w*=23, 1instance CPCS 422, evid=30, w*=23, 1instance

0.1 0.1 IJGP 30 it (at convergence) MC IBP 10 it (at convergence)

0.01 0.01

IJGP at convergence MC KL distance KL KL distance KL 0.001 0.001 IBP at convergence

0.0001 0.0001 2 4 6 8 10 12 14 16 18 3 4 5 6 7 8 9 10 11 12 13 14 15 16 i-bound i-bound

evidence=0 evidence=30 CPCS 422 – KL vs. Iteraons

CPCS 422, evid=0, w*=23, 1instance CPCS 422, evid=30, w*=23, 1instance

0.1 1

IJGP (3) IJGP(3) IJGP(10) IJGP(10) IBP IBP 0.1

0.01

0.01 KL distance KL KL distance KL 0.001

0.001

0.0001 0.0001 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 number of iterations number of iterations

evidence=0 evidence=30 Time vs Space for w-cutset (Dechter and El-Fatah, 2000) (Larrosa and Dechter, 2001) (Rish and Dechter 2000)

• Random Graphs (50 nodes, 200 edges, average degree 8, w*≈23)! 60 Branch and bound ! 50

Bucket ! 30 elimination! W+c(w)

20 time 10

W-cutset time O(exp(w+cutset-size)) w space Space O(exp(w))