<<

Belief Propagation in Bayesian Networks

Céline Comte

Reading Group “Network Theory” November 5, 2018 Introduction

(First-order) logic Represent causal relations between variables by a directed acyclic graph

Probabilities Weight these causal relations by probabilities that implicitly account for non-represented variables

2/40 © 2018 Nokia Public Introduction

“Belief networks are directed acyclic graphs in which the nodes represent propositions (or variables), the arcs signify direct dependencies between the linked propositions, and the strengths of these de- pendencies are quantified by conditional probabilities” (Pearl, 1986)

2/40 © 2018 Nokia Public Bayesian networks are also …

• A memory-efficient way of storing a PMF • Based on simple probability rules (more details in a few slides) • Inspired by human causal reasoning (Pearl, 1986, 1988) • Used for decision taking if a utility function is provided • Applied in many fields: medecine diagnoses, turbo-codes, (programming) language detection, … • Related to other models: Markov random fields, Markov chains, hidden Markov models, …

3/40 © 2018 Nokia Public References: Pearl’s articles and book

• J. Pearl (1982). “Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach”. In: AAAI’82 → Belief propagation in causal trees

• J. Pearl (1986). “Fusion, propagation, and structuring in belief networks”. In: → Belief propagation in causal trees and

• J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann → A complete reference (thanks Achille for providing me with this book)

4/40 © 2018 Nokia Public References: Textbooks

• T. D. Nielsen and F. V. Jensen (2007). Bayesian Networks and Decision Graphs. Springer-Verlag → A lot of examples in Chapters 2 and 3

• D. Koller, N. Friedman, and F. Bach (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press

• M. Jordan (Last modified in 2015). An Introduction to Probabilistic Graphical Models. → Definition and belief probagation (thanks Nathan for pointing this reference)

5/40 © 2018 Nokia Public Outline

Reminders on probability theory

Bayesian networks

Belief propagation in trees

Belief propagation in polytrees

6/40 © 2018 Nokia Public Outline

Reminders on probability theory

Bayesian networks

Belief propagation in trees

Belief propagation in polytrees

7/40 © 2018 Nokia Public • A and B are marginally independent (written A ⊥⊥ B) if one of these three equivalent conditions is satisfied: − P(A,B) = P(A)P(B) − P(A | B) = P(A) − P(B | A) = P(B)

• A and B are conditonally independent given C (written A ⊥⊥ B | C) if one of these three equivalent conditions is satisfied: − P(A,B | C) = P(A | C)P(B | C) − P(A | B,C) = P(A | C) − P(B | A,C) = P(B | C)

Independence and conditional independence

Remark: We work exclusively with discrete random variables

8/40 © 2018 Nokia Public • A and B are conditonally independent given C (written A ⊥⊥ B | C) if one of these three equivalent conditions is satisfied: − P(A,B | C) = P(A | C)P(B | C) − P(A | B,C) = P(A | C) − P(B | A,C) = P(B | C)

Independence and conditional independence

Remark: We work exclusively with discrete random variables

• A and B are marginally independent (written A ⊥⊥ B) if one of these three equivalent conditions is satisfied: − P(A,B) = P(A)P(B) − P(A | B) = P(A) − P(B | A) = P(B)

8/40 © 2018 Nokia Public Independence and conditional independence

Remark: We work exclusively with discrete random variables

• A and B are marginally independent (written A ⊥⊥ B) if one of these three equivalent conditions is satisfied: − P(A,B) = P(A)P(B) − P(A | B) = P(A) − P(B | A) = P(B)

• A and B are conditonally independent given C (written A ⊥⊥ B | C) if one of these three equivalent conditions is satisfied: − P(A,B | C) = P(A | C)P(B | C) − P(A | B,C) = P(A | C) − P(B | A,C) = P(B | C)

8/40 © 2018 Nokia Public • Law of total probability If A and B are two random variables, ∑ P(B) = P(B | A)P(A) A

Useful rules

• The chain rule of probabilities If A1,...,An are random variables, we have

P(A1,...,An) = P(A1) × P(A2 | A1) × P(A3 | A1,A2) × ··· × P(An | A1,...,An−1)

9/40 © 2018 Nokia Public Useful rules

• The chain rule of probabilities If A1,...,An are random variables, we have

P(A1,...,An) = P(A1) × P(A2 | A1) × P(A3 | A1,A2) × ··· × P(An | A1,...,An−1)

• Law of total probability If A and B are two random variables, ∑ P(B) = P(B | A)P(A) A

9/40 © 2018 Nokia Public Useful rules

• Bayes’ rule If A and B are two random variables,

P(A | B)P(B) P(B | A) = P(A)

We can see P(A) as a normalizing constant: we can first compute P(B | A) ∝ P(A | B)P(B) for each value of B and then normalize to obtain P(B | A) without computing P(A)

10/40 © 2018 Nokia Public • Observe a

• Evidence (piece of evidence) The set of random variables that have been observed

Glossary

• Belief in a random variable (conviction in french) Marginal distribution of this random variable (given the value of some observed variables)

11/40 © 2018 Nokia Public • Evidence (piece of evidence) The set of random variables that have been observed

Glossary

• Belief in a random variable (conviction in french) Marginal distribution of this random variable (given the value of some observed variables)

• Observe a random variable

11/40 © 2018 Nokia Public Glossary

• Belief in a random variable (conviction in french) Marginal distribution of this random variable (given the value of some observed variables)

• Observe a random variable

• Evidence (piece of evidence) The set of random variables that have been observed

11/40 © 2018 Nokia Public Outline

Reminders on probability theory

Bayesian networks

Belief propagation in trees

Belief propagation in polytrees

12/40 © 2018 Nokia Public D I

G B

L

The Student example

Course Student difficulty intelligence

Grade Baccalauréat

Reference letter

Borrowed from (Koller, Friedman, and Bach, 2009)

13/40 © 2018 Nokia Public The Student example D I

G B Course Student difficulty intelligence L

Grade Baccalauréat

Reference letter

Borrowed from (Koller, Friedman, and Bach, 2009)

13/40 © 2018 Nokia Public • Chain rule of Bayesian networks By the chain rule of probabilities:

P(D,I,G,B,L) = P(D)P(I | D)P(G | D,I)P(B | D,I,G)P(L | D,I,G,B), = P(D)P(I)P(G | D,I)P(B | I)P(L | G)

These two definitions are equivalent

The Student example D I

• Local Markov property G B Each node is conditionally independent of its non-descendants given its parents: L

D ⊥⊥ {I,B}, I ⊥⊥ D, G ⊥⊥ B | {D,I}, B ⊥⊥ {D,G,L} | I, L ⊥⊥ {D,I,B} | G

14/40 © 2018 Nokia Public These two definitions are equivalent

The Student example D I

• Local Markov property G B Each node is conditionally independent of its non-descendants given its parents: L

D ⊥⊥ {I,B}, I ⊥⊥ D, G ⊥⊥ B | {D,I}, B ⊥⊥ {D,G,L} | I, L ⊥⊥ {D,I,B} | G

• Chain rule of Bayesian networks By the chain rule of probabilities:

P(D,I,G,B,L) = P(D)P(I | D)P(G | D,I)P(B | D,I,G)P(L | D,I,G,B), = P(D)P(I)P(G | D,I)P(B | I)P(L | G)

14/40 © 2018 Nokia Public The Student example D I

• Local Markov property G B Each node is conditionally independent of its non-descendants given its parents: L

D ⊥⊥ {I,B}, I ⊥⊥ D, G ⊥⊥ B | {D,I}, B ⊥⊥ {D,G,L} | I, L ⊥⊥ {D,I,B} | G

• Chain rule of Bayesian networks By the chain rule of probabilities:

P(D,I,G,B,L) = P(D)P(I | D)P(G | D,I)P(B | D,I,G)P(L | D,I,G,B), = P(D)P(I)P(G | D,I)P(B | I)P(L | G)

These two definitions are equivalent

14/40 © 2018 Nokia Public The Student example D I

G B

Course Student L P(D) P(I) difficulty intelligence

P(G | D,I) Grade Baccalauréat

P(B | I)

P(L | G) Reference letter

15/40 © 2018 Nokia Public Bayesian networks in general D I

G B

Described by L

• A directed acyclic graph

− Nodes ∼ (discrete) random variables X1,...,Xn − Arrows ∼ conditional (in)dependencies • Local conditional probability tables (CPT)

− P(Xi | parents(Xi)) for each node Xi

16/40 © 2018 Nokia Public Bayesian networks in general D I

G B Two equivalent definitions L • Local Markov property Each node is conditionally independent of its non-descendants given its parents • Chain rule of Bayesian networks

∏n P(X1,...,Xn) = P(Xi | parents(Xi)) i=1 Proof of the equivalence: Corollary 4 p.20 of (Pearl, 1988)

17/40 © 2018 Nokia Public • Interpretation: chain of causality X “causes” Y that “causes” Z y // • Information can flow between X and Z through Y (that is, observing X changes our belief in Z and vice versa),

Flow of information unless Y is observed • Example: Markov chains

Base case: serial connection

X ⊥⊥ Z | YP(X,Y,Z) = P(X)P(Y | X)P(Z | Y) X

Y

Z

18/40 © 2018 Nokia Public y // • Information can flow between X and Z through Y (that is, observing X changes our belief in Z and vice versa),

Flow of information unless Y is observed • Example: Markov chains

Base case: serial connection

X ⊥⊥ Z | YP(X,Y,Z) = P(X)P(Y | X)P(Z | Y) X • Interpretation: chain of causality X “causes” Y that “causes” Z Y

Z

18/40 © 2018 Nokia Public y // Flow of information

• Example: Markov chains

Base case: serial connection

X ⊥⊥ Z | YP(X,Y,Z) = P(X)P(Y | X)P(Z | Y) X • Interpretation: chain of causality X “causes” Y that “causes” Z Y • Information can flow between X and Z through Y (that is, observing X changes our belief in Z and vice versa), Z unless Y is observed

18/40 © 2018 Nokia Public y //

• Example: Markov chains

Base case: serial connection

X ⊥⊥ Z | YP(X,Y,Z) = P(X)P(Y | X)P(Z | Y) X • Interpretation: chain of causality X “causes” Y that “causes” Z Y • Information can flow between X and Z through Y (that is, observing X changes our belief in Z and vice versa), Z Flow of information unless Y is observed

18/40 © 2018 Nokia Public Y

• Example: Markov chains

Base case: serial connection

X ⊥⊥ Z | YP(X,Y,Z) = P(X)P(Y | X)P(Z | Y) X • Interpretation: chain of causality X “causes” Y that “causes” Z y // • Information can flow between X and Z through Y (that is, observing X changes our belief in Z and vice versa), Z Flow of information unless Y is observed

18/40 © 2018 Nokia Public Y

Base case: serial connection

X ⊥⊥ Z | YP(X,Y,Z) = P(X)P(Y | X)P(Z | Y) X • Interpretation: chain of causality X “causes” Y that “causes” Z y // • Information can flow between X and Z through Y (that is, observing X changes our belief in Z and vice versa), Z Flow of information unless Y is observed • Example: Markov chains

18/40 © 2018 Nokia Public y //

• Interpretation: a single root cause Y with two observable consequences X and Z • Information can flow between X and Z through Y, unless Y is observed

Base case: diverging connection

Y

X Z

X ⊥⊥ Z | YP(X,Y,Z) = P(X | Y)P(Y)P(Z | Y)

19/40 © 2018 Nokia Public y //

• Information can flow between X and Z through Y, unless Y is observed

Base case: diverging connection

Y

X Z

X ⊥⊥ Z | YP(X,Y,Z) = P(X | Y)P(Y)P(Z | Y)

• Interpretation: a single root cause Y with two observable consequences X and Z

19/40 © 2018 Nokia Public y //

Base case: diverging connection

Y

X Z

X ⊥⊥ Z | YP(X,Y,Z) = P(X | Y)P(Y)P(Z | Y)

• Interpretation: a single root cause Y with two observable consequences X and Z • Information can flow between X and Z through Y, unless Y is observed

19/40 © 2018 Nokia Public y //

Base case: diverging connection

Y

X Z

X ⊥⊥ Z | YP(X,Y,Z) = P(X | Y)P(Y)P(Z | Y)

• Interpretation: a single root cause Y with two observable consequences X and Z • Information can flow between X and Z through Y, unless Y is observed

19/40 © 2018 Nokia Public Y

Base case: diverging connection

y //

X Z

X ⊥⊥ Z | YP(X,Y,Z) = P(X | Y)P(Y)P(Z | Y)

• Interpretation: a single root cause Y with two observable consequences X and Z • Information can flow between X and Z through Y, unless Y is observed

19/40 © 2018 Nokia Public // y

• Interpretation: two possible explanations X and Z for an observed consequence Y • “Explaining away” effect: information cannot flow between X and Z, unless Y is observed

Base case: converging connection

X Z

Y

X ⊥⊥ ZP(X,Y,Z) = P(X)P(Y | X,Z)P(Z)

20/40 © 2018 Nokia Public // y

• “Explaining away” effect: information cannot flow between X and Z, unless Y is observed

Base case: converging connection

X Z

Y

X ⊥⊥ ZP(X,Y,Z) = P(X)P(Y | X,Z)P(Z)

• Interpretation: two possible explanations X and Z for an observed consequence Y

20/40 © 2018 Nokia Public // y

Base case: converging connection

X Z

Y

X ⊥⊥ ZP(X,Y,Z) = P(X)P(Y | X,Z)P(Z)

• Interpretation: two possible explanations X and Z for an observed consequence Y • “Explaining away” effect: information cannot flow between X and Z, unless Y is observed

20/40 © 2018 Nokia Public y

Base case: converging connection

X Z // Y

X ⊥⊥ ZP(X,Y,Z) = P(X)P(Y | X,Z)P(Z)

• Interpretation: two possible explanations X and Z for an observed consequence Y • “Explaining away” effect: information cannot flow between X and Z, unless Y is observed

20/40 © 2018 Nokia Public // Y

Base case: converging connection

X Z

y

X ⊥⊥ ZP(X,Y,Z) = P(X)P(Y | X,Z)P(Z)

• Interpretation: two possible explanations X and Z for an observed consequence Y • “Explaining away” effect: information cannot flow between X and Z, unless Y is observed

20/40 © 2018 Nokia Public No No No Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

Which are correct? 1 G ⊥⊥ B? 2 B ⊥⊥ L? 3 D ⊥⊥ L? 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L

21/40 © 2018 Nokia Public No No No Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

1 G ⊥⊥ B? 2 B ⊥⊥ L? 3 D ⊥⊥ L? 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct?

21/40 © 2018 Nokia Public No No Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

No 2 B ⊥⊥ L? 3 D ⊥⊥ L? 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B?

21/40 © 2018 Nokia Public No No Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

2 B ⊥⊥ L? 3 D ⊥⊥ L? 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No

21/40 © 2018 Nokia Public No Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

No 3 D ⊥⊥ L? 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L?

21/40 © 2018 Nokia Public No Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

3 D ⊥⊥ L? 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No

21/40 © 2018 Nokia Public Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

No 4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L?

21/40 © 2018 Nokia Public Yes (by the local Markov property applied to D) No (“explaining away” effect) Yes

4 D ⊥⊥ B? 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No

21/40 © 2018 Nokia Public No (“explaining away” effect) Yes

Yes (by the local Markov property applied to D) 5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No 4 D ⊥⊥ B?

21/40 © 2018 Nokia Public No (“explaining away” effect) Yes

5 D ⊥⊥ B | G? 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No 4 D ⊥⊥ B? Yes (by the local Markov property applied to D)

21/40 © 2018 Nokia Public Yes

No (“explaining away” effect) 6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No 4 D ⊥⊥ B? Yes (by the local Markov property applied to D) 5 D ⊥⊥ B | G?

21/40 © 2018 Nokia Public Yes

6 D ⊥⊥ B | {I,G}?

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No 4 D ⊥⊥ B? Yes (by the local Markov property applied to D) 5 D ⊥⊥ B | G? No (“explaining away” effect)

21/40 © 2018 Nokia Public Yes

Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No 4 D ⊥⊥ B? Yes (by the local Markov property applied to D) 5 D ⊥⊥ B | G? No (“explaining away” effect) 6 D ⊥⊥ B | {I,G}?

21/40 © 2018 Nokia Public Implied independencies D I

Similar to the “Strong Markov property” G B of Markov chains L Which are correct? 1 G ⊥⊥ B? No 2 B ⊥⊥ L? No 3 D ⊥⊥ L? No 4 D ⊥⊥ B? Yes (by the local Markov property applied to D) 5 D ⊥⊥ B | G? No (“explaining away” effect) 6 D ⊥⊥ B | {I,G}? Yes

21/40 © 2018 Nokia Public ( ) local Markov property applied to B

( ) P(G | D,I,B)P(D,B | I) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ rule = P(D | G,I)P(B | I,G)

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} L P(D,B | I,G)

22/40 © 2018 Nokia Public ( ) P(G | D,I,B)P(D,B | I) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ ( rule ) local Markov property applied to B

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} L P(D,B | I,G)

= P(D | G,I)P(B | I,G)

22/40 © 2018 Nokia Public ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ ( rule ) local Markov property applied to B

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} ( ) L P(G | D,I,B)P(D,B | I) P(D,B | I,G) = Bayes’ P(G | I) rule

= P(D | G,I)P(B | I,G)

22/40 © 2018 Nokia Public ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ ( rule ) local Markov property applied to B

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} ( ) L P(G | D,I,B)P(D,B | I) P(D,B | I,G) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G

= P(D | G,I)P(B | I,G)

22/40 © 2018 Nokia Public P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ ( rule ) local Markov property applied to B

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} ( ) L P(G | D,I,B)P(D,B | I) P(D,B | I,G) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities

= P(D | G,I)P(B | I,G)

22/40 © 2018 Nokia Public ( ) = P(D | G,I)P(B | D,I) Bayes’ ( rule ) local Markov property applied to B

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} ( ) L P(G | D,I,B)P(D,B | I) P(D,B | I,G) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) P(G | I)

= P(D | G,I)P(B | I,G)

22/40 © 2018 Nokia Public ( ) local Markov property applied to B

Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} ( ) L P(G | D,I,B)P(D,B | I) P(D,B | I,G) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ rule = P(D | G,I)P(B | I,G)

22/40 © 2018 Nokia Public Implied independencies D I

G B Proof of 6 D ⊥⊥ B | {I,G} ( ) L P(G | D,I,B)P(D,B | I) P(D,B | I,G) = Bayes’ P(G | I) rule ( ) P(G | D,I)P(D,B | I) = local Markov property P(G | I) applied to G ( ) P(G | D,I)P(D | I)P(B | D,I) = definition of condi- P(G | I) tional probabilities P(G | D,I)P(D | I) = P(B | D,I) | P(G I) ( ) = P(D | G,I)P(B | D,I) Bayes’ ( rule ) = | | local Markov property P(D G,I)P(B I,G) applied to B

22/40 © 2018 Nokia Public n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take d↑ maximum number of parents of a node

Memory complexity • If we store the : O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

Memory and time complexity

Parameters

23/40 © 2018 Nokia Public r number of values each variable can take d↑ maximum number of parents of a node

Memory complexity • If we store the probability distribution: O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands)

23/40 © 2018 Nokia Public d↑ maximum number of parents of a node

Memory complexity • If we store the probability distribution: O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take

23/40 © 2018 Nokia Public Memory complexity • If we store the probability distribution: O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take d↑ maximum number of parents of a node

23/40 © 2018 Nokia Public • If we store the probability distribution: O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take d↑ maximum number of parents of a node

Memory complexity

23/40 © 2018 Nokia Public • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take d↑ maximum number of parents of a node

Memory complexity • If we store the probability distribution: O(rn) entries

23/40 © 2018 Nokia Public What about the time complexity?

Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take d↑ maximum number of parents of a node

Memory complexity • If we store the probability distribution: O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

23/40 © 2018 Nokia Public Memory and time complexity

Parameters n number of random variables (typically, n ∼ hundreds or thousands) r number of values each variable can take d↑ maximum number of parents of a node

Memory complexity • If we store the probability distribution: O(rn) entries • If we store the node parents and the conditional ↑ ↑ probability tables: O(n(r + rd )) = O(nrd ) entries

What about the time complexity?

23/40 © 2018 Nokia Public → Bayesian networks: compute or update the belief in each variable given some evidence

Belief propagation, a.k.a. sum-product message passing: Propagate the information through the network, starting from the evidence node(s) • Each variable is a “separate processor” (a neuron?) that knows its own CPT and the messages received from its direct neighbors (Pearl, 1982) • Dynamic programming

Inference

“A guess that you make or an opinion that you form based on the information that you have” (Cambridge dictionary)

24/40 © 2018 Nokia Public Belief propagation, a.k.a. sum-product message passing: Propagate the information through the network, starting from the evidence node(s) • Each variable is a “separate processor” (a neuron?) that knows its own CPT and the messages received from its direct neighbors (Pearl, 1982) • Dynamic programming

Inference

“A guess that you make or an opinion that you form based on the information that you have” (Cambridge dictionary) → Bayesian networks: compute or update the belief in each variable given some evidence

24/40 © 2018 Nokia Public • Each variable is a “separate processor” (a neuron?) that knows its own CPT and the messages received from its direct neighbors (Pearl, 1982) • Dynamic programming

Inference

“A guess that you make or an opinion that you form based on the information that you have” (Cambridge dictionary) → Bayesian networks: compute or update the belief in each variable given some evidence

Belief propagation, a.k.a. sum-product message passing: Propagate the information through the network, starting from the evidence node(s)

24/40 © 2018 Nokia Public • Dynamic programming

Inference

“A guess that you make or an opinion that you form based on the information that you have” (Cambridge dictionary) → Bayesian networks: compute or update the belief in each variable given some evidence

Belief propagation, a.k.a. sum-product message passing: Propagate the information through the network, starting from the evidence node(s) • Each variable is a “separate processor” (a neuron?) that knows its own CPT and the messages received from its direct neighbors (Pearl, 1982)

24/40 © 2018 Nokia Public Inference

“A guess that you make or an opinion that you form based on the information that you have” (Cambridge dictionary) → Bayesian networks: compute or update the belief in each variable given some evidence

Belief propagation, a.k.a. sum-product message passing: Propagate the information through the network, starting from the evidence node(s) • Each variable is a “separate processor” (a neuron?) that knows its own CPT and the messages received from its direct neighbors (Pearl, 1982) • Dynamic programming

24/40 © 2018 Nokia Public Outline

Reminders on probability theory

Bayesian networks

Belief propagation in trees

Belief propagation in polytrees

25/40 © 2018 Nokia Public P(A)

P(B | A) Each node separates the : its non-descendants and ( | ) ( | ) P C B P D B the subtrees rooted at each of its children are conditionally P(E | D) P(F |independentD) given this node

Remark: We will explain the propagation on this toy example borrowed from (Pearl, 1988)

Tree

A Each node (except the root) has at most one parent B

C D

E F

26/40 © 2018 Nokia Public Each node separates the tree: its non-descendants and the subtrees rooted at each of its children are conditionally independent given this node

Remark: We will explain the propagation algorithm on this toy example borrowed from (Pearl, 1988)

Tree Bayesian network

P(A) A Each node (except the root) has at most one parent B P(B | A)

C P(C | B) D P(D | B)

P(E | D) E F P(F | D)

26/40 © 2018 Nokia Public P(A)

P(B | A) Each node separates the tree: its non-descendants and ( | ) ( | ) P C B P D B the subtrees rooted at each of its children are conditionally P(E | D) P(F |independentD) given this node

Remark: We will explain the propagation algorithm on this toy example borrowed from (Pearl, 1988)

Tree Bayesian network

A Each node (except the root) has at most one parent B

C D

E F

26/40 © 2018 Nokia Public P(A)

P(B | A)

P(C | B) P(D | B)

P(E | D) P(F | D)

Remark: We will explain the propagation algorithm on this toy example borrowed from (Pearl, 1988)

Tree Bayesian network

A Each node (except the root) has at most one parent B Each node separates the tree: its non-descendants and C D the subtrees rooted at each of its children are conditionally E F independent given this node

26/40 © 2018 Nokia Public P(A)

P(B | A)

P(C | B) P(D | B)

P(E | D) P(F | D)

Tree Bayesian network

A Each node (except the root) has at most one parent B Each node separates the tree: its non-descendants and C D the subtrees rooted at each of its children are conditionally E F independent given this node

Remark: We will explain the propagation algorithm on this toy example borrowed from (Pearl, 1988)

26/40 © 2018 Nokia Public • P(A): parameter ∑ • P(B) = P(B | A)P(A) A ∑ • P(C) = P(C | B)P(B) B ∑ • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A

B

C D

E F

27/40 © 2018 Nokia Public ∑ • P(B) = P(B | A)P(A) A ∑ • P(C) = P(C | B)P(B) B ∑ • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter

B

C D

E F

27/40 © 2018 Nokia Public ∑ • P(C) = P(C | B)P(B) B ∑ • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter ∑ B • P(B) = P(B | A)P(A) A C D

E F

27/40 © 2018 Nokia Public ∑ • P(C) = P(C | B)P(B) B ∑ • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) A C D

E F

27/40 © 2018 Nokia Public ∑ • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) A ∑ C D • P(C) = P(C | B)P(B) B

E F

27/40 © 2018 Nokia Public ∑ • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) P(B) A ∑ C D • P(C) = P(C | B)P(B) B

E F

27/40 © 2018 Nokia Public X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) P(B) A ∑ C D • P(C) = P(C | B)P(B) B ∑ E F • P(D) = P(D | B)P(B) B

27/40 © 2018 Nokia Public X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) P(B) P(B) A ∑ C D • P(C) = P(C | B)P(B) B ∑ E F • P(D) = P(D | B)P(B) B

27/40 © 2018 Nokia Public X Top-down propagation 2 P(X) Complexity O(nr ) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) P(B) P(B) A ∑ C D • P(C) = P(C | B)P(B) ( ) ( ) B P D P D ∑ E F • P(D) = P(D | B)P(B) B

27/40 © 2018 Nokia Public X P(X) Y

No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) P(B) P(B) A ∑ C D • P(C) = P(C | B)P(B) ( ) ( ) B P D P D ∑ E F • P(D) = P(D | B)P(B) B

Top-down propagation Complexity O(nr2)

27/40 © 2018 Nokia Public No evidence

A • P(A): parameter P(A) ∑ B • P(B) = P(B | A)P(A) P(B) P(B) A ∑ C D • P(C) = P(C | B)P(B) ( ) ( ) B P D P D ∑ E F • P(D) = P(D | B)P(B) B

X Top-down propagation 2 P(X) Complexity O(nr ) Y

27/40 © 2018 Nokia Public • Objective: Compute the belief BEL(X) = P(X | c,e,f) of each node X

• Principle: Propagate the information through the network, starting from the evidence nodes

Three pieces of evidence

A • Evidence: We observe that = = = B C c, E e, and F f

c D

e f

X

Y

28/40 © 2018 Nokia Public • Principle: Propagate the information through the network, starting from the evidence nodes

Three pieces of evidence

A • Evidence: We observe that = = = B C c, E e, and F f • Objective: Compute the c D belief BEL(X) = P(X | c,e,f) of each node X

e f

X

Y

28/40 © 2018 Nokia Public Three pieces of evidence

A • Evidence: We observe that = = = B C c, E e, and F f • Objective: Compute the c D belief BEL(X) = P(X | c,e,f) of each node X

e f • Principle: Propagate the information through the network, starting from the X evidence nodes

Y

28/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

∝ P(c,e,f | B)P(B)

P(e,f | D)P(D | c) = P(e,f | c) ∝ P(e,f | D)P(D | c)

P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f)

P(c,e,f | B)P(B) • P(B | c,e,f) = P(c,e,f)

P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c)

Causal and diagnostic support

A By Bayes’ rule:

B

c D

e f

X

Y

29/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

∝ P(c,e,f | B)P(B)

P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f)

P(c,e,f | B)P(B) • P(B | c,e,f) = P(c,e,f)

P(e,f | D)P(D | c) = P(e,f | c) ∝ P(e,f | D)P(D | c)

Causal and diagnostic support

A By Bayes’ rule:

B

c D

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) X

Y

29/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

∝ P(c,e,f | B)P(B)

P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f)

P(c,e,f | B)P(B) • P(B | c,e,f) = P(c,e,f)

∝ P(e,f | D)P(D | c)

Causal and diagnostic support

A By Bayes’ rule:

B

c D

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) P(e,f | D)P(D | c) X = P(e,f | c)

Y

29/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

∝ P(c,e,f | B)P(B)

P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f)

P(c,e,f | B)P(B) • P(B | c,e,f) = P(c,e,f)

Causal and diagnostic support

A By Bayes’ rule:

B

c D

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) P(e,f | D)P(D | c) X = P(e,f | c) ∝ P(e,f | D)P(D | c) Y

29/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f)

∝ P(c,e,f | B)P(B)

Causal and diagnostic support

A By Bayes’ rule:

B

P(c,e,f | B)P(B) c • P(B | c,e,f) = D P(c,e,f)

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) P(e,f | D)P(D | c) X = P(e,f | c) ∝ P(e,f | D)P(D | c) Y

29/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f)

Causal and diagnostic support

A By Bayes’ rule:

B

P(c,e,f | B)P(B) c • P(B | c,e,f) = D P(c,e,f) ∝ P(c,e,f | B)P(B)

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) P(e,f | D)P(D | c) X = P(e,f | c) ∝ P(e,f | D)P(D | c) Y

29/40 © 2018 Nokia Public ∝ P(c,e,f | A)P(A)

Causal and diagnostic support

A By Bayes’ rule: P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f) B

P(c,e,f | B)P(B) c • P(B | c,e,f) = D P(c,e,f) ∝ P(c,e,f | B)P(B)

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) P(e,f | D)P(D | c) X = P(e,f | c) ∝ P(e,f | D)P(D | c) Y

29/40 © 2018 Nokia Public Causal and diagnostic support

A By Bayes’ rule: P(c,e,f | A)P(A) • P(A | c,e,f) = P(c,e,f) B ∝ P(c,e,f | A)P(A)

P(c,e,f | B)P(B) c • P(B | c,e,f) = D P(c,e,f) ∝ P(c,e,f | B)P(B)

e f P(e,f | D,c)P(D | c) • P(D | c,e,f) = P(e,f | c) P(e,f | D)P(D | c) X = P(e,f | c) ∝ P(e,f | D)P(D | c) Y

29/40 © 2018 Nokia Public Causal and diagnostic support

A

For each X, we compute B ( ¯ ) • evidence ¯ Diagnostic support P below X X c D Bottom-up propagation ( ¯ ) • ¯ evidence Causal support P X above X e f Top-down propagation ( ¯ ) ( ¯ ) BEL(X) ∝ P evidence ¯ X ×P X ¯ evidence X below X above X

Y

30/40 © 2018 Nokia Public Compute P(e,f | B): ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D = P(e,f | D)P(D | B) D

• P(e,f | D) = P(e | D)P(f | D) P(c | B) P(e,f | B)

• P(c,e,f | B) = P(c | B)P(e,f | B) P(e | D) P(f | D)

( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B

c D

e f

X

Y

31/40 © 2018 Nokia Public Compute P(e,f | B): ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D = P(e,f | D)P(D | B) D

P(c | B) P(e,f | B)

• P(c,e,f | B) = P(c | B)P(e,f | B) P(e | D) P(f | D)

( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B • P(e,f | D) = P(e | D)P(f | D)

c D

e f

X

Y

31/40 © 2018 Nokia Public Compute P(e,f | B): ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D = P(e,f | D)P(D | B) D

P(c | B) P(e,f | B)

• P(c,e,f | B) = P(c | B)P(e,f | B)

( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B • P(e,f | D) = P(e | D)P(f | D)

c D P(e | D) P(f | D) e f

X

Y

31/40 © 2018 Nokia Public P(c | B) P(e,f | B)

Compute P(e,f | B): ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D = P(e,f | D)P(D | B) D

( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B • P(e,f | D) = P(e | D)P(f | D)

c D • P(c,e,f | B) = P(c | B)P(e,f | B) P(e | D) P(f | D) e f

X

Y

31/40 © 2018 Nokia Public P(e,f | B)

Compute P(e,f | B): ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D = P(e,f | D)P(D | B) D

( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B • P(e,f | D) = P(e | D)P(f | D) P(c | B)

c D • P(c,e,f | B) = P(c | B)P(e,f | B) P(e | D) P(f | D) e f

X

Y

31/40 © 2018 Nokia Public P(e,f | B)

( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B • P(e,f | D) = P(e | D)P(f | D) P(c | B)

c D • P(c,e,f | B) = P(c | B)P(e,f | B) P(e | D) P(f | D) Compute P(e,f | B): e f ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D X = P(e,f | D)P(D | B) D Y

31/40 © 2018 Nokia Public ( ¯ ) evidence ¯ Diagnostic support P below X X

A Bottom-up propagation B • P(e,f | D) = P(e | D)P(f | D) P(c | B) P(e,f | B)

c D • P(c,e,f | B) = P(c | B)P(e,f | B) P(e | D) P(f | D) Compute P(e,f | B): e f ∑ P(e,f | B) = P(e,f | B,D)P(D | B) ∑D X = P(e,f | D)P(D | B) D Y

31/40 © 2018 Nokia Public P(c,e,f | A)

(Bottom-up¯ ) evidence ¯ P below Y X

( ¯ ) evidence ¯ Diagnostic support P below X X

A

B • P(c,e,f | A) = P(c,e,f | A) P(c | B) P(e,f | B) Compute P(c,e,f | A): c D P(c,e,f | A) P(e | D) P(f | D) ∑ = P(c,e,f | A,B)P(B | A) e f ∑B = P(c,e,f | B)P(B | A) X B

Y

32/40 © 2018 Nokia Public (Bottom-up¯ ) evidence ¯ P below Y X

( ¯ ) evidence ¯ Diagnostic support P below X X

A P(c,e,f | A) B • P(c,e,f | A) = P(c,e,f | A) P(c | B) P(e,f | B) Compute P(c,e,f | A): c D P(c,e,f | A) P(e | D) P(f | D) ∑ = P(c,e,f | A,B)P(B | A) e f ∑B = P(c,e,f | B)P(B | A) X B

Y

32/40 © 2018 Nokia Public ( ¯ ) evidence ¯ Diagnostic support P below X X

A P(c,e,f | A) B • P(c,e,f | A) = P(c,e,f | A) P(c | B) P(e,f | B) Compute P(c,e,f | A): c D P(c,e,f | A) P(e | D) P(f | D) ∑ = P(c,e,f | A,B)P(B | A) e f ∑B = P(c,e,f | B)P(B | A) X B (Bottom-up¯ ) evidence ¯ P below Y X Y

32/40 © 2018 Nokia Public → BEL(A) ∝ P(c,e,f | A)P(A)

→ BEL(B) ∝ P(c,e,f | B)P(B)

P(A)

• P(A): parameter

∑ • P(B) = P(B | A)P(A) A

( ¯ ) ¯ evidence Causal support P X above X

A

B

c D

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

33/40 © 2018 Nokia Public → BEL(B) ∝ P(c,e,f | B)P(B)

P(A)

→ BEL(A) ∝ P(c,e,f | A)P(A) ∑ • P(B) = P(B | A)P(A) A

( ¯ ) ¯ evidence Causal support P X above X

A

B • P(A): parameter

c D

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

33/40 © 2018 Nokia Public → BEL(B) ∝ P(c,e,f | B)P(B)

P(A)

∑ • P(B) = P(B | A)P(A) A

( ¯ ) ¯ evidence Causal support P X above X

A

B • P(A): parameter → ∝ | c D BEL(A) P(c,e,f A)P(A)

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

33/40 © 2018 Nokia Public P(A)

→ BEL(B) ∝ P(c,e,f | B)P(B)

( ¯ ) ¯ evidence Causal support P X above X

A

B • P(A): parameter → ∝ | c D BEL(A) P(c,e,f A)P(A) ∑ • P(B) = P(B | A)P(A) e f A

X (Bottom-up¯ ) evidence ¯ P below Y X Y

33/40 © 2018 Nokia Public → BEL(B) ∝ P(c,e,f | B)P(B)

( ¯ ) ¯ evidence Causal support P X above X

A P(A) B • P(A): parameter → ∝ | c D BEL(A) P(c,e,f A)P(A) ∑ • P(B) = P(B | A)P(A) e f A

X (Bottom-up¯ ) evidence ¯ P below Y X Y

33/40 © 2018 Nokia Public ( ¯ ) ¯ evidence Causal support P X above X

A P(A) B • P(A): parameter → ∝ | c D BEL(A) P(c,e,f A)P(A) ∑ • P(B) = P(B | A)P(A) e f A → BEL(B) ∝ P(c,e,f | B)P(B)

X (Bottom-up¯ ) evidence ¯ P below Y X Y

33/40 © 2018 Nokia Public ∑ = P(D | B)P(B | c) B

Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = P(e,f | c) BEL(B) i.e. P(B | c) ∝ P(e,f | B)

→ BEL(D) ∝ P(e,f | D)P(D | c)

∑ • P(D | c) = P(D | B,c)P(B | c) B

P(B | c)

Top-down( ¯ ) ¯ evidence P X above Y

( ¯ ) ¯ evidence Causal support P X above X

A P(A) B

c D

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

34/40 © 2018 Nokia Public ∑ = P(D | B)P(B | c) B P(B | c) Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = P(e,f | c) BEL(B) i.e. P(B | c) ∝ P(e,f | B) Top-down ( ¯ ) → BEL(D) ∝ P(e,f | D)P(D | c) ¯ evidence P X above Y

( ¯ ) ¯ evidence Causal support P X above X

A ∑ • P(D | c) = P(D | B,c)P(B | c) P(A) B B

c D

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

34/40 © 2018 Nokia Public P(B | c) Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = P(e,f | c) BEL(B) i.e. P(B | c) ∝ P(e,f | B) Top-down ( ¯ ) → BEL(D) ∝ P(e,f | D)P(D | c) ¯ evidence P X above Y

( ¯ ) ¯ evidence Causal support P X above X

A ∑ • P(D | c) = P(D | B,c)P(B | c) P(A) ∑B B = P(D | B)P(B | c) B

c D

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

34/40 © 2018 Nokia Public Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = P(e,f | c) BEL(B) i.e. P(B | c) ∝ P(e,f | B) Top-down ( ¯ ) → BEL(D) ∝ P(e,f | D)P(D | c) ¯ evidence P X above Y

( ¯ ) ¯ evidence Causal support P X above X

A ∑ • P(D | c) = P(D | B,c)P(B | c) P(A) ∑B B = P(D | B)P(B | c) B P(B | c) c D

e f

X (Bottom-up¯ ) evidence ¯ P below Y X Y

34/40 © 2018 Nokia Public Top-down ( ¯ ) → BEL(D) ∝ P(e,f | D)P(D | c) ¯ evidence P X above Y

( ¯ ) ¯ evidence Causal support P X above X

A ∑ • P(D | c) = P(D | B,c)P(B | c) P(A) ∑B B = P(D | B)P(B | c) B P(B | c) c D Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = | e f P(e,f c) BEL(B) i.e. P(B | c) ∝ P(e,f | B) X (Bottom-up¯ ) evidence ¯ P below Y X Y

34/40 © 2018 Nokia Public Top-down( ¯ ) ¯ evidence P X above Y

( ¯ ) ¯ evidence Causal support P X above X

A ∑ • P(D | c) = P(D | B,c)P(B | c) P(A) ∑B B = P(D | B)P(B | c) B P(B | c) c D Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = | e f P(e,f c) BEL(B) i.e. P(B | c) ∝ P(e,f | B) X Bottom-up ( ¯ ) → BEL(D) ∝ P(e,f | D)P(D | c) evidence ¯ P below Y X Y

34/40 © 2018 Nokia Public ( ¯ ) ¯ evidence Causal support P X above X

A ∑ • P(D | c) = P(D | B,c)P(B | c) P(A) ∑B B = P(D | B)P(B | c) B P(B | c) c D Compute P(B | c): P(e,f | B,c)P(B | c) P(B | c,e,f) = | e f P(e,f c) BEL(B) i.e. P(B | c) ∝ P(e,f | B) X Bottom-up Top-down ( ¯ ) ( ¯ ) → BEL(D) ∝ P(e,f | D)P(D | c) evidence ¯ ¯ evidence P below Y X P X above Y Y

34/40 © 2018 Nokia Public ( ¯ ) • evidence ¯ Diagnostic support P below X X Bottom-up propagation ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation

In general • Use a topological ordering • Complexity: O(rd↓ + r2 + r)

Summary

A Algorithm

B

c D

e f

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

35/40 © 2018 Nokia Public ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation

In general • Use a topological ordering • Complexity: O(rd↓ + r2 + r)

Summary

A Algorithm ( ¯ ) • evidence ¯ Diagnostic support P below X X B Bottom-up propagation

c D

e f

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

35/40 © 2018 Nokia Public In general • Use a topological ordering • Complexity: O(rd↓ + r2 + r)

Summary

A Algorithm ( ¯ ) • evidence ¯ Diagnostic support P below X X B Bottom-up propagation ( ¯ ) • ¯ evidence Causal support P X above X c D Top-down propagation

e f

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

35/40 © 2018 Nokia Public • Use a topological ordering • Complexity: O(rd↓ + r2 + r)

Summary

A Algorithm ( ¯ ) • evidence ¯ Diagnostic support P below X X B Bottom-up propagation ( ¯ ) • ¯ evidence Causal support P X above X c D Top-down propagation

e f In general

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

35/40 © 2018 Nokia Public • Complexity: O(rd↓ + r2 + r)

Summary

A Algorithm ( ¯ ) • evidence ¯ Diagnostic support P below X X B Bottom-up propagation ( ¯ ) • ¯ evidence Causal support P X above X c D Top-down propagation

e f In general • Use a topological ordering

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

35/40 © 2018 Nokia Public Summary

A Algorithm ( ¯ ) • evidence ¯ Diagnostic support P below X X B Bottom-up propagation ( ¯ ) • ¯ evidence Causal support P X above X c D Top-down propagation

e f In general • Use a topological ordering • Complexity: O(rd↓ + r2 + r) X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

35/40 © 2018 Nokia Public • The calculations can be written as matrix products − Belief, causal and diagnostic supports, messages ∼ Vectors − CPT ∼ Matrix

• Asynchronous / parallel updates: acknowledgements (Pearl, 1982, 1986, 1988)

Additional remarks

A • If the evidence node is not a leaf: add a phantom node B

c D

e f

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

36/40 © 2018 Nokia Public • Asynchronous / parallel updates: acknowledgements (Pearl, 1982, 1986, 1988)

Additional remarks

A • If the evidence node is not a leaf: add a phantom node

B • The calculations can be written as matrix products c D − Belief, causal and diagnostic supports, messages ∼ Vectors e f − CPT ∼ Matrix

X Top-down (Bottom-up¯ ) ( ¯ ) evidence ¯ ¯ evidence P below Y X P X above Y Y

36/40 © 2018 Nokia Public Additional remarks

A • If the evidence node is not a leaf: add a phantom node

B • The calculations can be written as matrix products c D − Belief, causal and diagnostic supports, messages ∼ Vectors e f − CPT ∼ Matrix

• Asynchronous / parallel X Top-down (Bottom-up¯ ) ( ¯ ) updates: acknowledgements evidence ¯ ¯ evidence P below Y X P X above Y (Pearl, 1982, 1986, 1988) Y

36/40 © 2018 Nokia Public Outline

Reminders on probability theory

Bayesian networks

Belief propagation in trees

Belief propagation in polytrees

37/40 © 2018 Nokia Public Separation properties • Given a node, the non- descendants and the subtrees rooted at each child are independent • If we don’t condition on a node nor any of its descendants, the inversed subtrees rooted at its ancestors are independent

Polytree (or singly-connected) Bayesian network

The underlying undirected graph is a tree

D I

G B

L

38/40 © 2018 Nokia Public • Given a node, the non- descendants and the subtrees rooted at each child are independent • If we don’t condition on a node nor any of its descendants, the inversed subtrees rooted at its ancestors are independent

Polytree (or singly-connected) Bayesian network

The underlying undirected graph is a tree

Separation properties D I

G B

L

38/40 © 2018 Nokia Public • If we don’t condition on a node nor any of its descendants, the inversed subtrees rooted at its ancestors are independent

Polytree (or singly-connected) Bayesian network

The underlying undirected graph is a tree

Separation properties D I • Given a node, the non- descendants and the G B subtrees rooted at each child are independent

L

38/40 © 2018 Nokia Public Polytree (or singly-connected) Bayesian network

The underlying undirected graph is a tree

Separation properties D I • Given a node, the non- descendants and the G B subtrees rooted at each child are independent • If we don’t condition on a L node nor any of its descendants, the inversed subtrees rooted at its ancestors are independent

38/40 © 2018 Nokia Public ( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

( ) ↑ AX ↑ BX ¯ • evidence ¯ Diagnostic support P below X X Bottom-up propagation ↓ XC and ↓ XD are independent given X ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent ↓ XC ↓ XD

Belief propagation

A B

X

C D

39/40 © 2018 Nokia Public ( ) ↑ AX ↑ BX ¯ • evidence ¯ Diagnostic support P below X X Bottom-up propagation ↓ XC and ↓ XD are independent given X ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent ↓ XC ↓ XD

Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

A B

X

C D

39/40 © 2018 Nokia Public ( ) ↑ BX ¯ • evidence ¯ Diagnostic support P below X X Bottom-up propagation ↓ XC and ↓ XD are independent given X ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent ↓ XC ↓ XD

Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

↑ AX

A B

X

C D

39/40 © 2018 Nokia Public ( ¯ ) • evidence ¯ Diagnostic support P below X X Bottom-up propagation ↓ XC and ↓ XD are independent given X ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent ↓ XC ↓ XD

Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

↑ AX ↑ BX

A B

X

C D

39/40 © 2018 Nokia Public ( ¯ ) • evidence ¯ Diagnostic support P below X X Bottom-up propagation ↓ XC and ↓ XD are independent given X ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent ↓ XD

Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

↑ AX ↑ BX

A B

X

C D

↓ XC

39/40 © 2018 Nokia Public ( ¯ ) • evidence ¯ Diagnostic support P below X X Bottom-up propagation ↓ XC and ↓ XD are independent given X ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent

Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

↑ AX ↑ BX

A B

X

C D

↓ XC ↓ XD

39/40 © 2018 Nokia Public ( ¯ ) • ¯ evidence Causal support P X above X Top-down propagation ↑ AX and ↑ BX are independent

Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

( ) ↑ AX ↑ BX ¯ • evidence ¯ Diagnostic support P below X X A B Bottom-up propagation ↓ XC and ↓ XD are independent given X X

C D

↓ XC ↓ XD

39/40 © 2018 Nokia Public Belief propagation

( ¯ ) ( ¯ ) | ∝ evidence ¯ × ¯ evidence P(X evidence) P below X X P X above X

( ) ↑ AX ↑ BX ¯ • evidence ¯ Diagnostic support P below X X A B Bottom-up propagation ↓ XC and ↓ XD are independent given X X ( ¯ ) • ¯ evidence Causal support P X above X C D Top-down propagation ↑ AX and ↑ BX are independent ↓ XC ↓ XD

39/40 © 2018 Nokia Public • Bayesian networks A memory-efficient way of storing a PMF by leveraging conditional independencies between variables

• Belief propagation A time-efficient algorithm for computing the belief − Asynchronous, parallelizable − Exact in (poly)trees − In general, extended to the and to other (approximate)

Conclusion

40/40 © 2018 Nokia Public • Belief propagation A time-efficient algorithm for computing the belief − Asynchronous, parallelizable − Exact in (poly)trees − In general, extended to the junction tree algorithm and to other (approximate) algorithms

Conclusion

• Bayesian networks A memory-efficient way of storing a PMF by leveraging conditional independencies between variables

40/40 © 2018 Nokia Public Conclusion

• Bayesian networks A memory-efficient way of storing a PMF by leveraging conditional independencies between variables

• Belief propagation A time-efficient algorithm for computing the belief − Asynchronous, parallelizable − Exact in (poly)trees − In general, extended to the junction tree algorithm and to other (approximate) algorithms

40/40 © 2018 Nokia Public