<<

Bayesian networks: and learning

CS194-10 Fall 2011 Lecture 22

CS194-10 Fall 2011 Lecture 22 1 Outline

♦ Exact inference (briefly) ♦ Approximate inference (rejection , MCMC) ♦ Parameter learning

CS194-10 Fall 2011 Lecture 22 2 Inference by enumeration

Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B|j, m) B E = P(B, j, m)/P (j, m) = αP(B, j, m) A = α Σe Σa P(B, e, a, j, m) J M Rewrite full joint entries using product of CPT entries: P(B|j, m) = α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a) = αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a) Recursive depth-first enumeration: O(D) space, O(KD) time

CS194-10 Fall 2011 Lecture 22 3 Evaluation tree

P(b) .001 P(e) P( e) .002 .998

P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06

P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05

P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01

Enumeration is inefficient: repeated computation e.g., computes P (j|a)P (m|a) for each value of e

CS194-10 Fall 2011 Lecture 22 4 Efficient exact inference

Junction tree and avoid repeated computation (generalized from of dynamic programming) ♦ Singly connected networks (or ): – any two nodes are connected by at most one (undirected) path – time and space cost of exact inference are O(KLD) ♦ Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete

0.5 0.5 0.5 0.5 A B C D L L 1. A v B v C L 2. C v D v A 1 2 3 L 3. B v C v D

AND

CS194-10 Fall 2011 Lecture 22 5 Inference by stochastic simulation

Idea: replace sum over hidden-variable assignments with a random sam- ple. 1) Draw N samples from a S 2) Compute an approximate posterior Pˆ 0.5 3) Show this converges to the true probability P Outline: Coin – Sampling from an empty network – : reject samples disagreeing with evidence – Monte Carlo (MCMC): sample from a whose stationary distribution is the true posterior

CS194-10 Fall 2011 Lecture 22 6 Sampling from an empty network function Prior-Sample(bn) returns an event sampled from prior specified by bn inputs: bn, a specifying joint distribution P(X1,...,XD) x ← an event with D elements for j = 1,...,D do x[j] ← a random sample from P(Xj | values of parents(Xj) in x) return x

CS194-10 Fall 2011 Lecture 22 7 Example P(C) .50

Cloudy

C P(S|C) C P(|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 8 Example P(C) .50

Cloudy

C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 9 Example P(C) .50

Cloudy

C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 10 Example P(C) .50

Cloudy

C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 11 Example P(C) .50

Cloudy

C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 12 Example P(C) .50

Cloudy

C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 13 Example P(C) .50

Cloudy

C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20

Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01

CS194-10 Fall 2011 Lecture 22 14 Sampling from an empty network contd.

Probability that PriorSample generates a particular event D SPS(x1 . . . xD) = Πj = 1P (xj|parents(Xj)) = P (x1 . . . xD) i.e., the true

E.g., SPS(t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t)

Let NPS(x1 . . . xD) be the number of samples generated for event x1, . . . , xD Then we have

lim Pˆ(x1, . . . , xD) = lim NPS(x1, . . . , xD)/N N→∞ N→∞ = SPS(x1, . . . , xD) = P (x1 . . . xD) That is, estimates derived from PriorSample are consistent

Shorthand: Pˆ(x1, . . . , xD) ≈ P (x1 . . . xD)

CS194-10 Fall 2011 Lecture 22 15 Rejection sampling

Pˆ (X|e) estimated from samples agreeing with e

function Rejection-Sampling(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero for i = 1 to N do x ← Prior-Sample(bn) if x is consistent with e then N[x] ← N[x]+1 where x is the value of X in x return Normalize(N)

E.g., estimate P(Rain|Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. Pˆ (Rain|Sprinkler = true) = Normalize(h8, 19i) = h0.296, 0.704i Similar to a basic real-world empirical estimation procedure

CS194-10 Fall 2011 Lecture 22 16 Analysis of rejection sampling

Pˆ (X|e) = αNPS(X, e) ( defn.) = NPS(X, e)/NPS(e) (normalized by NPS(e)) ≈ P(X, e)/P (e) (property of PriorSample) = P(X|e) (defn. of ) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P (e) is small P (e) drops off exponentially with number of evidence variables!

CS194-10 Fall 2011 Lecture 22 17 Approximate inference using MCMC

General idea of ♦ Sample space Ω, probability π(ω) (e.g., posterior given e) ♦ Would like to sample directly from π(ω), but it’s hard ♦ Instead, wander around Ω randomly, collecting samples ♦ Random wandering is controlled by transition kernel φ(ω → ω0) specifying the probability of moving to ω0 from ω (so the random state sequence ω0, ω1, . . . , ωt is a Markov chain) ♦ If φ is defined appropriately, the stationary distribution is π(ω) so that, after a while (mixing time) the collected samples are drawn from π

CS194-10 Fall 2011 Lecture 22 18 in Bayes nets

Markov chain state ωt = current assignment xt to all variables

Transition kernel: pick a variable Xj, sample it conditioned on all others

Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj)) so generate next state by sampling a variable given its Markov blanket

function Gibbs-Ask(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero Z, the nonevidence variables in bn z, the current state of variables Z, initially random for i = 1 to N do choose Zj in Z uniformly at random set the value of Zj in z by sampling from P(Zj|mb(Zj)) N[x] ← N[x] + 1 where x is the value of X in z return Normalize(N)

CS194-10 Fall 2011 Lecture 22 19 The Markov chain

With Sprinkler = true, W etGrass = true, there are four states:

Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Wet Grass Grass

Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Wet Grass Grass

CS194-10 Fall 2011 Lecture 22 20 MCMC example contd.

Estimate P(Rain|Sprinkler = true, W etGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = false Pˆ (Rain|Sprinkler = true, W etGrass = true) = Normalize(h31, 69i) = h0.31, 0.69i

CS194-10 Fall 2011 Lecture 22 21 Markov blanket sampling

Markov blanket of Cloudy is Cloudy Sprinkler and Rain

Markov blanket of Rain is Sprinkler Rain

Cloudy, Sprinkler, and W etGrass Wet Grass Probability given the Markov blanket is calculated as follows:

0 0 P (xj|mb(Xj)) = αP (xj|parents(Xj))ΠZ`∈Children(Xj)P (z`|parents(Z`)) E.g., φ(¬cloudy, rain → cloudy, rain) = 0.5 × αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5 × α × 0.5 × 0.1 × 0.8 0.040 = 0.5 × 0.040+0.050 = 0.2222 (Easy for discrete variables; continuous case requires mathematical analysis for each combination of distribution types.) Easily implemented in message-passing parallel systems, brains Can converge slowly, especially for near-deterministic models

CS194-10 Fall 2011 Lecture 22 22 Theory for Gibbs sampling

Theorem: stationary distribution for Gibbs transition kernel is P (z | e); i.e., long-run fraction of time spent in each state is exactly proportional to its Proof sketch: – The Gibbs transition kernel satisfies detailed balance for P (z | e) i.e., for all z, z0 the “flow” from z to z0 is the same as from z0 to z – π is the unique stationary distribution for any ergodic transition kernel satisfying detailed balance for π

CS194-10 Fall 2011 Lecture 22 23 Detailed balance

Let πt(z) be the probability the chain is in state z at time t Detailed balance condition: “outflow” = “inflow” for each pair of states:

0 0 0 0 πt(z)φ(z → z ) = πt(z )φ(z → z) for all z, z Detailed balance ⇒ stationarity:

0 0 0 πt+1(z) = Σz0πt(z )φ(z → z) = Σz0πt(z)φ(z → z ) 0 = πt(z)Σz0φ(z → z ) 0 = πt(z )

MCMC algorithms typically constructed by designing a transition kernel φ that is in detailed balance with desired π

CS194-10 Fall 2011 Lecture 22 24 Gibbs sampling transition kernel

Probability of choosing variable Zj to sample is 1/(D − |E|)

Let Z¯j be all other nonevidence variables, i.e., Z − {Zj} Current values are zj and z¯j; e is fixed; transition probability is given by 0 0 0 φ(z → z ) = φ(zj, z¯j → zj, z¯j) = P (zj|z¯j, e)/(D − |E|) This gives detailed balance with P (z | e): 1 1 π(z)φ(z → z0) = P (z | e)P (z0 |z¯ , e) = P (z , z¯ | e)P (z0 |z¯ , e) D − |E| j j D − |E| j j j j 1 = P (z |z¯ , e)P (z¯ | e)P (z0 |z¯ , e) (chain rule) D − |E| j j j j j 1 = P (z |z¯ , e)P (z0 , z¯ | e) (chain rule backwards) D − |E| j j j j = φ(z0 → z)π(z0) = π(z0)φ(z0 → z)

CS194-10 Fall 2011 Lecture 22 25 Summary (inference)

Exact inference: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by MCMC: – Generally insensitive to topology – Convergence can be very slow with close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables

CS194-10 Fall 2011 Lecture 22 26 Parameter learning: Complete

θjk` = P (Xj = k | P arents(Xj) = `) (i) Let xj = value of Xj in example i; assume Boolean for simplicity Log likelihood

N D X X (i) (i) L(θ) = log P (xj | parents(Xj) i = 1 j = 1 (i) N D x (i) X X j 1−xj = log θ (i)(1 − θj1`(i)) i = 1 j = 1 j1` ∂L N N = j1` − j0` = 0 gives ∂θj1` θj1` 1 − θj1` Nj1` Nj1` θj1` = = . Nj0` + Nj1` Nj` I.e., learning is completely decomposed; MLE = observed condition

CS194-10 Fall 2011 Lecture 22 27 Example

Red/green wrapper depends probabilistically on flavor:

Likelihood for, e.g., cherry candy in green wrapper: P(F=cherry) θ

P (F = cherry, W = green|hθ,θ1,θ2) Flavor

F P(W=red | F) = P (F = cherry|hθ,θ1,θ2)P (W = green|F = cherry, hθ,θ1,θ2) cherry θ1

= θ · (1 − θ1) lime θ2

Wrapper N candies, rc red-wrapped cherry candies, etc.:

c ` rc gc r` g` P (X|hθ,θ1,θ2) = θ (1 − θ) · θ1 (1 − θ1) · θ2 (1 − θ2)

L = [c log θ + ` log(1 − θ)]

+ [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)]

CS194-10 Fall 2011 Lecture 22 28 Example contd.

Derivatives of L contain only the relevant parameter: ∂L c ` c = − = 0 ⇒ θ = ∂θ θ 1 − θ c + `

∂L rc gc rc = − = 0 ⇒ θ1 = ∂θ1 θ1 1 − θ1 rc + gc

∂L r` g` r` = − = 0 ⇒ θ2 = ∂θ2 θ2 1 − θ2 r` + g`

CS194-10 Fall 2011 Lecture 22 29 Hidden variables

Why learn models with hidden variables?

2 2 2 222 Smoking Diet Exercise Smoking Diet Exercise

54 HeartDisease

6 6 6 54 162 486 Symptom1 Symptom2 Symptom3 Symptom1 Symptom2 Symptom3

(a) (b)

Hidden variables ⇒ simplified structure, fewer parameters ⇒ faster learning

CS194-10 Fall 2011 Lecture 22 30 Learning with and without hidden variables

3.5 3-3 network, APN algorithm 3.45 3-1-3 network, APN algorithm 1 hidden node, NN algorithm 3 hidden nodes, NN algorithm 3.4 5 hidden nodes, NN algorithm Target network 3.35

3.3

3.25

3.2

3.15

Average negative log likelihood per case 3.1

3.05

3 0 1000 2000 3000 4000 5000 Number of training cases

CS194-10 Fall 2011 Lecture 22 31 EM for Bayes nets

For t = 0 to ∞ (until convergence) do

(i) (t) E step: Compute all pijk` = P (Xj = k, P arents(Xj) = ` | e , θ )

Nˆ Σ p M step: θ(t+1) = jk` = i ijk` jk` ˆ Σk0Njk0` ΣiΣk0pijk0` E step can be any exact or approximate inference algorithm With MCMC, can treat each sample as a complete-data example

CS194-10 Fall 2011 Lecture 22 32 Candy example

-1980

L -1990

P(Bag=1) -2000 θ

Bag Log-likelihood Bag P(F=cherry | B) -2010

1 θF1

2 θF2 -2020

0 20 40 60 80 100 120 Flavor Wrapper Hole Iteration number

CS194-10 Fall 2011 Lecture 22 33 Example: Car insurance

SocioEcon Age GoodStudent ExtraCar Mileage RiskAversion VehicleYear SeniorTrain

DrivingSkill MakeModel DrivingHist Antilock DrivQuality Airbag CarValue HomeBase AntiTheftt

Ruggedness Accident Theft OwnDamage Cushioning OtherCost OwnCost

MedicalCost LiabilityCost PropertyCost

CS194-10 Fall 2011 Lecture 22 34 Example: Car insurance contd.

3.4 12--3 network, APN algorithm Insurance network, APN algorithm 1 hidden node, NN algorithm 3.2 5 hidden nodes, NN algorithm 10 hidden nodes, NN algorithm Target network 3

2.8

2.6

2.4

2.2 Average negative log likelihood per case 2

1.8 0 1000 2000 3000 4000 5000 Number of training cases

CS194-10 Fall 2011 Lecture 22 35 Bayesian learning in Bayes nets

Parameters become variables (parents of their previous owners):

P (Wrapper = red | Flavor = cherry, Θ1 = θ1, Θ2 = θ2) = θ1 . network replicated for each example, parameters shared across all examples:

Θ

Flavor1 Flavor2 Flavor3 P(F=cherry) θ

Flavor

F P(W=red | F) Wrapper1 Wrapper2 Wrapper3

cherry θ1

lime θ2

Wrapper Θ1 Θ2

CS194-10 Fall 2011 Lecture 22 36 Bayesian learning contd.

Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc. With independent Beta or Dirichlet priors, MAP EM learning = pseudocounts + expected counts: Nˆ M step: θ(t+1) = a + jk` jk` jk` ˆ Σk0Njk0` Implemented in EM training for Hugin and other Bayes net packages

CS194-10 Fall 2011 Lecture 22 37 Summary (parameter learning)

Complete data: likelihood factorizes; each parameter θjk` learned separately from the observed counts for conditional frequency Njk`/Nj` Incomplete data: likelihood is a summation over all values of hidden variables; can apply EM by computing “expected counts” Bayesian learning: parameters become variables in a replicated model with their own prior distributions defined by ; then Bayesian learning is just ordinary inference in the model

CS194-10 Fall 2011 Lecture 22 38