Bayesian Networks: Inference and Learning

Bayesian networks: Inference and learning CS194-10 Fall 2011 Lecture 22 CS194-10 Fall 2011 Lecture 22 1 Outline ♦ Exact inference (briefly) ♦ Approximate inference (rejection sampling, MCMC) ♦ Parameter learning CS194-10 Fall 2011 Lecture 22 2 Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B|j, m) B E = P(B, j, m)/P (j, m) = αP(B, j, m) A = α Σe Σa P(B, e, a, j, m) J M Rewrite full joint entries using product of CPT entries: P(B|j, m) = α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a) = αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a) Recursive depth-first enumeration: O(D) space, O(KD) time CS194-10 Fall 2011 Lecture 22 3 Evaluation tree P(b) .001 P(e) P( e) .002 .998 P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06 P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05 P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01 Enumeration is inefficient: repeated computation e.g., computes P (j|a)P (m|a) for each value of e CS194-10 Fall 2011 Lecture 22 4 Efficient exact inference Junction tree and variable elimination algorithms avoid repeated computation (generalized from of dynamic programming) ♦ Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of exact inference are O(KLD) ♦ Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5 0.5 0.5 0.5 A B C D L L 1. A v B v C L 2. C v D v A 1 2 3 L 3. B v C v D AND CS194-10 Fall 2011 Lecture 22 5 Inference by stochastic simulation Idea: replace sum over hidden-variable assignments with a random sample. 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability Pˆ 0.5 3) Show this converges to the true probability P Outline: Coin – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior CS194-10 Fall 2011 Lecture 22 6 Sampling from an empty network function Prior-Sample(bn) returns an event sampled from prior specified by bn inputs: bn, a Bayesian network specifying joint distribution P(X1,...,XD) x ← an event with D elements for j = 1,...,D do x[j] ← a random sample from P(Xj | values of parents(Xj) in x) return x CS194-10 Fall 2011 Lecture 22 7 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 8 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 9 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 10 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 11 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 12 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 13 Example P(C) .50 Cloudy C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20 Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01 CS194-10 Fall 2011 Lecture 22 14 Sampling from an empty network contd. Probability that PriorSample generates a particular event D SPS(x1 . xD) = Πj = 1P (xj|parents(Xj)) = P (x1 . xD) i.e., the true prior probability E.g., SPS(t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t) Let NPS(x1 . xD) be the number of samples generated for event x1, . , xD Then we have lim Pˆ(x1, . , xD) = lim NPS(x1, . , xD)/N N→∞ N→∞ = SPS(x1, . , xD) = P (x1 . xD) That is, estimates derived from PriorSample are consistent Shorthand: Pˆ(x1, . , xD) ≈ P (x1 . xD) CS194-10 Fall 2011 Lecture 22 15 Rejection sampling Pˆ (X|e) estimated from samples agreeing with e function Rejection-Sampling(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero for i = 1 to N do x ← Prior-Sample(bn) if x is consistent with e then N[x] ← N[x]+1 where x is the value of X in x return Normalize(N) E.g., estimate P(Rain|Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. Pˆ (Rain|Sprinkler = true) = Normalize(h8, 19i) = h0.296, 0.704i Similar to a basic real-world empirical estimation procedure CS194-10 Fall 2011 Lecture 22 16 Analysis of rejection sampling Pˆ (X|e) = αNPS(X, e) (algorithm defn.) = NPS(X, e)/NPS(e) (normalized by NPS(e)) ≈ P(X, e)/P (e) (property of PriorSample) = P(X|e) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P (e) is small P (e) drops off exponentially with number of evidence variables! CS194-10 Fall 2011 Lecture 22 17 Approximate inference using MCMC General idea of Markov chain Monte Carlo ♦ Sample space Ω, probability π(ω) (e.g., posterior given e) ♦ Would like to sample directly from π(ω), but it’s hard ♦ Instead, wander around Ω randomly, collecting samples ♦ Random wandering is controlled by transition kernel φ(ω → ω0) specifying the probability of moving to ω0 from ω (so the random state sequence ω0, ω1, . , ωt is a Markov chain) ♦ If φ is defined appropriately, the stationary distribution is π(ω) so that, after a while (mixing time) the collected samples are drawn from π CS194-10 Fall 2011 Lecture 22 18 Gibbs sampling in Bayes nets Markov chain state ωt = current assignment xt to all variables Transition kernel: pick a variable Xj, sample it conditioned on all others Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj)) so generate next state by sampling a variable given its Markov blanket function Gibbs-Ask(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero Z, the nonevidence variables in bn z, the current state of variables Z, initially random for i = 1 to N do choose Zj in Z uniformly at random set the value of Zj in z by sampling from P(Zj|mb(Zj)) N[x] ← N[x] + 1 where x is the value of X in z return Normalize(N) CS194-10 Fall 2011 Lecture 22 19 The Markov chain With Sprinkler = true, W etGrass = true, there are four states: Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Wet Grass Grass Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Wet Grass Grass CS194-10 Fall 2011 Lecture 22 20 MCMC example contd. Estimate P(Rain|Sprinkler = true, W etGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = false Pˆ (Rain|Sprinkler = true, W etGrass = true) = Normalize(h31, 69i) = h0.31, 0.69i CS194-10 Fall 2011 Lecture 22 21 Markov blanket sampling Markov blanket of Cloudy is Cloudy Sprinkler and Rain Markov blanket of Rain is Sprinkler Rain Cloudy, Sprinkler, and W etGrass Wet Grass Probability given the Markov blanket is calculated as follows: 0 0 P (xj|mb(Xj)) = αP (xj|parents(Xj))ΠZ`∈Children(Xj)P (z`|parents(Z`)) E.g., φ(¬cloudy, rain → cloudy, rain) = 0.5 × αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5 × α × 0.5 × 0.1 × 0.8 0.040 = 0.5 × 0.040+0.050 = 0.2222 (Easy for discrete variables; continuous case requires mathematical analysis for each combination of distribution types.) Easily implemented in message-passing parallel systems, brains Can converge slowly, especially for near-deterministic models CS194-10 Fall 2011 Lecture 22 22 Theory for Gibbs sampling Theorem: stationary distribution for Gibbs transition kernel is P (z | e); i.e., long-run fraction of time spent in each state is exactly proportional to its posterior probability Proof sketch: – The Gibbs transition kernel satisfies detailed balance for P (z | e) i.e., for all z, z0 the “flow” from z to z0 is the same as from z0 to z – π is the unique stationary distribution for any ergodic transition kernel satisfying detailed balance for π CS194-10 Fall 2011 Lecture 22 23 Detailed balance Let πt(z) be the probability the chain is in state z at time t Detailed balance condition: “outflow” = “inflow” for each pair of states: 0 0 0 0 πt(z)φ(z → z ) = πt(z )φ(z → z) for all z, z Detailed balance ⇒ stationarity: 0 0 0 πt+1(z) = Σz0πt(z )φ(z → z) = Σz0πt(z)φ(z → z ) 0 = πt(z)Σz0φ(z → z ) 0 = πt(z ) MCMC algorithms typically constructed by designing a transition kernel φ that is in detailed balance with desired π CS194-10 Fall 2011 Lecture 22 24 Gibbs sampling transition kernel Probability of choosing variable Zj to sample is 1/(D − |E|) Let Z¯j be all other nonevidence variables, i.e., Z − {Zj} Current values are zj and z¯j; e is fixed; transition probability is given by 0 0 0 φ(z → z ) = φ(zj, z¯j → zj, z¯j) = P (zj|z¯j, e)/(D − |E|) This gives detailed balance with P (z | e): 1 1 π(z)φ(z → z0) = P (z | e)P (z0 |z¯ , e) = P (z , z¯ | e)P (z0 |z¯ , e) D − |E| j j D − |E| j j j j 1 = P (z |z¯ , e)P (z¯ | e)P (z0 |z¯ , e) (chain rule) D − |E| j j j j j 1 = P (z |z¯ , e)P (z0 , z¯ | e) (chain rule backwards) D − |E| j j j j = φ(z0 → z)π(z0) = π(z0)φ(z0 → z) CS194-10 Fall 2011 Lecture 22 25 Summary (inference) Exact inference: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by MCMC: – Generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables CS194-10 Fall 2011 Lecture 22 26 Parameter learning: Complete data θjk` = P (Xj = k | P arents(Xj) = `) (i) Let xj = value of Xj in example i; assume Boolean for simplicity Log likelihood N D X X (i) (i) L(θ) = log P (xj | parents(Xj) i = 1 j = 1 (i) N D x (i) X X j 1−xj = log θ (i)(1 − θj1`(i)) i = 1 j = 1 j1` ∂L N N = j1` − j0` = 0 gives ∂θj1` θj1` 1 − θj1` Nj1` Nj1` θj1` = = .

Bayesian Networks: Inference and Learning

Practical Statistics for Particle Physics Lecture 1 AEPS2018, Quy Nhon, Vietnam

Markov Chain Monte Carlo Convergence Diagnostics: a Comparative Review

Meta-Learning MCMC Proposals

Common Quandaries and Their Practical Solutions in Bayesian Network Modeling

3.3 Bayes' Formula

Markov Chain Monte Carlo Edps 590BAY

Bayesian Network Modelling with Examples

Numerical Physics with Probabilities: the Monte Carlo Method and Bayesian Statistics Part I for Assignment 2

The Bayesian Approach to Statistics

Bayesian Phylogenetics and Markov Chain Monte Carlo Will Freyman

Empirical Evaluation of Scoring Functions for Bayesian Network Model Selection

Paradoxes and Priors in Bayesian Regression