Bayesian networks: Inference and learning
CS194-10 Fall 2011 Lecture 22
CS194-10 Fall 2011 Lecture 22 1 Outline
♦ Exact inference (briefly) ♦ Approximate inference (rejection sampling, MCMC) ♦ Parameter learning
CS194-10 Fall 2011 Lecture 22 2 Inference by enumeration
Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B|j, m) B E = P(B, j, m)/P (j, m) = αP(B, j, m) A = α Σe Σa P(B, e, a, j, m) J M Rewrite full joint entries using product of CPT entries: P(B|j, m) = α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a) = αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a) Recursive depth-first enumeration: O(D) space, O(KD) time
CS194-10 Fall 2011 Lecture 22 3 Evaluation tree
P(b) .001 P(e) P( e) .002 .998
P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06
P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05
P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01
Enumeration is inefficient: repeated computation e.g., computes P (j|a)P (m|a) for each value of e
CS194-10 Fall 2011 Lecture 22 4 Efficient exact inference
Junction tree and variable elimination algorithms avoid repeated computation (generalized from of dynamic programming) ♦ Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of exact inference are O(KLD) ♦ Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete
0.5 0.5 0.5 0.5 A B C D L L 1. A v B v C L 2. C v D v A 1 2 3 L 3. B v C v D
AND
CS194-10 Fall 2011 Lecture 22 5 Inference by stochastic simulation
Idea: replace sum over hidden-variable assignments with a random sam- ple. 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability Pˆ 0.5 3) Show this converges to the true probability P Outline: Coin – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior
CS194-10 Fall 2011 Lecture 22 6 Sampling from an empty network function Prior-Sample(bn) returns an event sampled from prior specified by bn inputs: bn, a Bayesian network specifying joint distribution P(X1,...,XD) x ← an event with D elements for j = 1,...,D do x[j] ← a random sample from P(Xj | values of parents(Xj) in x) return x
CS194-10 Fall 2011 Lecture 22 7 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 8 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 9 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 10 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 11 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 12 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 13 Example P(C) .50
Cloudy
C P(S|C) C P(R|C) T .10 Sprinkler Rain T .80 F .50 F .20
Wet Grass SRP(W|S,R) TT .99 TF .90 FT .90 FF .01
CS194-10 Fall 2011 Lecture 22 14 Sampling from an empty network contd.
Probability that PriorSample generates a particular event D SPS(x1 . . . xD) = Πj = 1P (xj|parents(Xj)) = P (x1 . . . xD) i.e., the true prior probability
E.g., SPS(t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t)
Let NPS(x1 . . . xD) be the number of samples generated for event x1, . . . , xD Then we have
lim Pˆ(x1, . . . , xD) = lim NPS(x1, . . . , xD)/N N→∞ N→∞ = SPS(x1, . . . , xD) = P (x1 . . . xD) That is, estimates derived from PriorSample are consistent
Shorthand: Pˆ(x1, . . . , xD) ≈ P (x1 . . . xD)
CS194-10 Fall 2011 Lecture 22 15 Rejection sampling
Pˆ (X|e) estimated from samples agreeing with e
function Rejection-Sampling(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero for i = 1 to N do x ← Prior-Sample(bn) if x is consistent with e then N[x] ← N[x]+1 where x is the value of X in x return Normalize(N)
E.g., estimate P(Rain|Sprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. Pˆ (Rain|Sprinkler = true) = Normalize(h8, 19i) = h0.296, 0.704i Similar to a basic real-world empirical estimation procedure
CS194-10 Fall 2011 Lecture 22 16 Analysis of rejection sampling
Pˆ (X|e) = αNPS(X, e) (algorithm defn.) = NPS(X, e)/NPS(e) (normalized by NPS(e)) ≈ P(X, e)/P (e) (property of PriorSample) = P(X|e) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P (e) is small P (e) drops off exponentially with number of evidence variables!
CS194-10 Fall 2011 Lecture 22 17 Approximate inference using MCMC
General idea of Markov chain Monte Carlo ♦ Sample space Ω, probability π(ω) (e.g., posterior given e) ♦ Would like to sample directly from π(ω), but it’s hard ♦ Instead, wander around Ω randomly, collecting samples ♦ Random wandering is controlled by transition kernel φ(ω → ω0) specifying the probability of moving to ω0 from ω (so the random state sequence ω0, ω1, . . . , ωt is a Markov chain) ♦ If φ is defined appropriately, the stationary distribution is π(ω) so that, after a while (mixing time) the collected samples are drawn from π
CS194-10 Fall 2011 Lecture 22 18 Gibbs sampling in Bayes nets
Markov chain state ωt = current assignment xt to all variables
Transition kernel: pick a variable Xj, sample it conditioned on all others
Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj)) so generate next state by sampling a variable given its Markov blanket
function Gibbs-Ask(X , e, bn, N ) returns an estimate of P(X|e) local variables: N, a vector of counts for each value of X , initially zero Z, the nonevidence variables in bn z, the current state of variables Z, initially random for i = 1 to N do choose Zj in Z uniformly at random set the value of Zj in z by sampling from P(Zj|mb(Zj)) N[x] ← N[x] + 1 where x is the value of X in z return Normalize(N)
CS194-10 Fall 2011 Lecture 22 19 The Markov chain
With Sprinkler = true, W etGrass = true, there are four states:
Cloudy Cloudy
Sprinkler Rain Sprinkler Rain
Wet Wet Grass Grass
Cloudy Cloudy
Sprinkler Rain Sprinkler Rain
Wet Wet Grass Grass
CS194-10 Fall 2011 Lecture 22 20 MCMC example contd.
Estimate P(Rain|Sprinkler = true, W etGrass = true) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true, 69 have Rain = false Pˆ (Rain|Sprinkler = true, W etGrass = true) = Normalize(h31, 69i) = h0.31, 0.69i
CS194-10 Fall 2011 Lecture 22 21 Markov blanket sampling
Markov blanket of Cloudy is Cloudy Sprinkler and Rain
Markov blanket of Rain is Sprinkler Rain
Cloudy, Sprinkler, and W etGrass Wet Grass Probability given the Markov blanket is calculated as follows:
0 0 P (xj|mb(Xj)) = αP (xj|parents(Xj))ΠZ`∈Children(Xj)P (z`|parents(Z`)) E.g., φ(¬cloudy, rain → cloudy, rain) = 0.5 × αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5 × α × 0.5 × 0.1 × 0.8 0.040 = 0.5 × 0.040+0.050 = 0.2222 (Easy for discrete variables; continuous case requires mathematical analysis for each combination of distribution types.) Easily implemented in message-passing parallel systems, brains Can converge slowly, especially for near-deterministic models
CS194-10 Fall 2011 Lecture 22 22 Theory for Gibbs sampling
Theorem: stationary distribution for Gibbs transition kernel is P (z | e); i.e., long-run fraction of time spent in each state is exactly proportional to its posterior probability Proof sketch: – The Gibbs transition kernel satisfies detailed balance for P (z | e) i.e., for all z, z0 the “flow” from z to z0 is the same as from z0 to z – π is the unique stationary distribution for any ergodic transition kernel satisfying detailed balance for π
CS194-10 Fall 2011 Lecture 22 23 Detailed balance
Let πt(z) be the probability the chain is in state z at time t Detailed balance condition: “outflow” = “inflow” for each pair of states:
0 0 0 0 πt(z)φ(z → z ) = πt(z )φ(z → z) for all z, z Detailed balance ⇒ stationarity:
0 0 0 πt+1(z) = Σz0πt(z )φ(z → z) = Σz0πt(z)φ(z → z ) 0 = πt(z)Σz0φ(z → z ) 0 = πt(z )
MCMC algorithms typically constructed by designing a transition kernel φ that is in detailed balance with desired π
CS194-10 Fall 2011 Lecture 22 24 Gibbs sampling transition kernel
Probability of choosing variable Zj to sample is 1/(D − |E|)
Let Z¯j be all other nonevidence variables, i.e., Z − {Zj} Current values are zj and z¯j; e is fixed; transition probability is given by 0 0 0 φ(z → z ) = φ(zj, z¯j → zj, z¯j) = P (zj|z¯j, e)/(D − |E|) This gives detailed balance with P (z | e): 1 1 π(z)φ(z → z0) = P (z | e)P (z0 |z¯ , e) = P (z , z¯ | e)P (z0 |z¯ , e) D − |E| j j D − |E| j j j j 1 = P (z |z¯ , e)P (z¯ | e)P (z0 |z¯ , e) (chain rule) D − |E| j j j j j 1 = P (z |z¯ , e)P (z0 , z¯ | e) (chain rule backwards) D − |E| j j j j = φ(z0 → z)π(z0) = π(z0)φ(z0 → z)
CS194-10 Fall 2011 Lecture 22 25 Summary (inference)
Exact inference: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by MCMC: – Generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables
CS194-10 Fall 2011 Lecture 22 26 Parameter learning: Complete data
θjk` = P (Xj = k | P arents(Xj) = `) (i) Let xj = value of Xj in example i; assume Boolean for simplicity Log likelihood
N D X X (i) (i) L(θ) = log P (xj | parents(Xj) i = 1 j = 1 (i) N D x (i) X X j 1−xj = log θ (i)(1 − θj1`(i)) i = 1 j = 1 j1` ∂L N N = j1` − j0` = 0 gives ∂θj1` θj1` 1 − θj1` Nj1` Nj1` θj1` = = . Nj0` + Nj1` Nj` I.e., learning is completely decomposed; MLE = observed condition frequency
CS194-10 Fall 2011 Lecture 22 27 Example
Red/green wrapper depends probabilistically on flavor:
Likelihood for, e.g., cherry candy in green wrapper: P(F=cherry) θ
P (F = cherry, W = green|hθ,θ1,θ2) Flavor
F P(W=red | F) = P (F = cherry|hθ,θ1,θ2)P (W = green|F = cherry, hθ,θ1,θ2) cherry θ1
= θ · (1 − θ1) lime θ2
Wrapper N candies, rc red-wrapped cherry candies, etc.:
c ` rc gc r` g` P (X|hθ,θ1,θ2) = θ (1 − θ) · θ1 (1 − θ1) · θ2 (1 − θ2)
L = [c log θ + ` log(1 − θ)]
+ [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)]
CS194-10 Fall 2011 Lecture 22 28 Example contd.
Derivatives of L contain only the relevant parameter: ∂L c ` c = − = 0 ⇒ θ = ∂θ θ 1 − θ c + `
∂L rc gc rc = − = 0 ⇒ θ1 = ∂θ1 θ1 1 − θ1 rc + gc
∂L r` g` r` = − = 0 ⇒ θ2 = ∂θ2 θ2 1 − θ2 r` + g`
CS194-10 Fall 2011 Lecture 22 29 Hidden variables
Why learn models with hidden variables?
2 2 2 222 Smoking Diet Exercise Smoking Diet Exercise
54 HeartDisease
6 6 6 54 162 486 Symptom1 Symptom2 Symptom3 Symptom1 Symptom2 Symptom3
(a) (b)
Hidden variables ⇒ simplified structure, fewer parameters ⇒ faster learning
CS194-10 Fall 2011 Lecture 22 30 Learning with and without hidden variables
3.5 3-3 network, APN algorithm 3.45 3-1-3 network, APN algorithm 1 hidden node, NN algorithm 3 hidden nodes, NN algorithm 3.4 5 hidden nodes, NN algorithm Target network 3.35
3.3
3.25
3.2
3.15
Average negative log likelihood per case 3.1
3.05
3 0 1000 2000 3000 4000 5000 Number of training cases
CS194-10 Fall 2011 Lecture 22 31 EM for Bayes nets
For t = 0 to ∞ (until convergence) do
(i) (t) E step: Compute all pijk` = P (Xj = k, P arents(Xj) = ` | e , θ )
Nˆ Σ p M step: θ(t+1) = jk` = i ijk` jk` ˆ Σk0Njk0` ΣiΣk0pijk0` E step can be any exact or approximate inference algorithm With MCMC, can treat each sample as a complete-data example
CS194-10 Fall 2011 Lecture 22 32 Candy example
-1980
L -1990
P(Bag=1) -2000 θ
Bag Log-likelihood Bag P(F=cherry | B) -2010
1 θF1
2 θF2 -2020
0 20 40 60 80 100 120 Flavor Wrapper Hole Iteration number
CS194-10 Fall 2011 Lecture 22 33 Example: Car insurance
SocioEcon Age GoodStudent ExtraCar Mileage RiskAversion VehicleYear SeniorTrain
DrivingSkill MakeModel DrivingHist Antilock DrivQuality Airbag CarValue HomeBase AntiTheftt
Ruggedness Accident Theft OwnDamage Cushioning OtherCost OwnCost
MedicalCost LiabilityCost PropertyCost
CS194-10 Fall 2011 Lecture 22 34 Example: Car insurance contd.
3.4 12--3 network, APN algorithm Insurance network, APN algorithm 1 hidden node, NN algorithm 3.2 5 hidden nodes, NN algorithm 10 hidden nodes, NN algorithm Target network 3
2.8
2.6
2.4
2.2 Average negative log likelihood per case 2
1.8 0 1000 2000 3000 4000 5000 Number of training cases
CS194-10 Fall 2011 Lecture 22 35 Bayesian learning in Bayes nets
Parameters become variables (parents of their previous owners):
P (Wrapper = red | Flavor = cherry, Θ1 = θ1, Θ2 = θ2) = θ1 . network replicated for each example, parameters shared across all examples:
Θ
Flavor1 Flavor2 Flavor3 P(F=cherry) θ
Flavor
F P(W=red | F) Wrapper1 Wrapper2 Wrapper3
cherry θ1
lime θ2
Wrapper Θ1 Θ2
CS194-10 Fall 2011 Lecture 22 36 Bayesian learning contd.
Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc. With independent Beta or Dirichlet priors, MAP EM learning = pseudocounts + expected counts: Nˆ M step: θ(t+1) = a + jk` jk` jk` ˆ Σk0Njk0` Implemented in EM training mode for Hugin and other Bayes net packages
CS194-10 Fall 2011 Lecture 22 37 Summary (parameter learning)
Complete data: likelihood factorizes; each parameter θjk` learned separately from the observed counts for conditional frequency Njk`/Nj` Incomplete data: likelihood is a summation over all values of hidden variables; can apply EM by computing “expected counts” Bayesian learning: parameters become variables in a replicated model with their own prior distributions defined by hyperparameters; then Bayesian learning is just ordinary inference in the model
CS194-10 Fall 2011 Lecture 22 38