Learning Bayesian Networks from Data

Overview Introduction Parameter Estimation Learning Bayesian Model Selection Networks from Data Structure Discovery Incomplete Data Nir Friedman Daphne Koller Learning from Structured Data Hebrew U. Stanford . 2 Bayesian Networks Example: “ICU Alarm” network Compact representation of probability Domain: Monitoring Intensive-Care Patients distributions via conditional independence 37 variables Family of Alarm MINVOLSET E B P(A | E,B) 509 parameters Qualitative part: Earthquake Burglary PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT Directed acyclic graph (DAG) e b 0.9 0.1 …instead of 254 e b 0.2 0.8 PAP SHUNT VENTLUNG VENITUBE Nodes - random variables PRESS e b 0.9 0.1 MINOVL FIO2 VENTALV Edges - direct influence Radio Alarm e b 0.01 0.99 ANAPHYLAXIS PVSAT ARTCO2 TPR SAO2 INSUFFANESTH EXPCO2 Call Together: HYPOVOLEMIA LVFAILURE CATECHOL Quantitative part: Define a unique distribution Set of conditional LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER in a factored form probability distributions CVP PCWP CO HREKG HRSAT P (B,E ,A,C ,R) = P (B)P (E )P (A |B,E )P (R |E )P (C |A) HRBP BP 3 4 Inference Why learning? Posterior probabilities Knowledge acquisition bottleneck Probability of any event given any evidence Knowledge acquisition is an expensive process Most likely explanation Often we don’t have an expert Scenario that explains evidence Rational decision making Data is cheap Earthquake Burglary Maximize expected utility Amount of available information growing rapidly Value of Information Learning allows us to construct models from raw Radio Alarm Effect of intervention data Call 5 6 1 Why Learn Bayesian Networks? Learning Bayesian networks Conditional independencies & graphical language capture structure of many real-world distributions E B Graph structure provides much insight into domain Data Learner R A Allows “knowledge discovery” + Prior Information C Learned model can be used for many tasks E B P(A | E,B) e b .9 .1 Supports all the features of probabilistic learning e b .7 .3 Model selection criteria e b .8 .2 .99 .01 Dealing with missing data & hidden variables e b 7 8 Known Structure, Complete Data Unknown Structure, Complete Data E, B, A E, B, A <Y,N,N> <Y,N,N> <Y,N,Y> <Y,N,Y> <N,N,Y> <N,N,Y> <N,Y,Y> <N,Y,Y> . E B . E B . <N,Y,Y> E B P(A | E,B) <N,Y,Y> E B P(A | E,B) Learner A Learner A e b .9 .1 e b .9 .1 E B P(A | E,B) E B E B P(A | E,B) E B e b .7 .3 e b .7 .3 e b ?? e b ?? A e b .8 .2 A e b .8 .2 e b ?? e b ?? e b .99 .01 e b .99 .01 e b ?? e b ?? e b ?? e b ?? Network structure is specified Network structure is not specified Inducer needs to estimate parameters Inducer needs to select arcs & estimate parameters Data does not contain missing values Data does not contain missing values 9 10 Known Structure, Incomplete Data Unknown Structure, Incomplete Data E, B, A E, B, A <Y,N,N> <Y,N,N> <Y,?,Y> <Y,?,Y> <N,N,Y> <N,N,Y> <N,Y,?> <N,Y,?> . E B . E B . <?,Y,Y> E B P(A | E,B) <?,Y,Y> E B P(A | E,B) Learner A Learner A e b .9 .1 e b .9 .1 E B P(A | E,B) E B E B P(A | E,B) E B e b .7 .3 e b .7 .3 e b ?? e b ?? A e b .8 .2 A e b .8 .2 e b ?? e b ?? e b .99 .01 e b .99 .01 e b ?? e b ?? e b ?? e b ?? Network structure is specified Network structure is not specified Data contains missing values Data contains missing values Need to consider assignments to missing values Need to consider assignments to missing values 11 12 2 Overview Learning Parameters Introduction Training data has the form: E B Parameter Estimation A Likelihood function E [1] B[1] A[1] C [1] C Bayesian estimation ⋅ ⋅ ⋅ ⋅ D = Model Selection ⋅ ⋅ ⋅ ⋅ Structure Discovery E [M] B[M] A[M] C [M] Incomplete Data Learning from Structured Data 13 14 Likelihood Function Likelihood Function Assume i.i.d. samples E B By definition of network, we get E B Likelihood function is A A L(Θ :D) = ∏P(E[m],B[m],A[m],C[m] : Θ) C L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m P(E [m]: Θ) P(B[m]: Θ) = ∏ m P(A[m]|B[m],E [m]: Θ) P(C[m]|A[m]: Θ) E [1] B[1] A[1] C [1] ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ E [M] B[M] A[M] C [M] 15 16 Likelihood Function General Bayesian Networks Rewriting terms, we get E B Generalizing for any Bayesian network: A L(Θ :D) = ∏P (x1[m],K,xn [m] : Θ) L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m = ∏∏P (xi [m] |Pai [m] : Θi ) ∏P(E[m]: Θ) im m = L (Θ : D) P(B[m]: Θ) ∏ i i ∏ i m = ∏P(A[m]|B[m],E[m]: Θ) E [1] B[1] A[1] C [1] Decomposition m ⋅ ⋅ ⋅ ⋅ ⇒ ∏P(C[m]|A[m]: Θ) ⋅ ⋅ ⋅ ⋅ Independent estimation problems m E [M] B[M] A[M] C [M] 17 18 3 Likelihood Function: Multinomials Bayesian Inference L(θ : D) = P (D | θ) = ∏ P (x [m]| θ) m Represent uncertainty about parameters using a probability distribution over parameters, data The likelihood for the sequence H,T, T, H, H is Learning using Bayes rule ) :D L(θ : D) = θ ⋅ (1 − θ) ⋅ (1 − θ) ⋅ θ ⋅ θ Likelihood Prior (θ L P (x [1],Kx [M] | θ)P (θ) P (θ| x [1],Kx [M]) = Count of kth P (x [1],Kx [M ]) 0 0.2 0.4 0.6 0.8 1 outcome in D θ Posterior K Probability of data N General case: L(Θ : D) = ∏θ k k Probability of k =1 kth outcome 19 20 Bayesian Inference Example: Binomial Data Represent Bayesian distribution as Bayes net Prior: uniform for θ in [0,1] θ ⇒ P(θ |D) ∝ the likelihood L(θ :D) P(θ |x[1],Kx[M])∝P(x[1],Kx[M]|θ)⋅P(θ) X[1] X[2] X[m] (NH,NT ) = (4,1) Observed data The values of X are independent given θ MLE for P(X = H ) is 4/5 = 0.8 0 0.2 0.4 0.6 0.8 1 P(x[m] | θ ) = θ Bayesian prediction is 5 Bayesian prediction is inference in this network P (x[M + 1] = H |D) = θ ⋅P (θ|D)dθ = = 0.7142K ∫ 7 21 22 Dirichlet Priors Dirichlet Priors - Example 5 Recall that the likelihood function is 4.5 K N Dirichlet(α , α ) L(Θ : D) = θ k 4 heads tails ∏ k k =1 3.5 ) Dirichlet prior with hyperparameters α1,…,αK 3 K Dirichlet(5,5) αk −1 heads 2.5 P (Θ) ∝ ∏ θk θ Dirichlet(0.5,0.5) k =1 P( 2 Dirichlet(2,2) ⇒ the posterior has the same form, with 1.5 Dirichlet(1,1) 1 hyperparameters α1+N 1,…,αK +N K K K K 0.5 αk −1 Nk αk +Nk −1 P (Θ|D) ∝ P (Θ)P (D | Θ) ∝ ∏θ k ∏θk = ∏θk 0 k =1 k =1 k =1 0 0.2 0.4 0.6 0.8 1 θ heads 23 24 4 Dirichlet Priors (cont.) Bayesian Nets & Bayesian Prediction If P(Θ) is Dirichlet with hyperparameters α1,…,αK θY|X θX θX αk P (X [1] = k ) = θ ⋅P (Θ)dΘ = m ∫ k θY|X α X[m] ∑ l X[1] X[2] X[M] X[M+1] l Y[m] Since the posterior is also Dirichlet, we get Y[1] Y[2] Y[M] Y[M+1] Plate notation Observed data Query αk + Nk P (X [M + 1] = k |D) = ∫ θk ⋅P (Θ |D)dΘ = ∑ (αl + Nl ) Priors for each parameter group are independent l Data instances are independent given the unknown parameters 25 26 Bayesian Nets & Bayesian Prediction Learning Parameters: Summary θ θ Y|X X Estimation relies on sufficient statistics For multinomials: counts N(xi,pai) X[1] X[2] X[M] Parameter estimation Y[1] Y[2] Y[M] N (x , pa ) ~ α(x , pa ) + N (x , pa ) ˆθ = i i θ = i i i i xi |pai xi |pai N (pai ) α(pai ) + N (pai ) Observed data MLE Bayesian (Dirichlet) We can also “read” from the network: Both are asymptotically equivalent and consistent Complete data ⇒ posteriors on parameters are independent Both can be implemented in an on-line manner by accumulating sufficient statistics Can compute posterior over parameters separately! 27 28 Learning Parameters: Case Study Overview 1.4 Instances sampled from Introduction 1.2 ICU Alarm network Parameter Learning M’ — strength of prior 1 Model Selection Scoring function 0.8 MLE Structure search 0.6 Bayes; M'=20 Structure Discovery 0.4 Bayes; M'=50 Incomplete Data KL Divergence to true distribution true to 0.2 Learning from Structured Data Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 29 30 5 Why Struggle for Accurate Structure? Score-based Learning Earthquake Alarm Set Burglary Define scoring function that evaluates how well a structure matches the data Sound Missing an arc Adding an arc E, B, A <Y,N,N> Earthquake Alarm Set Burglary Earthquake Alarm Set Burglary <Y,Y,Y> <N,N,Y> <N,Y,Y> Sound Sound .

Load more