<<

Overview

 Introduction  Parameter Estimation Learning Bayesian  Model Selection Networks from Data  Structure Discovery  Incomplete Data Nir Friedman Daphne Koller  Learning from Structured Data Hebrew U. Stanford

. 2

Bayesian Networks Example: “ICU Alarm” network Compact representation of probability Domain: Monitoring Intensive-Care Patients distributions via  37 variables Family of Alarm MINVOLSET E B P(A | E,B)  509 parameters Qualitative part: Earthquake Burglary PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT (DAG) e b 0.9 0.1 …instead of 254 e b 0.2 0.8 PAP SHUNT VENTLUNG VENITUBE  Nodes - random variables PRESS e b 0.9 0.1 MINOVL FIO2 VENTALV  Edges - direct influence Radio Alarm e b 0.01 0.99 ANAPHYLAXIS PVSAT ARTCO2

TPR SAO2 INSUFFANESTH EXPCO2 Call Together: HYPOVOLEMIA LVFAILURE CATECHOL Quantitative part: Define a unique distribution Set of conditional LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER in a factored form probability distributions CVP PCWP CO HREKG HRSAT P (B,E ,A,C ,R) = P (B)P (E )P (A |B,E )P (R |E )P (C |A) HRBP BP 3 4

Inference Why learning?  Posterior probabilities Knowledge acquisition bottleneck  Probability of any event given any evidence  Knowledge acquisition is an expensive process  Most likely explanation  Often we don’t have an expert  Scenario that explains evidence Rational decision making Data is cheap Earthquake Burglary  Maximize expected utility  Amount of available information growing rapidly  Value of Information  Learning allows us to construct models from raw Radio Alarm Effect of intervention data

Call

5 6

1 Why Learn Bayesian Networks? Learning Bayesian networks

 Conditional independencies & graphical language capture structure of many real-world distributions

E B  Graph structure provides much insight into domain Data Learner R A  Allows “knowledge discovery” + Prior Information C  Learned model can be used for many tasks E B P(A | E,B) e b .9 .1  Supports all the features of probabilistic learning e b .7 .3  Model selection criteria e b .8 .2 .99 .01  Dealing with missing data & hidden variables e b

7 8

Known Structure, Complete Data Unknown Structure, Complete Data E, B, A E, B, A . E B . E B . . E B P(A | E,B) E B P(A | E,B) Learner A Learner A e b .9 .1 e b .9 .1 E B P(A | E,B) E B E B P(A | E,B) E B e b .7 .3 e b .7 .3 e b ?? e b ?? A e b .8 .2 A e b .8 .2 e b ?? e b ?? e b .99 .01 e b .99 .01 e b ?? e b ?? e b ?? e b ??

 Network structure is specified  Network structure is not specified  Inducer needs to estimate parameters  Inducer needs to select arcs & estimate parameters  Data does not contain missing values  Data does not contain missing values 9 10

Known Structure, Incomplete Data Unknown Structure, Incomplete Data E, B, A E, B, A . E B . E B . . E B P(A | E,B) E B P(A | E,B) Learner A Learner A e b .9 .1 e b .9 .1 E B P(A | E,B) E B E B P(A | E,B) E B e b .7 .3 e b .7 .3 e b ?? e b ?? A e b .8 .2 A e b .8 .2 e b ?? e b ?? e b .99 .01 e b .99 .01 e b ?? e b ?? e b ?? e b ??

 Network structure is specified  Network structure is not specified  Data contains missing values  Data contains missing values  Need to consider assignments to missing values  Need to consider assignments to missing values 11 12

2 Overview Learning Parameters

 Introduction  Training data has the form: E B

 Parameter Estimation A   E [1] B[1] A[1] C [1]  C  Bayesian estimation  ⋅ ⋅ ⋅ ⋅  D =    Model Selection  ⋅ ⋅ ⋅ ⋅     Structure Discovery E [M] B[M] A[M] C [M]  Incomplete Data  Learning from Structured Data

13 14

Likelihood Function Likelihood Function

 Assume i.i.d. samples E B  By definition of network, we get E B

 Likelihood function is A A

L(Θ :D) = ∏P(E[m],B[m],A[m],C[m] : Θ) C L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m P(E [m]: Θ)    P(B[m]: Θ)  = ∏  m P(A[m]|B[m],E [m]: Θ)   P(C[m]|A[m]: Θ)   E [1] B[1] A[1] C [1]     ⋅ ⋅ ⋅ ⋅   ⋅ ⋅ ⋅ ⋅    E [M] B[M] A[M] C [M] 15 16

Likelihood Function General Bayesian Networks

 Rewriting terms, we get E B Generalizing for any Bayesian network: A

L(Θ :D) = ∏P (x1[m],K,xn [m] : Θ) L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m = ∏∏P (xi [m] |Pai [m] : Θi ) ∏P(E[m]: Θ) im m = L (Θ : D) P(B[m]: Θ) ∏ i i ∏ i m = ∏P(A[m]|B[m],E[m]: Θ)  E [1] B[1] A[1] C [1]  Decomposition m  ⋅ ⋅ ⋅ ⋅    ⇒ ∏P(C[m]|A[m]: Θ)  ⋅ ⋅ ⋅ ⋅  Independent estimation problems m   E [M] B[M] A[M] C [M] 17 18

3 Likelihood Function: Multinomials Bayesian L(θ : D) = P (D | θ) = ∏ P (x [m]| θ)  m Represent uncertainty about parameters using a probability distribution over parameters, data  The likelihood for the sequence H,T, T, H, H is  Learning using Bayes rule

) :D L(θ : D) = θ ⋅ (1 − θ) ⋅ (1 − θ) ⋅ θ ⋅ θ Likelihood Prior (θ L P (x [1],Kx [M] | θ)P (θ) P (θ| x [1],Kx [M]) = Count of kth P (x [1],Kx [M ]) 0 0.2 0.4 0.6 0.8 1 outcome in D θ Posterior K Probability of data N General case: L(Θ : D) = ∏θ k k Probability of k =1 kth outcome 19 20

Bayesian Inference Example: Binomial Data

 Represent Bayesian distribution as Bayes net  Prior: uniform for θ in [0,1] θ ⇒ P(θ |D) ∝ the likelihood L(θ :D)

P(θ |x[1],Kx[M])∝P(x[1],Kx[M]|θ)⋅P(θ)

X[1] X[2] X[m]

(NH,NT ) = (4,1) Observed data  The values of X are independent given θ  MLE for P(X = H ) is 4/5 = 0.8 0 0.2 0.4 0.6 0.8 1  P(x[m] | θ ) = θ  Bayesian prediction is 5  Bayesian prediction is inference in this network P (x[M + 1] = H |D) = θ ⋅P (θ|D)dθ = = 0.7142K ∫ 7 21 22

Dirichlet Priors Dirichlet Priors - Example

5  Recall that the likelihood function is 4.5 K N Dirichlet(α , α ) L(Θ : D) = θ k 4 heads tails ∏ k k =1 3.5 )  Dirichlet prior with α1,…,αK 3 K Dirichlet(5,5) αk −1 heads 2.5 P (Θ) ∝ ∏ θk θ Dirichlet(0.5,0.5)

k =1 P( 2 Dirichlet(2,2) ⇒ the posterior has the same form, with 1.5 Dirichlet(1,1) 1 hyperparameters α1+N 1,…,αK +N K K K K 0.5 αk −1 Nk αk +Nk −1 P (Θ|D) ∝ P (Θ)P (D | Θ) ∝ ∏θ k ∏θk = ∏θk 0 k =1 k =1 k =1 0 0.2 0.4 0.6 0.8 1 θ heads 23 24

4 Dirichlet Priors (cont.) Bayesian Nets & Bayesian Prediction

 If P(Θ) is Dirichlet with hyperparameters α1,…,αK θY|X θX θX αk P (X [1] = k ) = θ ⋅P (Θ)dΘ = m ∫ k θY|X α X[m] ∑ l X[1] X[2] X[M] X[M+1] l

Y[m]  Since the posterior is also Dirichlet, we get Y[1] Y[2] Y[M] Y[M+1]

Plate notation Observed data Query αk + Nk P (X [M + 1] = k |D) = ∫ θk ⋅P (Θ |D)dΘ = ∑ (αl + Nl )  Priors for each parameter group are independent l  Data instances are independent given the unknown parameters

25 26

Bayesian Nets & Bayesian Prediction Learning Parameters: Summary

θ θ Y|X X  Estimation relies on sufficient

 For multinomials: counts N(xi,pai) X[1] X[2] X[M]  Parameter estimation

Y[1] Y[2] Y[M] N (x , pa ) ~ α(x , pa ) + N (x , pa ) ˆθ = i i θ = i i i i xi |pai xi |pai N (pai ) α(pai ) + N (pai ) Observed data MLE Bayesian (Dirichlet)  We can also “read” from the network:  Both are asymptotically equivalent and consistent Complete data ⇒ posteriors on parameters are independent  Both can be implemented in an on-line manner by accumulating sufficient statistics  Can compute posterior over parameters separately! 27 28

Learning Parameters: Case Study Overview

1.4 Instances sampled from  Introduction 1.2 ICU Alarm network  Parameter Learning M’ — strength of prior  1 Model Selection  Scoring function 0.8 MLE  Structure search 0.6 Bayes; M'=20  Structure Discovery 0.4 Bayes; M'=50  Incomplete Data KL Divergence to true distribution true to 0.2  Learning from Structured Data Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 29 30

5 Why Struggle for Accurate Structure? Score-based Learning

Earthquake Alarm Set Burglary Define scoring function that evaluates how well a structure matches the data Sound

Missing an arc Adding an arc E, B, A Earthquake Alarm Set Burglary Earthquake Alarm Set Burglary Sound Sound . E . E B E  Cannot be compensated  Increases the number of A A A B for by fitting parameters parameters to be estimated B  Wrong assumptions about  Wrong assumptions about Search for a structure that maximizes the score domain structure domain structure 31 32

Likelihood Score for Structure Bayesian Score ˆ G Likelihood score: L(G : D) = P(D | G, θ G ) l(G : D) = log L(G : D) = M ∑ ()I(X i ; Pa i ) − H(X i ) i Max likelihood params between Bayesian approach: X and its parents i  Deal with uncertainty by assigning probability to all possibilities  Larger dependence of X on Pa ⇒ higher score i i P (D |G ) = ∫ P (D |G,θ)P (θ |G )dθ

 Adding arcs always helps Marginal Likelihood Likelihood Prior over parameters  I(X; Y) ≤ I(X; {Y,Z})

 Max score attained by fully connected network P(D | G)P(G)  Overfitting: A bad idea… P(G | D) = P(D) 33 34

Marginal Likelihood: Multinomials Marginal Likelihood: Bayesian Networks 1234567 Fortunately, in many cases integral has closed form  Network structure X determines form of HTTHTHH marginal likelihood Y HTHHTTH  P(Θ) is Dirichlet with hyperparameters α1,…,αK

 D is a dataset with sufficient statistics N1,…,NK Network 1: Two Dirichlet marginal likelihoods X Y Then   P( ) θ Γ∑αl  Integral over X  l  Γ(αl + Nl ) P (D) = ∏ θ P( ) Integral over Y   l Γ(αl ) Γ∑(αl + Nl )  l 

35 36

6 Marginal Likelihood: Bayesian Networks Marginal Likelihood for Networks 1234567  Network structure The marginal likelihood has the form: X determines form of HTTHTHH marginal likelihood Y HTHHTTHHHTHTH T P (D |G ) = ∏∏ Dirichlet marginal likelihood G P(X | pa ) i pai for multinomial i i Network 2: Three Dirichlet marginal likelihoods X Y Γ()α ( pa G ) Γ(α (x , pa G ) + N (x , pa G )) i i i i i G G ∏ G Γ()α (pa ) + N ( pa ) x Γ(α (x i , pa )) θ i i i i P( ) Integral over X θ P( ) Integral over Y|X=H N(..) are counts from the data α(..) are hyperparameters for each family given G P( ) θ Integral over Y|X=T

37 38

Bayesian Score: Asymptotic Behavior Structure Search as Optimization log M Input: log P(D | G) = l(G : D) − dim(G) + O(1) 2  Training data G log M = M ∑ ()I(X i ; Pa i ) − H(X i ) − dim(G) + O(1) i 2  Scoring function  Set of possible structures Fit dependencies in Complexity empirical distribution penalty Output:  As M (amount of data) grows,   Increasing pressure to fit dependencies in distribution A network that maximizes the score  Complexity term avoids fitting noise  Asymptotic equivalence to MDL score Key Computational Property: Decomposability:  Bayesian score is consistent score(G) = ∑ score ( family of X in G )  Observed data eventually overrides prior 39 40

Tree-Structured Networks Learning Trees

MINVOLSET  Let p(i) denote parent of Xi

PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNE CT  We can write the Bayesian score as Trees: PAP SHUNT VENTLUNG VENITUBE

PRESS  At most one parent per variable MINOVL VENTALV

PVSAT ARTCO2 Score(G : D) = ∑Score(Xi : Pai )

TPR SAO2 INSUFFANESTH EXPCO2 i

Why trees? HYPOVOLEMIA LVFAILURE CATECHOL = (Score(X : X ) − Score(X ))+ Score(X ) ∑ i p(i ) i ∑ i

LVEDVOLUME STROEVOLUME ERRBLOWOUTPUT HR ERRC AUTER i i  Elegant math HISTORY

CVP PCWP CO HREKG HRSAT ⇒ we can solve the HRBP BP Improvement over Score of “empty” optimization problem “empty” network network  Sparse parameterization ⇒ avoid overfitting Score = sum of edge scores + constant

41 42

7 Learning Trees Beyond Trees

When we consider more complex network, the problem is not as easy  Suppose we allow at most two parents per node  A greedy algorithm is no longer guaranteed to find the optimal network

 Set w(j→i) =Score( Xj → Xi ) - Score(Xi)  In fact, no efficient algorithm exists  Find tree (or forest) with maximal weight  Standard max spanning tree algorithm — O(n2 log n) Theorem: Finding maximal scoring structure with at k k > 1 Theorem: This procedure finds tree with max score most parents per node is NP-hard for

43 44

Heuristic Search Local Search

 Define a search space:  Start with a given network  search states are possible structures  empty network  operators make small changes to structure  best tree  Traverse space looking for high-scoring structures  a random network  Search techniques:  Greedy hill-climbing  At each iteration  Best first search  Evaluate all possible changes  Simulated Annealing  Apply change based on score  ...  Stop when no modification improves score

45 46

Heuristic Search Learning in Practice: Alarm domain 2  Typical operations: S C

E D 1.5 S C → d C Structure known, fit params To update score after localAd change,D E only re-score families that changed 1 ∆score = Learn both structure & params → D S({C,E} D) E R → ev -S({E} D) distributiontrue → KL Divergence to er C se te C S C le →E S C 0.5 De

E E

0 D D 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 #samples 47 48

8 Local Search: Possible Pitfalls Improved Search: Weight Annealing

 Local search can get stuck in:  Standard annealing process:  Take bad steps with probability ∝ exp(∆score/t)  Local Maxima:  Probability increases with temperature All one-edge changes reduce the score  Weight annealing:  Plateaux:  Take uphill steps relative to perturbed score Some one-edge changes leave the score unchanged  Perturbation increases with temperature

)  Standard heuristics can escape both |D (G  Random restarts e r o  TABU search c S  Simulated annealing

G 49 50

Perturbing the Score Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of  Perturb the score by reweighting instances annealed structure search  Each weight sampled from distribution: True structure Learned params  Mean = 1  Variance ∝ temperature  Instances sampled from “original” distribution Annealed search  … but perturbation changes emphasis Greedy hill-climbing

Benefit:  Allows global moves in the search space

51 52

Structure Search: Summary Overview

 Discrete optimization problem  Introduction  In some cases, optimization problem is easy  Parameter Estimation  Example: learning trees  Model Selection  In general, NP-Hard  Structure Discovery  Need to resort to heuristic search  Incomplete Data  In practice, search is relatively fast (~100 vars in  Learning from Structured Data ~2-5 min): Decomposability Sufficient statistics  Adding randomness to search is critical

53 54

9 Structure Discovery P(G|D) Discovering Structure Task: Discover structural properties  Is there a direct connection between X & Y  Does X separate between two “subsystems”  Does X causally effect Y E B

R A Example: scientific C  Disease properties and symptoms  Current practice: model selection  Interactions between the expression of genes  Pick a single high-scoring model  Use that model to infer domain structure

55 56

P(G|D) Discovering Structure Bayesian Approach

 Posterior distribution over structures  Estimate probability of features  Edge X→Y

E B E B E B E B E B  →… → Y Path X Bayesian score R A R A R A R A R A  … for G

C C C C C P (f |D) = f (G )P (G |D) Problem ∑ G  Small sample size ⇒ many high scoring models Feature of G, Indicator function  e.g., X→Y Answer based on one model often useless for feature f  Want features common to many models 57 58

MCMC over Networks ICU Alarm BN: No Mixing  500 instances:  Cannot enumerate structures, so sample structures 1 n P (f (G ) |D) ≈ ∑f (Gi ) n i =1  MCMC Sampling  Define Markov chain over BNs  Run chain to get samples from posterior P(G | D) Score of cuurent sample Possible pitfalls:  Huge (superexponential) number of networks empty greedy  Time for chain to converge to posterior is unknown 0 100000 200000 300000 400000 500000 600000  Islands of high posterior, connected by low bridges MCMC Iteration  The runs clearly do not mix 59 60

10 Effects of Non-Mixing Fixed Ordering  Two MCMC runs over same 500 instances Suppose that  Probability estimates for edges for two runs  We know the ordering of variables X > X > X > X > … > X  say, 1 2 3 4 n 2k•n•log n parents for Xi must be in X1,…,Xi-1 networks  Limit number of parents per nodes to k

Intuition: Order decouples choice of parents True BN True  Choice of Pa(X7) does not restrict choice of Pa(X12) Random start

Upshot: Can compute efficiently in closed form True BN True BN  Likelihood P(D | p) Probability estimates highly variable, nonrobust  Feature probability P(f | D, p) 61 62

Our Approach: Sample Orderings Mixing with MCMC-Orderings  4 runs on ICU-Alarm with 500 instances We can write  fewer iterations than MCMC-Nets  approximately same amount of computation P (f |D) = ∑P (f |p,D)P (p|D) 400 p 405 410

Sample orderings and approximate 415

n 420

425 P (f |D) ≈ ∑P (f | pi ,D) Score of i =1 430 cuurent sample 435  MCMC Sampling 440 445  random Define Markov chain over orderings greedy 450  Run chain to get samples from posterior P(p | D) 0 10000 20000 30000 40000 50000 60000 MCMC Iteration Process appears to be mixing! 63 64

Mixing of MCMC runs Application: Gene expression  Two MCMC runs over same instances Input:  Probability estimates for edges Measurement of gene expression under different conditions  Thousands of genes  Hundreds of

Output: Models of gene  Uncover pathways 500 instances 1000 instances

Probability estimates very robust 65 66

11 Map of Feature Confidence “Mating response” Substructure

SST2 KAR4

TEC1 NDJ1 KSS1 FUS1 PRM1 AGA1

YLR343W AGA2 TOM6 FIG1 FUS3

Yeast data YLR334C MFA1 STE6 YEL059W [Hughes et al 2000]  600 genes  Automatically constructed sub-network of high- confidence edges  300 experiments  Almost exact reconstruction of yeast mating pathway 67 68

Overview Incomplete Data

 Introduction Data is often incomplete  Parameter Estimation  Some variables of interest are not assigned values  Model Selection  Structure Discovery This phenomenon happens when we have  Incomplete Data  Missing values:  Parameter estimation  Some variables unobserved in some instances  Structure search  Hidden variables:  Learning from Structured Data  Some variables are never observed  We might not even know they exist

69 70

Hidden (Latent) Variables Example X Y  P(X) assumed to be known Why should we care about unobserved variables?  θ θ Likelihood function of: Y|X=T, Y|X=H  Contour plots of log likelihood for different number of missing values of X (M = 8):

X1 X2 X3 X1 X2 X3

H = H |X Y θ

Y1 Y2 Y3 Y1 Y2 Y3

θ θ θ 17 parameters 59 parameters Y|X=T Y|X=T Y|X=T no missing values 2 missing value 3 missing values In general: likelihood function has multiple modes 71 72

12 Incomplete Data EM: MLE from Incomplete Data

 In the presence of incomplete data, the likelihood can have multiple maxima H Y ) Example: |D (Θ  We can rename the values of hidden variable H L  If H has two values, likelihood has two maxima

 In practice, many local maxima Θ  Use current point to construct “nice” alternative function  Max of new function scores ≥ than current point

73 74

Expectation Maximization (EM) Expectation Maximization (EM)  A general purpose method for learning from incomplete data Data Expected Counts P(Y=H|X=H,Z=T,Θ) = 0.3 Intuition: X Y Z N (X,Y ) H ? T XY#  If we had true counts, we could estimate parameters T ? ? Current H H 1.3 H H ?  T H 0.4 But with missing values, counts are unknown model H T T H T 1.7 T T H  We “complete” counts using probabilistic inference T T 1.6 based on current parameter assignment P(Y=H|X=T,Θ) = 0.4  We use completed counts as if real to re-estimate parameters

75 76

Expectation Maximization (EM) Expectation Maximization (EM) Reiterate Formal Guarantees:

 L(Θ1:D) ≥ L(Θ0:D)  Initial network (G,Θ0) Each iteration improves the likelihood Updated network (G,Θ ) X X X 1 1 2 3 Expected Counts  Θ = Θ Θ L(Θ:D) X X X If 1 0 , then 0 is a stationary point of N(X1) 11 2 3 N(X ) H Computation 2 Reparameterize  Usually, this means a local maximum N(X3) H N(H, X , X , X ) Y Y Y 1 1 3 1 2 3 N(Y , H) (M-Step) (E-Step) 1 Y Y Y Y11 Y2 Y3 + N(Y2, H) N(Y3, H) Training Data

77 78

13 Summary: Parameter Learning Expectation Maximization (EM) with Incomplete Data Computational bottleneck:  Incomplete data makes parameter estimation hard  Computation of expected counts in E-Step  Likelihood function  Need to compute posterior for each unobserved  Does not have closed form variable in each instance of training set  Is multimodal  All posteriors for an instance can be derived from one pass of standard BN inference  Finding max likelihood parameters:  EM  Gradient ascent  Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics

79 80

Incomplete Data: Structure Scores Naive Approach

Recall, Bayesian score:  Perform EM for each candidate graph Parameter space G G G P (G |D) ∝ P (G )P (D |G ) 3 2 1

= P (G ) P (D |G,Θ)P (Θ|G )dθ Parametric ∫ optimization (EM)

With incomplete data: Local Maximum  ComputationallyG expensive: G  Cannot evaluate marginal likelihood in closed form 4 n  Parameter optimization via EM — non-trivial  We have to resort to approximations:  Need to perform EM for all candidate structures  Evaluate score around MAP parameters  Spend time even on poor candidates  Need to find MAP parameters (e.g., EM)  ⇒In practice, considers only a few candidates 81 82

Structural EM Structural EM

Recall, in complete data we had Idea:  Decomposition ⇒ efficient search  Use current model to help evaluate new structures

Idea: Outline:  Instead of optimizing the real score…  Perform search in (Structure, Parameters) space  Find decomposable alternative score  At each iteration, use current model for finding either:  Such that maximizing new score  Better scoring parameters: “parametric” EM step ⇒ improvement in real score or  Better scoring structure: “structural” EM step

83 84

14 Reiterate Example: Phylogenetic Reconstruction Input: Biological sequences Score An “instance” of Computation & Human CGTTGC… Parameterize evolutionary process Chimp CCTAGG… X1 X2 X3 Expected Counts X X X N(X1) 1 2 3 Assumption: positions N(X ) Orang CGAACG… H 2 are independent N(X3) H N(H, X , X , X ) …. Y Y Y 1 1 3 1 2 3 N(Y , H) 1 Y Y Y 1 2 3 Output: a phylogeny + N(Y2, H) N(Y3, H) N(X X ) rs 2, 1 10 billion yea N(H, X1, X3) N(Y1, X2) X X X N(Y , Y , H) 11 2 3 Training 2 1 leaf Data H

Y1 Y2 Y3

85 86

Phylogenetic Model Phylogenetic Tree as a Bayes Net branch (8,9) 9 8  Variables: Letter at each position for each species internal 11 12 10 node  Current day species – observed 1234 567  Ancestral species - hidden leaf  BN Structure: Tree topology  Topology: bifurcating  BN Parameters: Branch lengths (time spans)  Observed species – 1…N Main problem: Learn topology  Ancestral species – N+1…2N-2  Lengths t = {t } for each branch (i,j) i,j If ancestral were observed  Evolutionary model: ⇒ easy learning problem (learning trees)  P(A changes to T| 10 billion yrs )

87 88

Algorithm Outline Algorithm Outline

Compute expected pairwise stats Compute expected pairwise stats

Weights: Branch scores Weights: Branch scores

Find: T '= argmaxT ∑wi, j (i, j )∈T

Original Tree (T0,t0) Pairwise weights

O(N2) pairwise statistics suffice to evaluate all trees

89 90

15 Algorithm Outline Algorithm Outline

Compute expected pairwise stats Compute expected pairwise stats

Weights: Branch scores Weights: Branch scores

Find: T '= argmaxT ∑wi, j Find: T '= argmaxT ∑wi, j (i, j )∈T (i, j )∈T

Construct bifurcation T1 Construct bifurcation T1 Max. Spanning Tree New Tree Theorem: L(T1,t1) ≥ L(T0,t0)

Repeat until convergence…

91 92

Real Life Data Overview Lysozyme c Mitochondrial  Introduction genomes  # sequences 43 34 Parameter Estimation # pos 122 3,578  Model Selection  Traditional -2,916.2 -74,227.9 Structure Discovery approach  Incomplete Data Log- Structural EM -2,892.1 -70,533.5  Learning from Structured Data likelihood Approach Difference 0.19 1.03 per position

Each position twice as likely 93 94

Bayesian Networks: Problem Bayesian Networks: Problem  Bayesian nets use propositional representation  Bayesian nets use propositional representation  Real world has objects, related to each other  Real world has objects, related to each other These “instances” are not independent!

Intelligence Difficulty Intell_J.Doe Diffic_CS101 Intell_FGump Diffic_CS101

Grade_JDoe_CS101A Grade_FGump_CS101C Grade

Intell_FGump Diffic_Geo101

Grade_FGump_Geo101 95 96

16 St. Nordaf University Relational Schema  Specifies types of objects in domain, attributes of each type Teaching-ability Teaching-ability of object, & types of links between objects Classes Professor Student Teaching-Ability Intelligence

Teaches Teaches In-courseGrade Registered Intelligence Satisfac Welcome to Teach Take Geo101 Forrest Gump Links Grade Attributes Difficulty Registered Welcome to In-courseSatisfac Course Registration CS101 In Intelligence Grade Difficulty Grade Difficulty Registered Satisfaction In-courseSatisfac Jane Doe 97 98

Representing the Distribution Possible Worlds  World: assignment to all attributes  Many possible worlds for a given university Teaching-abilityProf.HighLow Jones Teaching-abilityProf.High Smith of all objects in domain  All possible assignments of all attributes of all objects

GradeCB IntelligenceWeak  Infinitely many potential universities SatisfacHateLike Welcome to  Each associated with a very different set of Geo101 Forrest Gump worlds GradeBC DifficultyEasy Welcome to SatisfacHate CS101 IntelligenceSmart Need to represent GradeA DifficultyEasyHard infinite set of complex distributions SatisfacHateLike Jane Doe 99 100

Probabilistic Relational Models PRM Semantics Key ideas: Instantiated PRM BN Teaching-abilityProf. Jones Teaching-abilityProf. Smith   Universals: Probabilistic patterns hold for all objects in class variables: attributes of all objects  dependencies: determined by  Locality: Represent direct probabilistic dependencies θGrade|Intell,Diffic links & PRM  Links give us potential interactions!

Professor Grade Intelligence Teaching-Ability Satisfac Student Welcome to Intelligence Forrest Gump Course Geo101 Difficulty Grade Difficulty A B C Satisfac easy,weak Welcome to Reg easy,smart CS101 hard,weak Intelligence Grade Grade hard,smart Difficulty Satisfaction 0% 20% 40% 60% 80% 100% Satisfac Jane Doe 101 102

17 The Web of Influence PRM Learning: Complete Data

Welcome to Teaching-abilityProf.Low Jones Teaching-abilityProf.High Smith CS101 Objects are all correlated  Need to perform inference over entire model θGrade|Intell,Diffic  For large databases, use approximate inference:  Loopy beliefC propagation GradeC IntelligenceWeak SatisfacLike Welcome to  Introduce prior over parameters  EntireGeo101 database is single “instance”

0%0% 50% 50% 100% 100%  Update prior with sufficient Gradestatistics:B  ParametersDifficultyEasy used many times in instance Welcome to Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)SatisfacHate A Welcome to Geo101 weak smart CS101 IntelligenceSmart GradeA DifficultyEasy SatisfacLike easy / hard weak / smart 103 104

PRM Learning: Incomplete Data A Web of Data

??? ??? Tom Mitchell Professor

Project-of WebKB C ??? Hi Project  Use expectedWelcome sufficient to statistics Advisee-of Geo101 Works-on  But, everything is correlated: A Sean Slattery ⇒ E-step uses??? (approx) inference over entire model Student Welcome to Low CS101 Contains ??? B CMU CS Faculty ??? Hi 105 [Craven et al.] 106

Standard Approach What’s in a Link

Page From-Page To-Page

Category Category Category

Word ... Word Word ... Word 1 N Word1 ... WordN 1 N Professor Exists department Link extract information computer science

0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 107 108

18 Discovering Hidden Concepts Discovering Hidden Concepts

Actor Type Director Type Gender

Movie Appeared Type Genre Rating Credit-Order #Votes Year MPAA Rating

Internet Movie Database Internet Movie Database http://www.imdb.com http://www.imdb.com 109 110

Web of Influence, Yet Again Conclusion Movies Actors Directors Wizard of Oz Sylvester Stallone Alfred Hitchcock  Many distributions have combinatorial dependency Cinderella Bruce Willis Stanley Kubrick Sound of Music Harrison Ford David Lean structure The Love Bug Steven Seagal Milos Forman  Pollyanna Kurt Russell Terry Gilliam Utilizing this structure is good The Parent Trap Kevin Costner Francis Coppola  Mary Poppins Jean-Claude Van Damme Discovering this structure has implications: Swiss Family Robinson Arnold Schwarzenegger  To density estimation Terminator 2 Anthony Hopkins Steven Spielberg  Batman Robert De Niro Tim Burton To knowledge discovery Batman Forever Tommy Lee Jones Tony Scott  Mission: Impossible Harvey Keitel James Cameron Many applications GoldenEye Morgan Freeman John McTiernan  Starship Troopers Gary Oldman Joel Schumacher Medicine Hunt for Red October  Biology … … …  Web

111 112

The END Thanks to

 Gal Elidan  Dana Pe’er  Lise Getoor  Eran Segal  Moises Goldszmidt  Ben Taskar  Matan Ninio

Slides will be available from: http://www.cs.huji.ac.il/~nir/ http://robotics.stanford.edu/~koller/

113

19