Overview
Introduction Parameter Estimation Learning Bayesian Model Selection Networks from Data Structure Discovery Incomplete Data Nir Friedman Daphne Koller Learning from Structured Data Hebrew U. Stanford
. 2
Bayesian Networks Example: “ICU Alarm” network Compact representation of probability Domain: Monitoring Intensive-Care Patients distributions via conditional independence 37 variables Family of Alarm MINVOLSET E B P(A | E,B) 509 parameters Qualitative part: Earthquake Burglary PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT Directed acyclic graph (DAG) e b 0.9 0.1 …instead of 254 e b 0.2 0.8 PAP SHUNT VENTLUNG VENITUBE Nodes - random variables PRESS e b 0.9 0.1 MINOVL FIO2 VENTALV Edges - direct influence Radio Alarm e b 0.01 0.99 ANAPHYLAXIS PVSAT ARTCO2
TPR SAO2 INSUFFANESTH EXPCO2 Call Together: HYPOVOLEMIA LVFAILURE CATECHOL Quantitative part: Define a unique distribution Set of conditional LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER in a factored form probability distributions CVP PCWP CO HREKG HRSAT P (B,E ,A,C ,R) = P (B)P (E )P (A |B,E )P (R |E )P (C |A) HRBP BP 3 4
Inference Why learning? Posterior probabilities Knowledge acquisition bottleneck Probability of any event given any evidence Knowledge acquisition is an expensive process Most likely explanation Often we don’t have an expert Scenario that explains evidence Rational decision making Data is cheap Earthquake Burglary Maximize expected utility Amount of available information growing rapidly Value of Information Learning allows us to construct models from raw Radio Alarm Effect of intervention data
Call
5 6
1 Why Learn Bayesian Networks? Learning Bayesian networks
Conditional independencies & graphical language capture structure of many real-world distributions
E B Graph structure provides much insight into domain Data Learner R A Allows “knowledge discovery” + Prior Information C Learned model can be used for many tasks E B P(A | E,B) e b .9 .1 Supports all the features of probabilistic learning e b .7 .3 Model selection criteria e b .8 .2 .99 .01 Dealing with missing data & hidden variables e b
7 8
Known Structure, Complete Data Unknown Structure, Complete Data E, B, A E, B, A
Network structure is specified Network structure is not specified Inducer needs to estimate parameters Inducer needs to select arcs & estimate parameters Data does not contain missing values Data does not contain missing values 9 10
Known Structure, Incomplete Data Unknown Structure, Incomplete Data E, B, A E, B, A
Network structure is specified Network structure is not specified Data contains missing values Data contains missing values Need to consider assignments to missing values Need to consider assignments to missing values 11 12
2 Overview Learning Parameters
Introduction Training data has the form: E B
Parameter Estimation A Likelihood function E [1] B[1] A[1] C [1] C Bayesian estimation ⋅ ⋅ ⋅ ⋅ D = Model Selection ⋅ ⋅ ⋅ ⋅ Structure Discovery E [M] B[M] A[M] C [M] Incomplete Data Learning from Structured Data
13 14
Likelihood Function Likelihood Function
Assume i.i.d. samples E B By definition of network, we get E B
Likelihood function is A A
L(Θ :D) = ∏P(E[m],B[m],A[m],C[m] : Θ) C L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m P(E [m]: Θ) P(B[m]: Θ) = ∏ m P(A[m]|B[m],E [m]: Θ) P(C[m]|A[m]: Θ) E [1] B[1] A[1] C [1] ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ E [M] B[M] A[M] C [M] 15 16
Likelihood Function General Bayesian Networks
Rewriting terms, we get E B Generalizing for any Bayesian network: A
L(Θ :D) = ∏P (x1[m],K,xn [m] : Θ) L(Θ :D) = ∏P(E[m],B[m],A[m],C[m]: Θ) C m m = ∏∏P (xi [m] |Pai [m] : Θi ) ∏P(E[m]: Θ) im m = L (Θ : D) P(B[m]: Θ) ∏ i i ∏ i m = ∏P(A[m]|B[m],E[m]: Θ) E [1] B[1] A[1] C [1] Decomposition m ⋅ ⋅ ⋅ ⋅ ⇒ ∏P(C[m]|A[m]: Θ) ⋅ ⋅ ⋅ ⋅ Independent estimation problems m E [M] B[M] A[M] C [M] 17 18
3 Likelihood Function: Multinomials Bayesian Inference L(θ : D) = P (D | θ) = ∏ P (x [m]| θ) m Represent uncertainty about parameters using a probability distribution over parameters, data The likelihood for the sequence H,T, T, H, H is Learning using Bayes rule
) :D L(θ : D) = θ ⋅ (1 − θ) ⋅ (1 − θ) ⋅ θ ⋅ θ Likelihood Prior (θ L P (x [1],Kx [M] | θ)P (θ) P (θ| x [1],Kx [M]) = Count of kth P (x [1],Kx [M ]) 0 0.2 0.4 0.6 0.8 1 outcome in D θ Posterior K Probability of data N General case: L(Θ : D) = ∏θ k k Probability of k =1 kth outcome 19 20
Bayesian Inference Example: Binomial Data
Represent Bayesian distribution as Bayes net Prior: uniform for θ in [0,1] θ ⇒ P(θ |D) ∝ the likelihood L(θ :D)
P(θ |x[1],Kx[M])∝P(x[1],Kx[M]|θ)⋅P(θ)
X[1] X[2] X[m]
(NH,NT ) = (4,1) Observed data The values of X are independent given θ MLE for P(X = H ) is 4/5 = 0.8 0 0.2 0.4 0.6 0.8 1 P(x[m] | θ ) = θ Bayesian prediction is 5 Bayesian prediction is inference in this network P (x[M + 1] = H |D) = θ ⋅P (θ|D)dθ = = 0.7142K ∫ 7 21 22
Dirichlet Priors Dirichlet Priors - Example
5 Recall that the likelihood function is 4.5 K N Dirichlet(α , α ) L(Θ : D) = θ k 4 heads tails ∏ k k =1 3.5 ) Dirichlet prior with hyperparameters α1,…,αK 3 K Dirichlet(5,5) αk −1 heads 2.5 P (Θ) ∝ ∏ θk θ Dirichlet(0.5,0.5)
k =1 P( 2 Dirichlet(2,2) ⇒ the posterior has the same form, with 1.5 Dirichlet(1,1) 1 hyperparameters α1+N 1,…,αK +N K K K K 0.5 αk −1 Nk αk +Nk −1 P (Θ|D) ∝ P (Θ)P (D | Θ) ∝ ∏θ k ∏θk = ∏θk 0 k =1 k =1 k =1 0 0.2 0.4 0.6 0.8 1 θ heads 23 24
4 Dirichlet Priors (cont.) Bayesian Nets & Bayesian Prediction
If P(Θ) is Dirichlet with hyperparameters α1,…,αK θY|X θX θX αk P (X [1] = k ) = θ ⋅P (Θ)dΘ = m ∫ k θY|X α X[m] ∑ l X[1] X[2] X[M] X[M+1] l
Y[m] Since the posterior is also Dirichlet, we get Y[1] Y[2] Y[M] Y[M+1]
Plate notation Observed data Query αk + Nk P (X [M + 1] = k |D) = ∫ θk ⋅P (Θ |D)dΘ = ∑ (αl + Nl ) Priors for each parameter group are independent l Data instances are independent given the unknown parameters
25 26
Bayesian Nets & Bayesian Prediction Learning Parameters: Summary
θ θ Y|X X Estimation relies on sufficient statistics
For multinomials: counts N(xi,pai) X[1] X[2] X[M] Parameter estimation
Y[1] Y[2] Y[M] N (x , pa ) ~ α(x , pa ) + N (x , pa ) ˆθ = i i θ = i i i i xi |pai xi |pai N (pai ) α(pai ) + N (pai ) Observed data MLE Bayesian (Dirichlet) We can also “read” from the network: Both are asymptotically equivalent and consistent Complete data ⇒ posteriors on parameters are independent Both can be implemented in an on-line manner by accumulating sufficient statistics Can compute posterior over parameters separately! 27 28
Learning Parameters: Case Study Overview
1.4 Instances sampled from Introduction 1.2 ICU Alarm network Parameter Learning M’ — strength of prior 1 Model Selection Scoring function 0.8 MLE Structure search 0.6 Bayes; M'=20 Structure Discovery 0.4 Bayes; M'=50 Incomplete Data KL Divergence to true distribution true to 0.2 Learning from Structured Data Bayes; M'=5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances 29 30
5 Why Struggle for Accurate Structure? Score-based Learning
Earthquake Alarm Set Burglary Define scoring function that evaluates how well a structure matches the data Sound
Missing an arc Adding an arc E, B, A
Likelihood Score for Structure Bayesian Score ˆ G Likelihood score: L(G : D) = P(D | G, θ G ) l(G : D) = log L(G : D) = M ∑ ()I(X i ; Pa i ) − H(X i ) i Max likelihood params Mutual information between Bayesian approach: X and its parents i Deal with uncertainty by assigning probability to all possibilities Larger dependence of X on Pa ⇒ higher score i i P (D |G ) = ∫ P (D |G,θ)P (θ |G )dθ
Adding arcs always helps Marginal Likelihood Likelihood Prior over parameters I(X; Y) ≤ I(X; {Y,Z})
Max score attained by fully connected network P(D | G)P(G) Overfitting: A bad idea… P(G | D) = P(D) 33 34
Marginal Likelihood: Multinomials Marginal Likelihood: Bayesian Networks 1234567 Fortunately, in many cases integral has closed form Network structure X determines form of HTTHTHH marginal likelihood Y HTHHTTH P(Θ) is Dirichlet with hyperparameters α1,…,αK
D is a dataset with sufficient statistics N1,…,NK Network 1: Two Dirichlet marginal likelihoods X Y Then P( ) θ Γ∑αl Integral over X l Γ(αl + Nl ) P (D) = ∏ θ P( ) Integral over Y l Γ(αl ) Γ∑(αl + Nl ) l
35 36
6 Marginal Likelihood: Bayesian Networks Marginal Likelihood for Networks 1234567 Network structure The marginal likelihood has the form: X determines form of HTTHTHH marginal likelihood Y HTHHTTHHHTHTH T P (D |G ) = ∏∏ Dirichlet marginal likelihood G P(X | pa ) i pai for multinomial i i Network 2: Three Dirichlet marginal likelihoods X Y Γ()α ( pa G ) Γ(α (x , pa G ) + N (x , pa G )) i i i i i G G ∏ G Γ()α (pa ) + N ( pa ) x Γ(α (x i , pa )) θ i i i i P( ) Integral over X θ P( ) Integral over Y|X=H N(..) are counts from the data α(..) are hyperparameters for each family given G P( ) θ Integral over Y|X=T
37 38
Bayesian Score: Asymptotic Behavior Structure Search as Optimization log M Input: log P(D | G) = l(G : D) − dim(G) + O(1) 2 Training data G log M = M ∑ ()I(X i ; Pa i ) − H(X i ) − dim(G) + O(1) i 2 Scoring function Set of possible structures Fit dependencies in Complexity empirical distribution penalty Output: As M (amount of data) grows, Increasing pressure to fit dependencies in distribution A network that maximizes the score Complexity term avoids fitting noise Asymptotic equivalence to MDL score Key Computational Property: Decomposability: Bayesian score is consistent score(G) = ∑ score ( family of X in G ) Observed data eventually overrides prior 39 40
Tree-Structured Networks Learning Trees
MINVOLSET Let p(i) denote parent of Xi
PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNE CT We can write the Bayesian score as Trees: PAP SHUNT VENTLUNG VENITUBE
PRESS At most one parent per variable MINOVL VENTALV
PVSAT ARTCO2 Score(G : D) = ∑Score(Xi : Pai )
TPR SAO2 INSUFFANESTH EXPCO2 i
Why trees? HYPOVOLEMIA LVFAILURE CATECHOL = (Score(X : X ) − Score(X ))+ Score(X ) ∑ i p(i ) i ∑ i
LVEDVOLUME STROEVOLUME ERRBLOWOUTPUT HR ERRC AUTER i i Elegant math HISTORY
CVP PCWP CO HREKG HRSAT ⇒ we can solve the HRBP BP Improvement over Score of “empty” optimization problem “empty” network network Sparse parameterization ⇒ avoid overfitting Score = sum of edge scores + constant
41 42
7 Learning Trees Beyond Trees
When we consider more complex network, the problem is not as easy Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find the optimal network
Set w(j→i) =Score( Xj → Xi ) - Score(Xi) In fact, no efficient algorithm exists Find tree (or forest) with maximal weight Standard max spanning tree algorithm — O(n2 log n) Theorem: Finding maximal scoring structure with at k k > 1 Theorem: This procedure finds tree with max score most parents per node is NP-hard for
43 44
Heuristic Search Local Search
Define a search space: Start with a given network search states are possible structures empty network operators make small changes to structure best tree Traverse space looking for high-scoring structures a random network Search techniques: Greedy hill-climbing At each iteration Best first search Evaluate all possible changes Simulated Annealing Apply change based on score ... Stop when no modification improves score
45 46
Heuristic Search Learning in Practice: Alarm domain 2 Typical operations: S C
E D 1.5 S C → d C Structure known, fit params To update score after localAd change,D E only re-score families that changed 1 ∆score = Learn both structure & params → D S({C,E} D) E R → ev -S({E} D) distributiontrue → KL Divergence to er C se te C S C le →E S C 0.5 De
E E
0 D D 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 #samples 47 48
8 Local Search: Possible Pitfalls Improved Search: Weight Annealing
Local search can get stuck in: Standard annealing process: Take bad steps with probability ∝ exp(∆score/t) Local Maxima: Probability increases with temperature All one-edge changes reduce the score Weight annealing: Plateaux: Take uphill steps relative to perturbed score Some one-edge changes leave the score unchanged Perturbation increases with temperature
) Standard heuristics can escape both |D (G Random restarts e r o TABU search c S Simulated annealing
G 49 50
Perturbing the Score Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of Perturb the score by reweighting instances annealed structure search Each weight sampled from distribution: True structure Learned params Mean = 1 Variance ∝ temperature Instances sampled from “original” distribution Annealed search … but perturbation changes emphasis Greedy hill-climbing
Benefit: Allows global moves in the search space
51 52
Structure Search: Summary Overview
Discrete optimization problem Introduction In some cases, optimization problem is easy Parameter Estimation Example: learning trees Model Selection In general, NP-Hard Structure Discovery Need to resort to heuristic search Incomplete Data In practice, search is relatively fast (~100 vars in Learning from Structured Data ~2-5 min): Decomposability Sufficient statistics Adding randomness to search is critical
53 54
9 Structure Discovery P(G|D) Discovering Structure Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y E B
R A Example: scientific data mining C Disease properties and symptoms Current practice: model selection Interactions between the expression of genes Pick a single high-scoring model Use that model to infer domain structure
55 56
P(G|D) Discovering Structure Bayesian Approach
Posterior distribution over structures Estimate probability of features Edge X→Y
E B E B E B E B E B →… → Y Path X Bayesian score R A R A R A R A R A … for G
C C C C C P (f |D) = f (G )P (G |D) Problem ∑ G Small sample size ⇒ many high scoring models Feature of G, Indicator function e.g., X→Y Answer based on one model often useless for feature f Want features common to many models 57 58
MCMC over Networks ICU Alarm BN: No Mixing 500 instances: Cannot enumerate structures, so sample structures 1 n P (f (G ) |D) ≈ ∑f (Gi ) n i =1 MCMC Sampling Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Score of cuurent sample Possible pitfalls: Huge (superexponential) number of networks empty greedy Time for chain to converge to posterior is unknown 0 100000 200000 300000 400000 500000 600000 Islands of high posterior, connected by low bridges MCMC Iteration The runs clearly do not mix 59 60
10 Effects of Non-Mixing Fixed Ordering Two MCMC runs over same 500 instances Suppose that Probability estimates for edges for two runs We know the ordering of variables X > X > X > X > … > X say, 1 2 3 4 n 2k•n•log n parents for Xi must be in X1,…,Xi-1 networks Limit number of parents per nodes to k
Intuition: Order decouples choice of parents True BN True Choice of Pa(X7) does not restrict choice of Pa(X12) Random start
Upshot: Can compute efficiently in closed form True BN True BN Likelihood P(D | p) Probability estimates highly variable, nonrobust Feature probability P(f | D, p) 61 62
Our Approach: Sample Orderings Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances We can write fewer iterations than MCMC-Nets approximately same amount of computation P (f |D) = ∑P (f |p,D)P (p|D) 400 p 405 410
Sample orderings and approximate 415
n 420
425 P (f |D) ≈ ∑P (f | pi ,D) Score of i =1 430 cuurent sample 435 MCMC Sampling 440 445 random Define Markov chain over orderings greedy 450 Run chain to get samples from posterior P(p | D) 0 10000 20000 30000 40000 50000 60000 MCMC Iteration Process appears to be mixing! 63 64
Mixing of MCMC runs Application: Gene expression Two MCMC runs over same instances Input: Probability estimates for edges Measurement of gene expression under different conditions Thousands of genes Hundreds of experiments
Output: Models of gene interaction Uncover pathways 500 instances 1000 instances
Probability estimates very robust 65 66
11 Map of Feature Confidence “Mating response” Substructure
SST2 KAR4
TEC1 NDJ1 KSS1 FUS1 PRM1 AGA1
YLR343W AGA2 TOM6 FIG1 FUS3
Yeast data YLR334C MFA1 STE6 YEL059W [Hughes et al 2000] 600 genes Automatically constructed sub-network of high- confidence edges 300 experiments Almost exact reconstruction of yeast mating pathway 67 68
Overview Incomplete Data
Introduction Data is often incomplete Parameter Estimation Some variables of interest are not assigned values Model Selection Structure Discovery This phenomenon happens when we have Incomplete Data Missing values: Parameter estimation Some variables unobserved in some instances Structure search Hidden variables: Learning from Structured Data Some variables are never observed We might not even know they exist
69 70
Hidden (Latent) Variables Example X Y P(X) assumed to be known Why should we care about unobserved variables? θ θ Likelihood function of: Y|X=T, Y|X=H Contour plots of log likelihood for different number of missing values of X (M = 8):
X1 X2 X3 X1 X2 X3
H = H |X Y θ
Y1 Y2 Y3 Y1 Y2 Y3
θ θ θ 17 parameters 59 parameters Y|X=T Y|X=T Y|X=T no missing values 2 missing value 3 missing values In general: likelihood function has multiple modes 71 72
12 Incomplete Data EM: MLE from Incomplete Data
In the presence of incomplete data, the likelihood can have multiple maxima H Y ) Example: |D (Θ We can rename the values of hidden variable H L If H has two values, likelihood has two maxima
In practice, many local maxima Θ Use current point to construct “nice” alternative function Max of new function scores ≥ than current point
73 74
Expectation Maximization (EM) Expectation Maximization (EM) A general purpose method for learning from incomplete data Data Expected Counts P(Y=H|X=H,Z=T,Θ) = 0.3 Intuition: X Y Z N (X,Y ) H ? T XY# If we had true counts, we could estimate parameters T ? ? Current H H 1.3 H H ? T H 0.4 But with missing values, counts are unknown model H T T H T 1.7 T T H We “complete” counts using probabilistic inference T T 1.6 based on current parameter assignment P(Y=H|X=T,Θ) = 0.4 We use completed counts as if real to re-estimate parameters
75 76
Expectation Maximization (EM) Expectation Maximization (EM) Reiterate Formal Guarantees:
L(Θ1:D) ≥ L(Θ0:D) Initial network (G,Θ0) Each iteration improves the likelihood Updated network (G,Θ ) X X X 1 1 2 3 Expected Counts Θ = Θ Θ L(Θ:D) X X X If 1 0 , then 0 is a stationary point of N(X1) 11 2 3 N(X ) H Computation 2 Reparameterize Usually, this means a local maximum N(X3) H N(H, X , X , X ) Y Y Y 1 1 3 1 2 3 N(Y , H) (M-Step) (E-Step) 1 Y Y Y Y11 Y2 Y3 + N(Y2, H) N(Y3, H) Training Data
77 78
13 Summary: Parameter Learning Expectation Maximization (EM) with Incomplete Data Computational bottleneck: Incomplete data makes parameter estimation hard Computation of expected counts in E-Step Likelihood function Need to compute posterior for each unobserved Does not have closed form variable in each instance of training set Is multimodal All posteriors for an instance can be derived from one pass of standard BN inference Finding max likelihood parameters: EM Gradient ascent Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics
79 80
Incomplete Data: Structure Scores Naive Approach
Recall, Bayesian score: Perform EM for each candidate graph Parameter space G G G P (G |D) ∝ P (G )P (D |G ) 3 2 1
= P (G ) P (D |G,Θ)P (Θ|G )dθ Parametric ∫ optimization (EM)
With incomplete data: Local Maximum ComputationallyG expensive: G Cannot evaluate marginal likelihood in closed form 4 n Parameter optimization via EM — non-trivial We have to resort to approximations: Need to perform EM for all candidate structures Evaluate score around MAP parameters Spend time even on poor candidates Need to find MAP parameters (e.g., EM) ⇒In practice, considers only a few candidates 81 82
Structural EM Structural EM
Recall, in complete data we had Idea: Decomposition ⇒ efficient search Use current model to help evaluate new structures
Idea: Outline: Instead of optimizing the real score… Perform search in (Structure, Parameters) space Find decomposable alternative score At each iteration, use current model for finding either: Such that maximizing new score Better scoring parameters: “parametric” EM step ⇒ improvement in real score or Better scoring structure: “structural” EM step
83 84
14 Reiterate Example: Phylogenetic Reconstruction Input: Biological sequences Score An “instance” of Computation & Human CGTTGC… Parameterize evolutionary process Chimp CCTAGG… X1 X2 X3 Expected Counts X X X N(X1) 1 2 3 Assumption: positions N(X ) Orang CGAACG… H 2 are independent N(X3) H N(H, X , X , X ) …. Y Y Y 1 1 3 1 2 3 N(Y , H) 1 Y Y Y 1 2 3 Output: a phylogeny + N(Y2, H) N(Y3, H) N(X X ) rs 2, 1 10 billion yea N(H, X1, X3) N(Y1, X2) X X X N(Y , Y , H) 11 2 3 Training 2 1 leaf Data H
Y1 Y2 Y3
85 86
Phylogenetic Model Phylogenetic Tree as a Bayes Net branch (8,9) 9 8 Variables: Letter at each position for each species internal 11 12 10 node Current day species – observed 1234 567 Ancestral species - hidden leaf BN Structure: Tree topology Topology: bifurcating BN Parameters: Branch lengths (time spans) Observed species – 1…N Main problem: Learn topology Ancestral species – N+1…2N-2 Lengths t = {t } for each branch (i,j) i,j If ancestral were observed Evolutionary model: ⇒ easy learning problem (learning trees) P(A changes to T| 10 billion yrs )
87 88
Algorithm Outline Algorithm Outline
Compute expected pairwise stats Compute expected pairwise stats
Weights: Branch scores Weights: Branch scores
Find: T '= argmaxT ∑wi, j (i, j )∈T
Original Tree (T0,t0) Pairwise weights
O(N2) pairwise statistics suffice to evaluate all trees
89 90
15 Algorithm Outline Algorithm Outline
Compute expected pairwise stats Compute expected pairwise stats
Weights: Branch scores Weights: Branch scores
Find: T '= argmaxT ∑wi, j Find: T '= argmaxT ∑wi, j (i, j )∈T (i, j )∈T
Construct bifurcation T1 Construct bifurcation T1 Max. Spanning Tree New Tree Theorem: L(T1,t1) ≥ L(T0,t0)
Repeat until convergence…
91 92
Real Life Data Overview Lysozyme c Mitochondrial Introduction genomes # sequences 43 34 Parameter Estimation # pos 122 3,578 Model Selection Traditional -2,916.2 -74,227.9 Structure Discovery approach Incomplete Data Log- Structural EM -2,892.1 -70,533.5 Learning from Structured Data likelihood Approach Difference 0.19 1.03 per position
Each position twice as likely 93 94
Bayesian Networks: Problem Bayesian Networks: Problem Bayesian nets use propositional representation Bayesian nets use propositional representation Real world has objects, related to each other Real world has objects, related to each other These “instances” are not independent!
Intelligence Difficulty Intell_J.Doe Diffic_CS101 Intell_FGump Diffic_CS101
Grade_JDoe_CS101A Grade_FGump_CS101C Grade
Intell_FGump Diffic_Geo101
Grade_FGump_Geo101 95 96
16 St. Nordaf University Relational Schema Specifies types of objects in domain, attributes of each type Teaching-ability Teaching-ability of object, & types of links between objects Classes Professor Student Teaching-Ability Intelligence
Teaches Teaches In-courseGrade Registered Intelligence Satisfac Welcome to Teach Take Geo101 Forrest Gump Links Grade Attributes Difficulty Registered Welcome to In-courseSatisfac Course Registration CS101 In Intelligence Grade Difficulty Grade Difficulty Registered Satisfaction In-courseSatisfac Jane Doe 97 98
Representing the Distribution Possible Worlds World: assignment to all attributes Many possible worlds for a given university Teaching-abilityProf.HighLow Jones Teaching-abilityProf.High Smith of all objects in domain All possible assignments of all attributes of all objects
GradeCB IntelligenceWeak Infinitely many potential universities SatisfacHateLike Welcome to Each associated with a very different set of Geo101 Forrest Gump worlds GradeBC DifficultyEasy Welcome to SatisfacHate CS101 IntelligenceSmart Need to represent GradeA DifficultyEasyHard infinite set of complex distributions SatisfacHateLike Jane Doe 99 100
Probabilistic Relational Models PRM Semantics Key ideas: Instantiated PRM BN Teaching-abilityProf. Jones Teaching-abilityProf. Smith Universals: Probabilistic patterns hold for all objects in class variables: attributes of all objects dependencies: determined by Locality: Represent direct probabilistic dependencies θGrade|Intell,Diffic links & PRM Links give us potential interactions!
Professor Grade Intelligence Teaching-Ability Satisfac Student Welcome to Intelligence Forrest Gump Course Geo101 Difficulty Grade Difficulty A B C Satisfac easy,weak Welcome to Reg easy,smart CS101 hard,weak Intelligence Grade Grade hard,smart Difficulty Satisfaction 0% 20% 40% 60% 80% 100% Satisfac Jane Doe 101 102
17 The Web of Influence PRM Learning: Complete Data
Welcome to Teaching-abilityProf.Low Jones Teaching-abilityProf.High Smith CS101 Objects are all correlated Need to perform inference over entire model θGrade|Intell,Diffic For large databases, use approximate inference: Loopy beliefC propagation GradeC IntelligenceWeak SatisfacLike Welcome to Introduce prior over parameters EntireGeo101 database is single “instance”
0%0% 50% 50% 100% 100% Update prior with sufficient Gradestatistics:B ParametersDifficultyEasy used many times in instance Welcome to Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)SatisfacHate A Welcome to Geo101 weak smart CS101 IntelligenceSmart GradeA DifficultyEasy SatisfacLike easy / hard weak / smart 103 104
PRM Learning: Incomplete Data A Web of Data
??? ??? Tom Mitchell Professor
Project-of WebKB C ??? Hi Project Use expectedWelcome sufficient to statistics Advisee-of Geo101 Works-on But, everything is correlated: A Sean Slattery ⇒ E-step uses??? (approx) inference over entire model Student Welcome to Low CS101 Contains ??? B CMU CS Faculty ??? Hi 105 [Craven et al.] 106
Standard Approach What’s in a Link
Page From-Page To-Page
Category Category Category
Word ... Word Word ... Word 1 N Word1 ... WordN 1 N Professor Exists department Link extract information computer science machine learning …
0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 107 108
18 Discovering Hidden Concepts Discovering Hidden Concepts
Actor Type Director Type Gender
Movie Appeared Type Genre Rating Credit-Order #Votes Year MPAA Rating
Internet Movie Database Internet Movie Database http://www.imdb.com http://www.imdb.com 109 110
Web of Influence, Yet Again Conclusion Movies Actors Directors Wizard of Oz Sylvester Stallone Alfred Hitchcock Many distributions have combinatorial dependency Cinderella Bruce Willis Stanley Kubrick Sound of Music Harrison Ford David Lean structure The Love Bug Steven Seagal Milos Forman Pollyanna Kurt Russell Terry Gilliam Utilizing this structure is good The Parent Trap Kevin Costner Francis Coppola Mary Poppins Jean-Claude Van Damme Discovering this structure has implications: Swiss Family Robinson Arnold Schwarzenegger To density estimation Terminator 2 Anthony Hopkins Steven Spielberg Batman Robert De Niro Tim Burton To knowledge discovery Batman Forever Tommy Lee Jones Tony Scott Mission: Impossible Harvey Keitel James Cameron Many applications GoldenEye Morgan Freeman John McTiernan Starship Troopers Gary Oldman Joel Schumacher Medicine Hunt for Red October Biology … … … Web
111 112
The END Thanks to
Gal Elidan Dana Pe’er Lise Getoor Eran Segal Moises Goldszmidt Ben Taskar Matan Ninio
Slides will be available from: http://www.cs.huji.ac.il/~nir/ http://robotics.stanford.edu/~koller/
113
19