Graphical Model II

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory MACHINE LEARNING Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University [email protected] www.cs.iastate.edu/~honavar/ www.cild.iastate.edu/ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Bayesian networks E B Data + L R A Prior information E B P(A | E,B) e b 0.9 0.1 C e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99 Copyright Vasant Honavar, 2006. 1 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Learning Problem Known Structure Unknown Structure Complete Data Statistical Discrete optimization parametric over structures estimation (discrete search) (closed-form eq.) Incomplete Data Parametric Combined optimization (Structural EM, mixture (EM, gradient models…) descent...) Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over estimation structures Data (closed-form eq.) (discrete search) Incomplete Parametric Combined optimization (Structural EM, mixture Data (EM, gradient descent...) models…) E, B, A <Y,N,N> <Y,Y,Y> E B P(A | E,B) <N,N,Y> E B <N,Y,Y> E B P(A | E,B) e b ?? . L A e b 0.9 0.1 e b ?? . <N,Y,Y> e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b ?? e b 0.01 0.99 A Copyright Vasant Honavar, 2006. 2 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search) Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…) E, B, A <Y,N,N> <Y,Y,Y> E B P(A | E,B) <N,N,Y> E B <N,Y,Y> e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A <N,Y,Y> e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search) Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…) E, B, A <Y,N,?> <Y,?,Y> E B P(A | E,B) <N,N,Y> E B <?,Y,Y> e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A <N,?,Y> e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A Copyright Vasant Honavar, 2006. 3 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search) Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…) E, B, A <Y,N,?> <Y,?,Y> E B P(A | E,B) <N,N,Y> E B <?,Y,Y> e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A <N,?,Y> e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Bayesian Networks Known Structure Unknown Structure Complete data Incomplete data » Parameter learning: Complete data (Review) • Statistical parametric fitting • Maximum likelihood estimation • Bayesian inference • Parameter learning: Incomplete data • Structure learning: Complete data • Application: classification • Structure learning: Incomplete data Copyright Vasant Honavar, 2006. 4 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Estimating probabilities from data (discrete case) • Maximum likelihood estimation • Bayesian estimation • Maximum a posteriori estimation Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian estimation • Treat the unknown parameters as random variables • Assume a prior distribution for the unknown parameters • Update the distribution of the parameters based on data • Use Bayes rule to make prediction Copyright Vasant Honavar, 2006. 5 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Networks and Bayesian Prediction θY|X θX θX m θY|X X[m] X[1] X[2] X[M] X[M+1] Y[m] Y[1] Y[2] Y[M] Y[M+1] Plate notation Observed data Query • Priors for each parameter group are independent • Data instances are independent given the unknown parameters Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Networks and Bayesian Prediction θY|X θX θX m θY|X X[m] X[1] X[2] X[M] X[M+1] Y[m] Y[1] Y[2] Y[M] Y[M+1] Plate notation Observed data Query • We can “read” from the network: • Complete data ⇒ posteriors on parameters are independent Copyright Vasant Honavar, 2006. 6 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Prediction (cont.) • Since posteriors on parameters for each node are independent, we can compute them separately • Posteriors for parameters within a node are also independent: θX θX m θ m Y|X Refined model θY|X=0 X[m] X[m] θY|X=1 Y[m] Y[m] • Complete data ⇒ the posteriors on θY|X=0 and θ Y|X=1 are independent Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Prediction • Given these observations, we can compute the posterior for each multinomial θ independently Xi | pai • The posterior is Dirichlet with parameters • α(Xi=1|pai)+N (Xi=1|pai),…, α(Xi=k|pai)+N (Xi=k|pai) • The predictive distribution is then represented by the parameters ~ α (x , pa ) + N (x , pa ) θ = i i i i xi | pai α ( pa i ) + N ( pa i ) Copyright Vasant Honavar, 2006. 7 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Assigning Priors for Bayesian Networks • We need the α(xi,pai) for each node xj • We can use initial parameters Θ0 as prior information • Need also an equivalent sample size parameter M0 • Then, we let α(xi,pai) = M0•P(xi,pai|Θ0) • This allows update of a network in response to new data Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Parameters • Comparing two distribution P(x) (true model) vs. Q(x) (learned distribution) -- Measure their KL Divergence P(x) KL(P || Q) = ∑ P(x)log x Q(x) – KL(P||Q) ≥ 0 – KL(P||Q) = 0 iff are P and Q equal Copyright Vasant Honavar, 2006. 8 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Parameters: Summary • Estimation relies on sufficient statistics • For multinomial these are of the form N (xi,pai) • Parameter estimation N (x , pa ) α (x , pa ) + N (x , pa ) ˆθ = i i ~θ = i i i i x i |pa i x i |pa i N ( pa i ) α ( pa i ) + N ( pa i ) MLE Bayesian (Dirichlet) • Bayesian methods also require choice of priors • Both MLE and Bayesian estimates are asymptotically equivalent and consistent but the latter work better with small samples • Both can be implemented in an on-line manner by accumulating sufficient statistics Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search) Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…) E, B, A <Y,N,N> <Y,Y,Y> E B P(A | E,B) <N,N,Y> E B <N,Y,Y> e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A <N,Y,Y> e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A Copyright Vasant Honavar, 2006. 9 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Why do we need accurate structure? Earthquake Alarm Set Burglary Sound Missing an arc Extraneous arc Earthquake Alarm Set Burglary Earthquake Alarm Set Burglary Sound Sound • Cannot be compensated for by • Increases the number of fitting parameters parameters to be estimated • Incorrect independence • Incorrect independence assumptions assumptions Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Approaches to BN Structure Learning • Score based methods – assign a score to each candidate BN structure using a suitable scoring function – Search the space of candidate network structures for a BN structure with the maximum score • Independence testing based methods – Use independence tests to determine the structure of the network Copyright Vasant Honavar, 2006. 10 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Score-based BN Structure Learning Define scoring function that evaluates how well a structure matches the data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . E . E B E <N,Y,Y> A A A B B Search for a structure that maximizes the score Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Need for parsimony Copyright Vasant Honavar, 2006.

Graphical Model II

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support