Learning Bayesian Networks

AnAn OverviewOverview ofof LearningLearning BayesBayes NetsNets FromFrom DataData Chris Meek Microsoft Research http://research.microsoft.com/~meek What’sWhat’s andand Why’sWhy’s WhatWhat isis aa BayesianBayesian network?network? WhyWhy BayesianBayesian networksnetworks areare useful?useful? WhyWhy learnlearn aa BayesianBayesian network?network? WhatWhat isis aa BayesianBayesian Network?Network? also called belief networks, and (directed(directed acyclic) graphicalgraphical models Directed acyclic graph – Nodes are variables (discrete or continuous) – Arcs indicate dependence between variables. Conditional Probabilities (local distributions) Missing arcs implies conditional independence Independencies + local distributions => modular specification of a joint distribution X1 X2 X3 p(x ) p(x | x ) p(x | x , x ) 1 2 1 3 1 2 = p(x1, x2 , x3 ) WhyWhy BayesianBayesian Networks?Networks? Expressive language – Finite mixture models, Factor analysis, HMM, Kalman filter,… Intuitive language – Can utilize causal knowledge in constructing models – Domain experts comfortable building a network General purpose “inference” algorithms Battery Gas – P(Bad Battery | Has Gas, Won’t Start) Start – Exact: Modular specification leads to large computational efficiencies – Approximate: “Loopy” belief propagation WhyWhy Learning?Learning? knowledge-based (expert systems) -Answer Wizard, Office 95, 97, & 2000 -Troubleshooters, Windows 98 & 2000 -Causal discovery -Data visualization -Concise model of data -Prediction data-based OverviewOverview LearningLearning ProbabilitiesProbabilities (local distributions) – Introduction to Bayesian statistics: Learning a probability – Learning probabilities in a Bayes net – Applications LearningLearning BayesBayes--netnet structurestructure – Bayesian model selection/averaging – Applications LearningLearning Probabilities:Probabilities: ClassicalClassical ApproachApproach Simple case: Flipping a thumbtack heads tails True probability θ is unknown Given iid data, estimate θ using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate) LearningLearning ProbabilitProbabilities:ies: BayesianBayesian ApproachApproach heads tails True probability θ is unknown Bayesian probability density for θ p(θ) 01θ BayesianBayesian Approach:Approach: useuse Bayes'Bayes' rulerule toto computecompute aa newnew densitydensity forfor θθ givengiven datadata prior likelihood posterior p(θ) p(data |θ) p(θ | data) = ∫ p(θ) p(data |θ) dθ ∝ p(θ) p(data |θ) TheThe LikelihoodLikelihood p(heads |θ ) = θ p(tails |θ ) = (1−θ ) p(hhth...ttth |θ ) = θ #h (1−θ )#t “binomial distribution” Example:Example: ApplicationApplication ofof BayesBayes rulerule toto thethe observationobservation ofof aa singlesingle "heads""heads" p(θ) p(heads|θ)= θ p(θ|heads) × ∝ 01θ 01θ 01θ prior likelihood posterior TheThe probabilityprobability ofof headsheads onon thethe nextnext tosstoss p(h | d) = ∫ p(h |θ ,d) p(θ | d) dθ = ∫θ p(θ | d) dθ = E p(θ |d) (θ ) Note: This yields nearly identical answers to ML estimates when one uses a “flat” prior OverviewOverview LearningLearning ProbabilitiesProbabilities – Introduction to Bayesian statistics: Learning a probability – Learning probabilities in a Bayes net – Applications LearningLearning BayesBayes--netnet structurestructure – Bayesian model selection/averaging – Applications FromFrom thuthummbtacksbtacks toto BayesBayes netsnets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails P(X = heads ) = f (θ) Θ Θ ... X1 X2 XN X i=1 to N i toss 1 toss 2 toss N TheThe nextnext simplestsimplest BayesBayes netnet heads/tails X Y heads/tails heads tails “heads” “tails” TheThe nextnext simplestsimplest BayesBayes netnet heads/tails X Y heads/tails ? ΘX ΘY i=1 to N Xi Yi TheThe nextnext simplestsimplest BayesBayes netnet heads/tails X Y heads/tails "parameter independence" ΘX ΘY i=1 to N Xi Yi TheThe nextnext simplestsimplest BayesBayes netnet heads/tails X Y heads/tails "parameter independence" ΘX ΘY ⇓ two separate thumbtack-like i=1 to N X Y learning problems i i AA bitbit moremore difficult...difficult... heads/tails X Y heads/tails Three probabilities to learn: θX=heads θY=heads|X=heads θY=heads|X=tails AA bitbit moremore difficult...difficult... heads/tails X Y heads/tails ? ? ΘX ? ΘY|X=heads ΘY|X=tails X1 Y1 case 1 X2 Y2 case 2 Μ Μ AA bitbit moremore difficult...difficult... heads/tails X Y heads/tails ΘX ΘY|X=heads ΘY|X=tails X1 Y1 case 1 X2 Y2 case 2 Μ Μ AA bitbit moremore difficult...difficult... heads/tails X Y heads/tails ΘX ΘY|X=heads ΘY|X=tails heads X1 Y1 case 1 tails X2 Y2 case 2 Μ Μ 3 separate thumbtack-like problems InIn general…general… Learning probabilities in a BN is straightforward if Likelihoods from the exponential family (multinomial, poisson, gamma, ...) Parameter independence Conjugate priors Complete data IncompleteIncomplete datadata makesmakes parametersparameters dependentdependent heads/tails X Y heads/tails ΘX ΘY|X=heads ΘY|X=tails X1 Y1 IncompleteIncomplete datadata Incomplete data makes parameters dependent Parameter Learning for incomplete data Monte-Carlo integration – Investigate properties of the posterior and perform prediction Large-sample Approx. (Laplace/Gaussian approx.) – Expectation-maximization (EM) algorithm and inference to compute mean and variance. VariationalVariational methodsmethods OverviewOverview LearningLearning ProbabilitiesProbabilities – Introduction to Bayesian statistics: Learning a probability – Learning probabilities in a Bayes net – Applications LearningLearning BayesBayes--netnet structurestructure – Bayesian model selection/averaging – Applications Example:Example: AudioAudio--videovideo fusionfusion Beal, Attias, & Jojic 2002 Video scenario Audio scenario camera mic.1 mic.2 ly ∝τ lx source at lx Goal: detect and track speaker Slide courtesy Beal, Attias and Jojic SeparateSeparate audioaudio--videovideo modelsmodels Frame n=1,…,N audio data video data Slide courtesy Beal, Attias and Jojic CombinedCombined modelmodel α Frame n=1,…,N audio data video data Slide courtesy Beal, Attias and Jojic TrackingTracking DemoDemo Slide courtesy Beal, Attias and Jojic OverviewOverview LearningLearning ProbabilitiesProbabilities – Introduction to Bayesian statistics: Learning a probability – Learning probabilities in a Bayes net – Applications LearningLearning BayesBayes--netnet structurestructure – Bayesian model selection/averaging – Applications TwoTwo TypesTypes ofof MethodsMethods forfor LearningLearning BNsBNs ConstraintConstraint basedbased – Finds a Bayesian network structure whose implied independence constraints “match” those found in the data. ScoringScoring methodsmethods ((BayesianBayesian,, MDL,MDL, MML)MML) – Find the Bayesian network structure that can represent distributions that “match” the data (i.e. could have generated the data). LearningLearning BayesBayes--netnet structurestructure Given data, which model is correct? model 1: X Y model 2: X Y BayesianBayesian approachapproach Given data, which model is correct? more likely? model 1: X Y p(m1 ) = 0.7 p(m1 | d) = 0.1 Data d model 2: X Y p(m2 ) = 0.3 p(m2 | d) = 0.9 BayesianBayesian approach:approach: ModelModel AveragingAveraging Given data, which model is correct? more likely? model 1: X Y p(m1 ) = 0.7 p(m1 | d) = 0.1 Data d model 2: X Y p(m2 ) = 0.3 p(m2 | d) = 0.9 average predictions BayesianBayesian approach:approach: ModelModel SelectionSelection Given data, which model is correct? more likely? model 1: X Y p(m1 ) = 0.7 p(m1 | d) = 0.1 Data d model 2: X Y p(m2 ) = 0.3 p(m2 | d) = 0.9 Keep the best model: - Explanation - Understanding - Tractability ToTo scorescore aa model,model, useuse BayesBayes rulerule Given data d: model score p(m | d) ∝ p(m) p(d | m) "marginal likelihood likelihood" p(d | m) = ∫ p(d |θ , m) p(θ | m)dθ TheThe BayesianBayesian approachapproach andand Occam’sOccam’s RazorRazor p(d | m) = p(d |θ , m) p(θ | m)dθ ∫ m m m True distribution Simple model Just right p(θ |m) m Complicated model All distributions ComputationComputation ofof MarginalMarginal LikelihoodLikelihood Efficient closed form if Likelihoods from the exponential family (binomial, poisson, gamma, ...) Parameter independence Conjugate priors No missing data, including no hidden variables Else use approximations Monte-Carlo integration Large-sample approximations Variational methods PracticalPractical considerationsconsiderations The number of possible BN structures is super exponential in the number of variables. How do we find the best graph(s)? ModelModel searchsearch Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) initialize Heuristic methods structure – Greedy – Greedy score all possible – Greedy with restarts single changes perform best – MCMC methods change any yes changes better? no return saved structure LearningLearning thethe correctcorrect modelmodel True graph G and P is the generative distribution Markov Assumption: P satisfies the independencies implied by G Faithfulness Assumption: P satisfies only the independencies implied by G Theorem: Under Markov and Faithfulness, with enough data generated from P one can recover G (up to equivalence). Even with the greedy method! LearningLearning BayesBayes NetsNets FromFrom DataData Bayes net(s) data X1 X2 X3 true 1 Red X1 false 5 Blue Bayes-net X2 ... false 3 Green learner X3 true 2 Red X4 X5 . X . 6 X7 + X8 prior/expert information X9 OverviewOverview LearningLearning

Learning Bayesian Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support