Bayesian Inference and Bayesian Model Selection

Bayesian inference and Bayesian model selection Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI data analysis", University of Zurich & ETH Zurich, 27 November 2018 With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes' theorem The Reverend Thomas Bayes (1702-1761) p( y | ) p ( ) “... the theorem expresses py( | ) how a ... degree of belief should rationally change to py() account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayes' theorem The Reverend Thomas Bayes (1702-1761) p( y | ) p ( ) “... the theorem expresses py( | ) how a ... degree of belief p( y | ) p ( ) should rationally change to account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayesian inference: an animation The evidence term continuous p( y | ) p ( ) p( y | ) p ( ) pydiscrete( | ) py( | ) p( y | ) p ( ) p( y | ) p ( ) Bayesian inference: An example (with fictitious probabilities) • symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20% • disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD) • A priori: p(Ebola) =10-6 p(AOD) = (1-10-6) • A patient presents with fever. What is the probability that he/she has ebola? p( y 1| 1) p ( 1) py( 1| 1) p( y 1| j ) p ( j ) j0,1 Bayesian inference: An example (with fictitious probabilities) • symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20% • disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD) • A priori: p(Ebola) =10-6 p(AOD) = (1-10-6) • A patient presents with fever. What is the probability that he/she has ebola? 0.999 106 py( 1| 1) 4.995 106 0.999 1066 0.2 (1 10 ) Generative models p( y | , m )pm( | ) p( | y , m ) 1. specify the joint probability over data (observations) and parameters 2. enforce mechanistic thinking: how could the data have been caused? 3. generate synthetic data (observations) by sampling from the prior – can model explain certain phenomena at all? 4. inference about parameters → p(|y) 5. model evidence p(y|m): index of model quality Bayesian inference in practice Formulation of a generative model likelihood function p(y|) Model prior distribution p() Observation of data Measurement data y Model inversion – updating one's beliefs posterior p( | y ) p ( y | ) p ( ) model evidence Example of a shrinkage prior Priors • Objective priors: – "non-informative" priors – objective constraints (e.g., non-negativity) • Subjective priors: – subjective but not arbitrary – can express beliefs that result from understanding of the problem or system – can be result of previous empirical results • Shrinkage priors: – emphasize regularization and sparsity • Empirical priors: – learn parameters of prior distributions from the data ("empirical Bayes") – rest on a hierarchical model A generative modelling framework for fMRI & EEG: Dynamic causal modeling (DCM) dwMRI EEG, MEG fMRI Forward model: Model inversion: Predicting measured Inferring neuronal activity parameters dx y g(x,) f(,,) x u dt Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage DCM for fMRI Stephan et al. 2015, Neuron 0.4 Neural population activity 0.3 0.2 0.1 u2 0 0 10 20 30 40 50 60 70 80 90 100 0.6 0.4 u 0.2 1 x3 0 0 10 20 30 40 50 60 70 80 90 100 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 x1 x2 3 2 fMRI signal change (%) 1 0 0 10 20 30 40 50 60 70 80 90 100 4 Nonlinear Dynamic Causal Model for fMRI 3 2 m n 1 dx (i) ( j) 0 -1 A u B x D x Cu 0 10 20 30 40 50 60 70 80 90 100 i j dt i1 j1 3 2 1 0 Stephan et al. 2008, NeuroImage 0 10 20 30 40 50 60 70 80 90 100 Bayesian system identification u(t) Design experimental inputs Neural dynamics dx dt f(,,) x u Define likelihood model Observer function y g(,) x p(y |,m) N(g( ),( )) Specify priors p(,m) N( , ) Inference on model p(y | m) p(y |,m) p( )d Invert model structure p(y |,m) p(,m) Inference on parameters p( | y,m) p(y | m) Make inferences Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays" p( y | , m )pm( | ) p( | y , m ) Computational assays: Models of disease mechanisms Translational Neuromodeling dx f(,,) x u dt Individual treatment prediction Detecting physiological subgroups (based on inferred mechanisms) disease mechanism A Application to brain activity and disease mechanism B behaviour of individual patients disease mechanism C Stephan et al. 2015, Neuron Generative embedding (supervised) step 1 — A step 2 — A → B model inversion kernel construction A → C B → B B C B → C measurements from subject-specific subject representation in the an individual subject inverted generative model generative score space A step 4 — step 3 — interpretation support vector classification B C jointly discriminative separating hyperplane fitted to model parameters discriminate between groups Brodersen et al. 2011, PLoS Comput. Biol. Generative embedding (unsupervised) Brodersen et al. 2014, NeuroImage Clinical Differential diagnosis by model selection SYMPTOM y (behaviour or physiology) p( y | , mk ) p( mk | y ) HYPOTHETICAL ... ... MECHANISM m1 mk mK p( y | mkk ) p ( m ) pm(k | y) p( y | mkk ) p ( m ) k Stephan et al. 2017, NeuroImage Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Perception = inversion of a hierarchical generative model environm. states neuronal states others' mental states bodily states forward model p( y | x , m ) p ( x | m ) p( x | y , m ) perception Free energy principle: predictive coding & active inference sensations – predictions Prediction error Change Change sensory input predictions Action Perception Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted? Bayesian Inference Analytical solutions Approximate Inference Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted? • compute the posterior analytically – requires conjugate priors • variational Bayes (VB) – often hard work to derive, but fast to compute – uses approximations (approx. posterior, mean field) – problems: local minima, potentially inaccurate approximations • Sampling: Markov Chain Monte Carlo (MCMC) – theoretically guaranteed to be accurate (for infinite computation time) – problems: may require very long run time in practice, convergence difficult to prove Conjugate priors • for a given likelihood function, the choice of prior determines the algebraic form of the posterior • for some probability distributions a prior can be found such that the posterior has the same algebraic form as the prior • such a prior is called “conjugate” to the likelihood • examples: – Normal × Normal Normal p(θ | y ) p ( y | θ ) p ( θ ) – Beta × Binomial Beta – Dirichlet × Multinomial Dirichlet same form Posterior mean & variance of univariate Gaussians Likelihood & Prior 2 y p( y | ) N ( , e ) pN()(,) 2 pp Posterior: p( | y ) N ( , 2 ) Posterior 1 1 1 Likelihood 2 2 2 e p p Prior 1 1 2 2 2 p e p Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as precision weighting Likelihood & prior 1 y p( y | ) N ( , e ) pN()(,) 1 pp Posterior: p( | y ) N ( , 1 ) Posterior Likelihood e p p Prior e p p Relative precision weighting Variational Bayes (VB) Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics. true hypothesis posterior class 푝 휃 푦 divergence KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence • asymmetric measure of the difference between two probability distributions P and Q • Interpretations of DKL(P‖Q): – "Bayesian surprise" when Q=prior, P=posterior: measure of the information gained when one updates one's prior beliefs to the posterior P – a measure of the information lost when Q is used to approximate P • non-negative: 0 (zero when P=Q) Variational calculus Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a probability distribution 푝 푥 Variational Bayes ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg.

Bayesian Inference and Bayesian Model Selection

The Bayesian Approach to Statistics

Scalable and Robust Bayesian Inference Via the Median Posterior

1 Estimation and Beyond in the Bayes Universe

Introduction to Bayesian Inference and Modeling Edps 590BAY

Statistical Inference: Paradigms and Controversies in Historic Perspective

Bayesian Inference for Median of the Lognormal Distribution K

9 Bayesian Inference

Bayesian Inference

Bayesian Inference in the Linear Regression Model

On the Derivation of the Bayesian Information Criterion

Bayesian Alternatives for Common Null-Hypothesis Significance Tests

Evidence-Based Crime Investigation. a Bayesian Approach