Bayesian inference and Bayesian model selection
Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI data analysis", University of Zurich & ETH Zurich, 28 November 2017
With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• system identification
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes' theorem
The Reverend Thomas Bayes (1702-1761)
pp(y |θθ ) ( ) “Bayes‘ Theorem p(θy | ) describes, how an p()y ideally rational person processes information." posterior = likelihood ∙ prior / evidence Wikipedia Bayes’ Theorem
Given data y and parameters , the joint probability is:
p(,)(|)()(|)() ypy p yp yp
Eliminating p(y,) gives Bayes’ rule:
posterior likelihood prior p( yp | ) ( ) Py( | ) py()
evidence Bayesian inference: an animation Generative models
• specify a joint probability distribution over all variables (observations and parameters)
• require a likelihood function and a prior: p( ymp ,|)(|,)(|)(|,) ym pmpy m
• can be used to randomly generate synthetic data (observations) by sampling from the prior – we can check in advance whether the model can explain certain phenomena at all
• model comparison based on the model evidence p( y | m ) p ( y | , m ) p ( | m ) d Principles of Bayesian inference
Formulation of a generative model likelihood function p(y|) Model prior distribution p()
Observation of data
Measurement data y
Model inversion – updating one's beliefs
maximum a posteriori pyp( | yp )( | ) ( ) (MAP) estimates model evidence Example of a shrinkage prior Priors
Priors can be of different sorts, e.g.
• empirical (previous data)
• empirical (estimated from current data using a hierarchical model → "empirical Bayes")
• uninformed
• principled (e.g., positivity constraints)
• shrinkage Generative models
pym(|,) pm(|)
pym(|,)
1. enforce mechanistic thinking: how could the data have been caused? 2. generate synthetic data (observations) by sampling from the prior – can model explain certain phenomena at all? 3. inference about parameters → p(|y) 4. inference about model structure → p(y|m) or p(m|y) 5. model evidence: index of model quality A generative modelling framework for fMRI & EEG: Dynamic causal modeling (DCM)
dwMRI EEG, MEG fMRI
Forward model: Model inversion: Predicting measured Estimating neuronal activity mechanisms
dx y g(x,) f(,,) x u dt
Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage DCM for fMRI
Stephan et al. 2015, Neuron Bayesian system identification u(t) Design experimental inputs
Neural dynamics dx dt f(,,) x u Define likelihood model Observer function y g(,) x
p(y |,m) N(g( ),( )) Specify priors p(,m) N( , )
Inference on model p(y | m) p(y |,m) p( )d Invert model structure p(y |,m) p(,m) Inference on parameters p( | y,m) p(y | m) Make inferences Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• system identification
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays"
pym(|,) pm(|)
pym(|,) Computational assays: Models of disease mechanisms Translational Neuromodeling
dx fxu(,,) dt
Individual treatment prediction
Detecting physiological subgroups (based on inferred mechanisms)
disease mechanism A Application to brain activity and disease mechanism B behaviour of individual patients disease mechanism C
Stephan et al. 2015, Neuron Differential diagnosis based on generative models of disease symptoms
SYMPTOM y (behaviour or physiology)
pym(|,) k pmy(|) k
HYPOTHETICAL ...... MECHANISM m1 mk mK
p( y | mkk ) p ( m ) pm(k | y) p( y | mkk ) p ( m ) k
Stephan et al. 2016, NeuroImage Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• system identification
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Perception = inversion of a hierarchical generative model
environm. states neuronal states others' mental states bodily states
forward model p(|,)(|) yx mp x m
pxym(|,) perception Example: free-energy principle and active inference
sensations – predictions
Prediction error
Change Change sensory input predictions Action Perception
Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted?
Bayesian Inference
Analytical solutions Approximate Inference
Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted?
• compute the posterior analytically – requires conjugate priors – even then often difficult to derive an analytical solution
• variational Bayes (VB) – often hard work to derive, but fast to compute – cave: local minima, potentially inaccurate approximations
• sampling methods (MCMC) – guaranteed to be accurate in theory (for infinite computation time) – but may require very long run time in practice – convergence difficult to prove Conjugate priors
If the posterior p(θ|x) is in the same family as the prior p(θ), the prior and posterior are called "conjugate distributions", and the prior is called a "conjugate prior" for the likelihood function. p(y | ) p( ) p( | y) p(y)
same form analytical expression for posterior examples (likelihood-prior): • Normal-Normal • Normal-inverse Gamma • Binomial-Beta • Multinomial-Dirichlet Posterior mean & variance of univariate Gaussians
Likelihood & Prior 2 y p( y | ) N ( , e ) pN()(,) 2 pp Posterior: pyN(|)(,) 2 Posterior 1 1 1 Likelihood 2 2 2 e p p Prior 1 1 2 2 2 p e p
Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as precision weighting
Likelihood & prior 1 y p( y | ) N ( , e ) pN()(,) 1 pp Posterior: pyN(|)(,) 1 Posterior Likelihood e p p Prior e p p
Relative precision weighting Variational Bayes (VB)
Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics.
true hypothesis posterior class 푝 휃 푦 divergence KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence
• asymmetric measure of the difference between two probability distributions P and Q
• Interpretations of DKL(P‖Q): – "Bayesian surprise" when Q=prior, P=posterior: measure of the information gained when one updates one's prior beliefs to the posterior P – a measure of the information lost when Q is used to approximate P
• non-negative: 0 (zero when P=Q) Variational calculus
Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a probability distribution 푝 푥 Variational Bayes
ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg. free 퐹 푞, 푦 ≥ 0 energy KL[푞| 푝 (unknown) (easy to evaluate for a given 푞)
퐹 푞 is a functional wrt. the 퐹 푞, 푦 approximate posterior 푞 휃 . Maximizing 퐹 푞, 푦 is equivalent to: • minimizing KL[푞| 푝 • tightening 퐹 푞, 푦 as a lower bound to the log model evidence initialization … When 퐹 푞, 푦 is maximized, 푞 휃 is … convergence our best estimate of the posterior. Derivation of the (negative) free energy approximation
• See whiteboard! • (or Appendix to Stephan et al. 2007, NeuroImage 38: 387-401) Mean field assumption
Factorize the approximate posterior 푞 휃 into independent 푞 휃1 푞 휃 partitions: 2
푞 휃 = ෑ 푞푖 휃푖 푖
where 푞푖 휃푖 is the approximate posterior for the 푖th subset of parameters.
For example, split parameters and hyperparameters: 휃2 휃1 Jean Daunizeau, www.fil.ion.ucl.ac.uk/ p, | y q , q q ~jdaunize/presentations/Bayes2.pdf VB in a nutshell (under mean-field approximation)
Neg. free-energy ln|,,,|p y mFKL qpy approx. to model evidence. Fp yKLln| ,,,,| qpm q
Mean field approx. pyqqq,|,
Maximise neg. free qIp y expexp ln, , energy wrt. q = q() minimise divergence, qIp y expexp ln, , by maximising q() variational energies
Iterative updating of sufficient statistics of approx. posteriors by gradient ascent. VB (under mean-field assumption) in more detail VB (under mean-field assumption) in more detail Model comparison and selection
Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?
Which model represents the best balance between model fit and model complexity?
For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS)
Model evidence (marginal likelihood): Gharamani, 2004
p(|)(|,)(|) ymp ym pmd p(y|m) y accounts for both accuracy and complexity of the model all possible datasets
“If I randomly sampled from my prior and plugged the resulting value into the likelihood Various approximations, e.g.: function, how close would the - negative free energy, AIC, BIC predicted data be – on average
– to my observed data?” McKay 1992, Neural Comput. Penny et al. 2004a, NeuroImage Model space (hypothesis set) M
Model space M is defined by prior on models. Usual choice: flat prior over a small set of models.
1/ ifMmM pm() 0 if mM
In this case, the posterior probability of model i is:
p( y | mi ) p ( m i ) p ( y | m i ) p( mi | y ) MM
p( y | mj ) p ( m j ) p ( y | m j ) jj11 Differential diagnosis based on generative models of disease symptoms
SYMPTOM y (behaviour or physiology)
pym(|,) k pmy(|) k
HYPOTHETICAL ...... MECHANISM m1 mk mK
p( y | mkk ) p ( m ) pm(k | y) p( y | mkk ) p ( m ) k
Stephan et al. 2016, NeuroImage Approximations to the model evidence
Logarithm is a Maximizing log model evidence monotonic function = Maximizing model evidence
Log model evidence = balance between fit and complexity log p( y | m) accuracy(m) complexity(m) log p( y |,m) complexity(m) No. of parameters
Akaike Information Criterion: AIC log p( y |,m) p No. of data points p Bayesian Information Criterion: BIC log p( y |,m) log N 2 The (negative) free energy approximation F
ln 푝 푦|푚 F is a lower bound on the log model evidence: KL[푞| 푝 logp ( y | m ) F KL q , p | y , m 퐹 푞, 푦
Like AIC/BIC, F is an accuracy/complexity tradeoff:
Fp ylog m ( | KL , ), q | p m accuracy complexity The (negative) free energy approximation
• Log evidence is thus expected log likelihood (wrt. q) plus 2 KL's:
logp ( y | m ) logp ( ymKL | , ),|,| qpmKL , qpy m
Flog p ( y | m ) KL q , p | y , m logp ( y | , m ) KL q , p | m accuracy complexity The complexity term in F
• In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for parameter interdependencies. Under Gaussian assumptions about the posterior (Laplace approximation): KLq( ), p( | m) 1 1 1 ln C ln C T C 1 2 2 |y 2 |y |y • The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean Bayes factors
To compare two models, we could just compare their log evidences.
But: the log evidence is just some number – not very intuitive!
A more intuitive interpretation of model comparisons is made possible by Bayes factors: positive value, [0;[ p( y | m1) B12 p( y | m2 ) B12 p(m1|y) Evidence 1 to 3 50-75% weak 3 to 20 75-95% positive Kass & Raftery classification: 20 to 150 95-99% strong 150 99% Very strong Kass & Raftery 1995, J. Am. Stat. Assoc. Fixed effects BMS at group level
Group Bayes factor (GBF) for 1...K subjects:
(k ) GBFij BFij k
Average Bayes factor (ABF):
()k ABFij K BF ij k Problems: - blind with regard to group heterogeneity - sensitive to outliers Random effects BMS for heterogeneous groups
Dirichlet parameters = “occurrences” of models in the population
r ~ Dir(r;) Dirichlet distribution of model probabilities r
mk ~ p(mk | p) Multinomial distribution of model labels m
m1 ~ Mult(m;1,r) Model inversion by Variational y ~ p(y | m ) Bayes or MCMC 1 1 1 Measured data y y2 ~ p(y2 | m2 ) y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage Random effects BMS for heterogeneous groups
Dirichlet parameters = “occurrences” of models in the population
k = 1...K Dirichlet distribution of model probabilities r rk
m nk Multinomial distribution of model labels m
Model inversion yn by Variational n = 1...N Bayes (VB) or Measured data y MCMC
Stephan et al. 2009, NeuroImage Random effects BMS
1 k 1 p(|), rDirrr k Z k r ~ Dir(r;) Z() kk k k
mk ~ p(mk | p) mnk p(|) mrrnk m1 ~ Mult(m;1,r) k
y1 ~ p(y1 | m1) y ~ p(y | m ) p( y | m ) p ( y | ) p ( | m ) d 2 2 2 n nk nk y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage Write down joint probability and take the log
pyrm , ,||( | pympmrpr ) 0 p( rp | y0 )|| m p mn r nn n
1 0knk1 m rpkn y nk mr | Z 0 knk
1 mnk p y m rr 0k 1 n nk kk Z 0 nk
lnpyrm , , ln Z00 k 1 ln rmpym k nk log n | nk ln r k nk Mean field approx. qrmqrqm(,)()()
Maximise neg. free q( rI )exp r energy wrt. q = minimise divergence, q( mI )exp m by maximising variational energies I rp ylog, r m , qm() I mp ylog, r m , qr() Iterative updating of sufficient statistics of approx. posteriors
0 0 [1,,1] Until convergence
unkexp ln p ( y n | m nk ) k k k
unk gnk unk our (normalized) gqnknk m(1) k posterior belief that g model k generated the k nk data from subject n n 0 k g nk expected number of n subjects whose data we end believe were generated by model k Four equivalent options for reporting model ranking by random effects BMS
1. Dirichlet parameter estimates
2. expected posterior probability of r ( ) obtaining the k-th model for any k q k 1 K randomly selected subject
3. exceedance probability that a kKjK {1... jk },{1...|}: particular model k is more likely than any other model (of the K models kkjp(| rr ; y ) tested), given the group data
4. protected exceedance probability: see below Example: Hemispheric interactions during vision
LD LD|LVF m2 m1
MOG FG FG MOG MOG FG FG MOG
LD|RVF LD|LVF LD LD
LG LG LG LG
RVF LD LVF RVF LD|RVF LVF stim. stim. stim. stim.
m2 m1
Subjects
Data: Stephan et al. 2003, Science Models: Stephan et al. 2007, J. Neurosci.
-35 -30 -25 -20 -15 -10 -5 0 5 Log model evidence differences p(r >0.5 | y) = 0.997 1 5
4.5
4
3.5 m2 pr1 r2 99.7% m1 3
|y)
1 2.5
p(r 2
1.5
1 2 2.2 1 11.8 0.5 r2 15.7% r1 84.3% 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r 1
Stephan et al. 2009a, NeuroImage Example: Synaesthesia
• “projectors” experience color externally colocalized with a presented grapheme
• “associators” report an internally evoked association
• across all subjects: no evidence for either model
• but BMS results map precisely onto projectors (bottom-up mechanisms) and associators (top-down)
van Leeuwen et al. 2011, J. Neurosci. Protected exceedance probability: Using BMA to protect against chance findings • EPs express our confidence that the posterior probabilities of models are
different – under the hypothesis H1 that models differ in probability: rk1/K
• does not account for possibility "null hypothesis" H0: rk=1/K
• Bayesian omnibus risk (BOR) of wrongly accepting H1 over H0:
• protected EP: Bayesian model averaging over H0 and H1:
Rigoux et al. 2014, NeuroImage definition of model space
inference on model structure or inference on model parameters?
inference on inference on individual models or model space partition? parameters of an optimal model or parameters of all models?
optimal model structure assumed comparison of model optimal model structure assumed BMA to be identical across subjects? families using to be identical across subjects? FFX or RFX BMS
yes no yes no FFX BMS RFX BMS
FFX BMS RFX BMS FFX analysis of RFX analysis of parameter estimates parameter estimates (e.g. BPA) (e.g. t-test, ANOVA)
Stephan et al. 2010, NeuroImage Further reading
• Penny WD, Stephan KE, Mechelli A, Friston KJ (2004) Comparing dynamic causal models. NeuroImage 22:1157-1172. • Penny WD, Stephan KE, Daunizeau J, Joao M, Friston K, Schofield T, Leff AP (2010) Comparing Families of Dynamic Causal Models. PLoS Computational Biology 6: e1000709. • Penny WD (2012) Comparing dynamic causal models using AIC, BIC and free energy. Neuroimage 59: 319- 330. • Rigoux L, Stephan KE, Friston KJ, Daunizeau J (2014) Bayesian model selection for group studies – revisited. NeuroImage 84: 971-985. • Stephan KE, Weiskopf N, Drysdale PM, Robinson PA, Friston KJ (2007) Comparing hemodynamic models with DCM. NeuroImage 38:387-401. • Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009) Bayesian model selection for group studies. NeuroImage 46:1004-1017. • Stephan KE, Penny WD, Moran RJ, den Ouden HEM, Daunizeau J, Friston KJ (2010) Ten simple rules for Dynamic Causal Modelling. NeuroImage 49: 3099-3109. • Stephan KE, Iglesias S, Heinzle J, Diaconescu AO (2015) Translational Perspectives for Computational Neuroimaging. Neuron 87: 716-732. • Stephan KE, Schlagenhauf F, Huys QJM, Raman S, Aponte EA, Brodersen KH, Rigoux L, Moran RJ, Daunizeau J, Dolan RJ, Friston KJ, Heinz A (2017) Computational Neuroimaging Strategies for Single Patient Predictions. NeuroImage 145: 180-199. Thank you