Bayesian inference and Bayesian model selection
Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI data analysis", University of Zurich & ETH Zurich, 27 November 2018
With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• system identification
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes' theorem
The Reverend Thomas Bayes (1702-1761)
p( yp | ) ( ) “... the theorem expresses py( | ) how a ... degree of belief should rationally change to py() account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayes' theorem
The Reverend Thomas Bayes (1702-1761)
p( yp | ) ( ) “... the theorem expresses py( | ) how a ... degree of belief p( yp | ) ( ) should rationally change to account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayesian inference: an animation The evidence term
pyp(|)() continuous py(|) pyp(|)()
p( y | ) p ( ) discrete py( | ) p( y | ) p ( ) Bayesian inference: An example (with fictitious probabilities)
• symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20%
• disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD)
• A priori: p(Ebola) =10-6 p(AOD) = (1-10-6)
• A patient presents with fever. What is the probability that he/she has ebola? p( y 1| 1) p ( 1) py( 1| 1) p( y 1| j ) p ( j ) j0,1 Bayesian inference: An example (with fictitious probabilities)
• symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20%
• disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD)
• A priori: p(Ebola) =10-6 p(AOD) = (1-10-6)
• A patient presents with fever. What is the probability that he/she has ebola?
0.999 106 py( 1| 1) 4.995 106 0.999 1066 0.2 (1 10 ) Generative models
pym(|,) pm(|)
pym(|,)
1. specify the joint probability over data (observations) and parameters 2. enforce mechanistic thinking: how could the data have been caused? 3. generate synthetic data (observations) by sampling from the prior – can model explain certain phenomena at all? 4. inference about parameters → p(|y) 5. model evidence p(y|m): index of model quality Bayesian inference in practice
Formulation of a generative model likelihood function p(y|) Model prior distribution p()
Observation of data
Measurement data y
Model inversion – updating one's beliefs
posterior pyp( | yp )( | ) ( ) model evidence Example of a shrinkage prior Priors • Objective priors: – "non-informative" priors – objective constraints (e.g., non-negativity)
• Subjective priors: – subjective but not arbitrary – can express beliefs that result from understanding of the problem or system – can be result of previous empirical results
• Shrinkage priors: – emphasize regularization and sparsity
• Empirical priors: – learn parameters of prior distributions from the data ("empirical Bayes") – rest on a hierarchical model A generative modelling framework for fMRI & EEG: Dynamic causal modeling (DCM)
dwMRI EEG, MEG fMRI
Forward model: Model inversion: Predicting measured Inferring neuronal activity parameters
dx y g(x,) f(,,) x u dt
Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage DCM for fMRI
Stephan et al. 2015, Neuron 0.4 Neural population activity 0.3 0.2 0.1 u2 0 0 10 20 30 40 50 60 70 80 90 100
0.6
0.4
u 0.2 1 x3 0 0 10 20 30 40 50 60 70 80 90 100
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100 x1 x2 3
2 fMRI signal change (%)
1
0
0 10 20 30 40 50 60 70 80 90 100
4 Nonlinear Dynamic Causal Model for fMRI 3 2 m n 1 dx (i) ( j) 0 -1 A u B x D x Cu 0 10 20 30 40 50 60 70 80 90 100 i j dt i1 j1 3 2
1
0
Stephan et al. 2008, NeuroImage 0 10 20 30 40 50 60 70 80 90 100 Bayesian system identification u(t) Design experimental inputs
Neural dynamics dx dt f(,,) x u Define likelihood model Observer function y g(,) x
p(y |,m) N(g( ),( )) Specify priors p(,m) N( , )
Inference on model p(y | m) p(y |,m) p( )d Invert model structure p(y |,m) p(,m) Inference on parameters p( | y,m) p(y | m) Make inferences Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• system identification
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays"
pym(|,) pm(|)
pym(|,) Computational assays: Models of disease mechanisms Translational Neuromodeling
dx fxu(,,) dt
Individual treatment prediction
Detecting physiological subgroups (based on inferred mechanisms)
disease mechanism A Application to brain activity and disease mechanism B behaviour of individual patients disease mechanism C
Stephan et al. 2015, Neuron Generative embedding (supervised)
step 1 — A step 2 — A → B model inversion kernel construction A → C B → B B C B → C
measurements from subject-specific subject representation in the an individual subject inverted generative model generative score space
A step 4 — step 3 — interpretation support vector classification
B C jointly discriminative separating hyperplane fitted to model parameters discriminate between groups
Brodersen et al. 2011, PLoS Comput. Biol. Generative embedding (unsupervised)
Brodersen et al. 2014, NeuroImage Clinical Differential diagnosis by model selection
SYMPTOM y (behaviour or physiology)
pym(|,) k pmy(|) k
HYPOTHETICAL ...... MECHANISM m1 mk mK
p( y | mkk ) p ( m ) pm(k | y) p( y | mkk ) p ( m ) k
Stephan et al. 2017, NeuroImage Why should I know about Bayesian inference?
Because Bayesian principles are fundamental for
• statistical inference in general
• system identification
• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics
• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Perception = inversion of a hierarchical generative model
environm. states neuronal states others' mental states bodily states
forward model p(|,)(|) yx mp x m
pxym(|,) perception Free energy principle: predictive coding & active inference
sensations – predictions
Prediction error
Change Change sensory input predictions Action Perception
Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted?
Bayesian Inference
Analytical solutions Approximate Inference
Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted?
• compute the posterior analytically – requires conjugate priors
• variational Bayes (VB) – often hard work to derive, but fast to compute – uses approximations (approx. posterior, mean field) – problems: local minima, potentially inaccurate approximations
• Sampling: Markov Chain Monte Carlo (MCMC) – theoretically guaranteed to be accurate (for infinite computation time) – problems: may require very long run time in practice, convergence difficult to prove Conjugate priors
• for a given likelihood function, the choice of prior determines the algebraic form of the posterior
• for some probability distributions a prior can be found such that the posterior has the same algebraic form as the prior
• such a prior is called “conjugate” to the likelihood
• examples: – Normal × Normal Normal ppp(θ |)(|) yy θθ ( ) – Beta × Binomial Beta – Dirichlet × Multinomial Dirichlet same form Posterior mean & variance of univariate Gaussians
Likelihood & Prior 2 y p( y | ) N ( , e ) pN()(,) 2 pp Posterior: pyN(|)(,) 2 Posterior 1 1 1 Likelihood 2 2 2 e p p Prior 1 1 2 2 2 p e p
Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as precision weighting
Likelihood & prior 1 y p( y | ) N ( , e ) pN()(,) 1 pp Posterior: pyN(|)(,) 1 Posterior Likelihood e p p Prior e p p
Relative precision weighting Variational Bayes (VB)
Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics.
true hypothesis posterior class 푝 휃 푦 divergence KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence
• asymmetric measure of the difference between two probability distributions P and Q
• Interpretations of DKL(P‖Q): – "Bayesian surprise" when Q=prior, P=posterior: measure of the information gained when one updates one's prior beliefs to the posterior P – a measure of the information lost when Q is used to approximate P
• non-negative: 0 (zero when P=Q) Variational calculus
Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a probability distribution 푝 푥 Variational Bayes
ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg. free 퐹 푞, 푦 ≥ 0 energy KL[푞| 푝 (unknown) (easy to evaluate for a given 푞)
퐹 푞 is a functional wrt. the 퐹 푞, 푦 approximate posterior 푞 휃 . Maximizing 퐹 푞, 푦 is equivalent to: • minimizing KL[푞| 푝 • tightening 퐹 푞, 푦 as a lower bound to the log model evidence initialization … When 퐹 푞, 푦 is maximized, 푞 휃 is … convergence our best estimate of the posterior. Derivation of the (negative) free energy approximation
• See whiteboard! • (or Appendix to Stephan et al. 2007, NeuroImage 38: 387-401) Mean field assumption
Factorize the approximate posterior 푞 휃 into independent 푞 휃1 푞 휃 partitions: 2
푞 휃 = 푞푖 휃푖 푖
where 푞푖 휃푖 is the approximate posterior for the 푖th subset of parameters.
For example, split parameters and hyperparameters: 휃2 휃1 Jean Daunizeau, www.fil.ion.ucl.ac.uk/ p, | y q , q q ~jdaunize/presentations/Bayes2.pdf VB in a nutshell (under mean-field approximation)
Neg. free-energy ln|,,,|p y mFKL qpy approx. to model evidence. Fp yKLln| ,,,,| qpm q
Mean field approx. pyqqq,|,
Maximise neg. free qIp y expexp ln, , energy wrt. q = q() minimise divergence, qIp y expexp ln, , by maximising q() variational energies
Iterative updating of sufficient statistics of approx. posteriors by gradient ascent. Model comparison and selection
Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?
Which model represents the best balance between model fit and model complexity?
For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS)
• First step of inference: define model M [1,[ space M
• Inference on model structure m: Posterior model probability p y| m p m p m| y py p y| m p m p y| m p m m
• For a uniform prior on m, model Model evidence: evidence sufficient for model selection p(|) y m p (|,)(|) y m p m d Bayesian model selection (BMS)
Model evidence: Ghahramani 2004
p(|)(|,)(|) ymp ym pmd p(y|m) y probability that data were generated by model m, averaging over all possible all possible datasets parameter values (as specified by the prior) Various approximations: accounts for both accuracy and - negative free energy (F) complexity of the model - Akaike Information Criterion (AIC) - Bayesian Information Criterion (BIC) Bayesian model selection (BMS)
Model evidence: Ghahramani 2004
p(|)(|,)(|) ymp ym pmd p(y|m) y “If I randomly sampled from my prior and plugged the resulting value into the likelihood all possible datasets function, how close would the predicted data be – on average – to my observed data?” Various approximations: accounts for both accuracy and - negative free energy (F) complexity of the model - Akaike Information Criterion (AIC) - Bayesian Information Criterion (BIC) Approximations to the model evidence
Logarithm is a Maximizing log model evidence monotonic function = Maximizing model evidence
Log model evidence = balance between fit and complexity log p( y | m) accuracy(m) complexity(m) log p( y |,m) complexity(m) No. of parameters
Akaike Information Criterion: AIC log p(y |,m) p No. of data points p Bayesian Information Criterion: BIC log p( y |,m) log N 2 The (negative) free energy approximation F
ln 푝 푦|푚 F is a lower bound on the log model evidence: KL[푞| 푝 logp ( y | m ) F KL q , p | y , m 퐹 푞, 푦
Like AIC/BIC, F is an accuracy/complexity tradeoff:
Fp ylog m ( | KL , ), q | p m accuracy complexity The (negative) free energy approximation
• Log evidence is thus expected log likelihood (wrt. q) plus 2 KL's:
logp ( y | m ) logp ( ymKL | , ),|,| qpmKL , qpy m
Flog p ( y | m ) KL q , p | y , m logp ( y | , m ) KL q , p | m accuracy complexity The complexity term in F
• In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for parameter interdependencies. Under Gaussian assumptions about the posterior (Laplace approximation):
KLq(), p( | m) 1 1 1 ln C ln C T C 1 2 2 |y 2 |y |y
• The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean Bayes factors
To compare two models, we could just compare their log evidences.
But: the log evidence is just some number – not very intuitive!
A more intuitive interpretation of model comparisons is made possible by Bayes factors:
p( y | m ) Kass & Raftery classification: B 1 12 B p(m |y) Evidence p( y | m ) 12 1 2 1 to 3 50-75% weak 3 to 20 75-95% positive positive value, [0;[ 20 to 150 95-99% strong 150 99% Very strong
Kass & Raftery 1995, J. Am. Stat. Assoc. Fixed effects BMS at group level
Group Bayes factor (GBF) for 1...K subjects:
(k) GBFij BFij k
Average Bayes factor (ABF):
()k ABFij K BF ij k Problems: - blind with regard to group heterogeneity - sensitive to outliers Random effects BMS for heterogeneous groups
Dirichlet parameters = “occurrences” of models in the population
r ~ Dir(r;) Dirichlet distribution of model probabilities r
mk ~ p(mk | p) Multinomial distribution of model labels m
m1 ~ Mult(m;1,r) Model inversion by Variational y ~ p(y | m ) Bayes or MCMC 1 1 1 Measured data y y2 ~ p(y2 | m2 ) y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage Random effects BMS for heterogeneous groups
Dirichlet parameters = “occurrences” of models in the population
k = 1...K Dirichlet distribution of model probabilities r rk
mnk Multinomial distribution of model labels m
Model inversion yn by Variational n = 1...N Bayes or MCMC Measured data y
Stephan et al. 2009, NeuroImage Four equivalent options for reporting model ranking by random effects BMS
1. Dirichlet parameter estimates
2. expected posterior probability of r ( ) obtaining the k-th model for any k q k 1 K randomly selected subject
3. exceedance probability that a kKjK {1... jk },{1...|}: particular model k is more likely than any other model (of the K models kkjp(| rr ; y ) tested), given the group data
4. protected exceedance probability: see below Example: Hemispheric interactions during vision
LD LD|LVF m2 m1
MOG FG FG MOG MOG FG FG MOG
LD|RVF LD|LVF LD LD
LG LG LG LG
RVF LD LVF RVF LD|RVF LVF stim. stim. stim. stim.
m2 m1
Subjects
Data: Stephan et al. 2003, Science Models: Stephan et al. 2007, J. Neurosci.
-35 -30 -25 -20 -15 -10 -5 0 5 Log model evidence differences p(r >0.5 | y) = 0.997 1 5
4.5
4
3.5 m2 pr1 r2 99.7% m1 3
|y)
1 2.5
p(r 2
1.5
1 2 2.2 1 11.8 0.5 r2 15.7% r1 84.3% 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r 1
Stephan et al. 2009a, NeuroImage Example: Synaesthesia
• “projectors” experience color externally colocalized with a presented grapheme
• “associators” report an internally evoked association
• across all subjects: no evidence for either model
• but BMS results map precisely onto projectors (bottom-up mechanisms) and associators (top-down)
van Leeuwen et al. 2011, J. Neurosci. Overfitting at the level of models
• #models risk of overfitting posterior model probability: • solutions: p my| – regularisation: definition of model space = choosing priors p(m) py mp|() m – family-level BMS py mp|() m – Bayesian model averaging (BMA) m
BMA: py | p | y , m p m | y m Model space partitioning: comparing model families
• partitioning model space into K subsets Mff 1,..., K or families:
• pooling information over all models in pf k these subsets allows one to compute the probability of a model family, given the data
• effectively removes uncertainty about any aspect of model structure, other than the attribute of interest (which defines the partition)
Stephan et al. 2009, NeuroImage Penny et al. 2010, PLoS Comput. Biol. Family-level inference: random effects – a special case
• When the families are of equal size, one can simply sum the posterior model probabilities within families by exploiting the agglomerative property of the Dirichlet distribution:
r1, r 21 ,..., 2 rDirKK ~, ,..., *** rr12 rrkkJk,,..., rr k Nk12 Nk N J
*** ~,,...,Dir 12 kkJk k Nk12 Nk N J
Stephan et al. 2009, NeuroImage nonlinear linear
FFX log 80
) GBFL Model space partitioning: 60 comparing model families
40
20
Summed log evidence (rel. to RBM to (rel. evidence log Summed p(r >0.5 | y) = 0.986 0 1 5 CBMN CBMN(ε) RBMN RBMN(ε) CBML CBML(ε) RBML RBML(ε) RFX 12 4.5
10 m m 4 2 1
8 3.5
6
alpha 3
4 prr 12 98.6% |y)
1 2.5
2 p(r
0 2
CBMN CBMN(ε) RBMN RBMN(ε) CBML CBML(ε) RBML RBML(ε) 1.5
16 1 Model r2 26.5% r1 73.5% 14 space m 0.5 12 m1 2 partitioning 10 0 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 8 * 1 k r
alpha k1 1 6
4 8 * 2 k 2 k5 0 Stephan et al. 2009, NeuroImage nonlinear models linear models Bayesian Model Averaging (BMA)
• abandons dependence of parameter single-subject BMA: inference on a single model and takes into account model uncertainty py | • uses the entire model space considered (or an optimal family of models) pympmy |,| • averages parameter estimates, weighted m by posterior model probabilities group-level BMA: • represents a particularly useful alternative – when none of the models (or model pynN| 1.. subspaces) considered clearly outperforms all others pn| y n , m p m | y1.. N – when comparing groups for which the m optimal model differs NB: p(m|y1..N) can be obtained by either FFX or RFX BMS
Penny et al. 2010, PLoS Comput. Biol. Prefrontal-parietal connectivity during working memory in schizophrenia
• 17 at-risk mental state (ARMS) individuals • 21 first-episode patients (13 non-treated) • 20 controls
Schmidt et al. 2013, JAMA Psychiatry BMS results for all groups
Schmidt et al. 2013, JAMA Psychiatry BMA results: PFC PPC connectivity
17 ARMS, 21 first-episode (13 non-treated), 20 controls
Schmidt et al. 2013, JAMA Psychiatry Protected exceedance probability: Using BMA to protect against chance findings • EPs express our confidence that the posterior probabilities of models are
different – under the hypothesis H1 that models differ in probability: rk1/K
• does not account for possibility "null hypothesis" H0: rk=1/K
• Bayesian omnibus risk (BOR) of wrongly accepting H1 over H0:
• protected EP: Bayesian model averaging over H0 and H1:
Rigoux et al. 2014, NeuroImage definition of model space
inference on model structure or inference on model parameters?
inference on inference on individual models or model space partition? parameters of an optimal model or parameters of all models?
optimal model structure assumed comparison of model optimal model structure assumed BMA to be identical across subjects? families using to be identical across subjects? FFX or RFX BMS
yes no yes no FFX BMS RFX BMS
FFX BMS RFX BMS FFX analysis of RFX analysis of parameter estimates parameter estimates (e.g. BPA) (e.g. t-test, ANOVA)
Stephan et al. 2010, NeuroImage Further reading
• Penny WD, Stephan KE, Mechelli A, Friston KJ (2004) Comparing dynamic causal models. NeuroImage 22:1157-1172. • Penny WD, Stephan KE, Daunizeau J, Joao M, Friston K, Schofield T, Leff AP (2010) Comparing Families of Dynamic Causal Models. PLoS Computational Biology 6: e1000709. • Penny WD (2012) Comparing dynamic causal models using AIC, BIC and free energy. Neuroimage 59: 319- 330. • Rigoux L, Stephan KE, Friston KJ, Daunizeau J (2014) Bayesian model selection for group studies – revisited. NeuroImage 84: 971-985. • Stephan KE, Weiskopf N, Drysdale PM, Robinson PA, Friston KJ (2007) Comparing hemodynamic models with DCM. NeuroImage 38:387-401. • Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009) Bayesian model selection for group studies. NeuroImage 46:1004-1017. • Stephan KE, Penny WD, Moran RJ, den Ouden HEM, Daunizeau J, Friston KJ (2010) Ten simple rules for Dynamic Causal Modelling. NeuroImage 49: 3099-3109. • Stephan KE, Iglesias S, Heinzle J, Diaconescu AO (2015) Translational Perspectives for Computational Neuroimaging. Neuron 87: 716-732. • Stephan KE, Schlagenhauf F, Huys QJM, Raman S, Aponte EA, Brodersen KH, Rigoux L, Moran RJ, Daunizeau J, Dolan RJ, Friston KJ, Heinz A (2017) Computational Neuroimaging Strategies for Single Patient Predictions. NeuroImage 145: 180-199. Thank you