<<

Bayesian and Bayesian

Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI analysis", University of Zurich & ETH Zurich, 27 November 2018

With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

in general

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – – active inference Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes' theorem

The Reverend (1702-1761)

p( yp | ) ( ) “... the theorem expresses py( | )  how a ... degree of should rationally change to py() account for availability of related ." posterior = likelihood ∙ prior / evidence Wikipedia Bayes' theorem

The Reverend Thomas Bayes (1702-1761)

p( yp | ) ( ) “... the theorem expresses py( | )  how a ... degree of belief p( yp | ) ( ) should rationally change to  account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayesian inference: an animation The evidence term

pyp(|)() continuous  py(|)   pyp(|)()

p( y | ) p ( ) discrete  py( | )   p( y | ) p ( )  Bayesian inference: An example (with fictitious probabilities)

 • symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20%

• disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD)

• A priori: p(Ebola) =10-6 p(AOD) = (1-10-6)

• A patient presents with fever. What is the probability that he/she has ebola? p( y 1|  1) p (  1) py(  1|  1)   p( y 1|  j ) p (  j ) j0,1 Bayesian inference: An example (with fictitious probabilities)

 • symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20%

• disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD)

• A priori: p(Ebola) =10-6 p(AOD) = (1-10-6)

• A patient presents with fever. What is the probability that he/she has ebola?

0.999 106 py(  1|  1)   4.995  106 0.999 1066  0.2  (1  10 ) Generative models

pym(|,)  pm(|)

pym(|,)

1. specify the joint probability over data (observations) and 2. enforce mechanistic thinking: how could the data have been caused? 3. generate (observations) by from the prior – can model explain certain phenomena at all? 4. inference about parameters → p(|y) 5. model evidence p(y|m): index of model quality Bayesian inference in practice

 Formulation of a p(y|) Model prior distribution p()

 Observation of data

Measurement data y

 Model inversion – updating one's beliefs

posterior pyp( | yp )( | ) ( ) model evidence Example of a prior Priors • Objective priors: – "non-informative" priors – objective constraints (e.g., non-negativity)

• Subjective priors: – subjective but not arbitrary – can express beliefs that result from understanding of the problem or system – can be result of previous empirical results

• Shrinkage priors: – emphasize regularization and sparsity

• Empirical priors: – learn parameters of prior distributions from the data ("empirical Bayes") – rest on a hierarchical model A generative modelling framework for fMRI & EEG: (DCM)

dwMRI EEG, MEG fMRI

Forward model: Model inversion: Predicting measured Inferring neuronal activity parameters

dx y  g(x,)    f(,,) x u  dt

Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage DCM for fMRI

Stephan et al. 2015, Neuron 0.4 Neural population activity 0.3 0.2 0.1 u2 0 0 10 20 30 40 50 60 70 80 90 100

0.6

0.4

u 0.2 1 x3 0 0 10 20 30 40 50 60 70 80 90 100

0.3

0.2

0.1

0

0 10 20 30 40 50 60 70 80 90 100 x1 x2 3

2 fMRI signal change (%)

1

0

0 10 20 30 40 50 60 70 80 90 100

4 Nonlinear Dynamic Causal Model for fMRI 3 2 m n 1 dx  (i) ( j)  0 -1   A  u B  x D x  Cu 0 10 20 30 40 50 60 70 80 90 100   i  j  dt  i1 j1  3 2

1

0

Stephan et al. 2008, NeuroImage 0 10 20 30 40 50 60 70 80 90 100 Bayesian system identification u(t) Design experimental inputs

Neural dynamics dx dt f(,,) x u  Define likelihood model Observer function y g(,) x 

p(y |,m)  N(g( ),( )) Specify priors p(,m)  N( , )

Inference on model p(y | m)  p(y |,m) p( )d Invert model structure  p(y |,m) p(,m) Inference on parameters p( | y,m)  p(y | m) Make Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays"

pym(|,)  pm(|)

pym(|,)  Computational assays: Models of disease mechanisms Translational Neuromodeling

dx fxu(,,)  dt

 Individual treatment

 Detecting physiological subgroups (based on inferred mechanisms)

 disease mechanism A  Application to brain activity and  disease mechanism B behaviour of individual patients  disease mechanism C

Stephan et al. 2015, Neuron Generative embedding (supervised)

step 1 — A step 2 — A → B model inversion construction A → C B → B B C B → C

measurements from subject-specific subject representation in the an individual subject inverted generative model generative score space

A step 4 — step 3 — interpretation support vector classification

B C jointly discriminative separating hyperplane fitted to model parameters discriminate between groups

Brodersen et al. 2011, PLoS Comput. Biol. Generative embedding (unsupervised)

Brodersen et al. 2014, NeuroImage Clinical Differential diagnosis by model selection

SYMPTOM y (behaviour or physiology)

pym(|,)  k pmy(|) k

HYPOTHETICAL ...... MECHANISM m1 mk mK

p( y | mkk ) p ( m ) pm(k | y)   p( y | mkk ) p ( m ) k

Stephan et al. 2017, NeuroImage Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Perception = inversion of a hierarchical generative model

environm. states neuronal states others' mental states bodily states

forward model p(|,)(|) yx mp x m

pxym(|,) perception Free energy principle: predictive coding & active inference

sensations –

Prediction error

Change Change sensory input predictions Action Perception

Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted?

Bayesian Inference

Analytical solutions Approximate Inference

Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted?

• compute the posterior analytically – requires conjugate priors

• variational Bayes (VB) – often hard work to derive, but fast to compute – uses approximations (approx. posterior, field) – problems: local minima, potentially inaccurate approximations

• Sampling: Monte Carlo (MCMC) – theoretically guaranteed to be accurate (for infinite computation time) – problems: may require very long run time in practice, convergence difficult to prove Conjugate priors

• for a given likelihood function, the choice of prior determines the algebraic form of the posterior

• for some probability distributions a prior can be found such that the posterior has the same algebraic form as the prior

• such a prior is called “conjugate” to the likelihood

• examples: – Normal × Normal  Normal ppp(θ |)(|) yy θθ ( ) – Beta × Binomial  Beta – Dirichlet × Multinomial  Dirichlet same form Posterior mean & of univariate Gaussians

Likelihood & Prior 2 y     p( y | ) N (  ,  e ) pN()(,)   2 pp   Posterior: pyN(|)(,)  2 Posterior 1 1 1 Likelihood   2 2 2    e  p p Prior  1 1     2       2 2 p   e  p 

Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as weighting

Likelihood & prior 1 y     p( y | ) N (  , e ) pN()(,)   1 pp   Posterior: pyN(|)(,)  1 Posterior Likelihood   e  p  p Prior     e   p    p

Relative precision weighting Variational Bayes (VB)

Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics.

true posterior class 푝 휃 푦 KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence

• asymmetric measure of the difference between two probability distributions P and Q

• Interpretations of DKL(P‖Q): – "Bayesian surprise" when Q=prior, P=posterior: measure of the gained when one updates one's prior beliefs to the posterior P – a measure of the information lost when Q is used to approximate P

• non-negative: 0 (zero when P=Q) Variational calculus

Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a 푝 푥 Variational Bayes

ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg. free 퐹 푞, 푦 ≥ 0 energy KL[푞| 푝 (unknown) (easy to evaluate for a given 푞)

퐹 푞 is a functional wrt. the 퐹 푞, 푦 approximate posterior 푞 휃 . Maximizing 퐹 푞, 푦 is equivalent to: • minimizing KL[푞| 푝 • tightening 퐹 푞, 푦 as a lower bound to the log model evidence initialization … When 퐹 푞, 푦 is maximized, 푞 휃 is … convergence our best estimate of the posterior. Derivation of the (negative) free energy approximation

• See whiteboard! • (or Appendix to Stephan et al. 2007, NeuroImage 38: 387-401) Mean field assumption

Factorize the approximate posterior 푞 휃 into independent 푞 휃1 푞 휃 partitions: 2

푞 휃 = 푞푖 휃푖 푖

where 푞푖 휃푖 is the approximate posterior for the 푖th subset of parameters.

For example, split parameters and : 휃2 휃1 Jean Daunizeau, www.fil.ion.ucl.ac.uk/ p,  | y  q  ,  q  q    ~jdaunize/presentations/Bayes2.pdf VB in a nutshell (under mean-field approximation)

 Neg. free-energy ln|,,,|p y mFKL  qpy      approx. to model evidence. Fp yKLln| ,,,,| qpm       q    

 Mean field approx. pyqqq,|,        

 Maximise neg. free qIp y expexp  ln, ,  energy wrt. q =  q() minimise divergence, qIp y expexp  ln, ,  by maximising  q() variational energies

 Iterative updating of sufficient statistics of approx. posteriors by gradient ascent. Model comparison and selection

Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?

Which model represents the best balance between model fit and model complexity?

For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS)

• First step of inference: define model M [1,[ space M

• Inference on model structure m: Posterior model probability p y| m p m p m| y  py  p y| m p m   p y| m p m m

• For a uniform prior on m, model Model evidence: evidence sufficient for model selection p(|) y m  p (|,)(|) y m p  m d  Bayesian model selection (BMS)

Model evidence: Ghahramani 2004

p(|)(|,)(|) ymp ym  pmd  p(y|m) y probability that data were generated by model m, averaging over all possible all possible datasets values (as specified by the prior) Various approximations: accounts for both accuracy and - negative free energy (F) complexity of the model - Akaike Information Criterion (AIC) - Bayesian Information Criterion (BIC) Bayesian model selection (BMS)

Model evidence: Ghahramani 2004

p(|)(|,)(|) ymp ym  pmd  p(y|m) y “If I randomly sampled from my prior and plugged the resulting value into the likelihood all possible datasets function, how close would the predicted data be – on average – to my observed data?” Various approximations: accounts for both accuracy and - negative free energy (F) complexity of the model - Akaike Information Criterion (AIC) - Bayesian Information Criterion (BIC) Approximations to the model evidence

Logarithm is a Maximizing log model evidence monotonic function = Maximizing model evidence

Log model evidence = balance between fit and complexity log p( y | m)  accuracy(m)  complexity(m)  log p( y |,m)  complexity(m) No. of parameters

Akaike Information Criterion: AIC  log p(y |,m)  p No. of data points p Bayesian Information Criterion: BIC  log p( y |,m)  log N 2 The (negative) free energy approximation F

ln 푝 푦|푚 F is a lower bound on the log model evidence: KL[푞| 푝 logp ( y | m ) F KL q , p | y , m 퐹 푞, 푦

Like AIC/BIC, F is an accuracy/complexity tradeoff:

Fp ylog m ( | KL , ), q  | p  m   accuracy complexity The (negative) free energy approximation

• Log evidence is thus expected log likelihood (wrt. q) plus 2 KL's:

logp ( y | m ) logp ( ymKL | , ),|,| qpmKL ,  qpy  m      

Flog p ( y | m ) KL q , p | y , m logp ( y | , m ) KL q  , p  | m accuracy complexity The complexity term in F

• In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for parameter interdependencies. Under Gaussian assumptions about the posterior (Laplace approximation):

KLq(), p( | m) 1 1 1  ln C  ln C     T C 1    2  2 |y 2 |y   |y 

• The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean Bayes factors

To compare two models, we could just compare their log .

But: the log evidence is just some number – not very intuitive!

A more intuitive interpretation of model comparisons is made possible by Bayes factors:

p( y | m ) Kass & Raftery classification: B  1 12 B p(m |y) Evidence p( y | m ) 12 1 2 1 to 3 50-75% weak 3 to 20 75-95% positive positive value, [0;[ 20 to 150 95-99% strong  150  99% Very strong

Kass & Raftery 1995, J. Am. Stat. Assoc. Fixed effects BMS at group level

Group (GBF) for 1...K subjects:

(k) GBFij   BFij k

Average Bayes factor (ABF):

()k ABFij K  BF ij k Problems: - blind with regard to group heterogeneity - sensitive to outliers Random effects BMS for heterogeneous groups

 Dirichlet parameters  = “occurrences” of models in the population

r ~ Dir(r;) of model probabilities r

mk ~ p(mk | p) Multinomial distribution of model labels m

m1 ~ Mult(m;1,r) Model inversion by Variational y ~ p(y | m ) Bayes or MCMC 1 1 1 Measured data y y2 ~ p(y2 | m2 ) y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage Random effects BMS for heterogeneous groups

Dirichlet parameters   = “occurrences” of models in the population

k = 1...K Dirichlet distribution of model probabilities r rk

mnk Multinomial distribution of model labels m

Model inversion yn by Variational n = 1...N Bayes or MCMC Measured data y

Stephan et al. 2009, NeuroImage Four equivalent options for reporting model by random effects BMS

1. Dirichlet parameter estimates 

2. expected of r   (  ) obtaining the k-th model for any k q k 1 K randomly selected subject

3. exceedance probability that a kKjK {1...  jk },{1...|}: particular model k is more likely than any other model (of the K models kkjp(| rr ; y ) tested), given the group data

4. protected exceedance probability: see below Example: Hemispheric interactions during vision

LD LD|LVF m2 m1

MOG FG FG MOG MOG FG FG MOG

LD|RVF LD|LVF LD LD

LG LG LG LG

RVF LD LVF RVF LD|RVF LVF stim. stim. stim. stim.

m2 m1

Subjects

Data: Stephan et al. 2003, Models: Stephan et al. 2007, J. Neurosci.

-35 -30 -25 -20 -15 -10 -5 0 5 Log model evidence differences p(r >0.5 | y) = 0.997 1 5

4.5

4

3.5 m2 pr1  r2   99.7% m1 3

|y)

1 2.5

p(r 2

1.5

1 2  2.2 1  11.8 0.5 r2  15.7% r1  84.3% 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r 1

Stephan et al. 2009a, NeuroImage Example: Synaesthesia

• “projectors” experience color externally colocalized with a presented grapheme

• “associators” report an internally evoked association

• across all subjects: no evidence for either model

• but BMS results map precisely onto projectors (bottom-up mechanisms) and associators (top-down)

van Leeuwen et al. 2011, J. Neurosci. Overfitting at the level of models

•  #models   of overfitting posterior model probability: • solutions: p my|  – regularisation: definition of model space = choosing priors p(m) py mp|()  m  – family-level BMS py mp|() m – Bayesian model averaging (BMA)    m

BMA: py |    p | y , m p m | y m Model space partitioning: comparing model families

• partitioning model space into K subsets Mff  1,..., K  or families:

• pooling information over all models in pf k  these subsets allows one to compute the probability of a model family, given the data

• effectively removes uncertainty about any aspect of model structure, other than the attribute of interest (which defines the partition)

Stephan et al. 2009, NeuroImage Penny et al. 2010, PLoS Comput. Biol. Family-level inference: random effects – a special case

• When the families are of equal size, one can simply sum the posterior model probabilities within families by exploiting the agglomerative property of the Dirichlet distribution:

r1, r 21 ,..., 2 rDirKK ~, ,...,   *** rr12  rrkkJk,,..., rr k Nk12 Nk N J

*** ~,,...,Dir 12 kkJk k Nk12 Nk N J

Stephan et al. 2009, NeuroImage nonlinear linear

FFX log 80

) GBFL Model space partitioning: 60 comparing model families

40

20

Summed log evidence (rel. to RBM to (rel. evidence log Summed p(r >0.5 | y) = 0.986 0 1 5 CBMN CBMN(ε) RBMN RBMN(ε) CBML CBML(ε) RBML RBML(ε) RFX  12 4.5

10 m m 4 2 1

8 3.5

6

alpha 3

4 prr 12 98.6% |y)

1 2.5

2 p(r

0 2

CBMN CBMN(ε) RBMN RBMN(ε) CBML CBML(ε) RBML RBML(ε) 1.5

16 1 Model  r2  26.5% r1  73.5% 14 space m 0.5 12 m1 2 partitioning 10 0 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 8 * 1  k r

alpha  k1 1 6

4 8 * 2  k 2  k5 0 Stephan et al. 2009, NeuroImage nonlinear models linear models Bayesian Model Averaging (BMA)

• abandons dependence of parameter single-subject BMA: inference on a single model and takes into account model uncertainty py |  • uses the entire model space considered (or an optimal family of models)   pympmy |,|    • averages parameter estimates, weighted m by posterior model probabilities group-level BMA: • represents a particularly useful alternative – when none of the models (or model pynN| 1..  subspaces) considered clearly outperforms all others   pn| y n , m p m | y1.. N  – when comparing groups for which the m optimal model differs NB: p(m|y1..N) can be obtained by either FFX or RFX BMS

Penny et al. 2010, PLoS Comput. Biol. Prefrontal-parietal connectivity during working memory in schizophrenia

• 17 at-risk mental state (ARMS) individuals • 21 first-episode patients (13 non-treated) • 20 controls

Schmidt et al. 2013, JAMA Psychiatry BMS results for all groups

Schmidt et al. 2013, JAMA Psychiatry BMA results: PFC  PPC connectivity

17 ARMS, 21 first-episode (13 non-treated), 20 controls

Schmidt et al. 2013, JAMA Psychiatry Protected exceedance probability: Using BMA to protect against chance findings • EPs express our confidence that the posterior probabilities of models are

different – under the hypothesis H1 that models differ in probability: rk1/K

• does not account for possibility "null hypothesis" H0: rk=1/K

• Bayesian omnibus risk (BOR) of wrongly accepting H1 over H0:

• protected EP: Bayesian model averaging over H0 and H1:

Rigoux et al. 2014, NeuroImage definition of model space

inference on model structure or inference on model parameters?

inference on inference on individual models or model space partition? parameters of an optimal model or parameters of all models?

optimal model structure assumed comparison of model optimal model structure assumed BMA to be identical across subjects? families using to be identical across subjects? FFX or RFX BMS

yes no yes no FFX BMS RFX BMS

FFX BMS RFX BMS FFX analysis of RFX analysis of parameter estimates parameter estimates (e.g. BPA) (e.g. t-test, ANOVA)

Stephan et al. 2010, NeuroImage Further reading

• Penny WD, Stephan KE, Mechelli A, Friston KJ (2004) Comparing dynamic causal models. NeuroImage 22:1157-1172. • Penny WD, Stephan KE, Daunizeau J, Joao M, Friston K, Schofield T, Leff AP (2010) Comparing Families of Dynamic Causal Models. PLoS Computational Biology 6: e1000709. • Penny WD (2012) Comparing dynamic causal models using AIC, BIC and free energy. Neuroimage 59: 319- 330. • Rigoux L, Stephan KE, Friston KJ, Daunizeau J (2014) Bayesian model selection for group studies – revisited. NeuroImage 84: 971-985. • Stephan KE, Weiskopf N, Drysdale PM, Robinson PA, Friston KJ (2007) Comparing hemodynamic models with DCM. NeuroImage 38:387-401. • Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009) Bayesian model selection for group studies. NeuroImage 46:1004-1017. • Stephan KE, Penny WD, Moran RJ, den Ouden HEM, Daunizeau J, Friston KJ (2010) Ten simple rules for Dynamic Causal Modelling. NeuroImage 49: 3099-3109. • Stephan KE, Iglesias S, Heinzle J, Diaconescu AO (2015) Translational Perspectives for Computational Neuroimaging. Neuron 87: 716-732. • Stephan KE, Schlagenhauf F, Huys QJM, Raman S, Aponte EA, Brodersen KH, Rigoux L, Moran RJ, Daunizeau J, Dolan RJ, Friston KJ, Heinz A (2017) Computational Neuroimaging Strategies for Single Patient Predictions. NeuroImage 145: 180-199. Thank you