<<

Bayesian and Bayesian

Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI analysis", University of Zurich & ETH Zurich, 28 November 2017

With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about ?

Because Bayesian principles are fundamental for

in general

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – – active inference Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes' theorem

The Reverend (1702-1761)

pp(y |θθ ) ( ) “Bayes‘ Theorem p(θy | )  describes, how an p()y ideally rational person processes ." posterior = likelihood ∙ prior / Wikipedia Bayes’ Theorem

Given data y and , the joint probability is:

p(,)(|)()(|)() ypy p yp yp

Eliminating p(y,) gives Bayes’ rule:

posterior likelihood prior p( yp | ) ( ) Py( | )  py()

evidence Bayesian inference: an animation Generative models

• specify a joint over all variables (observations and parameters)

• require a and a prior: p( ymp ,|)(|,)(|)(|,) ym pmpy m

• can be used to randomly generate (observations) by from the prior – we can check in advance whether the model can explain certain phenomena at all

• model comparison based on the model evidence p( y | m )  p ( y | , m ) p (  | m ) d  Principles of Bayesian inference

 Formulation of a likelihood function p(y|) Model prior distribution p()

 Observation of data

Measurement data y

 Model inversion – updating one's beliefs

maximum a posteriori pyp( | yp )( | ) ( ) (MAP) estimates model evidence Example of a prior Priors

Priors can be of different sorts, e.g.

• empirical (previous data)

• empirical (estimated from current data using a hierarchical model → "empirical Bayes")

• uninformed

• principled (e.g., positivity constraints)

• shrinkage Generative models

pym(|,)  pm(|)

pym(|,)

1. enforce mechanistic thinking: how could the data have been caused? 2. generate synthetic data (observations) by sampling from the prior – can model explain certain phenomena at all? 3. inference about parameters → p(|y) 4. inference about model structure → p(y|m) or p(m|y) 5. model evidence: index of model quality A generative modelling framework for fMRI & EEG: (DCM)

dwMRI EEG, MEG fMRI

Forward model: Model inversion: Predicting measured Estimating neuronal activity mechanisms

dx y  g(x,)   f(,,) x u  dt

Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage DCM for fMRI

Stephan et al. 2015, Neuron Bayesian system identification u(t) Design experimental inputs

Neural dynamics dx dt f(,,) x u  Define likelihood model Observer function y g(,) x 

p(y |,m)  N(g( ),( )) Specify priors p(,m)  N( , )

Inference on model p(y | m)  p(y |,m) p( )d Invert model structure  p(y |,m) p(,m) Inference on parameters p( | y,m)  p(y | m) Make Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays"

pym(|,)  pm(|)

pym(|,)  Computational assays: Models of disease mechanisms Translational Neuromodeling

dx fxu(,,)  dt

 Individual treatment

 Detecting physiological subgroups (based on inferred mechanisms)

 disease mechanism A  Application to brain activity and  disease mechanism B behaviour of individual patients  disease mechanism C

Stephan et al. 2015, Neuron Differential diagnosis based on generative models of disease symptoms

SYMPTOM y (behaviour or physiology)

pym(|,)  k pmy(|) k

HYPOTHETICAL ...... MECHANISM m1 mk mK

p( y | mkk ) p ( m ) pm(k | y)   p( y | mkk ) p ( m ) k

Stephan et al. 2016, NeuroImage Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Perception = inversion of a hierarchical generative model

environm. states neuronal states others' mental states bodily states

forward model p(|,)(|) yx mp x m

pxym(|,) perception Example: free-energy principle and active inference

sensations –

Prediction error

Change Change sensory input predictions Action Perception

Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted?

Bayesian Inference

Analytical solutions Approximate Inference

Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted?

• compute the posterior analytically – requires conjugate priors – even then often difficult to derive an analytical solution

• variational Bayes (VB) – often hard work to derive, but fast to compute – cave: local minima, potentially inaccurate approximations

• sampling methods (MCMC) – guaranteed to be accurate in theory (for infinite computation time) – but may require very long run time in practice – convergence difficult to prove Conjugate priors

If the posterior p(θ|x) is in the same family as the prior p(θ), the prior and posterior are called "conjugate distributions", and the prior is called a "" for the likelihood function. p(y | ) p( ) p( | y)  p(y)

same form  analytical expression for posterior  examples (likelihood-prior): • Normal-Normal • Normal-inverse Gamma • Binomial-Beta • Multinomial-Dirichlet Posterior & of univariate Gaussians

Likelihood & Prior 2 y     p( y | ) N (  ,  e ) pN()(,)   2 pp   Posterior: pyN(|)(,)  2 Posterior 1 1 1 Likelihood   2 2 2    e  p p Prior  1 1     2       2 2 p   e  p 

Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as weighting

Likelihood & prior 1 y     p( y | ) N (  , e ) pN()(,)   1 pp   Posterior: pyN(|)(,)  1 Posterior Likelihood   e  p  p Prior     e   p    p

Relative precision weighting Variational Bayes (VB)

Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics.

true posterior class 푝 휃 푦 KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence

• asymmetric measure of the difference between two probability distributions P and Q

• Interpretations of DKL(P‖Q): – "Bayesian surprise" when Q=prior, P=posterior: measure of the information gained when one updates one's prior beliefs to the posterior P – a measure of the information lost when Q is used to approximate P

• non-negative: 0 (zero when P=Q) Variational calculus

Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a probability distribution 푝 푥 Variational Bayes

ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg. free 퐹 푞, 푦 ≥ 0 energy KL[푞| 푝 (unknown) (easy to evaluate for a given 푞)

퐹 푞 is a functional wrt. the 퐹 푞, 푦 approximate posterior 푞 휃 . Maximizing 퐹 푞, 푦 is equivalent to: • minimizing KL[푞| 푝 • tightening 퐹 푞, 푦 as a lower bound to the log model evidence initialization … When 퐹 푞, 푦 is maximized, 푞 휃 is … convergence our best estimate of the posterior. Derivation of the (negative) free energy approximation

• See whiteboard! • (or Appendix to Stephan et al. 2007, NeuroImage 38: 387-401) Mean field assumption

Factorize the approximate posterior 푞 휃 into independent 푞 휃1 푞 휃 partitions: 2

푞 휃 = ෑ 푞푖 휃푖 푖

where 푞푖 휃푖 is the approximate posterior for the 푖th subset of parameters.

For example, split parameters and : 휃2 휃1 Jean Daunizeau, www.fil.ion.ucl.ac.uk/ p,  | y  q  ,  q  q   ~jdaunize/presentations/Bayes2.pdf VB in a nutshell (under mean-field approximation)

 Neg. free-energy ln|,,,|p y mFKL  qpy      approx. to model evidence. Fp yKLln| ,,,,| qpm       q    

 Mean field approx. pyqqq,|,        

 Maximise neg. free qIp y expexp  ln, ,  energy wrt. q =  q() minimise divergence, qIp y expexp  ln, ,  by maximising  q() variational energies

 Iterative updating of sufficient statistics of approx. posteriors by gradient ascent. VB (under mean-field assumption) in more detail VB (under mean-field assumption) in more detail Model comparison and selection

Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?

Which model represents the best balance between model fit and model complexity?

For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS)

Model evidence (): Gharamani, 2004

p(|)(|,)(|) ymp ym  pmd  p(y|m) y accounts for both accuracy and complexity of the model all possible datasets

“If I randomly sampled from my prior and plugged the resulting value into the likelihood Various approximations, e.g.: function, how close would the - negative free energy, AIC, BIC predicted data be – on average

– to my observed data?” McKay 1992, Neural Comput. Penny et al. 2004a, NeuroImage Model space (hypothesis set) M

Model space M is defined by prior on models. Usual choice: flat prior over a small set of models.

1/ ifMmM pm()   0 if mM 

In this case, the of model i is:

p( y | mi ) p ( m i ) p ( y | m i ) p( mi | y ) MM

p( y | mj ) p ( m j ) p ( y | m j ) jj11 Differential diagnosis based on generative models of disease symptoms

SYMPTOM y (behaviour or physiology)

pym(|,)  k pmy(|) k

HYPOTHETICAL ...... MECHANISM m1 mk mK

p( y | mkk ) p ( m ) pm(k | y)   p( y | mkk ) p ( m ) k

Stephan et al. 2016, NeuroImage Approximations to the model evidence

Logarithm is a Maximizing log model evidence monotonic function = Maximizing model evidence

Log model evidence = balance between fit and complexity log p( y | m)  accuracy(m)  complexity(m)  log p( y |,m)  complexity(m) No. of parameters

Akaike Information Criterion: AIC  log p( y |,m)  p No. of data points p Bayesian Information Criterion: BIC  log p( y |,m)  log N 2 The (negative) free energy approximation F

ln 푝 푦|푚 F is a lower bound on the log model evidence: KL[푞| 푝 logp ( y | m ) F KL q , p | y , m 퐹 푞, 푦

Like AIC/BIC, F is an accuracy/complexity tradeoff:

Fp ylog m ( | KL , ), q  | p  m   accuracy complexity The (negative) free energy approximation

• Log evidence is thus expected log likelihood (wrt. q) plus 2 KL's:

logp ( y | m ) logp ( ymKL | , ),|,| qpmKL ,  qpy  m      

Flog p ( y | m ) KL q , p | y , m logp ( y | , m ) KL q  , p  | m accuracy complexity The complexity term in F

• In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for interdependencies. Under Gaussian assumptions about the posterior (Laplace approximation): KLq( ), p( | m) 1 1 1  ln C  ln C     T C 1    2  2 |y 2 |y   |y  • The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean Bayes factors

To compare two models, we could just compare their log .

But: the log evidence is just some number – not very intuitive!

A more intuitive interpretation of model comparisons is made possible by Bayes factors: positive value, [0;[ p( y | m1) B12  p( y | m2 ) B12 p(m1|y) Evidence 1 to 3 50-75% weak 3 to 20 75-95% positive Kass & Raftery classification: 20 to 150 95-99% strong  150  99% Very strong Kass & Raftery 1995, J. Am. Stat. Assoc. Fixed effects BMS at group level

Group (GBF) for 1...K subjects:

(k ) GBFij   BFij k

Average Bayes factor (ABF):

()k ABFij K  BF ij k Problems: - blind with regard to group heterogeneity - sensitive to outliers Random effects BMS for heterogeneous groups

 Dirichlet parameters  = “occurrences” of models in the population

r ~ Dir(r;) of model probabilities r

mk ~ p(mk | p) Multinomial distribution of model labels m

m1 ~ Mult(m;1,r) Model inversion by Variational y ~ p(y | m ) Bayes or MCMC 1 1 1 Measured data y y2 ~ p(y2 | m2 ) y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage Random effects BMS for heterogeneous groups

Dirichlet parameters   = “occurrences” of models in the population

k = 1...K Dirichlet distribution of model probabilities r rk

m nk Multinomial distribution of model labels m

Model inversion yn by Variational n = 1...N Bayes (VB) or Measured data y MCMC

Stephan et al. 2009, NeuroImage Random effects BMS

1 k 1 p(|), rDirrr   k Z   k r ~ Dir(r;)  Z()  kk  k k

mk ~ p(mk | p) mnk p(|) mrrnk  m1 ~ Mult(m;1,r) k

y1 ~ p(y1 | m1) y ~ p(y | m ) p( y | m ) p ( y | ) p (  | m ) d  2 2 2 n nk nk y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage  Write down joint probability and take the log

pyrm , ,||( | pympmrpr )    0   p( rp | y0 )|| m p mn r nn   n

1   0knk1 m   rpkn y nk mr  |  Z 0  knk 

1 mnk  p y m rr 0k 1  n nk kk Z 0  nk

lnpyrm , ,   ln Z00  k  1 ln rmpym k  nk log n | nk  ln r k  nk  Mean field approx. qrmqrqm(,)()() 

 Maximise neg. free q( rI )exp r    energy wrt. q = minimise divergence, q( mI )exp m    by maximising variational energies I rp ylog, r m ,     qm() I mp ylog, r m ,     qr()  Iterative updating of sufficient statistics of approx. posteriors

 0 0  [1,,1] Until convergence

 unkexp ln p ( y n | m nk )   k    k k

unk gnk  unk our (normalized) gqnknk m(1) k posterior that   g model k generated the k nk data from subject n n    0 k  g nk expected number of n subjects whose data we end believe were generated by model k Four equivalent options for reporting model by random effects BMS

1. Dirichlet parameter estimates 

2. expected posterior probability of r   (  ) obtaining the k-th model for any k q k 1 K randomly selected subject

3. exceedance probability that a kKjK {1...  jk },{1...|}: particular model k is more likely than any other model (of the K models kkjp(| rr ; y ) tested), given the group data

4. protected exceedance probability: see below Example: Hemispheric interactions during vision

LD LD|LVF m2 m1

MOG FG FG MOG MOG FG FG MOG

LD|RVF LD|LVF LD LD

LG LG LG LG

RVF LD LVF RVF LD|RVF LVF stim. stim. stim. stim.

m2 m1

Subjects

Data: Stephan et al. 2003, Models: Stephan et al. 2007, J. Neurosci.

-35 -30 -25 -20 -15 -10 -5 0 5 Log model evidence differences p(r >0.5 | y) = 0.997 1 5

4.5

4

3.5 m2 pr1  r2   99.7% m1 3

|y)

1 2.5

p(r 2

1.5

1 2  2.2 1  11.8 0.5 r2  15.7% r1  84.3% 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r 1

Stephan et al. 2009a, NeuroImage Example: Synaesthesia

• “projectors” experience color externally colocalized with a presented grapheme

• “associators” report an internally evoked association

• across all subjects: no evidence for either model

• but BMS results map precisely onto projectors (bottom-up mechanisms) and associators (top-down)

van Leeuwen et al. 2011, J. Neurosci. Protected exceedance probability: Using BMA to protect against chance findings • EPs express our confidence that the posterior probabilities of models are

different – under the hypothesis H1 that models differ in probability: rk1/K

• does not account for possibility "null hypothesis" H0: rk=1/K

• Bayesian omnibus (BOR) of wrongly accepting H1 over H0:

• protected EP: Bayesian model averaging over H0 and H1:

Rigoux et al. 2014, NeuroImage definition of model space

inference on model structure or inference on model parameters?

inference on inference on individual models or model space partition? parameters of an optimal model or parameters of all models?

optimal model structure assumed comparison of model optimal model structure assumed BMA to be identical across subjects? families using to be identical across subjects? FFX or RFX BMS

yes no yes no FFX BMS RFX BMS

FFX BMS RFX BMS FFX analysis of RFX analysis of parameter estimates parameter estimates (e.g. BPA) (e.g. t-test, ANOVA)

Stephan et al. 2010, NeuroImage Further reading

• Penny WD, Stephan KE, Mechelli A, Friston KJ (2004) Comparing dynamic causal models. NeuroImage 22:1157-1172. • Penny WD, Stephan KE, Daunizeau J, Joao M, Friston K, Schofield T, Leff AP (2010) Comparing Families of Dynamic Causal Models. PLoS Computational Biology 6: e1000709. • Penny WD (2012) Comparing dynamic causal models using AIC, BIC and free energy. Neuroimage 59: 319- 330. • Rigoux L, Stephan KE, Friston KJ, Daunizeau J (2014) Bayesian model selection for group studies – revisited. NeuroImage 84: 971-985. • Stephan KE, Weiskopf N, Drysdale PM, Robinson PA, Friston KJ (2007) Comparing hemodynamic models with DCM. NeuroImage 38:387-401. • Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009) Bayesian model selection for group studies. NeuroImage 46:1004-1017. • Stephan KE, Penny WD, Moran RJ, den Ouden HEM, Daunizeau J, Friston KJ (2010) Ten simple rules for Dynamic Causal Modelling. NeuroImage 49: 3099-3109. • Stephan KE, Iglesias S, Heinzle J, Diaconescu AO (2015) Translational Perspectives for Computational Neuroimaging. Neuron 87: 716-732. • Stephan KE, Schlagenhauf F, Huys QJM, Raman S, Aponte EA, Brodersen KH, Rigoux L, Moran RJ, Daunizeau J, Dolan RJ, Friston KJ, Heinz A (2017) Computational Neuroimaging Strategies for Single Patient Predictions. NeuroImage 145: 180-199. Thank you