Bayesian Inference and Generative Models

Bayesian inference and generative models

Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI data analysis", University of Zurich & ETH Zurich, 22 November 2016

Slides with a yellow title were not covered in detail in the lecture and will not be part of the exam.

With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes‘ Theorem posterior likelihood prior pyp(|)() Py(|)  py()

evidence Reverend Thomas Bayes 1702 - 1761

“Bayes‘ Theorem describes, how an ideally rational person processes information." Wikipedia Bayesian inference: an animation Generative models

• specify a joint probability distribution over all variables (observations and parameters)

• require a likelihood function and a prior: p( ymp ,|)(|,)(|)(|,) ym pmpy m

• can be used to randomly generate synthetic data (observations) by sampling from the prior – we can check in advance whether the model can explain certain phenomena at all

• model comparison based on the model evidence p( y | m )  p ( y | , m ) p (  | m ) d  Principles of Bayesian inference

 Formulation of a generative model likelihood function p(y|) Model prior distribution p()

 Observation of data

Measurement data y

 Model inversion – updating one's beliefs

maximum a posteriori pyp( | yp )( | ) ( ) (MAP) estimates model evidence Example of a shrinkage prior Priors

Priors can be of different sorts, e.g.

• empirical (previous data)

• empirical (estimated from current data using a hierarchical model → "empirical Bayes")

• uninformed

• principled (e.g., positivity constraints)

• shrinkage Advantages of generative models data y

• describe how observed data were generated by hidden mechanisms

mechanism 1 ... mechanism N • we can check in advance whether a model can explain certain phenomena at all

• force us to think mechanistically and be explicit about pathophysiological theories

• formal framework for differential diagnosis: statistical comparison of competing generative models, each of which provides a different explanation of measured brain activity or clinical symptoms A generative modelling framework for fMRI & EEG: Dynamic causal modeling (DCM)

dwMRI EEG, MEG fMRI

Forward model: Model inversion: Predicting measured Estimating neuronal activity mechanisms

dx y  g(x,)   f(,,) x u  dt

Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage y y y BOLD y   hemodynamic  activity λ model x2(t) activity

activity x3(t) x (t) 1 neuronal x states

modulatory integration

input u2(t)

( j) Neural state equation  driving t x  (A u j B )x Cu input u1(t) x endogenous A  connectivity x  x t modulation of B( j)  connectivity u j x x direct inputs C  u Modulatory input Driving input Neuronal state equation u2(t) u1(t) t t 풋 풙 = 푨 + 풖풋푩 풙 + 푪풖

풙ퟐ(풕) endogenous 흏풙 푨 = connectivity 흏풙 흏 흏풙 modulation of (풋) 풙ퟏ(풕) 푩 = connectivity 흏풖풋 흏풙 풙ퟑ(풕) 흏풙 푪 = direct inputs 흏풖

Local hemodynamic Neuronal states state equations 풙풊(풕) 풔 = 풙 − 휿풔 − 휸 풇 − ퟏ vasodilatory signal and flow 풇 = 풔 induction (rCBF) Hemodynamic model 풇 Balloon model 흉흂 = 풇 − 흂ퟏ/휶 ퟏ/휶 흂풊(풕) and 풒풊(풕) 흉풒 = 풇푬(풇, 푬ퟎ)/푬ퟎ − 흂 풒/흂

Changes in volume (흂) and dHb (풒) BOLD signal change equation 풒 BOLD signal 풚 = 푽ퟎ 풌ퟏ ퟏ − 풒 + 풌ퟐ ퟏ − + 풌ퟑ ퟏ − 흂 + 풆 흂 Stephan et al. 2015, y(t) with 풌ퟏ = ퟒ. ퟑ흑ퟎ푬ퟎ푻푬, 풌ퟐ = 휺풓ퟎ푬ퟎ푻푬, 풌ퟑ = ퟏ − 휺 Neuron 0.4 Neural population activity 0.3 0.2 0.1 u2 0 0 10 20 30 40 50 60 70 80 90 100

0.6

0.4

u 0.2 1 x3 0 0 10 20 30 40 50 60 70 80 90 100

0.3

0.2

0.1

0 10 20 30 40 50 60 70 80 90 100 x1 x2 3

2 fMRI signal change (%)

0 10 20 30 40 50 60 70 80 90 100

4 Nonlinear Dynamic Causal Model for fMRI 3 2 m n 1 dx  (i) ( j)  0 -1   A u B  x D x  Cu 0 10 20 30 40 50 60 70 80 90 100   i  j  dt  i1 j1  3 2

Stephan et al. 2008, NeuroImage 0 10 20 30 40 50 60 70 80 90 100 Why should I know about Bayesian inference?

Because Bayesian principles are fundamental for

• statistical inference in general

• system identification

• translational neuromodeling ("computational assays") – computational psychiatry – computational neurology

• contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays"

pym(|,)  pm(|)

pym(|,) Differential diagnosis based on generative models of disease symptoms

SYMPTOM y (behaviour or physiology)

pym(|,)  k pmy(|) k

HYPOTHETICAL ...... MECHANISM m1 mk mK

p( y | mkk ) p ( m ) pm(k | y)   p( y | mkk ) p ( m ) k

Stephan et al. 2016, NeuroImage  Computational assays: Models of disease mechanisms Translational Neuromodeling

dx fxu(,,)  dt

 Individual treatment prediction

 Detecting physiological subgroups (based on inferred mechanisms)

 disease mechanism A  Application to brain activity and  disease mechanism B behaviour of individual patients  disease mechanism C Stephan et al. 2015, Neuron Perception = inversion of a hierarchical generative model

environm. states neuronal states others' mental states bodily states

forward model p( y | x , m ) p ( x | m )

p( x | y , m ) perception Example: free-energy principle and active inference

sensations – predictions

Prediction error

Change Change sensory input predictions Action Perception

Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted?

Bayesian Inference

Analytical solutions Approximate Inference

Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted?

• compute the posterior analytically – requires conjugate priors – even then often difficult to derive an analytical solution

• variational Bayes (VB) – often hard work to derive, but fast to compute – cave: local minima, potentially inaccurate approximations

• sampling methods (MCMC) – guaranteed to be accurate in theory (for infinite computation time) – but may require very long run time in practice – convergence difficult to prove Conjugate priors

If the posterior p(θ|x) is in the same family as the prior p(θ), the prior and posterior are called "conjugate distributions", and the prior is called a "conjugate prior" for the likelihood function. p(y | ) p( ) p( | y)  p(y)

same form  analytical expression for posterior  examples (likelihood-prior): • Normal-Normal • Normal-inverse Gamma • Binomial-Beta • Multinomial-Dirichlet Posterior mean & variance of univariate Gaussians

Likelihood & Prior 2 y     p( y | ) N (  ,  e ) pN()(,)   2 pp   Posterior: pyN(|)(,)  2 Posterior 1 1 1 Likelihood   2 2 2    e  p p Prior  1 1     2       2 2 p   e  p 

Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as precision weighting

Likelihood & prior 1 y     p( y | ) N (  , e ) pN()(,)   1 pp   Posterior: pyN(|)(,)  1 Posterior Likelihood   e  p  p Prior     e   p    p

Relative precision weighting Variational Bayes (VB)

Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics.

true hypothesis posterior class 푝 휃 푦 divergence KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence

• non-symmetric measure of the difference between two probability distributions P and Q

• DKL(P‖Q) = a measure of the information lost when Q is used to approximate P: the expected number of extra bits required to code samples from P when using a code optimized for Q, rather than using the true code optimized for P. Variational calculus

Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a probability distribution 푝 푥 Variational Bayes

ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg. free 퐹 푞, 푦 ≥ 0 energy KL[푞| 푝 (unknown) (easy to evaluate for a given 푞)

퐹 푞 is a functional wrt. the 퐹 푞, 푦 approximate posterior 푞 휃 . Maximizing 퐹 푞, 푦 is equivalent to: • minimizing KL[푞| 푝 • tightening 퐹 푞, 푦 as a lower bound to the log model evidence initialization … When 퐹 푞, 푦 is maximized, 푞 휃 is … convergence our best estimate of the posterior. Derivation of the (negative) free energy approximation

• See whiteboard! • (or Appendix to Stephan et al. 2007, NeuroImage 38: 387-401) Mean field assumption

Factorize the approximate posterior 푞 휃 into independent 푞 휃1 푞 휃 partitions: 2

푞 휃 = 푞푖 휃푖 푖

where 푞푖 휃푖 is the approximate posterior for the 푖th subset of parameters.

For example, split parameters and hyperparameters: 휃2 휃1 Jean Daunizeau, www.fil.ion.ucl.ac.uk/ p,  | y  q  ,  q  q  ~jdaunize/presentations/Bayes2.pdf VB in a nutshell (under mean-field approximation)

 Neg. free-energy ln|,,,|p y mFKL  qpy     approx. to model evidence. Fp yKLln, ,,,,| qpm       q    

 Mean field approx. pyqqq,|,        

 Maximise neg. free qIp y expexp  ln, ,  energy wrt. q =  q() minimise divergence, qIp y expexp  ln, ,  by maximising  q() variational energies

 Iterative updating of sufficient statistics of approx. posteriors by gradient ascent. VB (under mean-field assumption) in more detail VB (under mean-field assumption) in more detail Model comparison and selection

Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?

Which model represents the best balance between model fit and model complexity?

For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS)

Model evidence (marginal likelihood): Gharamani, 2004

p(|)(|,)(|) ymp ym  pmd  p(y|m) y accounts for both accuracy and complexity of the model all possible datasets

“If I randomly sampled from my prior and plugged the resulting value into the likelihood Various approximations, e.g.: function, how close would the - negative free energy, AIC, BIC predicted data be – on average

– to my observed data?” McKay 1992, Neural Comput. Penny et al. 2004a, NeuroImage Model space (hypothesis set) M

Model space M is defined by prior on models. Usual choice: flat prior over a small set of models.

1/ ifMmM pm()   0 if mM

In this case, the posterior probability of model i is:

p( y | mi ) p ( m i ) p ( y | m i ) p( mi | y ) MM

p( y | mj ) p ( m j ) p ( y | m j ) jj11 Differential diagnosis based on generative models of disease symptoms

SYMPTOM y (behaviour or physiology)

pym(|,)  k pmy(|) k

HYPOTHETICAL ...... MECHANISM m1 mk mK

p( y | mkk ) p ( m ) pm(k | y)   p( y | mkk ) p ( m ) k

Stephan et al. 2016, NeuroImage Approximations to the model evidence

Logarithm is a Maximizing log model evidence monotonic function = Maximizing model evidence

Log model evidence = balance between fit and complexity log p( y | m)  accuracy(m)  complexity(m)  log p( y |,m)  complexity(m) No. of parameters

Akaike Information Criterion: AIC  log p( y |,m)  p No. of data points p Bayesian Information Criterion: BIC  log p( y |,m)  log N 2 The (negative) free energy approximation F

ln 푝 푦|푚 F is a lower bound on the log model evidence: KL[푞| 푝 logp ( y | m ) F KL q , p | y , m 퐹 푞, 푦

Like AIC/BIC, F is an accuracy/complexity tradeoff:

Fp ylog m ( | KL , ), q  | p  m   accuracy complexity The (negative) free energy approximation

• Log evidence is thus expected log likelihood (wrt. q) plus 2 KL's:

logp ( y | m ) logp ( ymKL | , ),|,| qpmKL ,  qpy  m      

Flog p ( y | m ) KL q , p | y , m logp ( y | , m ) KL q  , p  | m accuracy complexity The complexity term in F

• In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for parameter interdependencies. Under Gaussian assumptions about the posterior (Laplace approximation): KLq( ), p( | m) 1 1 1  ln C  ln C     T C 1    2  2 |y 2 |y   |y 

• The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean Bayes factors

To compare two models, we could just compare their log evidences.

But: the log evidence is just some number – not very intuitive!

A more intuitive interpretation of model comparisons is made possible by Bayes factors: positive value, [0;[

p( y | m1) B12  p( y | m2 ) B12 p(m1|y) Evidence 1 to 3 50-75% weak 3 to 20 75-95% positive Kass & Raftery classification: 20 to 150 95-99% strong  150  99% Very strong Kass & Raftery 1995, J. Am. Stat. Assoc. Fixed effects BMS at group level

Group Bayes factor (GBF) for 1...K subjects:

(k ) GBFij   BFij k

Average Bayes factor (ABF):

()k ABFij K  BF ij k Problems: - blind with regard to group heterogeneity - sensitive to outliers Random effects BMS for heterogeneous groups

 Dirichlet parameters  = “occurrences” of models in the population

r ~ Dir(r;) Dirichlet distribution of model probabilities r

mk ~ p(mk | p) Multinomial distribution of model labels m

m1 ~ Mult(m;1,r) Model inversion by Variational y ~ p(y | m ) Bayes or MCMC 1 1 1 Measured data y y2 ~ p(y2 | m2 ) y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage Random effects BMS



1 k 1 p(|), rDirrr   k Z   k r ~ Dir(r;)  Z()  kk  k k

mk ~ p(mk | p) mnk p(|) mrrnk  m1 ~ Mult(m;1,r) k

y1 ~ p(y1 | m1) y ~ p(y | m ) p( y | m ) p ( y | ) p (  | m ) d  2 2 2 n nk nk y1 ~ p(y1 | m1) Stephan et al. 2009, NeuroImage  Write down joint probability and take the log

pyrm , ,||( | pympmrpr )    0   p( rp | y0 )|| m p mn r nn   n

1   0knk1 m   rpkn y nk mr  |  Z 0  knk 

1 mnk  p y m rr 0k 1  n nk kk Z 0  nk

lnpyrm , ,   ln Z00  k  1 ln rmpym k  nk log n | nk  ln r k  nk  Mean field approx. qrmqrqm(,)()() 

 Maximise neg. free q( rI )exp r    energy wrt. q = minimise divergence, q( mI )exp m    by maximising variational energies I rp ylog, r m ,     qm() I mp ylog, r m ,     qr()  Iterative updating of sufficient statistics of approx. posteriors

 0 0  [1,,1] Until convergence

 unkexp ln p ( y n | m nk )   k    k k

unk gnk  unk our (normalized) gqnknk m(1) k posterior belief that   g model k generated the k nk data from subject n n    0 k  g nk expected number of n subjects whose data we end believe were generated by model k Four equivalent options for reporting model ranking by random effects BMS

1. Dirichlet parameter estimates 

2. expected posterior probability of r   (  ) obtaining the k-th model for any k q k 1 K randomly selected subject

3. exceedance probability that a kKjK {1...  jk },{1...|}: particular model k is more likely than any other model (of the K models kkjp(| rr ; y ) tested), given the group data

4. protected exceedance probability: see below Example: Hemispheric interactions during vision

LD LD|LVF m2 m1

MOG FG FG MOG MOG FG FG MOG

LD|RVF LD|LVF LD LD

LG LG LG LG

RVF LD LVF RVF LD|RVF LVF stim. stim. stim. stim.

m2 m1

Subjects

Data: Stephan et al. 2003, Science Models: Stephan et al. 2007, J. Neurosci.

-35 -30 -25 -20 -15 -10 -5 0 5 Log model evidence differences p(r >0.5 | y) = 0.997 1 5

4.5

3.5 m2 pr1  r2   99.7% m1 3

|y)

1 2.5

p(r 2

1.5

1 2  2.2 1  11.8 0.5 r2  15.7% r1  84.3% 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r 1

Stephan et al. 2009a, NeuroImage Example: Synaesthesia

• “projectors” experience color externally colocalized with a presented grapheme

• “associators” report an internally evoked association

• across all subjects: no evidence for either model

• but BMS results map precisely onto projectors (bottom-up mechanisms) and associators (top-down)

van Leeuwen et al. 2011, J. Neurosci. Overfitting at the level of models

•  #models   risk of overfitting posterior model probability: • solutions: p my|  – regularisation: definition of model space = choosing priors p(m) py mp|()  m  – family-level BMS py mp|()  m – Bayesian model averaging (BMA)  m

BMA: py |    p | y , m p m | y m Model space partitioning: comparing model families

• partitioning model space into K subsets Mff  1,..., K  or families:

• pooling information over all models in pf k  these subsets allows one to compute the probability of a model family, given the data

• effectively removes uncertainty about any aspect of model structure, other than the attribute of interest (which defines the partition)

Stephan et al. 2009, NeuroImage Penny et al. 2010, PLoS Comput. Biol. Family-level inference: fixed effects 1 • We wish to have a uniform prior at the pf k   family level: K

• This is related to the model level via pfpm k   () the sum of the priors on models:  mf k 1 • Hence the uniform prior at the family mfp m: level is: k   Kfk

• The probability of each family is then obtained by summing the posterior p fk|| y1.. N   p m y 1.. N  probabilities of the models it includes: mf k

Penny et al. 2010, PLoS Comput. Biol. Family-level inference: random effects

• The frequency of a family in the srkm  population is given by: mf k

• In RFX-BMS, this follows a Dirichlet distribution, with a uniform prior on the p() s Dir   parameters  (see above).

1 • A uniform prior over family mfm :() probabilities can be obtained by kprior f setting: k

Stephan et al. 2009, NeuroImage Penny et al. 2010, PLoS Comput. Biol. Family-level inference: random effects – a special case

• When the families are of equal size, one can simply sum the posterior model probabilities within families by exploiting the agglomerative property of the Dirichlet distribution:

r1, r 21 ,..., 2 rDirKK ~, ,...,   *** rr12  rrkkJk,,..., rr k Nk12 Nk N J

*** ~,,...,Dir 12 kkJk k Nk12 Nk N J

Stephan et al. 2009, NeuroImage nonlinear linear

FFX log 80

) GBFL Model space partitioning: 60 comparing model families

Summed log evidence (rel. to RBM to (rel. evidence log Summed p(r >0.5 | y) = 0.986 0 1 5 CBMN CBMN(ε) RBMN RBMN(ε) CBML CBML(ε) RBML RBML(ε) RFX  12 4.5

10 m m 4 2 1

8 3.5

alpha 3 4 prr  98.6%

|y) 12

1 2.5

2 p(r

0 2

CBMN CBMN(ε) RBMN RBMN(ε) CBML CBML(ε) RBML RBML(ε) 1.5

16 1 Model  r2  26.5% r1  73.5% 14 space m 0.5 12 m1 2 partitioning 10 0 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 8 * 1  k r

alpha  k1 1 6

4 8 * 2  k 2  k5 0 Stephan et al. 2009, NeuroImage nonlinear models linear models Mismatch negativity (MMN) MMN

• elicited by surprising stimuli (scales with unpredictability) prediction IFG error •  in schizophrenic patients STG STG • classical interpretations: prediction – pre-attentive change detection A1 A1 input – neuronal adaptation input • current theories: deviants – reflection of (hierarchical) standards Bayesian inference

Garrido et al. 2009, Clin. Neurophysiol. standards ERPs Mismatch negativity deviants (MMN)

•  in schizophrenic patients MMN • Highly relevant for computational assays of glutamatergic and cholinergic transmission: placebo + NMDAR MMN ketamine + ACh (nicotinic & muscarinic) – 5HT – DA

Schmidt, Diaconescu et al. 2013, Cereb. Cortex Modelling Trial-by-Trial Changes of the Mismatch Negativity (MMN)

Lieder et al. 2013, PLoS Comput. Biol. MMN model comparison at multiple levels  Comparing MMN theories

 Comparing individual models

 Comparing modeling frameworks

Lieder et al. 2013, PLoS Comput. Biol. Bayesian Model Averaging (BMA)

• abandons dependence of parameter single-subject BMA: inference on a single model and takes into account model uncertainty py |  • uses the entire model space considered (or an optimal family of models)   pympmy |,|    • averages parameter estimates, weighted m by posterior model probabilities group-level BMA: • represents a particularly useful alternative – when none of the models (or model pynN| 1..  subspaces) considered clearly outperforms all others   pn| y n , m p m | y1.. N  – when comparing groups for which the m optimal model differs NB: p(m|y1..N) can be obtained by either FFX or RFX BMS

Penny et al. 2010, PLoS Comput. Biol. Prefrontal-parietal connectivity during working memory in schizophrenia

• 17 at-risk mental state (ARMS) individuals • 21 first-episode patients (13 non-treated) • 20 controls

Schmidt et al. 2013, JAMA Psychiatry BMS results for all groups

Schmidt et al. 2013, JAMA Psychiatry BMA results: PFC  PPC connectivity

17 ARMS, 21 first-episode (13 non-treated), 20 controls

Schmidt et al. 2013, JAMA Psychiatry Protected exceedance probability: Using BMA to protect against chance findings • EPs express our confidence that the posterior probabilities of models are

different – under the hypothesis H1 that models differ in probability: rk1/K

• does not account for possibility "null hypothesis" H0: rk=1/K

• Bayesian omnibus risk (BOR) of wrongly accepting H1 over H0:

• protected EP: Bayesian model averaging over H0 and H1:

Rigoux et al. 2014, NeuroImage definition of model space

inference on model structure or inference on model parameters?

inference on inference on individual models or model space partition? parameters of an optimal model or parameters of all models?

optimal model structure assumed comparison of model optimal model structure assumed BMA to be identical across subjects? families using to be identical across subjects? FFX or RFX BMS

yes no yes no FFX BMS RFX BMS

FFX BMS RFX BMS FFX analysis of RFX analysis of parameter estimates parameter estimates (e.g. BPA) (e.g. t-test, ANOVA)

Stephan et al. 2010, NeuroImage Further reading

• Penny WD, Stephan KE, Mechelli A, Friston KJ (2004) Comparing dynamic causal models. NeuroImage 22:1157-1172. • Penny WD, Stephan KE, Daunizeau J, Joao M, Friston K, Schofield T, Leff AP (2010) Comparing Families of Dynamic Causal Models. PLoS Computational Biology 6: e1000709. • Penny WD (2012) Comparing dynamic causal models using AIC, BIC and free energy. Neuroimage 59: 319- 330. • Rigoux L, Stephan KE, Friston KJ, Daunizeau J (2014) Bayesian model selection for group studies – revisited. NeuroImage 84: 971-985. • Stephan KE, Weiskopf N, Drysdale PM, Robinson PA, Friston KJ (2007) Comparing hemodynamic models with DCM. NeuroImage 38:387-401. • Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009) Bayesian model selection for group studies. NeuroImage 46:1004-1017. • Stephan KE, Penny WD, Moran RJ, den Ouden HEM, Daunizeau J, Friston KJ (2010) Ten simple rules for Dynamic Causal Modelling. NeuroImage 49: 3099-3109. • Stephan KE, Iglesias S, Heinzle J, Diaconescu AO (2015) Translational Perspectives for Computational Neuroimaging. Neuron 87: 716-732. • Stephan KE, Schlagenhauf F, Huys QJM, Raman S, Aponte EA, Brodersen KH, Rigoux L, Moran RJ, Daunizeau J, Dolan RJ, Friston KJ, Heinz A (2016) Computational Neuroimaging Strategies for Single Patient Predictions. NeuroImage, in press. DOI: 10.1016/j.neuroimage.2016.06.038 Thank you