<<

StatisticalMethods inMedicalResearch 2003; 12: 419^446

Controlling thefamilywise error rate infunctional neuroimaging: acomparative review

Thomas Nichols and Satoru Hayasaka DepartmentofBiostatistics, Universityof Michigan,AnnArbor,MI,USA

Functionalneuroimagingdata embodies a massivemultiple testing problem, where10 0000correlatedtest statisticsmust be assessed.The familywiseerror rate, the cha nceof any falsepositives is the sta ndard measureof Type Ierrorsin multipletesting. In thispa perwe review and evaluatethree a pproachesto thresholdingima gesof teststatistics: Bonferroni, ra ndom Želda nd thepermutation test. Owing to recent developments,improved Bonferroni procedures,such asHochberg’s methods, arenow applicableto dependent data.Continuous random Želdmethods usethe smoothness of theimage to adapt to theseverity of themultiple testing problem. Also,increa sedcomputing power hasmade both permutationa nd bootstrapmethods applicableto functionalneuroimaging.We eva luatethese a pproacheson t images usingsimula tionsand acollectionof realdatasets.We Ž nd thatBonferroni-relatedtests offer little improvementover Bonferroni, while the permutation method offerssubsta ntialimprovementover the random Želdmethod forlow smoothnessa nd low degreesof freedom.We a lsoshow thelimita tionsof tryingto Žnd anequivalentnumber of independent testsfor an imageof correlatedtest .

1Introduction

Functionalneuroimagingrefers to anarrayof technologies used to measure neuronal activity in the living brain. Twowidely used methods,positron emission tomography (PET)a ndfunctiona lmagnetic resonance imaging(fM RI),both use blood ow asan indirect measure of brain activity. Anexperimenter images asubject repeatedly under different cognitive states and typically Žts amassively univariate model.Tha tis, a univariate model is independently Žtateach of hundreds of thousands of volume elements, or voxels.Images of statistics are created thatassess evidence for an experimentaleffect.Na ive thresholding of 100000voxels at a 5%threshold is inappropriate,since 5000false positives would be expected in nullˆ data. False positives must be controlled over all tests, butthere is nosingle measure of Type Ierror in multiple testing problems. The standard measure is the chance ofanyType I errors, the familywise error rate (FWE).A relatively new developmentis the (FDR)error metric, the expected proportion of rejected hypotheses that are false positives. FDR-controlling procedures are more powerful then FWEproce- dures, yet still control false positives in auseful manner. We predict thatFDRma ysoon eclipse FWEa sthe most commonmultiple false positive measure. Inlight of this, we believe thatthis is achoice momentto review FWE-controlling measures. (We prefer the

Addressfor correspondence: Thoma sNichols,Depa rtmentof Biostatistics,U niversityof Michigan,Ann Arbor, MI48109,USA.E-ma il:[email protected]

# Arnold 2003 10.1191/0962280203sm341ra 420 T Nichols et al. term multiple testing problem over multiple comparisons problem .‘Multiple compar- isons’ca nallude to pairwise comparisons ona single model,wherea sin imaginga large collection of models is each subjected to ahypothesis test.) Inthis paper we attemptto describe andeva luate all FWEmultiple testing procedures useful for functionalneuroimaging.Owing to the spatialdependence of functional neuroimagingda ta,there are actually quite asmall number ofapplicable methods.The only methods thatare appropriate under these conditions are Bonferroni, random Želd methods andresa mplingmethods. We limit our attention to Žnding FWE-corrected thresholds and P-values for Gaussian Z and Student t images. (An a0 FWE-corrected threshold is onetha tcontrols the FWEa t a0 ,while anFWE- corrected P-value is the most signiŽcant a0 corrected threshold such thatatest can be rejected.) Inparticularwe donot consider inference onsize of contiguous suprathres- hold regions or clusters. We focus particularattention onlow degrees-of-freedom (DF) t images,a sthese are unfortunately commonin groupa nalyses. The typicalimages have tens to hundreds of thousands of tests, where the tests havecomplicated, thoughusua lly positive dependence structure. In1987Hochbergand Tamhane wrote: ‘The question of which error rate to control . . . has generated muchdiscussion in the literature’. 1 While thatcould betrue in the statistics literature in the decades before their bookwa spublished, the same cannot be said ofneuroima ginglitera ture. Whenmultiple testing hasbeen acknowl- edgedat all, the familywise error rate hasusually been assumed implicitly. Forex ample, of the two papers thatintroduced corrected inference methods to neuroimaging,only one explicitly mentions familywise error. 2 ,3 Itis hopedthat the introduction of FDR will enrich the discussion ofmultiple testing issues in the neuroimaginglitera ture. The remainder of the paper is organized asfollows. We Žrst introduce the general multiple comparison problem andFWE. We then review Bonferroni-related methods, followed byra ndomŽ eld methods andthen resamplingmethods. We then evaluate these different methods with 11real datasets and simulations.

2Multipletestingbackground infunctional neuroimaging

Inthis section we formally deŽne strong andwea kcontrol of familywise error, aswell asother measures of false positives in multiple testing. We describe the relationship between the maximumsta tistic andFWE a ndalso introduce step-upand step-down tests andthe notion of equivalent numberof independent tests.

2.1Notati on Consider imageda ta on atwo- (2D)or three-dimensional(3D)lattice. Usually the lattice will be regular, but it mayalso be airregular, corresponding to a2Dsurface. Throughsome modelingprocess wehaveanimageof test statistics T {Ti}. Let Ti be the value of the statistic imagea tspatiallocation i, i {1, . . . ,V},ˆwhere V is the 2 V ˆ number ofvox els in the brain. Let the H Hi be ahypothesis image,where Hi 0 indicates thatthe null hypothesis holdsˆ at voxel i, and H 1indicates thatˆthe i ˆ alternative hypothesis holds. Let H0 indicate the complete null case where Hi 0 for all i.Adecision to reject the null for voxel i will be written H^ 1,notrejecting ˆH^ 0. i ˆ i ˆ Familywiseerrormethodsin neuroimaging 421

Write the null distribution of as ,andlet the imageof -values be { }. We Ti F0,Ti P P Pi require nothinga boutthe modelingprocess or statistic other thantheˆ test being unbiased, and,for simplicity of exposition, we will assume thatall distributions are continuous.

2.2Measure soffalsepositi vesin multipletesting A valid a-level test atlocation i corresponds to arejection threshold u where P{Ti u Hi 0} a.The challenge ofthe multiple testing problems to Žnda threshold u tha¶tcontrolsj ˆ someµ measure offalse positives across the entire image.While FWEis the focus of this paper, it is useful to compare its deŽnition with other measures. Tothis end consider the cross-classiŽca tion of all voxels in athreshold statistic imageof Table 1. In Table 1 V is the totalnumberof voxels tested, V 1 i Hi is the numberof voxels ¢j ˆ ^ with afalse null hypothesis, V 0 V i Hi the numbertrue nulls, and V1 i Hi ¢j ˆ ¡ P^ j¢ ˆ is the number of voxels abovethreshold, V0 V i Hi the numberbelow; V1 1 is P j¢ ˆ ¡ Pj the number of suprathreshold voxels correctly classiŽed assignal, V1 0 the number P j incorrectly classiŽed assignal; V0 0 is the number of subthreshold voxels correctly j classiŽed asnoise, and V0 1 the numberincorrectly classiŽed asnoise. With this notation arangej offa lse positive measures canbe deŽned, a sshown in Table 2.Anobserved familywise error (oFWE)occurs whenever V1 0 is greater than zero,a ndthe familywise error rate (FWE)is deŽned asthe probability jof this event.The observed FDR(oFDR) is the proportion of rejected voxels thathavebeen falsely rejected, and the expected false discovery rate (FDR)is deŽned asthe expected value of oFDR.N ote the contradictory notation: an‘observed familywise error’a ndthe ‘observed false discovery rate’are actually unobservable ,since they require knowledge of which tests are falsely rejected. Themea sures actually controlled, FWEand FDR, are the probability and expectation of the unobservable oFWEa ndoFDR respectively. Other variants onFDR ha vebeen proposed, including the positive FDR(pFDR), the FDRconditiona lona tleast onerejection 4 andcontrolling the oFDRa tsome speciŽed level of conŽdence (FDRc). 5 The per-comparison error rate (PCE)is essentially the ‘uncorrected’mea sure of false positive, which makes noa ccommodationfor multi- plicity, while the per-family error rate (PFE)measures the expected count of false positives. Per-family errors canalso be controlled with some level conŽdence (PFEc). 6 This taxonomydemonstrates that there are manypotentia lly useful multiple false positive metrics, butwe are choosing to focus onbutone.

Table 1 Cross-classiŽcation of voxels in a thresholdstatistic image

Nullnot rejected Nullrejected (declaredinactive) (declaredactive)

Nulltrue (inactive) V0 0 V1 0 V 7 V 1 j j ¢j Nullfalse (active) V0 1 V1 1 V 1 j j ¢j V 7 V1 V1 V j¢ j¢ 422 T Nichols et al.

Table 2 Differentmeasures of false positives in the multiple testing problem. I{A} is the indicatorfunction for event A

Measureof false positives Abbreviation DeŽ nition

Observedfamilywise error oFWE V1 0 > 0 Familywiseerror rate FWE P{oFWE}j Observedfalse discovery rate oFDR V =V I 1 0 1 {V 1 > 0} Expectedfalse discovery rate FDR E{oFDR}j j¢ j¢ Positivefalse discovery rate pFDR E{oFDR V1 > 0} Falsediscovery rate conŽ dence FDRc P{oFDRj ja¢ } µ Per-comparisonerror rate PCE E{V1 0}=V j Per-familyerror rate PFE E{V1 0} j Per-familyerror rate conŽ dence PFEc P{V1 0 a} j µ

There are twosenses of FWEcontrol, weakandstrong. Weakcontrol of FWEonly requires thatfalse positives are controlled under the complete null H0:

P {T u} H a (1) i ¶ 0 µ 0 Ái ­ ! [2V ­ ­ ­ where a0 is the nominal FWE.Strongcontrol requires that false positives are controlled for any subset where the null hypothesis holds V0 » V

P {Ti u} Hi 0, i 0 a0 (2) Á ¶ ˆ 2 V ! µ i 0 ­ [2V ­ ­ ­ SigniŽca nce determined bya methodwith weakcontrol only implies that H0 is false, anddoes nota llow the localization of individualsigniŽca nt voxels. Because of this, tests with only weakcontrol are sometimes called ‘omnibus’tests. SigniŽca nce obtained bya methodwith strong control allows rejection of individual Hiswhile controlling the FWEa tall nonsigniŽca nt voxels. This localization is essentialto neuroimaging,a ndin the rest of this documentwe focus onstrong control ofFWE a ndwe will omitthe qualiŽer unless needed. Note that variable thresholds, ui,could be used insteadof acommonthreshold u. Anycollection of thresholds { ui}canbe used aslongas the overall FWEis controlled. Also note thatFDRcontrols FWEwea kly. Under the complete null, oFDRbecomes an indicator for anoFWE,andthe expected oFDRex actly the probability of anoFWE.

2.3The maximum statisticand FWE Thema ximumsta tistic, MT maxi Ti,plays akey role in FWEcontrol. The connection is thatoneor moreˆ voxels will exceed athreshold if andonly if the maximumex ceeds the threshold

{Ti u} {MT u} (3) i ¶ ˆ ¶ [ Familywiseerrormethodsin neuroimaging 423

This relationship canbe used to directly obtain aFWE-controlling threshold from the distribution of the maximumsta tistic under H0 .To control FWEa t a0 let 1 u F¡ (1 a ), the (1 a )100th percentile of the maximumdistribution under a0 MT H0 0 0 theˆ completej null¡ hypothesis.¡ Then hasweakcontrol of FWE: ua0

P {T u } H P(M u H ) i ¶ a0 0 ˆ T ¶ a0 j 0 Á i ­ ! [ ­ ­ 1 FM H (ua ) ­ ˆ ¡ T j 0 0 a ˆ 0 Further, this u also hasstrong control of FWE,a lthougha nassumption ofsubset a0 pivotality is required. 7 Afamily oftests hassubset pivotality if the null distribution of a subset of voxels does notdepend on the state of other null hypotheses. Strongcontrol follows

P {Ti ua } Hi 0, i 0 P {Ti ua } H0 (4) Á ¶ 0 ˆ 2 V ! ˆ Á ¶ 0 ! i 0 ­ i 0 ­ [2V ­ [2V ­ ­ ­ ­ ­ P {T u } H (5) µ i ¶ a0 0 Ái ­ ! [2V ­ a ­ (6) ˆ 0 ­ where the Žrst equality uses the assumption of subset pivotality. Inimaging,subset pivotality is trivially satisŽed, a sthe imageof hypotheses H satisŽes the free combination condition. That is, there are no logicalconstraints between different voxel’s null hypoth- eses, and all combinations of voxel level nulls ({0,1} V )are possible. Situations where subset pivotality canfail include tests of elements of acorrelation matrix (Ref.7 ,p.43 ). Note thatwe could equivalently Žnd P-value thresholds using the distribution of the minimum P-value NP mini Pi.Whether with MT or NP,we stress thatwe are not simply makinginference ˆ onthe extremalstatistic, but rather using its distribution to Žnda threshold thatcontrols FWEstrongly .

2.4Step- upand step-downtests Ageneralization of the single threshold test takes the form ofmulti-step tests. Multi- step tests consider the sequence of sorted P-values and compare each P-value to a different threshold. Let the ordered P-values be P P P and H be the (1) µ (2) µ ¢ ¢ ¢ µ (V) (i) null hypothesis corresponding to P(i). Each P-value is assessed according to

P u (7) (i) µ i

Usually u1 will correspond to astandard Žxed threshold test (e.g.,Bonferroni u a =V). 1 ˆ 0 424 T Nichols et al.

There are two types of multi-step tests, step-upa ndstep- down.A step-uptest proceeds from the least to most signiŽca nt P-value (P(V), P(V 1), . . . )andsuccessively ¡ applies equation (7).TheŽ rst i0 thatsatisŽes (7)implies thatall smaller P-values are signiŽca nt;that is, H^ 1 for all i i0, H^ 0otherwise. Astep-downtest proceeds (i) ˆ µ (i) ˆ from the most to least signiŽca nt P-value (P , P , . . . ). The Žrst i0 that does not satisfy (1) (2) ^ (7)implies thatall strictly smaller P-values are signiŽca nt;tha tis, H 1 for all i < i0, ^ (i) ˆ H(i) 0otherwise. Forthe samecritical values { ui},astep-uptest will be asor more powerfulˆ thanastep-downtest.

2.5Equi valentnumber ofindependenttests The main challenge in FWEcontrol is dealing with dependence.U nder independence, the maximumnull distribution FM H is easily derived: T j 0

F (t) F (t) FV (t) (8) MT Ti T1 ˆ i ˆ Y where we havesuppressed the H0 subscript, and the last equality comes from assuming acommonnull distribution for all V voxels. However,in neuroimagingda ta,inde- pendence is rarely atenable assumption, asthere is usually some form of spatial dependence either due to acquisition physics or physiology. Insuch cases the maximum distribution will notbe knownor evenha vea closed form.The methods described in this paper canbe seen asdifferent approaches tobounding or approximatingthe maximumdistribution. One approach that hasanintuitive appeal, butwhich has not been afruitful avenue of research, is to Žnda nequivalent numberof independent observations. That is, to Žnd amultiplier y such that

yV FM (t) FT (t) (9) T ˆ 1 or, in terms of P-values,

F (t) 1 (1 F (t))yV NP ˆ ¡ ¡ P(1) 1 (1 t)yV (10) ˆ ¡ ¡ where the second equality comes from the uniform distribution of P-values under the null hypothesis. Weare drawnto the second form,for P-values, because ofits simplicity andthe log-linearity of 1 ( ). FNP t For the simulations considered¡ below, we will assess if minimum P-value distribu- tions follow equation (10)for asmall t.Ifthey do,one can Žndthe effective number of tests yV.This could becalled the number of maximum-equivalent elements in the data, 1 or maxels,where asingle maxel would consist of V=(yV) y¡ voxels. ˆ Familywiseerrormethodsin neuroimaging 425

3FWEmethods forf unctionalneuroimaging

There are two broadclasses of FWEmethods, those based onthe Bonferroni inequality andthose based onthe maximumstatistic (or minimum P-value) distribution. WeŽrst describe Bonferroni-typemethods andthen two types of maximumdistribution methods:ra ndomŽ eld theory-based andresa mpling-based methods.

3.1Bonfer roni and related The most widely knownmultiple testing procedure is the ‘Bonferroni correction’. Itis based onthe Bonferroni inequality,atruncation of Boole’s formula. 1 We write the Bonferroni inequality as

P Ai P{Ai} (11) i µ i » [ ¼ X where Ai corresponds to the eventof test i rejecting the null hypothesis whentrue. The inequality makes noa ssumption ondependence between tests, although it can be quite conservative for strong dependence.As anextreme,consider that V tests with perfect dependence ( Ti Ti0 for i i0)require nocorrection atall for multiple testing. However,for ˆ manyindependent 6ˆ tests the Bonferroni inequality is quite tight for 3 typical a0 .For example,for V 32 independent voxels and a0 0:05,the exact ˆ 1=V 6 ˆ one-sided P-value threshold is 1 ((1 a0 ) ) 1:5653 10¡ while Bonferroni 6 ¡ ¡ ˆ £ gives a0 =V 1:5259 10¡ . For a t9 distribution, this is the difference between 10.1616andˆ 1 0.1928£.Surprisingly, for weakly-dependentda ta,Bonferroni canalso 3 befairly tight.To preview the results below, for 32 -voxel t9 statistic imagebased on Gaussiandata with isotropic FWHMsmoothness of three voxels, we Žnd thatthe correct threshold is 10.0209.Hence Bonferroni thresholds and P-values canindeed be useful in imaging. Considering another term ofBoole’s formula yields asecond-order Bonferroni, or the Kouniasinequality 1

P Ai P{Ai} max P{Ai Ak} (12) µ ¡ k 1,...,V \ » i ¼ i ˆ » i k ¼ [ X X6ˆ The Slepianor Dunn–SNida ´kinequalities canbe used to replace the bivariate probabil- ities with products. The Slepianinequality, also knowna sthe positive orthant dependence property, is used when Ai corresponds to aone-sided test –it requires some form of positive dependence,like Gaussiandata with positive correlations. 8 Dunn–SNida´kis used for two-sided tests andis aweaker condition, for example,only requiring the data follow amultivariate Gaussian, tor Fdistribution (Ref.7 ,p.4 5). Ifall the null distributions are identicaland the appropriate inequality holds (Slepian or Dunn–SNida´k),the second-order Bonferroni P-value threshold is c such that

Vc (V 1)c2 a (13) ¡ ¡ ˆ 0 426 T Nichols et al.

When V is large,however, c will haveto be quite small making c2 negligible, essentially reducing to the Žrst-order Bonferroni. Forthe example considered above,with V 323 ˆ and a0 0:05,the Bonferroni andK ounias P-value thresholds agree toŽ vedecima l ˆ 6 6 places (0:05=V 1:525881 10¡ versus c 1:525879 10¡ ). Other approaˆches to extending£ the Bonferroniˆ methoda £re step-upor step-downtests. ABonferroni step-downtest canbe motivated asfollows. Ifwe compare the smallest P-value P(1) to a0 =V and reject, then our multiple testing problem hasbeen reduced by one test, andwe should compare the nextsmallest P-value P(2) to a0=(V 1).I ngeneral, we have ¡

1 P a (14) (i) µ 0 V i 1 ¡ ‡ This yields the Holmstep- downtest, 9 which starts at i 1and stops assoon as the inequality is violated,rejecting all tests with smaller Pˆ-values. Using the very same criticalvalues, the Hochbergstep- uptest 1 0 starts at i V and stops assoon asthe inequality is satisŽed, rejecting all tests with smaller P-vaˆlues. However,the Hochberg test depends ona result of Simes. Simes1 1 proposed astep-upmethod tha tonly controlled FWEwea kly and was only provento be valid under independence. Intheir seminalpaper, Benjamini and Hochberg1 2 showed thatSimes’method controlled whattheyna medthe ‘False Discovery Rate’. BothSimes’ test andBenja mini andHochberg’ s FDRha vethe form

i P a (15) (i) µ 0 V Botha re step-uptests, which work from i V. The Holmmethod, like Bonferroni, makesˆ noa ssumption onthe dependence of the tests. Butif the Slepianor Dunn–SNida ´kinequality holds the ‘SNida´kimprovementon Holm’ca nbe used. 1 3 The SNida´kmethodis also astep-downtest but uses thresholds 1=(V i 1) ui 1 (1 a0) ¡ ‡ instead. Recentlyˆ ¡ ¡ Benjamini and Yekutieli 1 4 showed thatthe Simes =FDRmethod is valid under ‘positive regression dependency onsubsets’ (PRDS). As with Slepianinequality, Gaussiandata with positive correlations will satisfy the PRDScondition, but it is more lenient thanother results in thatit only requires positive dependencyon the null, thatis, only between test statistics for which Hi 0.Interestingly, since Hochberg’s method dependedon Simes’ result, so Ref.1 4alsoˆ implies thatHochberg’s step-upmethod is valid under dependence. Table 3summarizes the multi-step methods.The Hochbergstep- up methoda nd SNida´kstep-downmethod a ppeartobe the most powerful Bonferroni-related FWE procedures available for functionalneuroimaging.Hochberg uses the same critical values asHolm,but Hochbergca nonly be more powerful since it is astep-uptest. The SNida´khas slightly more lenient critical values, but maybe more conservative than Hochbergbecause it is astep-downmethod. I fpositive dependence cannotbe assumed for one-sided tests, Holm’s step-downmethod would be the best alternative. The Simes=FDRprocedure hasthe most lenient critical values, butonly controls FWE Familywiseerrormethodsin neuroimaging 427

Table 3 Summaryof multi-step procedures

Procedure ui Type FWE control Assumptions

Holm a0(1=V 7 i 1) Step-down Strong None ‡1=(V 7 i 1) SN ida´ k 1 7 (1 7 a0) ‡ Step-down Strong Slepian=Dunn–SN ida´k Hochberg a (1=V 7 i 1) Step-up Strong PRDS 0 ‡ Simes=FDR a0(i=V) Step-up Weak PRDS

weakly. Seeworks bySa rkar 1 5 ,1 6 for amore detailed overviewof recent developments in multiple testing andthe interaction between FDRand FWEmethods. The multi-step methods can adapt to the signalpresent in the data,unlike Bonferroni. Forthe characteristics of neuroimagingda ta,with large images with sparse signals, however,we are notoptimistic these methods will offer muchimprovement over 3 Bonferroni. Forex ample,with a0 0:05 and V 32 ,consider acase where 3276 voxels (10%)of the imagehaveaveryˆ strong signaˆl, that is, inŽnitesima l P-values. For the 3276th smallest P-value the relevant criticalvalue would be0 :05=(V 3276 1) 6 1=(V 3276 1) ¡ 6 ‡ ˆ 1:6953 10¡ for Hochberga nd1 (1 0:05) ¡ ‡ 1:7392 10¡ for SNida´k, each only£ slightly more generous tha¡nthe¡ Bonferroni thresholdˆ of 1 :52£59 10 6 and a £ ¡ decrease of less than0.16in t9 units. Formore typical, evenmore sparse signals there will be less of adifference. (Note thatthe Simes =FDRcritical value would be 0:05 3276=V 0:005,althoughwith nostrong FWEcontrol.) The£ strength ofˆBonferroni and related methods are their lack of assumptions or only weakassumptions ondependence. However, none of the methodsma kes use of the spatialstructure of the data or the particularform of dependence.The following two methods explicitly account for imagesmoothness and dependence.

3.2Random ¢eldtheory Boththe randomŽ eld theory (RFT)methods andthe resampling-based methods account for dependence in the data,ascaptured bythe maximumdistribution. The randomŽ eld methods approximate the upper tail ofthe maximumdistribution, the end needed to Žnd small a0 FWEthresholds. Inthis section we review the general approach ofthe randomŽ eld methods andspeciŽ ca lly use the Gaussian randomŽ eld results to giveintuition tothe methods.I nsteadofdeta iled formulaefor every typeof sta tistic imagewe motivate the generalapproach for thresholding astatistic image,a ndthen briey review important details of the theory andcommon misconceptions. Fora detailed description of the randomŽ eld results we refer to anumberof useful papers. The originalpaper introducing RFTto brain imaging 2 is avery readable and well-illustrated introduction to the GaussianrandomŽ eld method.A later work 1 7 uniŽes Gaussian and t, w2 and F Želd results. Avery technical, but comprehensive summary1 8 also includes results onHotellings T2 andcorrelation Želds. As part of a broadreview of statisticalmethods in neuroimaging,Ref. 1 9describes RFTmethods andhighlights their limitations and assumptions. 428 T Nichols et al.

3.2.1RFT intuition Consider acontinuous Gaussian randomŽ eld, Z(s) deŽned on s O D, where D is the dimension of the process, typically 2or 3.Let Z(s)havezero 2 »mea nandunit variance,a swould berequired bythe null hypothesis. For athreshold u applied tothe Želd, the regions abovethe threshold are knownas the excursionset , A u ˆ O {s : Z(s) > u}. The Euler characteristic w(Au) wu is atopologicalmeasure of the ex\cursion set. While the precise deŽnition involves² the curvature of the boundary of the excursion set (Ref.2 0,cited in Ref.2 1),it essentially counts the numberof connected suprathreshold regions or ‘clusters’, minus the numberof ‘holes’plus the number of ‘hollows’(Figure 1).Forhigh thresholds the excursion set will haveno holes or hollows and wu will just count the numberof clusters; for yet higher thresholds the wu will be either 0or 1,anindicator of the presence of any clusters. This seemingly esoteric topologicalmeasure is actually very relevant for FWEcontrol. Ifanull Gaussian statistic image T approximates acontinuous randomŽ eld, then

FWE P{oFWE} (16) ˆ

P Ti u (17) ˆ ( i ¶ ) [ P{ max Ti u} (18) ˆ i ¶ P{w > 0} (19) º u E{w } (20) º u

The Žrst approximation comes from assuming thatthe threshold u is highenough for there to be noholes orhollows andhence the wu is just countingclusters. The second approximation is obtained when u is highenough such thatthe probability of two or more clusters is negligible.

Figure 1 Illustrationof Euler characteristic ( wu)for differentthresholds u.Theleft Ž gureshows the random Želd,and the remaining images show the excursion set for differentthresholds. The wu illustratedhere only countsclusters minus holes (neglecting hollows). For high thresholds there are no holes, and wu just counts thenumber clusters. Familywiseerrormethodsin neuroimaging 429

The expected value of the w hasaclosed-form approximation; 2 1 for D 3 u ˆ = E{w } l(O) L 1 2 (u2 1) exp( u2 =2)=(2p)2 (21) u º j j ¡ ¡ where l(O)is the Lesbegue measure of the search region, the volumein three dimen- sions, and L is the variance–cova riance matrix of the gradient of the process,

q q q > L Var Z Z Z (22) ˆ x y z Áµq q q ¶ !

Its determinant L is measure of roughness; the more ‘wiggly’a process, the more variable the partiaj lj derivatives, the larger the determinant. Consider anobserved value z of the process at some location. Tobuild intuition c consider the impact of search region size and smoothness oncorrected P-value Pz. The corrected P-value is the upper tail area ofthe maximumdistribution:

c Pz P( max Z(s) z) E{wz} (23) ˆ s ¶ º

For large z,equation (21)gives

2 = z Pc l(O) L 1 2z2 exp (24) z / j j ¡ 2 ³ ´ approximately. First note that,a ll other things constant,increasing large z reduces the corrected P-value. Of course P-values must be nonincreasing in z,but note that equation (24)is notmonotonic for all z, and that E{wz}canbe greater than1or even negative! Nextobserve thatincreasing the search region l(O)increases the corrected P-value, decreasing signiŽca nce.This should be anticipated,since anincreased search volumeresults in amore severe multiple testing problem. Andnex tconsider the impact ofsmoothness, the inverse of roughness. Increasing smoothness decreases L , which in turn decreases the corrected P-value andincrea ses signiŽca nce.The intuitionj herej is that anincrease in smoothness reduces the severity of the multiple testing problem; in some sense there is less information with greater smoothness. Inparticular, consider that in the limit of inŽnite smoothness the entire processes hasacommonva lue, andthere is no multiple testing problem atall.

3.2.2RFT strengths andweakness es As presented above,the merit ofthe RFTresults are thatthey adaptto the volume searched (like Bonferroni) andto the smoothness of the image(unlike Bonferroni). When combinedwith the generallinearmodel (GLM),the randomŽ eld methods comprise aexible frameworkfor neuroimaginginference. For functional neuroima- gingda ta thatis intrinsically smooth (PET,SPE CT,M EGorEEG)or heavily smoothed (multisubject fMRI),these results provide auniŽed framework to ŽndFWE -corrected inferences for statistic images from aGLM.While we only discuss results for peak 430 T Nichols et al. statistic height,a family of available results includes P-values for the size of acluster, the numberof clusters andjoint inference onpeak height andcluster size. Further, the results only dependon volume sea rched andthe smoothness (see below for more details onedge corrections), andare not computationally burdensome. Finally, they havebeen incorporated into various neuroimagingsoftwa re packages anda re widely used (if poorly understood) by thousands of users. (The software packages include SPM,http://www.Žl.ion.ucl.a c.uk;Vox Bo,www.vox bo.org;FSL, www.fmrib.ox.ac.uk/fsl; and Worsley’s ownfmristat, http://www.math.mcgill.ca/keith/ fmristat.) The principalweakness of the randomŽ eld methods are the assumptions. The assumptions are sometimes misstated,so wecarefully itemize them. 1)The multivariate distribution of the imageis Gaussian or derived from Gaussian data (e.g.,for tor Fstatistic image). 2)Thediscretely sampledsta tistic images are assumed tobe sufŽciently smoothto approximate the behaviour ofacontinuous randomŽ eld. The recommendedrule of thumbis three voxels FWHMsmoothness, 1 9 althoughwe will critically assess this with simulations andda ta (see below for deŽnition of FWHMsmoothness). 3)The spatialautocorrelation function (ACF)must havetwo derivatives atthe origin. Except for the joint cluster-peakheighttest, 2 2 the ACF is not assumed to havethe form of aGaussiandensity. 4)The data must be stationary orthere must exist adeformation of space such that the transformed data is stationary. 2 3 This assumption is most questionable for reconstructed MEGandEE Gdata,which mayha vevery convolutedcova riance structures. Remarkably, for the peakheight results we discuss here, nonstationarity need not evenbe estimated (see below for more.) 5)The results assume thatroughness =smoothness is knownwith negligible error. Poline et al.2 4 foundtha tuncertainty in smoothness wasin fact appreciable, causing corrected P-values to be accurate to only 20% if smoothness was estimated from asingle image.(I narecent abstract,§ Worsley proposed methods for nonstationary cluster size inference, which accounts for variability in smoothness estimation. 2 5 )

3.2.3RFT essential details To simplify the presentation, the results abovea voidedsevera limportant details which we nowreview. = RoughnessandRESELS .Because the roughness parameter L 1 2 lacks interpret- ability, Worsley proposed areparameterization in terms of the convolutionj j ofawhite noise Želd into arandomŽ eld with smoothness thatmatches the data. Consider awhite noise GaussianrandomŽ eld convolvedwith aGaussiankernel. If the kernel hasvariance matrix S then the roughness ofthe convolvedra ndomŽ eld is 1 2 6 2 2 2 L S¡ =2. If S has s , s , s onthe diagonal andzero elsewhere, then ˆ x y z 1=2 1 3 1=2 L ( S ¡ 2¡ ) (25) j j ˆ j j 1 3=2 (s s s )¡ 2¡ (26) ˆ x y z Familywiseerrormethodsin neuroimaging 431

Topara meterize in terms ofsmoothness,Worsley used the full width athalfmax imum (FWHM),a general measure of spreadof adensity. For asymmetric density f centered about zeroFWHM is the value x such that f ( x=2) f (x=2) f (0)=2.AGaussian kernel hasaFWHMof s 8 log 2.Ifawhite noise¡ Želdˆ is convolvedˆ with aGaussian kernel with scale (FWHM x , FWHMy, FWHMz)(andzero correla tions), the roughness parameter is p

3=2 = (4 log 2) L 1 2 (27) j j ˆ FWHMxFWHMyFWHMz

Worsley deŽned a RESolution ELement,or RESELto be aspatialelement with dimensions FWHM x FWHMy FWHMz.Thedenomina tor of(2 7)is thenthe volumeof oneRESEL. £ £ Noting that E{wu}in equation (21)depends onthe volumea ndroughness = through l(O) L 1 2 ,it can be seen thatsearch volumea nd RESELsize canbe combined andinstea dwrittenj j asthe search volumemea sured in RESELs:

l(O) R3 (28) ˆ FWHMxFWHMyFWHMz

TheRFT results then dependonly onthis single quantity, aresolution-adjusted = search volume,the RESELvolume. The essentia lobservation wasthat L 1 2 can be interpreted asafunction of FWHM,the scale of aGaussiankernel requiredj toj convolve = awhite noise Želd into one with the samesmoothness asthe data.When L 1 2 is unknownit is estimated from the data (see below), but not bya ssuming thatthej AjCFis Gaussian.A Gaussian ACFis not assumed byra ndomŽ eld results, rather Gaussian ACFis only used to reparameterize roughness into interpretable units of smoothness. If the true ACFis not Gaussian the accuracy of the resulting threshold is not impacted, only the precise interpretation of RESELs is disturbed. ComponentŽ elds andsmoothnes sestimation .Forthe Gaussiancase presented above,the smoothness ofthe statistic imageunder the null hypothesis is the key parameter of interest. Forresults onother types of Želds including t, w2 and F, the smoothness parameter describes the smoothness of the component Želds .Ifeach voxel’s data are Žtwith agenerallinear model Y Xb E,the componentŽ elds are images of ˆ ‡ Ej= Var{Ej}, where Ej is scan j’serror. Thatis, the componentŽ elds are the unobserv- able, meanzero, unit variance Gaussian noise images thatunderlie the observed data. Estimaption of componentŽ eld smoothness is performed onsta ndardized residual images,2 7 not the statistic imageitself. The statistic imageis notused because it will generally contain signal, increasing roughness anddecreasing estimated smoothness. Additionally, except for the Z statistic, the smoothness of the null statistic imagewill be different from thatof the componentŽ elds. For example,see Ref.2 1,Equation (7)and Ref.2 6,appendixG for the relationship between t andcomponentŽ eld smoothness. Edgecorrections anduniŽ ed RFTres ults .The original GaussianRFTresults (21) assumed anegligible chance of the excursion set Au touchingthe boundary ofthe search region O.2 1 Ifclusters did contact the surface of O they would havea contribution less 432 T Nichols et al. thanunity to wu.Worsley developedcorrection terms to (21)to account for this effect.1 7 ,1 8 These ‘uniŽed’ results havethe form

D Pc R r (u) (29) u ˆ d d d 0 Xˆ where D is the number of dimensions of the data, Rd is the d-dimensionalRESEL measure and rd(u)is the Euler characteristic density. These results are convenientas they dissociate the terms thatdependonly onthe topologyof the search regions ( Rd) from those thatdependonly onthe typeof statisticalŽeld ( rd(u)). Nonstationarity andclus ter size tests .For inferences onpea kheight,with the appropriate estimator of averagesmoothness, 2 3 equation (21)will be accurate in the presence of nonstationarity or variable smoothness. However,cluster size inference is greatly effected bynonsta tionarity. Inanull statistic image,la rge clusters will be more likely in smoothregions andsma ll clusters more likely in roughregions. Hence an incorrect assumption of stationarity will likely leadto ina ted false positive rate in smoothregions andreduced power in roughregions. As alluded to above,the solution is todeform space until stationarity holds (if possible 2 9 ).Explicit application of this transformation is actually notrequired, and localroughness can beused to determine cluster sizes in the transformed space. 2 3 ,2 5 ,3 0 RESELBo nferroni .Acommonmisconception is thatthe randomŽ eld results apply aBonferroni correction based onthe RESELvolume. 3 1 Theya re actually quite different results. Using Mill’s ratio, the Bonferroni corrected P-value canbe seen to be approximately

c u2 =2 1 P Ve¡ u¡ (30) Bonf / While the RFT P-value for 3Ddata is approximately

c u2 =2 2 P R e¡ u (31) RFT / 3 where R3 is the RESELvolume (2 8).Repla cing V with R3 obviously does nota lign these two results, nor are they evenproportiona l. We will characterize the performance of a naive RESELBonferroni approach in Section 4. Gaussianized t images.Early implementation ofra ndomŽ eld methods (e.g.,SPM 96 andprevious versions) used GaussianRFTresults on t images.While animageof t statistics canbe converted into Z statistics with the probability integraltransform, the resulting processes is nota t randomŽ eld. Worsley 1 7 foundtha tthe degrees of freedom would haveto be quite high,a smanyas 120for a t Želd tobeha velike a GaussianŽeld. We will examine the performance of the Gaussianized t approach with simulations.

3.2.4RFT conclusion Inthis subsection we havetried to motivate the RFTresults, aswell ashighlight important details of their application. Theya re apowerful set of tools for data thatare Familywiseerrormethodsin neuroimaging 433 smooth enoughto approximate continuous randomŽ elds. When the data are insufŽ- ciently smooth,or when other assumptions are dubious, nonparametric resampling techniques maybe preferred.

3.3Resampl ing methodsforFWE control Thecentra lpurpose of the random Želd methods is toa pproximate the upper tail of the maximaldistribution ( ).Insteadof makinga ssumptions onthe smoothness and FMT t distribution of the data,a nother approach is to use the data itself to obtain anempirical estimate ofthe maximaldistribution. There are two generalapproaches, permutation- based andbootstra p-based. Excellent treatments of permutation tests 3 2 ,3 3 and the bootstrap3 4 ,3 5 are available, so here we only briey review the methods andfocus on the differences between the twoapproaches and speciŽc issues relevant to neuroimaging. Tosumma rize briey, both permuta tion andbootstra pmethods proceed byresa m- pling the data under the null hypothesis. Inthe permutation test the data is resampled without replacement,while in abootstraptest the residuals are resampled with replacement anda null dataset constituted. Topreserve the spatialdependence the resamplingis not performed independently voxel byvox el, but rather entire images are resampledas awhole. Forea ch resamplednull dataset, the modelis Žt,the statistic imagecomputed, a ndthe maximalstatistic recorded. Byrepea ting this process manytimes anempiricalnull distribution of maximum ^ is constructed, andthe FMT 100(1 a0 )th percentile provides an FWE-controlling threshold. Fora more detailed introduction¡ to the permutation test for functionalneuroimaging see Ref.3 6. Wenowconsider three issues thathavea nimpact onthe application of resampling methods to neuroimaging.

3.3.1Vo xel-level statistic andho mogeneityo fspeciŽcity ands ensitivity TheŽ rst issue we consider is commonto a ll resamplingmethods used toŽ ndthe maximumdistribution and anFWEthreshold. While aparametric methodwill assume acommonnull distribution for each voxel in the statistic image, F F , FWE 0,Ti ˆ 0 resamplingmethods are actually valid regardless of whether the null distributions { F0,Ti } are the same.This is because the maximumdistribution captures anyheterogeneity; as equation (4)shows, the relationship between the maximuma ndFWE ma kes no assumption ofhomogeneous vox el null distributions. Nonparametric methods will accurately reect the maximumdistribution of whatever statistic is chosen, andproduce valid estimates ofFWE-controlling thresholds. However,once an FWE-controlling threshold is chosen, the false positive rate and power ateach voxel depends onea ch voxel’s distribution. As shown in Figure 2,FWE canbe controlled overall, but if the voxel null distributions are heterogeneous, the Type Ierror rate will be higher atsome voxels andlower atothers. As aresult, eventhough nonparametric methods admit the use of virtually anysta tistic (e.g.,ra wpercent change,or meandifference), we prefer avoxel-level statistic thathasacommonnull distribution F0 (e.g., t or F).Hence the usualstatistics motivated by parametric theory are used to provide amore pivotal Ti thananun-normalized statistic. Note thateven thoughthe statistic imagema yuse aparametric statistic, the FWE-corrected P-values are nonparametric. 434 T Nichols et al.

Figure 2 Impactof heterogeneous null distributions on FWE control. Shown are the null distributions for Žve independentvoxels, the null distribution of the maximum of the Ž vevoxels,and the 5% FWEthresholds. (a) Useof the mean difference statistic allows variance to vary from voxel to voxel, even under the null hypothesis.Voxel 2 hasrelatively large variance and shifts the maximum distribution to the right; the risk ofType I error islargely due to voxel 2, and, in contrast, voxel 3 willalmost never generate a falsepositive. (b) If a t statisticis used the variance is standardized but the data may still exhibit variable skew. This would occur ifthe data are not Gaussian and haveheterogeneous skew. Here voxels 2 and4 bearmost of the FWE risk. (c) If thevoxel-level null distributions are homogeneous (e.g. if the t statisticis used and the data are Gaussian) therewill be uniform risk of false positives. In all three of these cases the FWE iscontrolled, but the risk of TypeI errormay not be evenly distributed.

3.3.2Rando mizationversus permutationversusbo otstrap The permutation andbootstra ptests are most readily distinguished bytheir sampling schemes (without versus with replacement).However, there are severalother important differences, andsubtle aspects to the justiŽca tion of the permutation test. These are summarized in Table 4. Apermutation test requires anassumption of exchangeability under the null hypothesis. This is typically justiŽed bya nassumption of independence andidentica l distribution. However,if arandomizeddesign is used, noa ssumptions onthe data are required atall. A randomizationtest uses the randomselection of the experimental design tojustify the resamplingof the data (or, more precisely, the relabeling ofthe data).While the permutation test andra ndomization test are often referred to bythe samename, we Žndit useful to distinguish them.A sthe randomization test supposes no Familywiseerrormethodsin neuroimaging 435

Table 4 Differencesbetween randomization, permutation and bootstrap tests

Randomization Permutation Bootstrap P-values Exact Exact Asymptotic=approximate Assumption Randomizedexperiment Ho-exchangeability I.I.D. Inference Sample only Population Population Models Simple Simple General

population, the resulting inferences are speciŽc to the sample athand.The permutation test, in contrast, uses apopulation distribution to justify resamplinga ndhence makes inference onthe population sampled. 3 7 Astrength of randomizationa ndpermuta tion tests is thatthey exactly control the false positive rate.Bootstra ptests are only asymptotically exact,a ndeach particular type of model should be assessed for speciŽcity of the FWEthresholds. Weare unaware of any studies of the accuracy of the bootstrapfor the maximumdistribution in functionalneuroimaging.Further, the permutation test allows the slightly more general condition ofexchangeability, in contrast tothe bootstrap’s independent and identically distributed assumption. The clearadvantageof the bootstrapis thatit is ageneralmodelingmethod. With the permutation test, each typeof model must be studied to Žndthe nature of exchange- ability under the null hypothesis. Andsome data,such as positive one-sample data (i.e.,not difference data)cannot be analysed with apermutation test, asthe usual statistics are invariant topermuta tions of the data.Thebootstra pcanbe implemented generally for abroad arrayof models. While wedonot a ssert thatbootstrap tests are automatic, and indeed generallinearmodel design matrices canbe foundwhere the bootstrapperforms horribly (see Ref.3 5,p.2 76),it is amore exible approach thanthe permutation test.

3.3.3Ex changeabilityand fMRI time series Functionalmagnetic resonance imaging(fM RI)data is currently the most widely used functionalneuroimagingmoda lity. However,fM RItime series exhibit temporal auto- correlation thatviolates the exchangeability =independence assumption of the resam- pling methods.Three strategies to deal with temporaldependence havebeen applied: do nothing,resa mple ignoring autocorrelation; 3 8 use arandomized design and randomiza- tion test;3 9 anddecorrela te and then resample. 4 0 –4 2 Ignoringthe autocorrelation in parametric settings tends to ina te signiŽca nce due to biased variance estimates; with nonparametric analyses there maybe either ina ted or dea ted signiŽcance depending onthe resamplingschemes. Inarandomization test the data is considered Žxed,a nd hence anyautocorrela tion is irrelevant to the validity of the test (power surely does dependon the autocorrelation andthe resamplingscheme, but this hasnotbeen studied to our knowledge).The preferred approach is the last one.The process consists ofŽtting amodel,estimating the autocorrelation with the residuals, decorrelating the residuals, resampling,and then recorrelating the resampled residuals, creating null hypothesis realizations of the data.The challenges of this approach are the estimation of the autocorrelation and the computationalburden of the decorrelation–recorrelation 436 T Nichols et al. process. To haveanexact permutation test the residuals must be exactly whitened, but this is impossible without the true autocorrelation. However,in simulations and with realnull data,Brammer and colleagues foundtha tthe false positive rate waswell controlled. 4 0 Toreduce the computationalburden, Fadili andBullmore 4 3 proposed performing the entire analysis in the whitened (i.e.,wa velet) domain.

3.3.4Res ampling conclusion Nonparametric permutation andbootstra pmethods provide estimation of the maximumdistribution without strong assumptions, andwithout inequalities that loosen with increasing dependence.Only their computationalintensity andla ck of generality preclude their widespread use.

4Evaluation ofFWEmethods

We evaluated methods from the three classes of FWE-controlling procedures. Of particularinterest is acomparison ofra ndomŽ eld andresa mplingmethods, permuta- tion in particular. Inearlier work 3 6 comparing permutation andRFT methods on small groupPE TandfM RIdata,we foundthe permutation methodto be muchmore sensitive, andthe RFTmethod compa rable to Bonferroni. The present goal is to examine more datasets tosee if those results generalize, andto examine simulations to discover if the RFTmethod is intrinsically conservative or if speciŽc assumptions did not hold in the datasets considered. Inparticular, we seek the minimumsmoothness required bythe randomŽ eld theory methods toperform well. We also investigate if twoof the extended Bonferroni methodsenha nce the sensitivity ofBonferroni.

4.1Re aldata results Weapplied Bonferroni-related,ra ndomŽ eld andpermuta tion methods to nine fMRI groupda tasets andtwo PETda tasets. All data were analysed with amixed effect model based onsumma ry statistics. 4 4 This approach consists of Žtting intrasubject general linearmodels onea ch subject, creatinga contrast imageof the effect ofinterest and assessing the population effect with aone-sample t test onthe contrast images.The smoothness parameter ofthe randomŽ eld results were estimated from the standardized residualimages of the one-sample t.2 7 RandomŽ eld results were obtained with SPM99 (http://www.Žl.ion.ucl.a c.uk/spm) andnonpara metric results were obtained with SnPM99(http://www.Žl.ion.ucl.a c.uk/spm/snpm). Adetailed description of each dataset is omitted for reasons of space,but we summarize each briey . Verbal Fluency is aŽve-subject PETda taset comparing a baseline of passive listening versus word generation ascued bysingle letter (complete dataset available athttp://www.Žl.ion.ucl.a c.uk/spm/data). LocationSwitching and TaskSwitching are two different effects from a10-subject fMRIstudy of attention switching (Tor Wager et al.,in preparation). Faces:MainEffect and Faces:Interaction are twoeffects (main effect data available athttp://www.Žl.ion.ucl.a c.uk/spm/data) from a12-subject fMRIstudy of repetition priming. 4 5 ItemReco gnition is one effect from a12-subject fMRIstudy of working memory. 4 6 Visual Motion is a12-subject PET study of visualmotionperception, comparing movingsqua res to Žxed ones. 4 7 Familywiseerrormethodsin neuroimaging 437

Emotional Pictures is one effect from a13-subject fMRIstudy of emotionalprocessing, asprobed by photographs of varying emotionalintensity. 4 8 Pain:W arning, Pain:Anticip ation and Pain: Pain are three effects from a23-subject fMRIstudy of pain andthe placebo effect (Tor Wager et al.,in preparation). Tables 5and 6shows the results for the 11datasets. Table 5shows thatfor every dataset the permutation methodha sthe lowest threshold, often dramatically so. Using either Bonferroni or permutation asareference, the RFTbecomes more conservative with decreasing degrees of freedom (DF),for example specifyinga threshold of4701.32 for a4DFa nalysis. The Bonferroni threshold is lower thanthe RFTthreshold for all the low-DFstudies. Only for the 22DFstudy is the RFTthreshold below Bonferroni, althoughthe twoapproaches havecompa rable thresholds for one ofthe 11DFstudies andthe 2DFstudy. The smoothness is greater thanthree voxel FWHMfor all studies, except for the z-smoothness in the visualmotion study. This suggests thatathree voxel FWHMrule of thumb 1 9 is insufŽcient for low-DF t statistic images. Degrees of freedom andnot smoothness seems to be the biggest factor in the convergenceof the RFTand permutation results. Thatis, RFTcomes closest to permutation not whensmoothness is particularly large (e.g., Taskswitching ), but when degrees of freedom exceed 12(e.g.,the Pain:dataset). This suggest aconserva- tiveness in the low-DFRFT t results thatis not explained by excessive roughness. Comparing the 11DFstudies Itemreco gnition and Visual motion is informative,a s one hastwice asmanyvox els andyet half asmanyRE SELs. This situation results in Bonferroni beinghigher on Itemreco gnition (9.80versus 8.92)yet RFTbeing lower (9.87versus 11.07). Itemreco gnition has the lower permutation threshold (7.67versus 8.40)suggestingtha tthe resampling approach is adaptingto greater smoothness despite the larger numberof voxels. N Hochberga ndS ida´kare often inŽnity, indicating thatno a0 0:05threshold exists [i.e., no P-value satisŽed equation (7)].Also note thatHochbergˆa ndS Nida´kcan be more

Table 5 Summaryof FWE inferences for 11PET and fMRI studies. 5% FWE thresholdsfor Žvedifferent methodsare presented, RFT, Bonferroni, Hochberg step-up, S N ida´kstep-downand permutation. Note how RFT onlyoutperforms other methods for studieswith the largest degrees of freedom. Hochberg and S N ida´ k’s methodrarely differs from Bonferroni by much. Permutation always has the lowest threshold

Study DFVoxelsVoxel FWHM RESEL FEW-corrected t threshold smoothness volume RFT Bonf.Hoch. S N ida´ k Perm. Verbal uency 4550275.6 6.3 3.9 399.9 4701.32 42.59 10.14 Locationswitching 9 361246.1 5.9 5.1 196.8 11.17 10.31 1 1 5.83 Taskswitching 9371816.4 6.9 5.2 161.9 10.79 10.35 10.29 1 10.29 1 5.10 Faces:main effect 11 51560 4.1 4.1 4.3 713.3 10.43 9.07 9.07 9.04 7.92 Faces:interaction 11 51560 3.8 3.9 4.0 869.8 10.70 9.07 8.26 Itemrecognition 11 110776 5.1 6.8 6.9 462.9 9.879.80 9.99 1 9.99 1 7.67 Visualmotion 1143724 3.9 4.4 2.2 1158.2 11.07 8.92 8.90 8.87 8.40 Emotionalpictures 12 44552 5.6 5.4 5.0 294.7 8.48 8.41 7.15 Pain:warning 2223263 4.7 4.9 3.5 288.6 5.936.05 6.09 1 6.04 1 4.99 Pain:anticipation 22 23263 5.0 5.1 3.6 253.4 5.876.05 6.07 6.07 5.05 Pain: pain 2223263 4.6 4.8 3.4 309.9 5.956.05 6.05 6.05 5.15 438 T Nichols et al.

Table 6 Summaryof FWE inferencesfor 11PET and fMRI studies (continued). Shown are the number of signiŽcant voxels detected with the Ž vemethods discussed, along with permutation method on the smoothed variance t statistic

Study Numberof signiŽ cant voxels

t Sm. Var t Perm.

RFT Bonf. Hoch. SN ida´ k Perm. Verbal uency 0 0 0 0 0 0 Locationswitching 0 0 0 0 158 354 Taskswitching 4 6 7 7 2241 3447 Faces:main effect 127 371 372 379 917 4088 Faces:interaction 0 0 0 0 0 0 Itemrecognition 5 5 4 4 58 378 Visualmotion 626 1260 1269 1281 1480 4064 Emotionalpictures 0 0 0 0 0 7 Pain:warning 127 116 116 118 221 347 Pain:anticipation 74 55 55 55 182 402 Pain: pain 387 349 350 353 732 1300

stringent than Bonferroni eventhough their criticalvalues ui are never less than a0 =V. This occurs because the critical P-value falls below both ui and a0=V. Table 6shows how,even though the permutation thresholds are always lower, it fails to detect anyvox els in some studies. (As notedin Ref.4 5,the Faces:I nteraction effect is signiŽca nt in an a priori region of interest.) While truth is unknownhere, this is evidence of permutation’s speciŽcity. The last columnof this table includes results using asmoothedva riance t statistic, ameans toboost degrees of freedom by‘ borrowing strength’from neighboring voxels. 4 9 ,3 6 Inall ofthese studies it increased the number of detected voxels, in some cases dramatically.

4.2Simula tionme thods Wesimulated 32 32 32images,since avoxel countof 32 3 32767is typicalfor moderate resolution£ ( 3£mm3 )data.Smoothima ges were generaˆted asGaussianwhite noise convolvedwith º a3Disotropic Gaussiankernel of size 0,1.5,3,6and12voxels FWHM (s 0,0.64,1.27,2.55,5.1).We did not simulate in the Fourier domain to avoidwra p-ˆaround effects, andto avoidedge effects of the kernel we paddeda ll sides of imagebya factor of three FWHM,andthen truncated after convolution. Intotal, 3000 realizations of one-sample t statistic images were simulated for three different n, n 10, 20, 30. Each t statistic imagewas based on n realizations of smoothGa ussianrandomˆ Želds; toour knowledgethere is nodirect way to simulate smooth t Želds. Forea ch simulated dataset, asimple linearmodel wasŽtandresidua ls computeda ndused to estimate the smoothness, as in Ref.2 7.Tostress, for each realized dataset boththe estimated andknown smoothness wasused for the randomŽ eld inferences, allowing the assessment of this important source ofvariability. Forea ch simulated dataset we computeda permutation test with 100resamples. While the exactness ofthe permutation test is givenby ex changeability holding in these examples, this serves to validate our code andsupport other work. Familywiseerrormethodsin neuroimaging 439

Wealso simulated Gaussianstatistic images with the same set of smoothness, but we did nota pply the permutation test nor estimate smoothness. Forea ch realized statistic imagewe computedthe Bonferroni, randomŽ eld theory and permutation thresholds (except Gaussian)and noted the proportion ofrealizations for which maximal statistic wasabovethe threshold, which is the MonteCa rlo estimate ofthe familywise error in these null simulations. Forea ch realization we also computedthree other FWERprocedures: anFDR threshold, athreshold based onGaussianizingthe t images,a nd aBonferroni threshold using the estimated numberof RESELS. Toestima te the equivalent number of independent test (see Section 2.5)we estimated y with regression throughthe origin based ona transformation of equation (10):

log (1 F (t)) log (1 t)Vy (32) ¡ P(1) ˆ ¡ We replace F (t)with the empiricalcumulative distribution function of the minimum P(1) P-value foundunder simulation. Because we are generally interested in a 0:05 we 0 ˆ only use values of t such that 0:03 FP (t) 0:07. µ (1) µ 4.3Simula tionresult s Figure3 shows the accuracy in the smoothness estimate.As in Ref.2 7,we found positive biasfor low smoothness, althoughfor higher smoothness we foundslight negative bias. Positive bias, or overestimation of smoothness under estimates the degree of the multiple testing problem and cancause inferences tobe anticonservative. (However,a nticonservativeness was not aproblem; see below.) Figure4 shows the results using the estimated smoothness. Figure4 (a)shows the permutation and true results trackingclosely, while the RFTresults are very conserva- tive,only approaching truth for very highsmoothness. Bonferroni is of course not

Figure 3 Smoothnessestimation bias as function of smoothness. Bias (estimated minus true) is smaller for largerDF andlarger FWHM smoothness. Overestimated bias (for lowsmoothness) could result in anti- conservativeinferences (though apparently does not; see other results). 440 T Nichols et al.

Figure 4 Simulationresults. (a) FWE threshold found by three different methods compared to truth. TheBonferroni threshold is nonadaptive, while permutation and random Ž eldmethods both use lower thresholdswith higher smoothness. The low-smoothness conservativeness of the random Ž eldthresholds intensiŽes with decreasing degrees of freedom.(b) Rejection rate of null simulations for anominal a0 0:05 threshold,with a pointwiseMonte Carlo 95% conŽ dence interval shown with Ž nedotted line. The random ˆ Želdtheory results are valid, but quite conservative for allbut high smoothnesses. Bonferroni results are surprisinglysatisfactory for upto three voxels FWHM smoothness, but then become conservative. adaptive to smoothness, but is veryclose to truth for low smoothness, especially for low DF.The Gaussianresults are muchcloser to truth thananyof the t results (note the y-axis range).Figure 4(b) shows the familywise error rates, which magnifyperforma nce differences. RFTis seen to be severely conservative for all butex tremely smoothda ta, andB onferroni is indistinguishable from truth for FWHMof three orless with DFof 9and19.The permutation performance is consistent with its exactness. Forsix FWHM andabove, the Gaussianresult is close to nominal. Familywiseerrormethodsin neuroimaging 441

Using true smoothness insteadof estimated smoothness hadlittle impact on the results. The rejection rates never differed bymore than0.003,except for the case of9 DFa nd1 2FWHM,where it increased the rejection rate by0.0084. Figure 5plots the cumulative density functions (CDFs)of the minimum P-value foundby simula tion, andcompa res it to other methods for approximating or bounding FWE.The CDFa pproximation provided byBonferroni is the samefor all Žgures, since the number of voxels is Žxed.The RFT a pproximation (dash–dot line) changes with smoothness, but is farfrom true CDFfor low smoothness andlow DF;critica lly, for anygiven FWHM a ndDF, the RFTresults donot improve with (decreasing P-value)

Figure 5 Approximatingminimum P-valuedistributions with FWE methods.The minimum P-value CDF obtainedby simulation (‘ Truth’, solidline with dots) is compared to three different approximations: Bonferroni inequality(‘ Bonf’, solidline), random Ž eldtheory (‘ RFT’, dot–dashed line) and the equivalent independent n (EIN,dashed line); a corrected P-valueof 0.05 is indicated (horizontal dotted line). These plots re ect the Žndingsin Figure 4: Bonferroniis accurate for dataas smoothas 1.5 FWHM data; RFT is more conservative thanBonferroni for dataas smoothas three FWHM, and for sixFWHM for 6DF.While Figure 4 onlydepicted results for a0 0:05,note that for agivensmoothness and DF theRFT resultsdo not improve with more stringentthresholds ˆ (less than 0.05 corrected). For the 12 FWHM smoothness data the RFT resultsare quite accurate,and provide a betterapproximation than the equivalent independent n approach.Particularly for 9 DFand12 FWHM,note that the EIN approachfails to have the correct slope (it intersects the true CDF around F (0:05)by construction; see Section 2.5). NP 442 T Nichols et al. threshold. This indicates thatthe poor RFTperformance is not dueto use of an insufŽciently highthreshold. Finally, note thatthe CDFof anequivalent independent number (EIN) of observations (dashed line) follows the true CDFquite well for moderate smoothness, buta thighsmoothness it hasthe wrongslope andca nnot match the CDFin general[aspredicted by equations (30)and(3 1)].Tha tthe EIN approach performs so well for moderate smoothness suggests thatit mayyet be a tenable theoreticalapproach. For9 DFpoint estimates for y were foundto be 0.90,0.94,0.87,0.043and0 .06for 0,1.5,3,6,12voxel FWHMsmoothness, respectively. While Figure5 indicates thatthe EINapproach is inappropriate for highsmoothness, for three voxel FWHMsmooth-

Figure 6 Comparisonof other FWE methods. The RESEL– Bonferroni approach fails to control FWE for any smoothnessconsidered. The Gaussianized T approachdoes not reliably control FWE, in particular being anticonservativefor smooth,low DF images. FDR does control FWE (weakly), but becomes somewhat conserva- tivefor increasingsmoothness. Fine dotted line indicates pointwise Monte Carlo 95% conŽ dence interval. Familywiseerrormethodsin neuroimaging 443

3 3 ness a 32 voxel t9 imageha sthe sameFWE threshold as yV 0:87 32 28 623 independent voxels. ˆ £ ˆ Figure 6shows the performance of three alternative methods.The RESELBonferroni approach fails to control FWE,andfor moderate to highsmoothness exceeded aFWE of0.5(off the plot, notshown). The Gaussianized t methodex hibits conservativeness for low smoothness, but for low DFit is anticonservative,suggesting it would be inappropriate to use for all butthe highDF. I nthis complete null simulation, Benjamini andHochberg’ s FDRcontrols FWE,a lthoughit becomes somewhatconservative for increasing smoothness.

4.4Resul tsdiscussion While some authors haveobserved RFTconserva tiveness, 3 6 ,5 0 ,5 1 other havenot. 2 ,2 6 However,our Žndings are consistent with the literature, because the authors thatfound RFTresults to be accurate used Gaussiandata with highsmoothness. Forex ample, 2 Worsley et al. foundthe expected wu was quite accurate on Z images,but the smoothness of their data wasapproximately 10voxels FWHM.Our Gaussian simulations are consistent with this, and,for all but the lowest DF,our t simulations also suggest that10FWHMis sufŽcient. With our realdata studies the permutation methodwa sfoundto be more sensitive in all 11datasets. This is consistent with our simulations, in particularthatthe RFT methodwa sincreasingly conservative for shrinking degrees of freedom.B yconven- tionalstandards in functionalneuroimagingour realdata would be considered quite smooth(4 –6voxel FWHM),butour simulations indicate this is still insufŽcient for accurate RFTthresholds. As anoteon the selection ofthese datasets, they represent athree-year process of collecting group-level fMRIandPET da tasets. The only data omitted were other effects from the studies included, usually other nonorthogonalcontrasts with qualitatively similarresults. InŽveyears ofapplying these methods wehavenever seen asmall DF dataset (< 10)where the t random Želd methodoutperforms the permutation test.

5Discussion

Wehaveattemptedto provide acomprehensive review anda representative comparison ofFWEmethods for functionalneuroimaging.From B onferroni andits extensions, to cutting-edgerandom Želd theory methods,to permutation methods ofFisher, we have attemptedto cull all available tools thatare relevantfor the massive, dependent data of functionalneuroimaging.With anassumption of positive dependence,we can make use ofslightly improvedBonferroni methods.With anassumption of smoothness, wecan make use of smoothness-adaptive RFTmethods. And with few assumptions atall and some computationaleffort, wehavebotha nadaptive andpowerful method. There are severallimitations of these Žndings. First, yet more datasets should be studied, over yet awider rangeof smoothnesses andgroup sizes. Wehavefocused onvery small groupda ta todemonstra te asuspected conservativeness ofRFT methods. However,more moderate groupsizes are needed to see exactly whenRFT methods lose power. Secondly,more simulations are needed for larger volumes,a ndfor more 444 T Nichols et al. realistically shaped search regions. Our32-cubedvolume is toosma ll when1 mm 3 voxels are used anddoes not reect the wrinkled-ellipsoidaltopologyof real brain data.And Žnally, the computationalburden of the permutation tests must be considered, along with the exibility ofagenerallinearmodelingtool combined with RFTinference.

Acknowledgements Thea uthors would like to thank Keith Worsley for manyva luable conversations on randomŽ eld theory, especially Tor Wager for help with the pain data.Thea uthors wish to thanka ll of the individuals whocontributed data to our evaluations.

References

1HochbergY, Tamhane AC. Multiple 11SimesRJ. An improvedB onferroniprocedure comparisonprocedures. NewYork: Wiley, formultiple tests of signiŽca nce. Biometrika 1987. 1986; 73: 751–54. 2WorsleyK ,EvansA ,MarrettS, NeelinP . 12BenjaminiY, HochbergY. Controllingthe Athree-dimensionalstatistical analy sisfor CB F falsediscovery ra te:A practicaland powerful activationstudies in human brain. Journal of approach to multipletesting. Journalo fthe CerebralBlood Flow & Metabolism 1992; 12: RoyalStatistic alSoc iety,Series B, 900–18. Methodological 1995; 57: 289–300. 3FristonKJ, FrithCD, Liddle PF, Frackowiak 13Holland BS,Copenhaver M D.An improved RSJ.Compa ringfunctiona l(PET) images:the sequentiallyrejective Bonferroni test assessmentof signiŽcant change. Journal of procedure(Corr: V4 3p. 737). Biometrics CerebralBlo odFlow&Metabolism 1991; 11: 1987; 43: 417–23. 690–99. 14BenjaminiY, YekutieliD. The controlof the 4StoreyJD. Adirecta pproach to falsediscovery falsediscovery ra tein multipletesting under rates. Journalo ftheRoyal Statistic alSoc iety, dependency. Annals ofStatistics 2001; 29(4): Series B 2002; 64: 479–498. 1165–88. 5McShane LM. Statisticalissues in thea nalysis 15SarkarSK.Some results on falsediscovery ra te of microarrayda ta.In: Proceedingsof the in stepwisemultiple testing procedures. InternationalBiometrics So ciety , Frieburg, Annals ofStatistics 2002; 30(1):23 9–57. Germany,July 2002. 16SarkarSK.Recent a dvancesin multipletesting. 6Korn DL,Troendle JF, McShane LM,Simon Technicalreport. Phila delphia:Temple R.Controllingthe number of false University,20 02. discoveries:a pplication to high-dimensional 17WorsleyKJ, MarrettS, NeelinP, VandalAC, genomicda ta,Technicalreport.B iometric FristonKJ, Evans AC.A uniŽed statistical ResearchBra nch, NationalCancerInstitute, approach fordetermining signiŽ ca nt signalsin Bethesda,Maryland 20892U SA,August imagesof cerebralactivation. Human Brain 2001. Mapping 1995; 4: 58–73. 7WestfallPH, Young SS. Resampling-based 18Cao J,WorsleyKJ. Spatialstatis tics:metho- multipletesting: ex amplesand methods for dologicalasp ectsand applications , chapter 8. p-valueadjus tment. NewYork: Wiley,1993. Applicationsof random Želdsin human 8 Tong Y. Probabilityinequalities in brain mapping. LectureNotes in Statistics; multivariatedistributio ns. New York: v.1 59.NewYork: Springer,2 001;pp AcademicPress, 1 980. 169–82. 9Holm S.Asimplesequentia llyrejective 19PeterssonKM, NicholsTE ,Poline J-B, multipletest procedure. ScandinavianJournal HolmesA P. Statisticallimitationsin ofStatistics 1979; 6: 65–70. functionalneuroimagingII. Signal detection 10HochbergY. AsharperB onferroniprocedure and statisticalinference. Philosophical formultiple tests of signiŽca nce. Biometrika Transactions oftheRoyal So ciety,Series B 1988; 75: 800–802. 1999; 354: 1261–81. Familywiseerrormethodsin neuroimaging 445

20 Adler RJ. The geometryo frandomŽelds. New thresholding. Journalo fComputerAided York: Wiley,1981. Tomography 2000; 24(1):128–3 8. 21WorsleyKJ, Evans AC,Ma rrettS, NeelinP. A 32 Good P. Permutationtests.Apracticalguide three-dimensionalstatistica lanalysisfor CBF toresamp lingmetho ds fortes tinghyp otheses. activationstudies in humanbrain. Journal of NewYork: SpringerVerla g,1 994. CerebralBlo odFlow&Metabolism 1992; 33PeasarinF. Multivariatepermutatio ntests: 12(6):900–1 8. with applications in biostatistics. New York: 22PolineJB, WorsleyKJ, EvansA C,F ristonK J. Wiley,2002. Combining spatialextenta nd peakintensity 34Efron B, TibshiraniR. An introductionto the to testfor activations in functionalimaging. bootstrap. BocaRa ton: Chapman&Hall, NeuroImage 1997; 5(2): 83–96. 1993. 23WorsleyKJ, Andermann M,Koulis T, 35DavisonA C,Hinkley DV. Bootstrapmetho ds MacDonald D,Evans A C.Detecting changes and theirapp lication. Cambridge:Cambridge in nonisotropic images. Human Brain UniversityPress, 1 997. Mapping 1999; 8: 98–101. 36NicholsTE ,HolmesAP .Nonparametric 24PolineJ-B, Worsley KJ, HolmesAP, permutationtests for functiona l FrackowiakRSJ,Friston K J.Estimating neuroimaging:a primerwith examples. smoothnessin statisticalparametricma ps: Human BrainMapp ing 2001; 15: 1–25. variabilityof p values. Journalo fComputer 37Scheffe´H.Sta tisticalinference in thenon- AssistedTo mography 1995; 19: 788–96. parametricca se. Annals ofMathematical 25WorsleyKJ. Non-stationary FWHM and its Statistics 1947; 14: 304–32. effecton statisticalinferenceof fMRIdata. 38BelmonteM, Yurgelun-Todd D.Permutation Presentedat the 8th Internatio nalCo nference testingma de practicalfor functiona lmagnetic onFunctionalMap pingo ftheHuman Brain , resonanceima geana lysis. IEEE Transactions 2–6 June, 2002,Sendai,Ja pan. Availableon onMedicalImaging 2001; 20: 243–48. CDRom. NeuroImage 2002; 16(2):7 79–80. 39Liu C,Raz J, Turetsky B. An estimatora nd 26HolmesA P. Statisticalis suesin functional permutationtest for single-trial fM RIda ta. brainmap ping ,PhD thesis.U niversityof In Abstractso fENAR Meetingof the Glasgow,G lasgow,1 994.Availablefrom InternationalBiometric So ciety ,Pittsburgh, http:==www.Žl.ion.ucl.a c.uk =spm= March 1998. papers=APH_thesis. 40BrammerMJ, BullmoreE T,Simmons A et al. 27KiebelS, Poline J,FristonK, Holmes A, Genericbra in activationma pping in functional WorsleyK. Robust smoothnessestima tion in mri:a nonparametrica pproach. Magnetic statisticalparametricmaps using sta ndardized ResonanceImaging 1997; 15: 763–70. residualsfrom the genera llinearmodel. 41LocascioJJ, JenningsPJ, Moore CI,Corkin S. NeuroImage 1999; 10: 756–66. Timeseries a nalysisin thetime doma in and 28WorsleyKJ. Estimatingthe number of peaks resamplingmethods forstudies of functional in arandom Želdusing the Ha dwiger magneticresona ncebrain ima ging. Human characteristicof excursionsets, with BrainMapp ing 1997; 3: 168–93. applicationsto medicalimages. Annals of 42BullmoreE, Long C,Suckling J. Colored noise Statistics 1995; 23: 640–69. and computationalinferencein 29Sampson PD,G uttorp P. Nonparametric neurophysiological(fMRI)time series estimation of nonstationary spatialcovariance analysis:resa mplingmethods in timea nd structure. Journalof theAmeric anStatis tical waveletdoma ins. Human BrainMap ping Association 1992; 87: 108–19. 2001; 12: 61–78. 30Hayasaka S,NicholsTE. Aresel-basedcluster 43FadiliJ, BullmoreET. Wavelet-generalised sizepermuta tion testfor non-sta tionary leastsquares: a new BLUestimatorof images. Presentedat the 8 th International regressionmodels with long-memory errors. Conferenceon FunctionalMap pingo fthe NeuroImage 2001; 15: 217–32. Human Brain,2–6June, 2002,Sendai,Ja pan. 44HolmesA P, FristonKJ. Generalisability, Availableon CDRom. NeuroImage 2002; random effects& population inference. 16(2):1062–63. NeuroImage 1999; 7(4) S754. Proceedings 31Dinov ID,Mega MS, Thompson PM et al. ofFourth InternationalCo nferenceo n Analyzingfunctiona lbrain imagesin a FunctionalMap pingof theHuman Brain, probabilistica tlas:a validation of sub-volume 7–1 2June, 1998,Montreal,Ca nada. 446 T Nichols et al.

45Henson RNA,ShalliceT, Gorno-Tempini extended amygdalaby individualratingsof ML, DolanRJ. Facerepetition effects in emotionalarousal:An fMRIstudy . Biological implicita nd explicitmemory testsas measured Psychiatry 2003; 53: 211–15. by fMRI. CerebralCortex 2002; 12: 178–86. 49HolmesA P, BlairRC, WatsonJDG ,Ford I. 46MarshuetzC, Smith EE ,Jonides J,DeGutisJ, Nonparametricana lysisof statisticima ges ChenevertTL. Orderinformation in fromfunctiona lmapping experiments. Journal working memory: fMRIevidencefor pa rietal ofCerebralBloo dFlow&Metabolism 1996; and prefrontalmechanisms. Journal of 16(1): 7–22. CognitiveNeuros cience 2000; 12(S2): 50StoecklJ, Poline J-B,Malandain G,Ayache N, 130–44. DarcourtJ. Smoothnessand degreesof 47Watson JDG,MyersR, FrackowiakRSJ et al. freedomrestrictions when usingspm9 9. AreaV5 of thehuma nbrain:evidence from a NeuroImage 2001; 13: S259. combined study usingpositron emission 51SinghKD, BarnesGR, HillebrandA. G roup tomography and magneticresona nceima ging. imagingof task-relatedchanges in cortical CerebralCo rtex 1993; 3: 79–94. synchronisation usingnon-parametric 48PhanKL,Taylor SF,Welsh RC et al. permutationtesting. NeuroImage 2003; 19: Activation of themedial prefronta lcortexa nd 1589–1601.