Controlling the Familywise Error Rate in Functional Neuroimaging

StatisticalMethods inMedicalResearch 2003; 12: 419^446 Controlling thefamilywise error rate infunctional neuroimaging: acomparative review Thomas Nichols and Satoru Hayasaka DepartmentofBiostatistics, Universityof Michigan,AnnArbor,MI,USA Functionalneuroimagingdata embodies a massivemultiple testing problem, where10 0000correlatedtest statisticsmust be assessed.The familywiseerror rate, the cha nceof any falsepositives is the sta ndard measureof Type Ierrorsin multipletesting. In thispa perwe review and evaluatethree a pproachesto thresholdingima gesof teststatistics: Bonferroni, ra ndom elda nd thepermutation test. Owing to recent developments,improved Bonferroni procedures,such asHochberg’s methods, arenow applicableto dependent data.Continuous random eldmethods usethe smoothness of theimage to adapt to theseverity of themultiple testing problem. Also,increa sedcomputing power hasmade both permutationa nd bootstrapmethods applicableto functionalneuroimaging.We eva luatethese a pproacheson t images usingsimula tionsand acollectionof realdatasets.We nd thatBonferroni-relatedtests offer little improvementover Bonferroni, while the permutation method offerssubsta ntialimprovementover the random eldmethod forlow smoothnessa nd low degreesof freedom.We a lsoshow thelimita tionsof tryingto nd anequivalentnumber of independent testsfor an imageof correlatedtest statistics. 1Introduction Functionalneuroimagingrefers to anarrayof technologies used to measure neuronal activity in the living brain. Two widely used methods,positron emission tomography (PET)a nd functionalmagnetic resonance imaging(fM RI),both use blood ow asan indirect measure of brain activity. Anexperimenter images asubject repeatedly under different cognitive states and typically ts amassively univariate model.Tha tis, a univariate model is independently tateach of hundreds of thousands of volume elements, or voxels.Images of statistics are created thatassess evidence for an experimentaleffect.Na ive thresholding of 100000voxels at a 5%threshold is inappropriate,since 5000false positives would be expected in nullˆ data. False positives must be controlled over all tests, but there is nosingle measure of Type Ierror in multiple testing problems. The standard measure is the chance ofany TypeI errors, the familywise error rate (FWE).A relatively new developmentis the false discovery rate (FDR)error metric, the expected proportion of rejected hypotheses that are false positives. FDR-controlling procedures are more powerful then FWEproce- dures, yet still control false positives in auseful manner. We predict thatFDRma ysoon eclipse FWEa sthe most commonmultiple false positive measure. Inlight of this, we believe thatthis is achoice momentto review FWE-controlling measures. (We prefer the Addressfor correspondence: Thoma sNichols,Depa rtmentof Biostatistics,U niversityof Michigan,Ann Arbor, MI48109,USA.E-ma il:[email protected] # Arnold 2003 10.1191/0962280203sm341ra 420 T Nichols et al. term multiple testing problem over multiple comparisons problem .‘Multiple comparisons’ca nallude to pairwise comparisons on asingle model,wherea sin imaginga large collection of models is each subjected to ahypothesis test.) Inthis paper we attemptto describe and evaluate all FWEmultiple testing procedures useful for functionalneuroimaging.Owing to the spatialdependence of functional neuroimagingda ta,there are actually quite asmall number ofapplicable methods.The only methods thatare appropriate under these conditions are Bonferroni, random eld methods andresa mplingmethods. We limit our attention to nding FWE-corrected thresholds and P-values for Gaussian Z and Student t images. (An a0 FWE-corrected threshold is one thatcontrols the FWEa t a0 ,while anFWE- corrected P-value is the most signicant a0 corrected threshold such thatatest can be rejected.) Inparticularwe do not consider inference onsize of contiguous suprathreshold regions or clusters. We focus particularattention on low degrees-of-freedom (DF) t images,a sthese are unfortunately commonin group analyses. The typicalimages have tens to hundreds of thousands of tests, where the tests havecomplicated, thoughusua lly positive dependence structure. In1987Hochbergand Tamhane wrote: ‘The question of which error rate to control . has generated muchdiscussion in the literature’. 1 While thatcould betrue in the statistics literature in the decades before their bookwa spublished, the same cannot be said of neuroimaginglitera ture. When multiple testing hasbeen acknowl- edgedat all, the familywise error rate hasusually been assumed implicitly. For example, of the two papers thatintroduced corrected inference methods to neuroimaging,only one explicitly mentions familywise error. 2 ,3 Itis hopedthat the introduction of FDR will enrich the discussion ofmultiple testing issues in the neuroimaginglitera ture. The remainder of the paper is organized asfollows. We rst introduce the general multiple comparison problem and FWE.We then review Bonferroni-related methods, followed by random eld methods and then resamplingmethods. We then evaluate these different methods with 11real datasets and simulations. 2Multipletestingbackground infunctional neuroimaging Inthis section we formally dene strong andwea kcontrol of familywise error, aswell asother measures of false positives in multiple testing. We describe the relationship between the maximumsta tistic andFWE a ndalso introduce step-upand step-down tests and the notion of equivalent number of independent tests. 2.1Notati on Consider imageda ta on atwo- (2D)or three-dimensional(3D)lattice. Usually the lattice will be regular, but it mayalso be airregular, corresponding to a2Dsurface. Throughsome modelingprocess wehaveanimageof test statistics T {Ti}. Let Ti be the value of the statistic imagea tspatiallocation i, i {1, . ,V},ˆwhere V is the 2 V ˆ number of voxels in the brain. Let the H Hi be ahypothesis image,where Hi 0 indicates thatthe null hypothesis holdsˆ at voxel i, and H 1indicates thatˆthe i ˆ alternative hypothesis holds. Let H0 indicate the complete null case where Hi 0 for all i.Adecision to reject the null for voxel i will be written H^ 1,not rejecting ˆH^ 0. i ˆ i ˆ Familywiseerrormethods in neuroimaging 421 Write the null distribution of as ,andlet the imageof -values be { }. We Ti F0,Ti P P Pi require nothinga bout the modeling process or statistic other thantheˆ test being unbiased, and,for simplicity of exposition, we will assume thatall distributions are continuous. 2.2Measure soffalsepositi vesin multipletesting A valid a-level test atlocation i corresponds to arejection threshold u where P{Ti u Hi 0} a.The challenge ofthe multiple testing problems to nd athreshold u tha¶tcontrolsj ˆ someµ measure offalse positives across the entire image.While FWEis the focus of this paper, it is useful to compare its denition with other measures. To this end consider the cross-classica tion of all voxels in athreshold statistic imageof Table 1. In Table 1 V is the totalnumber of voxels tested, V 1 i Hi is the number of voxels ¢j ˆ ^ with afalse null hypothesis, V 0 V i Hi the number true nulls, and V1 i Hi ¢j ˆ ¡ P^ j¢ ˆ is the number of voxels abovethreshold, V0 V i Hi the number below; V1 1 is P j¢ ˆ ¡ Pj the number of suprathreshold voxels correctly classied assignal, V1 0 the number P j incorrectly classied assignal; V0 0 is the number of subthreshold voxels correctly j classied asnoise, and V0 1 the number incorrectly classied asnoise. With this notation arangej of false positive measures canbe dened, a sshown in Table 2.Anobserved familywise error (oFWE)occurs whenever V1 0 is greater than zero,a nd the familywise error rate (FWE)is dened asthe probability jof this event.The observed FDR(oFDR) is the proportion of rejected voxels thathavebeen falsely rejected, and the expected false discovery rate (FDR)is dened asthe expected value of oFDR.N ote the contradictory notation: an‘observed familywise error’a ndthe ‘observed false discovery rate’are actually unobservable ,since they require knowledge of which tests are falsely rejected. The measures actually controlled, FWEand FDR, are the probability and expectation of the unobservable oFWEa ndoFDR respectively. Other variants on FDRha vebeen proposed, including the positive FDR(pFDR), the FDRconditiona lona tleast one rejection 4 andcontrolling the oFDRa tsome specied level of condence (FDRc). 5 The per-comparison error rate (PCE) is essentially the ‘uncorrected’mea sure of false positive, which makes no accommodationfor multi- plicity, while the per-family error rate (PFE)measures the expected count of false positives. Per-family errors canalso be controlled with some level condence (PFEc). 6 This taxonomydemonstrates that there are many potentially useful multiple false positive metrics, but we are choosing to focus onbut one. Table 1 Cross-classication of voxels in a thresholdstatistic image Nullnot rejected Nullrejected (declaredinactive) (declaredactive) Nulltrue (inactive) V0 0 V1 0 V 7 V 1 j j ¢j Nullfalse (active) V0 1 V1 1 V 1 j j ¢j V 7 V1 V1 V j¢ j¢ 422 T Nichols et al. Table 2 Differentmeasures of false positives in the multiple testing problem. I{A} is the indicatorfunction for event A Measureof false positives Abbreviation De nition Observedfamilywise error oFWE V1 0 > 0 Familywiseerror rate FWE P{oFWE}j Observedfalse discovery rate oFDR V =V I 1 0 1 {V 1 > 0} Expectedfalse discovery rate FDR E{oFDR}j j¢ j¢ Positivefalse discovery rate pFDR E{oFDR V1 > 0} Falsediscovery rate con dence FDRc P{oFDRj ja¢ } µ Per-comparisonerror

Controlling the Familywise Error Rate in Functional Neuroimaging

Underestimation of Type I Errors in Scientific Analysis

The Reproducibility of Research and the Misinterpretation of P Values

From P-Value to FDR

Statistical Significance for Genome-Wide Experiments

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Redefine Statistical Significance We Propose to Change the Default P-Value Threshold for Statistical Significance from 0.05 to 0.005 for Claims of New Discoveries

Benjamini-Hochberg Method

Hypothesis Testing

8. Multiple Test Corrections

Hypothesis Testing (1/22/13)

False Discovery Rate Control Is a Recommended Alternative to Bonferroni-Type Adjustments in Health Studies Mark E

How Should We Use Hypothesis Tests to Guide the Next Experiment?