Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate1

NeuroImage 15, 870–878 (2002) doi:10.1006/nimg.2001.1037, available online at http://www.idealibrary.com on Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate1 Christopher R. Genovese,* Nicole A. Lazar,* and Thomas Nichols† *Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213; and †Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109 Received February 23, 2001 When one uses theoretically motivated thresholds Finding objective and effective thresholds for voxel- for the individual tests, ignoring the fact that many wise statistics derived from neuroimaging data has tests are being performed, the probability that there been a long-standing problem. With at least one test will be false positives (voxels declared active when they performed for every voxel in an image, some correc- are really inactive) among all the tests becomes very tion of the thresholds is needed to control the error high. For example, for a one-sided t test with a 0.05 rates, but standard procedures for multiple hypothe- significance level, the threshold would be 1.645, which sis testing (e.g., Bonferroni) tend to not be sensitive would lead to approximately 1433 voxels declared ac- enough to be useful in this context. This paper intro- tive on average of the 28672 voxels in a 64 ϫ 64 ϫ 7 duces to the neuroscience literature statistical proce- image when there is no real activity. The 5% error rate dures for controlling the false discovery rate (FDR). thus leads to a very large number of false positives in Recent theoretical work in statistics suggests that FDR-controlling procedures will be effective for the absolute terms, especially relative to the typical num- analysis of neuroimaging data. These procedures op- ber of true positives. erate simultaneously on all voxelwise test statistics to The traditional way to deal with multiple testing is determine which tests should be considered statisti- to adjust thresholds such that Type I error is controlled cally significant. The innovation of the procedures is for all voxels in the brain, simultaneously. There are that they control the expected proportion of the re- two types of error control, weak and strong. Weak jected hypotheses that are falsely rejected. We demon- control requires that, when the null hypothesis is true strate this approach using both simulations and func- everywhere, the chance of rejecting one or more tests is tional magnetic resonance imaging data from two less than or equal to a specified level ␣. Strong control simple experiments. © 2002 Elsevier Science (USA) requires that, for any subset of voxels where the null Key Words: functional neuroimaging; false discovery hypothesis is true, the chance of rejecting one or more rate; multiple testing; Bonferroni correction. of the subset’s tests is less than or equal to ␣.As concisely stated by Holmes et al. (1996), “A test with strong control declares nonactivated voxels as acti- ␣ INTRODUCTION vated with probability at most . .” A significant result from a test procedure with weak control only A common approach to identifying active voxels in a implies there is an activation somewhere; a procedure neuroimaging data set is to perform voxelwise hypoth- with strong control allows individual voxels to be de- esis tests (after suitable preprocessing of the data) and clared active—it has localizing power. to threshold the resulting image of test statistics. At There is a variety of methods available for control- each voxel, a test statistic is computed from the data, ling the false-positive rate when performing multiple usually related to the null hypothesis of no difference tests. Among the methods, perhaps the most commonly between specified experimental conditions. The voxels used is the Bonferroni correction (see, for example, for which the test statistics exceed the threshold are Miller, 1981). If there are k tests being performed, the then classified as active, relative to the particular com- Bonferroni correction replaces the nominal significance parison being made. While this approach has proved level ␣ (e.g., 0.05) with the level ␣/k for each test. It can reasonably effective for a wide variety of testing meth- be shown that the Bonferroni correction has strong ods, a basic problem remains: choosing the threshold. control of Type I error. This is a conservative condition, and in practice with neuroimaging data, the Bonfer- roni correction has a tendency to wipe out both false 1 This work partially supported by NSF Grant SES 9866147. and true positives when applied to the entire data set. 1053-8119/02 $35.00 870 © 2002 Elsevier Science (USA) All rights reserved. CONTROLLING FDR IN NEUROIMAGING 871 To get useful results, it is necessary to use a more TABLE 1 complicated method or to reduce the number of tests Classifications of Voxels in V Simultaneous Tests considered simultaneously. For instance, one could identify regions of interest (ROI) and apply the correc- Declared active Declared inactive tion separately within each region. More involved (discovered) (not discovered) methods include random field approaches (such as Truly active V V T Worsley et al., 1996) or permutation based methods aa ai a Truly inactive Via Vii Ti (such as Holmes et al., 1996). The random field meth- Da Di V ods are suitable only for smoothed data and may re- quire assumptions that are very difficult to check; the permutation method makes few assumptions, but has an additional computational burden and does not ac- which controls the rate of false positives among all count for temporal autocorrelation easily. ROIs are tests whether or not the null is actually rejected. A labor intensive to create, and further, they must be procedure that controls the FDR bounds the expected created prior to data analysis and left unchanged rate of false positives among those tests that show a throughout, a rigid condition of which researchers are significant result. The procedures we describe operate understandably wary. simultaneously on all voxels in a specified part of the Variation across subjects has a critical impact on data (e.g., the entire data set) and identify in which of threshold selection in practice. It has frequently been those voxels the test is rejected. This implicitly corre- observed that, even with the same scanner and exper- sponds to a threshold selection method that adapts to imental paradigm, subjects vary in the degree of acti- the properties of the given data set. These methods vation they exhibit, in the sense of contrast-to-noise. work for any statistical testing procedure for which one Subjective selection of thresholds (set low enough that can generate a P value. FDR methods also offer an meaningful structure is observed, but high enough so objective way to select thresholds that is automatically that appreciable random structure is not evident) sug- adaptive across subjects. gests that different thresholds are appropriate for dif- An outline of the paper is as follows. In Section 2, we ferent subjects. Without an objective method for select- describe the FDR in more detail and present a family of ing these thresholds, however, the meaning of the FDR-controlling procedures that have been studied in statistical tests can be subverted by the researcher by the statistics literature. In Section 3, we present sim- adjusting the thresholds, implicitly or explicitly, to give ple simulations that illustrate the performance of the desirable results. Many researchers using neuroimag- FDR-controlling procedures. In Section 4, we apply the ing therefore tend to choose a single threshold consis- methods to two data sets, one describing a simple mo- tently for all data analyzed in an individual experi- tor task (Kinahan and Noll, 1999) and the other from a ment. This choice is usually based on what has “worked study of auditory stimulation. Finally, in Section 5, we well” in the past. For example, a t threshold of 6 and a discuss some of the practical issues in the use of FDR. P value of less than 0.001 are commonly used, though completely arbitrary, values for thresholding maps. THE FALSE DISCOVERY RATE This practice avoids biases from ad hoc threshold ad- justments, but its forced consistency can significantly In a typical functional magnetic resonance imaging reduce sensitivity (and waste data). (fMRI) data analysis, one computes, for each voxel of There have been a number of efforts to find an ob- interest, a test statistic that relates to the magnitude of jective and effective method for threshold determina- a particular contrast among experimental conditions. tion (Genovese et al., 1997; Worsley et al., 1996; A voxel is declared active if the corresponding test Holmes et al., 1996). While these methods are promis- statistic is sufficiently extreme with respect to the sta- ing, they all involve either extra computational effort tistic’s distribution under the null hypothesis. or extra data collection that may deter researchers Let V denote the total number of voxels being tested from using them. In this paper, we describe a recent in such an analysis. Each voxel can be classified into development in statistics that can be adapted to auto- one of four types, depending on whether or not the matic and implicit threshold selection in neuroimag- voxel is truly active and whether or not it is declared ing: procedures that control the false discovery rate active, as shown in Table 1. For example, Via denotes the number of false posi- (FDR) (Benjamini and Hochberg, 1995; Benjamini and ϭ ϩ Liu, 1999; Benjamini and Yekutieli, 2001). tives and Da Vaa Via denotes the number of voxels Whenever one performs multiple tests, the FDR is declared active. In any data analysis, we only observe the proportion of false positives (incorrect rejections of Da, Di, and V; the remaining counts are unknown.

Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate1

Multiple Testing

Discussion: on Methods Controlling the False Discovery Rate

Inference on the Limiting False Discovery Rate and the P-Value Threshold Parameter Assuming Weak Dependence Between Gene Expression Levels Within Subject

Heterocedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing

Comprehensive Comparative Analysis of Local False Discovery Rate Control Methods

Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

1 Why Is Multiple Testing a Problem? 2 the Bonferroni Correction

False Discovery Rate 1 the Pairwise Multiple

Size, Power and False Discovery Rates

P and Q Values in RNA Seq the Q-Value Is an Adjusted P-Value, Taking in to Account the False Discovery Rate (FDR)

From P-Value to FDR

False Discovery Rate Control with Unknown Null Distribution: Is It Possible to Mimic the Oracle?