Biometrika (2011), 98,4,pp. 979–985 doi: 10.1093/biomet/asr057 C 2011 Biometrika Trust Printed in Great Britain

Miscellanea False discovery rate for scanning

BY D. O. SIEGMUND, N. . ZHANG

Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305-4065, U.S.A. [email protected] [email protected] Downloaded from AND B. YAKIR

Department of Statistics, The Hebrew University of Jerusalem, Jerusalem 91905, Israel [email protected] http://biomet.oxfordjournals.org/ SUMMARY The false discovery rate is a criterion for controlling Type I error in simultaneous testing of multi- ple hypotheses. For scanning statistics, due to local dependence, clusters of neighbouring hypotheses are likely to be rejected together. In such situations, it is more intuitive and informative to group neighbouring rejections together and count them as a single discovery, with the false discovery rate defined as the pro- portion of clusters that are falsely declared among all declared clusters. Assuming that the number of false discoveries, under this broader definition of a discovery, is approximately Poisson and independent of the number of true discoveries, we examine approaches for estimating and controlling the false discovery rate, and provide examples from biological applications. at University of Pennsylvania Library on May 30, 2013

Some key words: False discovery rate; Multiple comparisons; Poisson approximation; Scan .

1. INTRODUCTION In a pioneering paper, Benjamini & Hochberg (1995) initiated a fruitful line of research into the false discovery rate as a method to evaluate Type I error when simultaneously testing large numbers of hypothe- ses. We use their notation, so R is the number of discoveries that emerge as a result of a particular statistical procedure, and V is the number of false discoveries among them. Then S = R − V is the num- ber of true discoveries. The false discovery rate is the expected relative proportion of false discoveries, FDR = E(V/R; R > 0). These quantities are defined implicitly in terms of the specific procedure that is used to make discoveries. We are concerned with estimation and control of false discovery rates when there is substantial local correlation among the statistics used for testing the hypotheses. Due to local correlation, large values of the statistic tend to occur in clumps, and multiple rejections within a clump may constitute only a single discovery, as it relates to model identification. Yet a possibly large number of correct rejections at some locations can inflate the denominator in the definition of false discovery rate, hence artificially creating a small false discovery rate, and lowering the barrier to possibly false detections at distant locations. Scanning statistics to detect sparsely distributed signals provide typical examples. In the examples that follow, there is an underlying set of observations yt , where t varies over an indexing set having some geometric structure. The yt are often assumed to be independent, but this is not necessary, providing the dependence between them is local with respect to the geometric structure. The test statistics {Zt : t ∈ D}, where Zt is a function of t and of the ys for s ∈ Nt , an appropriate neighbourhood of t, are related by a measure of distance within the scanning index set D. Hence, values of Zt and Zs for nearby t and s in D 980 D. O. SIEGMUND,N.R.ZHANG AND B. YAKIR are correlated, so a large value at a specific τ ∈ D causes a cluster of large values at t close to τ.Thus,a group of large values of Zt within close proximity are often associated with a single signal. Example 1. The random fields to detect local activity in an fMRI scan as discussed in a series of papers by Worsley, for example, Worsley et al. (1992)orSiegmund & Worsley (1995). Example 2. Massively parallel paired end DNA re-sequencing used to detect structural variation in genomic sequences. For a review, see Medvedev et al. (2009). The data come in the form of distances yt between mapped positions of relatively short paired reads from the ends of DNA sequences of approxi- mately w base pairs in length, with the leftmost read mapped to position xt in the genome. Where there are no structural variations one observes, after subtracting w and standardizing, yt that are independent and have a standard normal distribution. For read pairs that straddle the breakpoint τ of a structural variation, the distribution of some percentage of the yt ,forτ

et al. 2011 to detect common regions of copy number variation in a set of subjects. An appropriate likeli- http://biomet.oxfordjournals.org/ hood based statistic for the special case of a single sequence (Olshen et al., 2004) is similar to that of the preceding example, but the window width is unknown, so the scan involves two-dimensional maximization with respect to τ and w. Example 4. A genome scan to detect either linkage or association between a phenotype and related genetic variation, e.g., Lander & Botstein (1989), Siegmund & Yakir (2007). The main results of this paper are methods for estimating and controlling the false discovery rate of a given procedure, with discovery defined in the sense of detection of sparse, local signals. In order to focus on the conceptual aspect of how one defines a discovery, our assumptions are given in general, at University of Pennsylvania Library on May 30, 2013 abstract terms, and we avoid, except for a few comments, the necessarily technical and application spe- cific discussion of methods to ensure and test the validity of those assumptions. Motivated by a different genomic application, Zhang (2008) contains a similar approach without any theoretical analysis. In an unpublished 2011 manuscript in the Harvard University Working Paper Series A. Schwartz- man, Y. Gavrilov and R. J. Adler discuss a different approach to the same general issue under specific technical assumptions motivated by and apparently limited to a one-dimensional process having the struc- ture of Example 1. Our two central assumptions are (i) the distribution of the number of false discoveries, V , is Poisson, with λ; and (ii) the number of false discoveries is independent of the number of true discov- eries, S. The number of true discoveries must be nonnegative, but otherwise may follow any distribution. The Poisson assumption is valid asymptotically in a variety of applications. These include the examples given above under the assumptions made in the cited references, or more generally if one makes suitable adjustments when there are local dependencies in the underlying observations. See Aldous (1988)for numerous examples, Lindgren et al. (1983) for relevant general theorems under different sets of condi- tions, and Arratia et al. (1989) for a flexible Poisson approximation theorem that applies quite generally to processes involving local dependence. Methods for determining λ depend on the specific problem. Illustrative examples based on more explicit assumptions about the underlying process are discussed below. The assumption of independence between V and S is more subtle. If we treat the locations of the signals in D as fixed but unknown quantities, then D can be partitioned into disjoint sets D0 ∪ D1, where D0 are hypotheses that, if rejected, would be considered part of a false discovery, and D1 are hypotheses that, if rejected, would be considered part of a true discovery. For example, in the simple case of a one-dimensional scan with fixed window size w as in Example 2, suppose the true signals are a set of intervals I within the scan region. If we count those windows that overlap with any interval in I towards true discoveries, and the rest towards false discoveries, then D0 ={t : (t, t + w] ∩ ι =∅,ι∈ I} and D1 = D \ D0. Then, V Miscellanea 981 would be a function solely of {Zt : t ∈ D0}, while S would be a function solely of {Zt : t ∈ D1}. At least when the true signals are sparse, approximate independence between V and S would follow if long- dependencies between {Zt : t ∈ D0} and {Zt : t ∈ D1} are negligible. In practice, near overlap of detected signals is a danger sign regarding possible violation of this hypothesis of independence. For a more detailed discussion, see § 3·1. Significant long range dependence between {Zt } may cause nonnegligible dependence between V and S. Scanning procedures that are based on a collection of localized tests are inherently designed for problems where dependence can be assumed to be local, since if long range independence does not hold, then procedures that account for that dependence would be preferred on the basis of greater power. The estimator that we propose for the false discovery rate is

FDRˆ = λ/(R + 1), (1) where λ is the expected number of false discoveries and R is the total number of discoveries. This esti- Downloaded from mator has been considered by Efron (2010) in the framework of hypothesis testing with a large number of independent hypotheses and, except for the constant 1 in the denominator, is the same as that suggested by Zhang (2008). In some cases, the parameter λ can be derived analytically. In other cases it, can be computed via permutations or simulations conducted under the null assumption. In § 2, we show that the estimator (1) is unbiased under assumptions (i) and (ii). http://biomet.oxfordjournals.org/ Our method for controlling the rate of false discoveries is closely associated with the procedure proposed by Benjamini & Hochberg (1995) for ordered p-values. We in effect replace their assumption regarding the relations between p-values of individual hypotheses by the assumption that an appropriately indexed family of false discoveries is a Poisson process.

2. ESTIMATING AND CONTROLLING THE FALSE DISCOVERY RATE

Let V ∼ Po(λ) be the number of false discoveries and let S  0 be the number of true discover- at University of Pennsylvania Library on May 30, 2013 ies. Assume S is a nonnegative independent of V . The total number of discoveries is R = V + S. Consider the ratio V/R, which is defined to be 0 if R = 0, and compare it to the estimator λ/(R + 1).

THEOREM 1. Under assumptions (i) and (ii),E(V/R; R > 0) = E{λ/(R + 1)}.

−1 Proof. For fixed s  0, let Fs(x) = (x + s) I (x + s > 0), with the understanding that F0(0) = 0. After writing expectations as infinite series, algebraic manipulations show that

E{VFs (V )}=λE{Fs(V + 1)}. (2)

The result follows by taking expectations with respect to the distribution of S. Hence, FDRˆ defined in (1) is an unbiased estimator of the false discovery rate. 

Remark. Equation (2) has been applied elsewhere. In particular, it is the basis for the Chen (1975) method of Poisson approximation. ¯ Now suppose that false detections are a Poisson process Vλ of rate 1, defined on the interval [0, λ]. We assume also that the process Rλ = Vλ + Sλ is nondecreasing and that the processes Vλ and Sλ are independent. ¯ Define the backwards stopping time  = max{λ  λ : Rλ  λ/α}. This is a function of the observed process Rλ, and thereby it is a function of the Poisson process Vλ and the independent process Sλ, both unobserved. The extreme case when  = 0 corresponds to the case where R is equal to zero and the ratio V/R is then defined to be equal to zero as well. Consider the procedure whereby the stopping time  is evaluated and R is reported as the number of discoveries. In Theorem 2, we prove that the expected proportion of false discoveries, E(V/R),is bounded by α. The proof is a version of the argument given by Storey et al. (2004). 982 D. O. SIEGMUND,N.R.ZHANG AND B. YAKIR

THEOREM 2. Under the given conditions and for the procedure associated with the stopping time , E(V/R)  α.

Proof. Consider the process Vλ/λ and notice that it is a one backwards martingale with respect to ¯ the filtration Fλ = σ(Vt , St : λ  t  λ). The stopping time  is measurable with respect to this filtration. ¯ It follows that E(V∨λ/ ∨ λ) = E(Vλ¯ /λ) = 1, for any λ>0. Let λ → 0 and observe that 1( < λ)Vλ/λ converges to 0 and is bounded by 1/α. Hence, by the domi- ¯ nated convergence theorem, we see that E(V/; >0) = E(Vλ¯ /λ) = 1. Consider the proportion V/ of false detection of the proposed procedure. Since this proportion is defined to be equal to zero  = 0, E(V/R) = E(V/R; >0). Dividing and multiplying by , we get

E(V/R; >0) = E{(/R) × (V/); >0}  α × E{(V/); >0}=α,

where the inequality follows from the fact that when >0 the definition of the stopping time implies that Downloaded from /R  α. The conclusion follows. 

3. EXAMPLES

3·1. Fixed-width sliding window scan http://biomet.oxfordjournals.org/

Consider a fixed window scan statistic. Suppose Y1,...,Ym are independent and normally distributed random variables with unit . Under a global null hypothesis they are standard normal. Under the alternative there are intervals of known length w, and unknown positive integers τ such that Yτ+1,...,Yτ+w have mean μτ > 0. The values of μτ and the number of such intervals is unknown, although we assume that the total width of all intervals is small relative to the sample size m. This situation corresponds roughly to Example 2 in § 1, although, to facilitate our simulations, the numerical values of the parameters we use below are smaller than would be typical for this application.  +w Let Z = ( t Y )/w1/2. The behaviour of Z as a process under the global null hypothesis that all t i=t+1 i t at University of Pennsylvania Library on May 30, 2013 discoveries are false is easily inferred from known results. Specifically, an asymptotic approximation to p = pr(max0tm−w Zt > z), is given, for a two-sided alternative, in display (5.3) of Siegmund & Yakir (2007, p. 112), with parameters C = 1, = 1, L = m − w and β = 1/w. For large enough thresholds z, the probability that Zt exceeds z is small, and the number of clumps of Zt that exceeds z is approximately Poisson distributed with mean

−1 1/2 λ0 =−log(1 − p) = mzw φ(z)ν{z(2/w) }, (3) where φ denotes the standard normal probability density function and ν is a special function associated with the overshoot of a stopped random walk (cf. Siegmund & Yakir, 2007, p. 112). Although there is no unique definition of a clump, there should usually be little difficulty in recognizing one in practice. Roughly speaking, it is a set of values of t that are relatively close together, where Zt  z. Except when different true discoveries are themselves close together, different clumps are distinguished by relatively long gaps where Zt remains below the level z. If all clumps were false positives and z 0, then the size of a clump would be stochastically bounded, while the expected distances between clumps would be approximately 1/λ0, and hence grow faster than exponentially in z. The independence of Yt makes Zs and Zt independent as long as |s − t| >w. Clumps of false positives should be short and approximately uniformly distributed across the search interval. Hence, unless the true signals occur very frequently, the probability of a false positive occurring close to a true signal is small, so the independence of V and S would be approximately satisfied. The same would be true of the variable window scans of Example 3 provided the maximum window size is much smaller than the number of observations. See Siegmund et al. (2011) for a discussion of the data normalization used to validate the normality and independence assumptions needed by Example 3. Some simulated results are presented in Tables 1 and 2. For the simulations we took m = 50 000 and w = 50. A total of 21 intervals of length w, scattered about the sequence, were simulated from the alter- 1/2 1/2 1/2 native distribution with mean values μτ that range between 6/w and 2/w in steps of size 0·2/w . Miscellanea 983

Table 1. Simulated values of false discovery rate and E{λ0/(R + 1)}, based on 400 repetitions with w = 50,m= 50 000. Nominal values of λ0 are 5, 3, 2 and 1, respectively. There are 21 possible discoveries, with noncentrality parameters ranging from 6 to 2 in steps of size 0·2.

z FDR E{λ0/(R + 1)} E(V ) E(S) 3·21 0·242 0·244 4·914·9 3·37 0·165 0·170 3·014·3 3·50·120 0·122 2·013·7 3·70·071 0·070 0·912·6

FDR, false discovery rate.

Table 2. Simulated values of the false discovery rate for the procedure Downloaded from that controls this rate. The simulations are based on 400 repetitions with w = 50,m= 50 000. The false discovery rate is controlled to be no more than 0·3, 0·2, 0·1 or 0·05, respectively. There are 21 possible discoveries, with noncentrality parameters ranging from 6 to 2 in steps of size 0·2.

α FDR E(V ) E(S) http://biomet.oxfordjournals.org/ 0·30·26 5·915·2 0·20·18 3·314·4 0·10·09 1·413·0 0·05 0·04 0·61 11·9

FDR, false discovery rate.

Table 1 examines the estimator λ0/(R + 1) of the false discovery rate for several thresholds z. Four at University of Pennsylvania Library on May 30, 2013 values of z corresponding to nominal values of 5, 3, 2 and 1 for λ0 are considered. For each level the actual level of the false discovery rate and the expectation of the estimator are presented. The expected number of false discoveries, E(V ), and the expected value of true discoveries, E(S), are also given. The expectations are based on 400 replicates of the scanning process. Table 2 examines the procedure for controlling the false discovery rate. We used the stopping rule inf{z  2:R(z)  λ0(z)/α}, where R(z) is the number of discoveries associated with the threshold z and λ0(z) is the approximation (3) of the expected number of clumps associated with z computed under the global null distribution. Four values of α,0·3, 0·2, 0·1 and 0·05, are considered. For each α the actual level of the false discovery rate, the expected number of false discoveries and the expected number of true discoveries are presented. The expectations are based on 400 replicates of the scanning process.

3·2. Allelic bias in transcribed RNA Another example involves an of RNA expression profiles in autistic subjects (Ben-David et al., 2011). The goal of the experiment was to identify autosomal loci where only one of the two alleles is expressed. Nuclear RNA was extracted from blood cell-lines of 17 subjects and reverse transcribed. Both the cDNA produced and the genomic DNA of each of the subjects were genotyped using the Affymetrix Single Nucleotide Polymorphism 6·0 array technology. The identification of loci with mono-allelic expression of RNA resulted from the examination of the cDNA genotypes at single nucleotide polymorphisms that had been identified as heterozygous in genomic DNA. Specifically, the algorithm for the discovery of differentially expressed regions involved the removal, for each subject, of the single nucleotide polymorphisms that were homozygous in the genomic DNA, or were determined not to be sufficiently expressed. For the remaining cDNA polymorphisms, an exponen- tially distributed distance from heterozygous expression was calculated using the log transformed 984 D. O. SIEGMUND,N.R.ZHANG AND B. YAKIR

100 80

60 z -score 40

16 000 16 100 16 200 16 300 16 400 16 500 Location

Fig. 1. Scanning windows (t, t + w) that exceed the threshold of z = 30 for a region containing 500 positions in the DNA copy number data of § 3·3. Each black horizontal segment shows the start and end points of a window, with the actual value of the scan statistic shown on the y-axis. This region contains three discoveries, or clumps, shown as thick bars at the top of the plot. Downloaded from of the confidence score from Affymetrix Birdseed V2 genotyping algorithm (Korn et al., 2008). The p- values for the sum of scores in windows of five consecutive polymorphisms were calculated using the function rollapply from the R package zoo (R Development Core Team, 2011). Windows that included polymorphisms more than 1 Mbp apart were excluded from the analysis. On the other hand, consecutive < · < windows with p-values 0 05 were combined if the distance between them was 1 Mbp. The p-values http://biomet.oxfordjournals.org/ for the merged windows were recalculated. Final windows with a p-value <0·0001 were declared to be discoveries. A total of 507 such windows were discovered using the algorithm described above. In order to estimate the false discovery rate of the algorithm, the method of § 2 was applied. The markers used are heterozygous and widely separated in the scale of base pairs. Hence, it seems reasonable to assume that they behave independently, since tran- scription, currently understood as a localized process within the genome, should not induce dependence between the allelic expression of distantly separated polymorphisms. The transcribed allelic ratios can be permuted within individuals; and a Monte Carlo experiment then determines the Poisson parameter λ.

The algorithm was applied to each permuted set of data, and the number of discoveries was counted. at University of Pennsylvania Library on May 30, 2013 The average number of discoveries, computed from 100 permutations, was 11·48. This average served as an estimate of the expected number of false discoveries. Consequently, the estimated rate of false discovery is 11·48/(507 + 1) = 0·0226.

3·3. Population-wide copy number variation To detect copy number variation, Olshen et al. (2004) introduced a change-point model with white Gaussian measurement errors. Their procedure was found by Lai et al. (2005)tobepreferabletoother existing methods. See Jeng et al. (2010) for a recent discussion of this model. For the more general problem of aligned copy number variation in multiple sequences Zhang et al. (2010) and Siegmund et al. (2011), after a suitable normalization of the data described in those papers, found that the change-point model with Gaussian white noise measurement errors was reasonable. It follows from (3.3) and (3.4) of Siegmund et al. (2011) that V is approximately Poisson for high thresholds. We used (3.4) from that paper applied to the data from chromosome 4 of the Stanford Panel. For a complete description of this appli- cation and dataset, see the cited papers. There is a total of 33 238 positions, with 62 samples. We restricted our analysis to small intervals, and so conducted a variable window scan of all positions with a maximum window size of 50 and a minimum window size of 1. The theoretically derived value of λ(z) compares well with values estimated via Monte Carlo simulation, even for values of z where λ(z) is fairly large. With a false discovery rate threshold of 0·01, 337 discoveries were made. With a false discovery rate threshold of 0·1, 472 discoveries were made. See Fig. 1 for an example region containing 500 positions and 3 discov- eries.

ACKNOWLEDGEMENT The research of the first and third authors is supported by the Israeli-American Bi-National Fund. The second and third authors are supported by the National Science Foundation, U.S.A. We would like to thank Miscellanea 985 Dr Shifman from The Hebrew University of Jerusalem for giving us access to the data of the experiment described in § 3 and for conducting the simulation described therein.

REFERENCES ALDOUS,D.(1988). Applications of the Poisson Clumping Heuristic. New York: Springer. ARRATIA,R.,GOLDSTEIN,L.&GORDON,L.(1989). Two moments suffice for Poisson approximation. Ann. Prob. 17, 9–25. BEN-DAVID,E.,GRANOT-HERSHKOVITZ,E.,MONDERER-ROTHKOFF,G.,LERER,E.,LEVI,S.,YAARI,M.,EBSTEIN,R. P., Y IRMIA,N.,&SHIFMAN,S.(2011). Identification of a functional rare variant in autism using genome-wide screen for monoallelic expression. Hum. Molec. Genet. 20, 3632–41. BENJAMINI,Y.&HOCHBERG,Y.(1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300. CHEN,L.(1975). Poisson approximation for dependent trials. Ann. Prob. 3, 533–45. EFRON,B.(2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge: Cambridge University Press. Downloaded from JENG,X.J.,CAI,T.T.&LI,H.(2010). Optimal sparse segment identification with application in copy number variation analysis. J. Am. Statist. Assoc. 105, 1156–66. KORN,J.M.,KURUVILLA,F.G.,MCCARROLL,S.A.,WYSOKER,A.,NEMESH,J.,CAWLEY,S.,HUBBELL,E., VEITCH,J.,COLLINS,P.J.,DARVISHI,K.,et al. (2008). Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nature Genet. 40, 1253–60. LAI,W.R.,JOHNSON,M.D.,KUCHERLAPATI,R.&PARK,P.J.(2005). Comparative analysis of algorithms for identi- http://biomet.oxfordjournals.org/ fying amplifications and deletions in array CGH data. 21, 3763–70. LANDER,E.&BOTSTEIN,D.(1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185–99. LINDGREN,G.,LEADBETTER,M.R.&ROOTZEN´ ,H.(1983). Extremes and Related Properties of Stationary Sequences and Processes. New York: Springer. MEDVEDEV,P.,STANCIU,M.&BRUDNO,M.(2009). Computational methods for discovering structural variation with next-generation sequencing. Nature Meth. Suppl. 6, S13–20. OLSHEN,A.B.,VENKATRAMAN,E.S.,LUCITO,R.&WIGLER,M.(2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–72. RDEVELOPMENT CORE TEAM

(2011). R: A Language and Environment for Statistical Computing. R Foundation for at University of Pennsylvania Library on May 30, 2013 Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. SIEGMUND,D.O.&WORSLEY,K.J.(1995). Testing for a signal with unknown location and scale in a stationary gaussian random field. Ann. Statist. 23, 608–39. SIEGMUND,D.&YAKIR,B.(2007). The Statistics of Gene Mapping. New York: Springer. SIEGMUND,D.,YAKIR,B.&ZHANG,N.(2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Statist. 5, 645–68. STOREY,J.D.,TAYLOR,J.E.&SIEGMUND,D.O.(2004). Strong control, conservative and simulta- neous conservative consistency of false discovery rates: a unified approach. J. R. Statist. Soc. B 66, 187–205. WORSLEY,K.,EVANS,A.C.,MARRETT,S.&NEELIN P. (1992). A three dimensional statistical analysis for CBF activation studies in human brain. J. Cerebral Blood Flow Metab. 12, 900–18. ZHANG,Y.(2008). Poisson approximation for significance in genome-wide ChiP-chip tiling arrays. Bioinformatics 24, 2825–31. ZHANG,N.,SIEGMUND,D.,JI,H.&LI,J.Z.(2010). Detecting simultaneous change-points in multiple sequences. Biometrika 97, 631–45.

[Received September 2010. Revised August 2011] Downloaded from http://biomet.oxfordjournals.org/ at University of Pennsylvania Library on May 30, 2013