Resampling-Based Multiple Testing with Applications to Microarray Data Analysis

Home , Resampling (statistics)

Resampling-based Multiple Testing with Applications to Microarray Data Analysis

DISSERTATION

Presented in Partial Fulﬁllment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

Dongmei Li, B.A., M.S.

*****

The Ohio State University

2009

Dissertation Committee: Approved by

Dr. Jason C. Hsu, Adviser Dr. Elizabeth Stasny Adviser Dr. William Notz Graduate Program in Dr. Steve MacEachern Biostatistics The Ohio State University c Copyright by

Dongmei Li

2009 ABSTRACT

In microarray data analysis, resampling methods are widely used to discover significantly differentially expressed genes under different biological conditions when the distributions of test statistics are unknown. When sample size is small, however, simultaneous testing of thousands, or even millions, of null hypotheses in microarray data analysis brings challenges to the multiple hypothesis testing field. We study small sample behavior of three commonly used resampling methods, including permutation tests, post-pivot resampling methods, and pre-pivot resampling methods in multiple hypothesis testing. We show the model-based pre-pivot resampling methods have the largest maximum number of unique resampled test statistic values, which tend to produce more reliable P-values than the other two resampling methods. To avoid problems with the application of the three resampling methods in practice, we propose new conditions, based on the Partitioning Principle, to control the multiple testing error rates in fixed-effects general linear models. Meanwhile, from both the- oretical results and simulation studies, we show the discrepancies between the true expected values of order statistics and the expected values of order statistics estimated by permutation in the Significant Analysis of Microarrays (SAM) procedure.

Moreover, we show the conditions for SAM to control the expected number of false

ii rejections in the permutation-based SAM procedure. We also propose a more powerful adaptive two-step procedure to control the expected number of false rejections with larger critical values than the Bonferroni procedure.

iii This is dedicated to my dear husband Zidian Xie, my cute daughter Catherine Xie,

my cute son Matthew Xie, and my dear parents.

iv ACKNOWLEDGMENTS

I would like to express my heartfelt gratitude to my advisor Professor Jason C.

Hsu for his encouragement, constant guidance and extreme patience. Without his

advice, it would have been impossible for me to ﬁnish this dissertation.

A special thanks goes to Professor Elizabeth Stasny, Graduate Studies Chairs in

Statistics, who carefully proofread my papers and gave me tons of help during my

Ph.D. study.

I would also like to thank my other committee members, Professor William Notz and Professor Steve MacEachern for their thoughtful questions and advice.

I am enormously grateful to my parents, my husband and my kids for their support and love, especially my husband Zidian Xie, who always support me whenever I need him.

v VITA

1998 ...... B.A. Pomology, Laiyang Agriculture College, China 2001 ...... M.S. Biophysics, China Agriculture University, China 2006 ...... M.S.Statistics, The Ohio State Univer- sity, U.S.A. 2001-present ...... Graduate Teaching and Research Asso- ciate, The Ohio State University.

PUBLICATIONS

Research Publications

Violeta Calian, Dongmei Li, and Jason C. Hsu. Partitioning to Uncover Conditions for Permutation Tests to Control Multiple Testing Error Rates. Biometrical Journal, 50 (5): 756-766, 2008. DOI:10.1002/bimj.200710471.

FIELDS OF STUDY

Major Field: Biostatistics

vi TABLE OF CONTENTS

Page

Abstract...... ii

Dedication...... iv

Acknowledgments...... v

Vita ...... vi

ListofTables...... x

ListofFigures ...... xi

Chapters:

1. Multiple hypotheses testing and resampling methods ...... 1

1.1 Multiple hypotheses testing ...... 1 1.1.1 Introduction ...... 1 1.1.2 TwodeﬁnitionsofTypeIerrorrate ...... 2 1.1.3 FamilywiseErrorRate(FWER) ...... 3 1.1.4 FalseDiscoveryRate(FDR)...... 5 1.1.5 Multiple testing principles ...... 6 1.2 Resamplingmethods...... 8 1.2.1 Permutationtests ...... 9 1.2.2 Bootstrapmethods...... 11

2. Small sample behavior of resampling methods ...... 17

2.1 Tomatomicroarrayexample ...... 17 2.2 Conditions for getting adjusted P-values of zero using the post-pivot resamplingmethod...... 21

vii 2.2.1 Conditions for getting adjusted P-values of zero with a sam- plesizeoftwo ...... 21 2.2.2 Conditions for getting adjusted P-values of zero with a sam- plesizeofthree...... 24 2.3 Conditions for getting adjusted P-values of zero using the pre-pivot resamplingmethod...... 29 2.4 Discreteness of resampled test statistics’ distributions...... 31 2.4.1 Pairedsamples ...... 31 2.4.2 Two independent samples ...... 33 2.4.3 Multiple independent samples ...... 36 2.4.4 General linear mixed-eﬀects models ...... 39

3. Conditions for resampling methods to control multiple testing error rates 46

3.1 Two-groupcomparison...... 46 3.1.1 Permutationtests ...... 47 3.1.2 Post-pivot resampling method ...... 50 3.1.3 Pre-pivot resampling method ...... 52 3.2 Fixed-eﬀects general linear model ...... 54 3.3 Estimating the test statistic’s null distribution ...... 55 3.3.1 Permutationtests ...... 55 3.3.2 Pre-pivot resampling method ...... 62 3.3.3 Post-pivot resampling method ...... 66 3.4 Estimating critical values for strong control of FWER ...... 69 3.4.1 Permutationtests ...... 69 3.4.2 Pre-pivot resampling method ...... 71 3.4.3 Post-pivot resampling method ...... 72 3.5 Shortcuts of partitioning tests using resampling methods...... 73 3.5.1 Permutationtests ...... 76 3.5.2 Pre-pivot resampling method ...... 78 3.5.3 Post-pivot resampling method ...... 79

4. Conditions for Signiﬁcant Analysis of Microarrays (SAM) to control the empiricalFDR ...... 80

4.1 Introduction to Significant Analysis of Microarrays (SAM) method 83 4.2 Discrepancies between true expected values of order statistics and expected values estimated by permutation ...... 84 4.2.1 Effect of unequal variances-covariance matrices and sample sizes...... 84 4.2.2 Effect of higher order cumulants with equal sample sizes.. 88

viii 4.3 Conditions for controlling the expected number of false rejections in SAM...... 96 4.4 An adaptive two-step procedure controlling the expected number of falserejections ...... 100 4.5 Discussion...... 108

5. Concludingremarks ...... 110

References...... 115

ix LIST OF TABLES

Table Page

1.1 Summary of possible outcomes from testing k null hypotheses . . . . 3

2.1 Adjusted P-values calculated from formula (2.1) for the permutation test, post-pivot resampling method and pre-pivot resampling method 20

2.2 Maximum number of unique resampled test statistic values for the permutation test, post-pivot resampling method and pre-pivot resampling method ...... 45

x LIST OF FIGURES

Figure Page

2.1 Null distribution of max T for k = 3 and n = 3. Observed test i=1,2,3| i| statistics and resampled test statistics from permutation test, post- pivot resampling and pre-pivot resampling methods...... 20

4.1 Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal variance and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line. 86

4.2 Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal correlations and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line. 87

4.3 Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal skewness. Dashed line in the Q-Q plot is the 45 degree diagonal line...... 95

4.4 Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal third order cross cumulants. Dashed line in the Q-Q plot is the 45 degree diagonal line. 97

xi CHAPTER 1

MULTIPLE HYPOTHESES TESTING AND RESAMPLING METHODS

1.1 Multiple hypotheses testing

1.1.1 Introduction

With the rapid development of biotechnology, microarray technology became widely used in biomedical and biological ﬁelds to identify diﬀerentially expressed genes and transcription factor binding sites, and map complex traits using single nucleotide polymorphisms (SNPs) (Kulesh et al. (1987), Schena et al. (1995), Lashkari et al. (1997), Pollack et al. (1999), Buck and Lieb (2004), Mei et al. (2000), Hehir-

Kwa et al. (2007)). Having thousands, even millions, of genes on a small array makes multiple comparisons a hot topic in today’s statistics ﬁeld because thousands, even millions, of hypotheses need to be tested simultaneously.

Without multiplicity adjustment, if each hypothesis is tested at level α, the probability of rejecting at least one true null hypothesis will increase enormously when testing multiple hypotheses. If, for example, 20 hypotheses are tested simultaneously and each hypothesis is tested at 5%, the probability of rejecting at least one true null hypothesis will be 64%, assuming all the test statistics are independent. Therefore,

1 in order to make the multiplicity adjustment, a multiple hypotheses testing proce-

dure need to control a certain type of error rate at a level of α. A popular multiple testing error rate being controlled in many multiple hypotheses testing procedures is the family-wise error rate (FWER) (Hochberg and Tamhane (1987), Shaﬀer (1995)), which is deﬁned as the probability of at least one false rejection. Another less strin- gent multiple testing error rate commonly used is the false discovery rate (FDR)

(Benjamini and Hochberg (1995)), which is deﬁned as the proportion of falsely rejected null hypotheses.

1.1.2 Two deﬁnitions of Type I error rate

Suppose k genes are probed to compare expression levels between high risk and low risk patients. Let µHi, µLi, i = 1, ..., k, denote the expected (logarithms of) expression levels of the ith gene of a randomly sampled patient from the high risk and low risk groups respectively. Let θ = µ µ denote the diﬀerence of expected i Hi − Li (logarithm of) expression levels of the ith gene between the high risk group and the low risk group. To determine which of the genes are diﬀerentially expressed in expectation between the high risk and low risk patients, we need to test the following null hypotheses:

H0i : θi =0, i =1,...,k. (1.1)

There are two diﬀerent ways to deﬁne the Type I error rate when testing a single null hypothesis. Let θ =(θ1, θ2,...,θk), and let Σ denote generically all nuisance parameters that the observed expression levels depend on, such as covariance of the expression

0 0 0 levels for each of the high risk group and low risk group. Let θ = (θ1,...,θk) and Σ0 be a collection of all (unknown) true parameter values. A traditional deﬁnition of

2 the Type I error rate given by Casella and Berger (1990) or Berger (1993) is

sup P Reject H , θi=0 θθ,Σ{ 0i}

where the supremum is taken over all possible θ and Σ subject to θi = 0.

Another deﬁnition of the Type I error rate, given by Pollard and van der Laan

(2005), is

P 0 0 Reject H , θ ,Σ { 0i}

0 0 0 0 0 where θi = 0, θ = (θ1,...,θk), and Σ represents the set of all (unknown) true parameter values.

The first definition of Type I error rate is more widely used than the second defini-

tion. The second deﬁnition of Type I error rate can only be controlled asymptotically

since the true parameter values are unknown in microarray data analysis.

1.1.3 Familywise Error Rate (FWER)

When we are testing k null hypotheses simultaneously, the summary of possible

outcomes is shown in Table 1.1.

Table 1.1: Summary of possible outcomes from testing k null hypotheses Number not rejected Number rejected True null hypotheses U Vk0 Non-true null hypotheses T Sk k0 Total k R R− k −

In Table 1.1, V denotes the number of incorrectly rejected true null hypotheses when testing k null hypotheses; R denotes the number of hypotheses rejected among

3 those k null hypotheses; k denotes the number of true null hypotheses; and k k 0 − 0 denotes the number of false null hypotheses.

FWER is deﬁned as the probability of rejecting at least one true null hypothesis

(at least one false rejection). FWER has the following expression:

FWER = P V 1 . (1.2) { ≥ }

There are two kinds of control of FWER. One is strong control of FWER, which controls the probability of at least one false rejection under any combination of true and false null hypotheses (controls the supremum). The other is weak control of

FWER, which controls the probability of at least one false rejection under the complete null hypothesis HC : k H with k = k (Westfall and Young (1993), Lehmann 0 ∩i=1 0i 0 and Romano (2005)). In microarray experiments, since it is rare that no gene is diﬀerentially expressed, to control FWER strongly is more appropriate than weakly.

Strong control of FWER is desired to minimize the number of false rejections in some cases, such as selecting genes to build diagnostic or prognostic chips for diseases. An example is the MammaPrint developed by Agendia, which is based on the well-known

Amsterdam 70-gene breast cancer gene signature (van ’t Veer et al. (2002), van de

Vijver et al. (2002), Buyse et al. (2006), Glas et al. (2006)). MammaPrint is used to predict whether existing breast cancer will metastasize (spread to other parts of a patient’s body).

The multiple testing procedure proposed by Pollard and van der Laan (2005) has a strong asymptotic control of FWER. It controls the error rate αn for a sample of size n. It has limsup α α under the true data generating distribution when n→∞ n ≤ the sample size n goes to inﬁnity.

4 1.1.4 False Discovery Rate (FDR)

The concept of false discovery rate (FDR) was ﬁrst proposed by Benjamini and

Hochberg (1995) to reduce the stringency of strong FWER control. FDR is more widely used than FWER in bioinformatics studies because the investigators are more interested in finding all potential genes that are differentially expressed even if some genes could be falsely identified (Benjamini and Yekutieli (2001), Storey (2002),

Storey and Tibshirani (2003b), Storey and Tibshirani (2003a), Benjamini et al. (2006),

Strimmer (2008)).

FDR is deﬁned as the expected proportion of erroneously rejected null hypotheses among all rejected null hypotheses

V F DR = E( R> 0)Pr(R> 0). R |

Benjamini and Hochberg (1995) also presented four alternative formulations of

FDR:

(1) Positive FDR V pFDR = E( R> 0). R|

The pFDR is recommended by Storey (2002) who argued that pFDR is a more appropriate error measure to use compared to FDR.

(2) Conditional FDR V cFDR = E( R = r), R | where r is the observed number of rejected null hypotheses.

(3) Marginal FDR

mF DR = E(V )/E(R).

5 (4) Empirical FDR

Fdr = E(V )/r.

Benjamini and Hochberg (1995) argued that all four FDRs can not be controlled when all null hypotheses are true (k0 = k). If k0 = k and even if a single null hypothesis is rejected, V/R = 1 and FDR cannot be controlled. Controlling pFDR, cFDR, mFDR and Fdr has the same problem-they are identically 1 when k0 = k.

Tsai et al. (2003) showed that pFDR, cFDR and mFDR are equivalent under the

Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. The signiﬁcant analysis of microarray (SAM) method that will be discussed in chapter 4 estimates the empirical FDR.

1.1.5 Multiple testing principles

A general principle of multiple testing is the Partitioning Principle proposed by

Stefansson et al. (1988), and further reﬁned by Finner and Strassburger (2002). Both

Holm (1979)’s step-down method and Hochberg (1988)’s step-up method are special cases of partition testing (Huang and Hsu (2007)). The principle of partition testing is to partition the parameter space into disjoint subspaces, test each partitioning null hypothesis at level α, and collate the results across the subspaces, as follows:

Let P = 1,...,k , and consider testing H : θ = 0, i = 1,...,k. To control { } 0i i FWER strongly, the Partitioning Principle states:

P1: For each I 1,...,k , I = , form H∗ : θ = 0 for all i I and θ = 0 for ⊆{ } 6 ∅ 0I i ∈ j 6 j / I. In total, there are 2k parameter subspaces and 2k 1 null hypotheses to be ∈ − tested.

6 ∗ P2: Test each H0I at level α. Since all the null hypotheses are disjoint, at most one null hypothesis is true. Therefore, no multiplicity adjustment is required for each

∗ H0I . P3: For each i, infer θ = 0 if and only if all H∗ with i I are rejected since H i 6 0I ∈ 0i is the union of H∗ with i I. 0I ∈ Taking k = 3 as an example, the parameter space Θ = θ , θ , θ will be parti- { 1 2 3} tioned into eight disjoint subspaces:

Θ = θ =0 and θ =0 and θ =0 1 { 1 2 3 } Θ = θ =0 and θ =0 and θ =0 2 { 1 2 3 6 } Θ = θ =0 and θ =0 and θ =0 3 { 1 2 6 3 }

··· Θ = θ =0 and θ =0 and θ =0 7 { 1 6 2 6 3 } Θ = θ =0 and θ =0 and θ =0 8 { 1 6 2 6 3 6 }

∗ Next, we will test each of the following H0I ’s at level α:

∗ H0{123} : θ1 =0 and θ2 =0 and θ3 =0 H∗ : θ =0 and θ =0 and θ =0 0{12} 1 2 3 6 H∗ : θ =0 and θ =0 and θ =0 0{13} 1 2 6 3

··· H∗ : θ =0 and θ =0 and θ =0 0{2} 1 6 2 3 6 H∗ : θ =0 and θ =0 and θ =0 0{3} 1 6 2 6 3

Finally, infer θ = 0 if and only if all H∗ involving θ = 0 are rejected. i 6 0I i Another multiple testing principle similar to the Partitioning Principle, is the closed testing principle (Marcus et al. (1976)). The closed testing principle states:

7 C1: For each I 1,...,k , form the intersection null hypothesis H : θ = 0 for ∈{ } 0I i all i I. ∈

C2: Test each H0I at level α.

C3: For each i, infer θ = 0 if and only if all H with i I are rejected. i 6 0I ∈ Compared to the partition testing procedure, the closed testing procedure tests less restrictive hypotheses. However, the closed testing procedure still controls FWER

∗ strongly because a level-α test for H0I is also a level-α test for H0I . ˆ To test H0 : θi =0(i =1,...,k) using the test statistic Ti = θi (i =1,...,k), we

will test 2k 1 null hypotheses in accordance with the Partitioning Principle. Here − is a typical partitioning null hypothesis:

H∗ : θ =0 and and θ =0 and 0{12···t} 1 ··· t θ =0 and and θ =0 (1 t k). t+1 6 ··· k 6 ≤ ≤ The above null hypothesis can be simpliﬁed as

H : θ =0 and θ =0 and and θ =0 (1 t k) 0{12···t} 1 2 ··· t ≤ ≤ according to the closed testing principle. It still controls FWER strongly because a

∗ level-α test for H0{12···t} is also a level-α test for H0{12···t}. The test statistic for testing H is max T = max θˆ because 0{12···t} i=1,...,t| i| i=1,...,t| i|

H0{12···t} is an Union-Intersection test (Casella and Berger (1990)), and the rejection

region for a Union-Intersection test is T > c = max T > c ∪i∈{1,...,t}{| i| } { i=1,...,t| i| }

(where c is the critical value for testing H0{12···t}).

1.2 Resampling methods

Resampling methods can be used to estimate the precision of sample statistics

(mean, median, percentiles), perform signiﬁcance tests, and validate models (Westfall

8 and Young (1993), Efron and Tibshirani (1994), Davison and Hinkley (1997), Good

(2005)). The commonly used resampling techniques include permutation tests and

bootstrap methods. Two diﬀerent bootstrap methods, the post-pivot resampling

method and the pre-pivot resampling method, will be introduced in this section.

Westfall and Young (1993) introduced procedures using resamplings to adjust P-

values in multiple testings to control multiple testing error rates.

1.2.1 Permutation tests

A permutation test is a type of non-parametric statistical signiﬁcance test in

which a reference distribution is constructed by calculating all possible values of

test statistics from permuted observations under a null hypothesis. The theory of

permutation tests is based on the works of Fisher and Pitman in the 1930s (Good

(2005)).

Compared to parametric testing procedures, the fewer distributional assumptions and the simpler procedures make permutation tests more attractive to many researchers and statisticians. For example, when comparing the means of two populations, a two-sample t-test assumes that the sampling distribution of the diﬀerence between sample averages is normal, which is not true in most cases. The t-test is only valid when both populations have independent or joint normal distributions. In contrast, the permutation test is distribution-free so that it can give exact P-values when the sample size is small. The permutation test permutes the labels of observations between two groups, and obtains the P-value by calculating the proportion of test statistic values from resamples that are as extreme or more extreme than the

9 observed test statistic value. In microarray data analysis, when the correlations between genes are considered in the joint distribution of test statistics, the parametric form of a multivariate t distribution becomes very complex and diﬃcult to calculate.

In contrast, the permutation test is easy to conduct and avoids complex calculations.

To carry out a permutation test based on a test statistic that measures the size of an eﬀect of interest, we proceed as follows:

1. Compute the test statistic for the observed data set.

2. Permute the original data in a way that matches the null hypothesis to get permuted resamples, and construct the reference distribution using the test statistics calculated from permuted resamples.

3. Calculate the critical value of a level α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of permutation test statistics that are as extreme or more extreme than the observed test statistic.

Permutation tests can be used in a wide variety of settings. For example, Fisher’s exact test (a permutation test) is used to detect the association between a row variable and a column variable for small, sparse, or unbalanced data sets. Ein-Dor et al.

(2005) used a permutation test for selecting genes which expression proﬁles are significantly correlated with breast cancer survival status. Based on random permutations of time points, Ptitsyn et al. (2006) applied the permutation test to identifying a periodic pattern in relatively short time series using microarray technology. The periodic process is important for modulating and coordinating the transcription of genes governing key metabolic pathways. Churchill and Doerge (1994) used a permutation

10 test based on the permutation of observed quantitative traits to determine the quantitative trait loci. To identify signiﬁcant changes in gene expression in microarray experiments, Tusher et al. (2001) used permutations of the repeated measurements in the signiﬁcance analysis of microarrays (SAM) procedure.

For two-group comparisons, permuting the labels of observations between two groups requires an assumption that two populations are identical when the null hypothesis is true-that is, not only are their means the same, but also their spreads and shapes. Pollard and van der Laan (2005) demonstrated that, if both the correlation structures and the sample sizes are different between two populations, then a permutation test does not control the type I error rate at its nominal significance level for detecting differentially expressed genes between two groups. When comparing two groups and finding significant predictor variables in fixed-effects general linear models, the conditions for permutation tests to control multiple testing error rates will be further discussed in chapter 3.

For testing hypotheses about a single population, comparing populations that diﬀer even under the null hypothesis, or testing general relationships, permutation tests cannot be used because we do not know how to resample in a way that matches the null hypothesis in these settings. Hence, bootstrap methods should be used instead.

1.2.2 Bootstrap methods

The bootstrap method was ﬁrst introduced by Efron (1979) and further discussed by Efron and Tibshirani (1994).

11 The bootstrap method is a way of approximating the sampling distribution from

just one sample. Instead of taking many simple random samples from the population

to ﬁnd the sampling distribution of a sample statistic, the bootstrap method repeat-

edly resamples with replacement from one random sample. The bootstrap distribution of a statistic collects values of the statistic from many resamples, and gives information about the sampling distribution of the statistic. For example, the bootstrap distribution of a sample mean is obtained from the resampled means calculated from hundreds of resamples with replacement from a single original sample. The bootstrap distribution of a sample mean has the following mean and standard error:

1 mean = X¯ = X¯ ∗ boot boot B · X

1 ∗ 2 SEboot = (X¯ meanboot) rB 1 − − X where X¯ ∗ is the sample mean of each bootstrap resample and B is the number of

resamples.

Since a bootstrap distribution of a statistic generates from a single original sample, it is centered at the value of the sample statistic rather than the parameter value. Bootstrap distributions include two sources of random variation: one is from choosing an original sample at random from the population, and the other is from choosing bootstrap resamples at random from the original sample, which introduces little additional variation.

Bootstrap methods are asymptotically valid (as original sample size goes to ). ∞ Efron (1979) showed that the bootstrap method can (asymptotically) correctly estimate the variance of a sample median, and the error rates in a linear discrimination

12 problem (outperforming cross-validation). Freedman (1981) showed that the bootstrap approximation to the distribution of least square estimates is valid. Hall (1986) showed the bootstrap method’s reduction of error coverage probability, from O(n−1/2) to O(n−1), which makes the bootstrap method one order more accurate than the delta method.

Bootstrap methods are widely used in all kinds of data analysis. Davison and

Hinkley (1997) illustrated the application of bootstrap methods in stratiﬁed data;

ﬁnite populations; censored and missing data; linear, nonlinear, and smooth regression models; classiﬁcation; time series and spatial problems. For example, by using

Efron’s bootstrap resampling method, Liu et al. (2004) analyzed the performance of artificial neural networks (ANNs) in the area of feature classification for the analysis of mammographic masses to achieve more accurate results. The feature classification in mammography is used to discover the salient information that can be used to discriminate benign from malignant masses.

In microarray data analysis, there are two commonly used bootstrap methods, including the post-pivot resampling method and the pre-pivot resampling method.

Both methods can control FWER asymptotically and give similar results in a ﬁxed- eﬀects general linear model with i.i.d. errors. In two-group comparisons, the null distribution estimated by the pre-pivot resampling method has more resampled test statistic values than that estimated by the post-pivot resampling method under a rea- sonable assumption (the distributions of the errors are exchangeable) for microarray data.

13 Post-pivot resampling method

The post-pivot resampling method was introduced by Pollard and van der Laan

(2005) to estimate the null distribution of test statistics in multiple hypotheses testing

to achieve asymptotic multiple testing error rate control. The post-pivot resampling

method obtains the asymptotically correct null distribution of the test statistic (based

on the true data generating distribution) from centered and/or scaled resampled test

statistics.

In microarray data analysis with two or more treatment groups, the post-pivot resampling method resamples the observed data within each group, calculates the resampled test statistics from each resample, centers and/or scales the resampled test statistics (subtracts the average of resampled test statistics and/or divides the standard deviation of resampled test statistics), and estimates the test statistic’s null distribution from the centered and/or scaled resampled test statistics.

To carry out a hypothesis test based on a test statistic that measures the location diﬀerence between two populations, the post-pivot resampling method proceeds as follows:

1. Compute the test statistic for the observed data set.

2. Resample the data with replacement within each group to obtain bootstrap resamples, compute the test statistic for each resampled data set, and construct the reference distribution using the centered and/or scaled resampled test statistics.

3. Calculate the critical value of a level-α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of bootstrapped test statistics that are as extreme or more extreme than the observed test statistic.

14 Pre-pivot resampling method

The pre-pivot resampling method ﬁts a model to the observed data ﬁrst and then

estimates the test statistic’s null distribution by bootstrapping the centered residuals

(subtract the sample mean of residuals) (Freedman (1981)). Under an assumption

that the model ﬁts the data well, the pre-pivot resampling method can provide asymp-

totically valid results, i.e., it can control multiple testing error rates asymptotically

when testing multiple null hypotheses.

In microarray data analysis, the pre-pivot resampling method estimates the null

distributions of test statistics by bootstrapping residuals from a probe level or a

gene level model with treatment eﬀects. The way that the residuals are re-sampled

with replacement (bootstrapped) depends on the assumptions about the residuals.

The residuals can be re-sampled across treatments under the assumption of same

distributions across treatments, but not across genes. If the distributions are the

same across genes, then residuals across treatments and genes can be pooled together

for resampling with replacement.

To carry out a hypothesis test based on a test statistic that measures the location diﬀerence of two populations, the pre-pivot resampling method has the following procedure:

1. Compute the test statistic for the observed data set.

2. Fit a one-way model to the observed data, and compute the residuals from the one-way model (subtract the sample mean from each observation within each group).

3. Combine the residuals of two groups together under an assumption that the

distributions of the residuals are the same for these two groups.

15 4. Resample the pooled residuals with replacement to get bootstrapped residuals, and center the bootstrapped residuals at the average (subtract the average of bootstrapped residuals) if the average of those bootstrapped residuals is not zero.

5. Add the centered bootstrapped residuals from each resample back to the one- way model, and recompute the test statistic for each resample. Then, the test statistics from all resamples form the reference distribution.

6. Calculate the critical value of a level-α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of bootstrapped test statistics as extreme or more extreme than the observed test statistic.

16 CHAPTER 2

SMALL SAMPLE BEHAVIOR OF RESAMPLING METHODS

Resampling techniques are popular in microarray data analysis. In this chapter, we will discuss the small sample behavior of three popular resampling techniques for multiple testing: the permutation test, the post-pivot resampling method, and the pre-pivot resampling method. We will show that when the sample size is small, for matched pairs, a permutation test is unlikely to give small P-values, while both post-pivot and pre-pivot resampling methods might give P-values of zero for the same data, even adjusting for multiplicity. The discreteness of the test statistics’ null distributions estimated by the above three resampling methods will be compared based on the maximum number of unique test statistic values.

2.1 Tomato microarray example

A biology professor in the Department of Horticulture and Crop Science at the

Ohio State University wishes to identify differentially expressed genes between control tomato plants and mutant tomato plants at different tomato fruit developmental stages (flower bud, flower, and fruit). Lee et al. (2000) recommended that at least three replicates should be used in designing experiments by using cDNA microarrays,

17 particularly when gene expression data from single specimens will be analyzed. In the

tomato microarray experiment, there are three paired samples at each stage (three

plants in the control group and three plants in the mutant group). Suppose we

only have three genes at the fruit stage and wish to learn which genes are diﬀerently

expressed between the mutant group and the control group using the single step maxT method, a method based on resampling techniques, for the multiplicity adjustment.

Let Xij (i =1, 2, 3, and j =1, 2, 3) denote the gene expression levels for the ith

gene, jth sample in the control group, and Yij (i =1, 2, 3, and j =1, 2, 3) denote the

gene expression levels for the ith gene, jth sample in the treatment group. For the

ith gene, X i.i.d. F , and Y i.i.d. F . ij ∼ Xi ij ∼ Yi Let d = x y denote the observed paired diﬀerence for the ith gene, jth ij ij − ij

paired sample, θi denote the true paired diﬀerence between the paired samples. To

identify the diﬀerentially expressed genes among these three genes, we will test the ¯ null hypotheses H0 : θi =0(i =1, 2, 3) using the test statistics Ti = di (i =1, 2, 3).

The raw P-values are calculated according to the following formula using resam-

pling methods:

♯ b : T T Raw P = { | i,b|≥| i|}, for i =1,...,k. i B

The single step maxT method based on resampling techniques will be used to

calculate the adjusted P-values for adjusting multiplicity when we are testing three

null hypotheses simultaneously. The formula for calculating maxT adjusted P-values with monotonicity enforced is (cf Westfall and Young (1993)):

♯ b : max T T Adjusted P = { i=1,2,3| i,b|≥| i|}, for i =1, 2, 3, (2.1) i B

18 where Ti,b denotes the resampled test statistic for the ith gene, bth resampling, and B is the total number of resamplings (b = 1, ..., B).

Figure 2.1 shows the absolute values of the observed test statistics T and the | i| maximums of the absolute values of resampled test statistics max T from i=1,2,3| i,b| three resampling methods. The dots denote the observed test statistics; the rectangles denote the maximums of resampled test statistics from the permutation test; the diamonds denote the maximums of resampled test statistics from the post-pivot resampling method; and the triangles denote the maximums of resampled test statistics from the pre-pivot resampling method.

As shown in Figure 2.1, the permutation test always produces permutated test statistics that are greater than or equal to the observed test statistic. Thus, it is unlikely that the permutation test gives zero adjusted P-values. In contrast, for either the pre-pivot or post-pivot resampling method, there is a high probability that the observed test statistic is far from the resampled test statistics. Therefore, we might get zero adjusted P-values using these two resampling methods.

Based on the formula of the single step maxT method for calculating adjusted

P-values, we can get the adjusted P-values for all three genes. Table 2.1 summarizes the adjusted P-values obtained from the permutation test, the post-pivot resampling method, and the pre-pivot resampling method for three tomato fruit genes based on

Figure 2.1. Based on the null distribution of max T estimated from the permutation | | test (the rectangles), we can observe that the adjusted P-value for gene 1 is 0.75 since

6 out of 8 max T values (square in Figure 2.1) are greater than or equal to T (dot | | | 1| in Figure 2.1). Similarly, the adjusted P-values for gene 2 and gene 3 are both 0.25 based on the permutation test. Using the post-pivot resampling method, the adjusted

19 Figure 2.1: Null distribution of maxi=1,2,3 Ti for k = 3 and n = 3. Observed test statistics and resampled test statistics from| permutation| test, post-pivot resampling and pre-pivot resampling methods.

P-value for gene 1 is 0.30 since 3 out of 10 max T values (diamond in Figure 2.1) | | are greater than or equal to T (dot in Figure 2.1). For gene 2 and gene 3, however, | 1| there is no resampled max T value from the post-pivot resampling method that is | | greater than or equal to either T or T (dots in Figure 2.1). Thus, the adjusted | 2| | 3| P-values for gene 2 and gene 3 are both zero using the post-pivot resampling method.

We obtain the same adjusted P-values from the pre-pivot resampling method as from the post-pivot resampling method for all three fruit genes.

Table 2.1: Adjusted P-values calculated from formula (2.1) for the permutation test, post-pivot resampling method and pre-pivot resampling method Permutation Post-pivot resampling Pre-pivot resampling gene1 6/8=0.75 3/10=0.30 3/10=0.30 gene2 2/8=0.25 0/10=0 0/10=0 gene3 2/8=0.25 0/10=0 0/10=0

20 Strikingly, for matched pairs, the permuted test statistics (unstandardized or standardized) with complete enumerations always have a mean of zero. The reason is that one sample in each pair can be assigned either zero or one as their group label. When the labels are switched, the signs of the test statistics are also switched. Thus, the positive signs and negative signs cancel each other out so that the mean of all permuted test statistics will be equal to zero. For standardized test statistics, since the

MSEs are always the same for paired permuted samples when labels switched, the mean of all permuted test statistics is also zero.

2.2 Conditions for getting adjusted P-values of zero using the post-pivot resampling method

The tomato microarray example suggests that P-values of zero may occur often even after multiplicity adjustment. Therefore, we need to explore the conditions for getting an adjusted P-value of zero using the post-pivot and pre-pivot resampling methods for paired samples with small sample sizes (2 or 3 each).

2.2.1 Conditions for getting adjusted P-values of zero with a sample size of two

To expand three genes in our tomato microarray example to k genes, let Xij

(i =1, 2,...,k; and j =1, 2,...,n) denote the gene expression levels for the ith gene, jth sample in the control group, and Yij (i = 1, 2,...,k and j = 1, 2,...,n) denote the gene expression levels for the ith gene jth sample in the mutant group. For the ith gene, X i.i.d. F , and Y i.i.d. F . ij ∼ Xi ij ∼ Yi Assume d = x y are the observed paired diﬀerences for the ith gene in the ij ij − ij jth paired sample. We wish to determine which genes are diﬀerentially expressed

21 among those k genes by testing the k null hypotheses H0 : θi =0(i =1,...,k) using the test statistics Ti = d¯i.

When the sample size n is two, the observed diﬀerences are d = x y (i =1, ij ij − ij 2, ..., k and j =1 and 2). For the ﬁrst two genes, we have the following observation

matrix d d 11 12 . d d 21 22

The observed test statistics are T1 = (d11 + d12)/2 and T2 = (d21 + d22)/2. The

resampled test statistics matrix is shown as follows using the post-pivot resampling

method: d d11+d12 d11+d12 d 11 2 2 12 . d d21+d22 d21+d22 d 21 2 2 22 We can get the following matrix after subtracting the average in each row:

d11−d12 0 0 d12−d11 2 2 . d21−d22 0 0 d22−d21 2 2 To get a raw P-value of zero for the ﬁrst gene, we need to have

d11+d12 > d11−d12 | 2 | | 2 | d11+d12 > 0. | 2 | Similarly, we need to have the following relationship to have a raw P-value of zero for the second gene: d21+d22 > d21−d22 | 2 | | 2 | d21+d22 > 0. | 2 | Therefore, the necessary and suﬃcient conditions for getting a raw P-value of zero for the ith gene are:

either

di1 > 0 d > 0 i2

22 or

di1 < 0 d < 0 i2 for i=1, 2.

Using the single step maxT method, we can get the necessary and suﬃcient conditions for getting an adjusted P-value of zero for the ﬁrst gene as follows:

either d11 > 0 d > 0  12 d + d > d d  11 12 | 21 − 22| or  d11 < 0 d < 0  12 d + d < d d .  11 12 −| 21 − 22| Similarly, we can get the necessary and suﬃcient conditions for getting an adjusted P-value of zero for the second gene as follows:

either d21 > 0 d > 0  22 d + d > d d  21 22 | 11 − 12| or  d21 < 0 d < 0  22 d + d < d d .  21 22 −| 11 − 12| In other words, to have both raw P-values of zero and adjusted P-values of zero with a sample size of two for two genes, the conditions are:

1. To have raw P-values of zero, the necessary and suﬃcient condition is that both observations are in the same direction (either both are bigger than zero or both are smaller than zero).

2. To have adjusted P-values of zero, the necessary and suﬃcient conditions that need to be satisﬁed are:

(a) Both observations for the same gene are in the same direction.

23 (b) The sum of two observations for one gene is either bigger than the absolute diﬀerence of two observations of the other gene (in the positive direction) or smaller than the negative value of the absolute diﬀerence of two observations of the other gene (in the negative direction).

If k genes are considered, the necessary and suﬃcient conditions for the ith gene to have a raw P-value of zero with a sample size of two are:

either

di1 > 0 d > 0 i2 or

di1 < 0 d < 0, i2 for i = 1, 2, ..., k.

For getting an adjusted P-value of zero for the ith gene with a sample size of two, the necessary and suﬃcient conditions are:

either di1 > 0 d > 0  i2 d + d > max d d  i1 i2 j6=i,j=1,2,...,n| j1 − j2| or  di1 < 0 d < 0  i2 d + d < max d d ,  i1 i2 − j6=i,j=1,2,...,n| j1 − j2| for i = 1, 2, ..., k.

2.2.2 Conditions for getting adjusted P-values of zero with a sample size of three

When the sample size increases from two to three for each group, the observed

diﬀerences are d = x y (i = 1, 2, ..., k and j = 1, 2 and 3). The observed ij ij − ij

24 diﬀerence matrix for the ﬁrst two genes is:

d d d 11 12 13 . d d d 21 22 23

T1 = (d11 + d12 + d13)/3 and T2 = (d21 + d22 + d23)/3 will be our observed test statistics for the ﬁrst two genes when the sample size is three, and there will be

3 3 3 = 27 complete bootstrap resampled test statistics. The ten bootstrap × × resamples that will give ten unique test statistic values are:

1 1 1 1 1 2 1 1 3 1 2 2   1 2 3   , 1 3 3   2 2 2   2 2 3   2 3 3   3 3 3     where 1 is the label for the first paired difference, 2 is the label for the second paired difference, and 3 is the label for the third paired difference.

If the bootstrap resamplings all come from the first paired difference, then we will have the following resampled difference matrix for the first two genes:

d d d 11 11 11 . d d d 21 21 21 The resampled test statistics computed from the above diﬀerence matrix are

T1,b=1 = d11 and T2,b=1 = d21.

If the bootstrap resamplings include the ﬁrst paired diﬀerence twice and the second

paired diﬀerence once, then the resampled diﬀerence matrix is:

d d d 11 11 12 . d d d 21 21 22 25 The resampled test statistics computed from the above diﬀerence matrix are

T1,b=2 = (2d11 + d12)/3 and T2,b=2 = (2d21 + d22)/3.

In the post-pivot resampling method, we subtract the average of all resampled test statistics, which is T1 =(d11 +d12 +d13)/3 for the ﬁrst gene and T2 =(d21 +d22 +d23)/3 for the second gene respectively, from each resampled test statistic to get the reference distribution Zb for both genes:

2d11−d12−d13 d11−d13 0 2d12−d11−d13 d13−d12 2d13−d11−d12 3 3 ··· 3 3 ··· 3 . 2d21−d22−d23 d21−d23 0 2d22−d21−d23 d23−d22 2d23−d21−d22 3 3 ··· 3 3 ··· 3 According to the formula for calculating raw P-values, if all Z < T , the raw | 1,b| | 1| P-value of the ﬁrst gene is equal to zero.

To have Z < T , the following relationships need to be satisﬁed: | 1,b| | 1|

(d11 d13)/3 < (d11 + d12 + d13)/3 |(d − d )/3| < |(d + d + d )/3|  11 12 11 12 13 |(d − d )/3| < |(d + d + d )/3|  | 12 − 13 | | 11 12 13 |  (2d11 d12 d13)/3 < (d11 + d12 + d13)/3  | − − | | |  (2d d d )/3 < (d + d + d )/3  | 12 − 11 − 13 | | 11 12 13 | (2d13 d11 d12)/3 < (d11 + d12 + d13)/3  | − − | | |  0 < (d11 + d12 + d13)/3  | |  From the above equations, we derive the following necessary and suﬃcient conditions for the ﬁrst gene to have a raw P-value of zero:

either d > max( 2d , 2d ) 11 − 12 − 13 d > max( 2d , 2d )  12 − 11 − 13  d13 > max( 2d11, 2d12)  −d13 −  d11 + d12 > 2  d12 d11 + d13 > 2 d + d > d12  11 13 2  or   d11 < 0 d < 0  12  d13 < 0,  26 For the second gene, to have Z < T , the following relationships need to be | 2,b| | 2| satisﬁed: (d21 d23)/3 < (d21 + d22 + d23)/3 |(d − d )/3| < |(d + d + d )/3|  | 21 − 22 | | 21 22 23 | (d d )/3 < (d + d + d )/3  | 22 − 23 | | 21 22 23 |  (2d21 d22 d23)/3 < (d21 + d22 + d23)/3  | − − | | |  (2d d d )/3 < (d + d + d )/3  | 22 − 21 − 23 | | 21 22 23 | (2d23 d21 d22)/3 < (d21 + d22 + d23)/3  | − − | | |  0 < (d21 + d22 + d23)/3  | |  From the above equations, the necessary and suﬃcient conditions for the second gene to have a raw P-value of zero are:

either d > max( 2d , 2d ) 21 − 22 − 23 d > max( 2d , 2d )  22 − 21 − 23  d23 > max( 2d21, 2d22)  −d23 −  d21 + d22 > 2  d22 d21 + d23 > 2 d + d > d22  21 23 2  or   d21 < 0 d < 0  22  d23 < 0, If we expand the two-genes case to k-genes case, the necessary and suﬃcient conditions for the ith gene to have a raw P-value of zero are shown as follows using

the post-pivot resampling method:

either d > max ( 2d , 2d ) i1 i=1,2,...,k − i2 − i3 d > max ( 2d , 2d )  i2 i=1,2,...,k − i1 − i3  di3 > maxi=1,2,...,k( 2di1, 2di2)  di3 − −  di1 + di2 > 2  di2 di1 + di3 > 2 d + d > di2  i1 i3 2  or   di1 < 0 d < 0  i2  di3 < 0, for i = 1, 2, ..., k. 

27 To have an adjusted P-value of zero for the ﬁrst gene when we only have two genes, the following relationships need to be satisﬁed:

max( (d11 d13)/3 , (d21 d23)/3 ) < (d11 + d12 + d13)/3 max(|(d − d )/3|, |(d − d )/3|) < |(d + d + d )/3|  | 11 − 12 | | 21 − 22 | | 11 12 13 | max( (d d )/3 , (d d )/3 ) < (d + d + d )/3  | 12 − 13 | | 22 − 23 | | 11 12 13 |  max( (2d11 d12 d13)/3 , (2d21 d22 d23)/3 ) < (d11 + d12 + d13)/3  | − − | | − − | | |  max( (2d d d )/3 , (2d d d )/3 ) < (d + d + d )/3  | 12 − 11 − 13 | | 22 − 21 − 23 | | 11 12 13 | max( (2d13 d11 d12)/3 , (2d23 d21 d22)/3 ) < (d11 + d12 + d13)/3  | − − | | − − | | |  0 < (d11 + d12 + d13)/3  | |  The above equations give us the following necessary and suﬃcient conditions for getting an adjusted P-value of zero for the ﬁrst gene:

either d > max( 2d , 2d ) 11 − 12 − 13 d > max( 2d , 2d )  12 − 11 − 13  d13 > max( 2d11, 2d12)  d13 − −  d11 + d12 > 2  d12  d11 + d13 > 2  d12 d11 + d13 > 2  d11 + d12 + d13 > max( d21 d23 + d21 d22 , d21 d22 + d22 d23 ,  | − | | − | | − | | − |  d21 d23 + d22 d23 )  | − | | − |  or

d11 < 0 d < 0  12 d < 0  13  d + d + d < max( d d + d d , d d + d d ,  11 12 13 − | 21 − 23| | 21 − 22| | 21 − 22| | 22 − 23| d21 d23 + d22 d23 ),  | − | | − |  If we have k genes instead of two genes, we need to solve the following equations to get an adjusted P-value of zero for the ith gene (i =1,...,k):

maxi=1,...,k( (di1 di3)/3 ) < (di1 + di2 + di3)/3 max (|(d − d )/3|) < |(d + d + d )/3|  i=1,...,k | i1 − i2 | | i1 i2 i3 | max ( (d d )/3 ) < (d + d + d )/3  i=1,...,k | i2 − i3 | | i1 i2 i3 |  maxi=1,...,k( (2di1 di2 di3)/3 ) < (di1 + di2 + di3)/3  | − − | | |  max ( (2d d d )/3 ) < (d + d + d )/3  i=1,...,k | i2 − i1 − i3 | | i1 i2 i3 | max ( (2d d d )/3 ) < (d + d + d )/3 i=1,...,k | i3 − i1 − i2 | | i1 i2 i3 |  0 < (d + d + d )/3  | 11 12 13 |    28 The following necessary and suﬃcient conditions are derived to get an adjusted

P-value of zero for the ith gene when the sample size is three in each group.

Either

di1 > maxi=1,2,...,k( 2di2, 2di3) d > max (−2d , −2d )  i2 i=1,2,...,k − i1 − i3  di3 > maxi=1,2,...,k( 2di1, 2di2)  di3 − −  di1 + di2 > 2  di2  di1 + di3 > 2  di2 di1 + di3 > 2  di1 + di2 + di3 > maxl6=i,l=1,2,...,k( dl1 dl3 + dl1 dl2 , dl1 dl2 + dl2 dl3 ,  | − | | − | | − | | − |  dl1 dl3 + dl2 dl3 )  | − | | − |   or

di1 < 0 d < 0  i2 d < 0  i3  d + d + d < max ( d d + d d , d d + d d ,  i1 i2 i3 − l6=i,l=1,2,...,k | l1 − l3| | l1 − l2| | l1 − l2| | l2 − l3| dl1 dl3 + dl2 dl3 ),  | − | | − |   for i = 1, 2, ..., k.

2.3 Conditions for getting adjusted P-values of zero using the pre-pivot resampling method

For paired data, two groups’ comparison is equivalent to a one-sample problem.

The pre-pivot resampling method subtracts the diﬀerence of the two groups’ means

ﬁrst, and then resamples the residuals with replacement for paired data. Since (x i − x¯) (y y¯)=(x y ) (¯x y¯), the test statistic’s null distribution estimated by − i − i − i − − the pre-pivot resampling method is the same as that by the post-pivot resampling

method for paired data, as shown below.

¯ (di1+di2+···+din) With a sample size of n, the observed test statistics will be di = n for the ith gene. For the post-pivot resampling method, there are ten unique bootstrap test statistics calculated from the resamples for each gene when the sample size is

29 three (n = 3). The bootstrap resampled test statistics matrix T b is:

2d11+d12 d11+d12+d13 d11+2d13 d11 3 3 d12 3 d13 2d21+d22 ··· d21+d22+d23 d21+2d23 ··· d21 3 3 d22 3 d23  . . ···. . . . ···. .  ......  2dk1+dk2 dk1+dk2+dk3 dk1+2dk3   dk1 dk2 dk3   3 ··· 3 3 ···    \ The estimated mean vector E(T b) is

d11+d12+d13 3 d21+d22+d23 3  .  . .  d +d +d   k1 k2 k3   3    The estimated null distribution matrix Zb is

2d11−d12−d13 d11−d13 2d12−d11−d13 d13−d12 2d13−d11−d12 3 3 0 3 3 3 2d21−d22−d23 d21−d23 ··· 2d22−d21−d23 d23−d22 ··· 2d23−d21−d22 3 3 0 3 3 3  . . ···. . . . ···. .  ......  2d −d −d d −d 2d −d −d d −d 2d −d −d   k1 k2 k3 k1 k3 0 k2 k1 k3 k3 k2 k3 k1 k2   3 3 ··· 3 3 ··· 3    The residuals for paired data sets using the pre-pivot resampling method are shown as follows:

d (¯x y¯ ) d (¯x y¯ ) d (¯x y¯ ) 11 − 1 − 1 12 − 1 − 1 13 − 1 − 1 d21 (¯x2 y¯2) d22 (¯x2 y¯2) d23 (¯x2 y¯2)  − . − − . − − . −  , . . .    dk1 (¯xk y¯k) dk2 (¯xk y¯k) dk3 (¯xk y¯k)   − − − − − −    wherex ¯ y¯ = xi1+xi2+xi3 yi1+yi2+yi3 = di1+di2+di3 . The number of unique bootstrap i − i 3 − 3 3 n+n−1 (2n−1)! resampled test statistics from the pre-pivot resampling method is n = n!(n−1)! , which is the same as that from the post-pivot resampling method since n data points are resampled from n paired diﬀerences with replacement for both the post-pivot and pre-pivot resampling methods. Therefore, there are ten unique resampled test statistic values for each gene when the sample size n is three. The calculated bootstrap test statistics matrix, which is the estimated test statistic’s null distribution, is showed as

30 the following since the same equation (¯x y¯ = di1+di2+di3 ) is used in the calculation: i − i 3 2d11−d12−d13 d11−d13 2d12−d11−d13 d13−d12 2d13−d11−d12 3 3 0 3 3 3 2d21−d22−d23 d21−d23 ··· 2d22−d21−d23 d23−d22 ··· 2d23−d21−d22 3 3 0 3 3 3  . . ···. . . . ···. .  ......  2d −d −d d −d 2d −d −d d −d 2d −d −d   k1 k2 k3 k1 k3 0 k2 k1 k3 k3 k2 k3 k1 k2   3 3 ··· 3 3 ··· 3    As shown, the estimated test statistic’s null distribution matrix from the pre- pivot resampling method is exactly the same as that from the post-pivot resampling method. The relationship still holds when the sample size is n and the equation of

x¯ y¯ = xi1+xi2+···+xin yi1+yi2+···+yin = di1+di2+···+din is used in the null distribution i − i n − n n calculation. Therefore, the necessary and suﬃcient conditions for obtaining adjusted

P-values of zero from the pre-pivot resampling method are exactly the same as that from the post-pivot resampling method.

2.4 Discreteness of resampled test statistics’ distributions

For the same data set, while the permutation test can give large P-values, the post-pivot and pre-pivot resampling methods will give P-values of zero even after the multiplicity adjustment when the sample size is small. This contradictory conclusion resulted from the discreteness of test statistics’ null distributions estimated by these three resampling methods for small sample size data. The discreteness can be repre- sented as the maximum number of unique resampled test statistic values, which will be discussed and compared in diﬀerent situations for three resampling methods.

2.4.1 Paired samples

The simplest general linear model is the one with observations of paired diﬀerences, which can be written as

y = µ + ǫ , i =1, 2, , n, (2.2) i i ··· 31 where yi denotes the ith paired diﬀerence, µ denotes the overall mean, and the residuals ǫi are independent and identically distributed with an arbitrary distribution

F with mean zero. In this case, we will test the null hypothesis H0 : µ = 0 using the

test statistic T =y ¯. The maximum number of unique resampled test statistic values

in each resampling method is given in Theorem 2.1.

Theorem 2.1. In the case of one sample with a sample size of n, the maximum number of unique resampled test statistic values in the permutation test, the post-

n 2n−1 pivot resampling method, and the pre-pivot resampling method are 2 , n and 2n−1 n respectively. Proof. (Theorem 2.1). In the case of paired samples, the permutation test will per-

mute the labels of the paired data. We solely need to consider the possible labels of

one sample since the labels of the paired samples will be ﬁxed once the labels of one

sample are known.

Since each sample in one group can be labeled as either 0 or 1, there will be

2 2 2 =2n (2.3) × ×···× n factors possible permuted labels for n samples,| {z and the} maximum number of unique permuted test statistics will be 2n for the permutation test.

For the post-pivot resampling method, we will resample n data points from the n paired diﬀerences with replacement. However, we do not need to consider the order since diﬀerent orders will give the same test statistic values. Therefore, there will be

n + n 1 2n 1 − = − (2.4) n n

32 possible arrangements. Each arrangement corresponds to one test statistic value so

that the maximum number of unique test statistic values for the post-pivot resampling

2n−1 method will be n . For the pre-pivot resampling method, we will resample n residuals from the n residuals with replacement without considering the order of the test statistic values under the null hypothesis. Therefore, similar to the post-pivot resampling method, there will be n + n 1 2n 1 − = − (2.5) n n possible arrangements. Thus, the maximum number of unique test statistic values

2n−1 for the pre-pivot resampling method is also n . 2.4.2 Two independent samples

The case of two independent samples is common in the general linear model set-

ting. A simple model for two independent samples can be written as

y = µ + τ + ǫ , i =1, 2; j =1, , n (n = m, n = n), (2.6) ij i ij ··· i 1 2

where yij denotes the observed value for the ith treatment, jth replicate; µ denotes the

overall mean; τi denotes the ith treatment eﬀect and the residuals ǫij are independent

and identically distributed with an arbitrary distribution F with E(ǫij)=0.

To test the null hypothesis H : τ τ = 0, the test statistic T =y ¯ y¯ will be 0 1 − 2 1. − 2. used, which null distribution will be estimated by three resampling methods.

Theorem 2.2 gives the maximum number of unique resampled test statistic values

in the permutation test, post-pivot resampling and pre-pivot resampling methods.

Theorem 2.2. In the case of two independent samples with a sample size of m

in one group and a sample size of n in another group, the maximum number of

33 unique resampled test statistic values in the permutation test, the post-pivot resampling method, and the pre-pivot resampling method are

1. The permutation test: m + n (2.7) m 2. The post-pivot resampling method:

2m 1 2n 1 − − (2.8) m × n 3. The pre-pivot resampling method:

2m + n 1 m +2n 1 − − (m + n)+1 if m = n (2.9) m × n − 6 3n 1 3n 1 3n 1 − − − +1 if m = n. (2.10) n × n − n Proof. (Theorem 2.2). In the case of two independent samples, the permutation test will permute the labels (0s and 1s) among (m + n) samples under the null hypothesis.

It can be considered as choosing m samples from (m + n) pooled samples without considering their orders, and assigning 0’s to the chosen m samples and 1’s to the remaining n samples, or vice versa.

Therefore, there will be

m + n m + n = (2.11) m n m+n unique permutations, which will give at most m unique resampled test statistic values.

For the post-pivot resampling method, we will resample m samples with replacement from m samples within one treatment, and n samples with replacement from n

34 samples within the other treatment. The possible arrangements are

m + m 1 2m 1 − = − and (2.12) m m n + n 1 2n 1 − = − . (2.13) n n The fundamental counting theorem states: if there are a ways of counting one thing, b ways of choosing a second after the ﬁrst is chosen . . . , and z ways of choosing the last item after the earlier choices, then the total number of choice patterns is a b c z. × × ×···× Since we do not need to consider the order of the resampled observations, based on the fundamental counting theorem, there will be a maximum of

m + m 1 n + n 1 2m 1 2n 1 − − = − − (2.14) m × n m × n unique test statistic values for the post-pivot resampling method.

For the pre-pivot resampling method, we will resample (m + n) residuals since the residuals from these two treatment groups can be combined together. We will then resample the groups with replacement under an assumption that the distributions of the residuals are the same for both treatment groups.

Although the residuals from these two treatment groups are combined, we still can obtain possible arrangements within each treatment group and combine them later.

The maximum number of unique resampled test statistic values can be calculated by the following steps.

(1) Sample m data points from the m + n residuals with replacement, without

m+m+n−1 considering the order, to get possible arrangements m for one treatment group.

35 (2) Sample n data points from the m+n residuals with replacement, without con-

n+m+n−1 sidering their order, to get possible arrangements n for the other treatment group.

(3) Based on the fundamental counting theorem, there will be

m + m + n 1 n + m + n 1 2m + n 1 m +2n 1 − − = − − m × n m × n possible arrangements for calculating the test statistic values.

(4) If m = n, there will be (m + n) arrangements giving 0 value resampled test 6 statistics. Therefore, there are (m+n) 1 redundant 0 value resampled test statistics − from step (3). If m = n, the number of redundant 0 value resampled test statistics is

3n−1 1. n − (5) The maximum number of unique resampled test statistic values will be

2m + n 1 m +2n 1 − − (m + n)+1 if m = n; (2.15) m × n − 6 3n 1 3n 1 3n 1 − − − +1 if m = n. (2.16) n × n − n for the pre-pivot resampling method.

2.4.3 Multiple independent samples

In this section, we will extend the case of two samples to the case of multiple

independent samples. In the case of multiple independent samples, there are multiple

treatment groups instead of two treatment groups. A simple, multiple independent

samples model can be written as

y = µ + τ + ǫ , i =1, , I; j =1, , n . (2.17) ij i ij ··· ··· i

In the above model, yij denotes the observed value for the ith treatment and the

jth replicate, µ denotes the overall mean, τi denotes the ith treatment eﬀect, and

36 ǫij are independent and identically distributed with an arbitrary distribution F with

E(ǫij)=0.

′′ ′ To test the null hypothesis H0 : C β = 0 (where β = (µ τ1 ...τI ) ), we use

the test statistics T = C ′′β0 , where C ′′β (C is a contrast vector) is an estimable

function of β, and β 0 is a generalized least square estimator. In the case of multiple

independent samples, Theorem 2.3 gives the maximum number of unique resampled

test statistic values using the permutation test, the post-pivot resampling method,

and the pre-pivot resampling method.

Theorem 2.3. In the case of multiple independent samples, assume the number of

non-zero values in vector C is t (t I) and the sample sizes corresponding to these ≤

t treatments are n[1], n[2], ..., and n[t] respectively. Then, the maximum number of

unique resampled test statistic values in the permutation test, the post-pivot resampling

method and the pre-pivot resampling method are shown as follows:

1. The permutation test:

t n t n t n i=1 [i] i=2 [i] i=t−1 [i] (2.18) n n ··· n P [1] P [2] P [t−1] 2. The post-pivot resampling method:

2n 1 2n 1 2n 1 [1] − [2] − . . . [t] − (2.19) n n n [1] [2] [t] 3. The pre-pivot resampling method:

n + t n n + t n n + t n t [1] i=1 [i] [2] i=1 [i] [t] i=1 [i] n +1 (2.20) n n ··· n − [i] [1] [2] [t] i=1 P P P X if no two n[i]’s is the same; (t + 1)n 1 t (t + 1)n 1 − − +1 if n = n. (2.21) n − n [i]

37 Proof. (Theorem 2.3). In the case of multiple independent samples, the permutation

t test will permute all the t labels among i=1 n[i] samples under the null hypothesis.

Each time we do a permutation, we willP choose n[i] (i =1, ..., t) samples from the

t i=1 n[i] pooled samples without considering their order. According to the funda- mentalP counting theorem, there will be

t n t n t n i=1 [i] i=2 [i] . . . i=t−1 [i] (2.22) n n n P [1] P [2] P [t−1] unique permutations, which will give the above maximum number of unique permuted

test statistic values.

For the post-pivot resampling method, we will resample each n[i] (i =1, ..., t) sample with replacement within each treatment. The possible arrangements in each, within treatment resampling, are

2n 1 [i] − for i =1,...,t (2.23) n [i] without considering the sample order.

Based on the fundamental counting theorem, there will be at most

2n 1 2n 1 2n 1 [1] − [2] − . . . [t] − (2.24) n n n [1] [2] [t] unique resampled test statistic values for the post-pivot resampling method.

t For the pre-pivot resampling method, we will resample i=1(n[i]) residuals with replacement under an assumption that the distributions of rPesiduals are the same for

all treatment groups. The maximum number of unique resampled test statistic values

for the pre-pivot resampling method can be calculated as follows:

t (1) Sample n[i] (i =1, ..., t) data points from the i=1 n[i] residuals with replace-

n +Pt n ment, without considering the order, to get [i] i=1P[i] possible arrangements for n[1] each treatment group.

38 (2) Based on the fundamental counting theorem, there will be

n + t n n + t n n + t n [1] i=1 [i] [2] i=1 [i] . . . [t] i=1 [i] (2.25) n n n P[1] P[2] P[t] possible arrangements to calculate the test statistics under the null hypothesis.

(3) For any n = n (1 i = j t), there will be t n arrangements giving [i] 6 [j] ≤ 6 ≤ i=1 [i] test statistics with a value of 0. Therefore, there are Pt (n ) 1 redundant test i=1 [i] − statistics with a value of 0 from step (2). For n[i] = n P(i = 1, ..., t), the number of redundant test statistics with a value of 0 is (t+1)n−1 1. n − (4) Therefore, the maximum number of unique resampled test statistic values will be

n + t n n + t n n + t n t [1] i=1 [i] [2] i=1 [i] . . . [t] i=1 [i] n +1 (2.26) n n n − [i] [1] [2] [t] i=1 P P P X if no two n[i]’s is the same; (t + 1)n 1 t (t + 1)n 1 − − +1 if n = n. (2.27) n − n [i] for the pre-pivot resampling method.

2.4.4 General linear mixed-eﬀects models

The general linear mixed-effects models are widely used for analyzing data with repeated measurements. Since customized microarray experiments, designed according to statistical principles, have become more popular, general linear mixed-effects models have been applied to microarray data analysis (Pan et al. (2003)). A standard general linear mixed-effects model can be written as

y = Xββ + Zγγ + ǫǫ, (2.28) where y is the vector of observations, X and Z are the design matrix for the fixed effects β and the random effects γ, respectively. The residuals ǫ are assumed to have

39 a mean of zero and covariance matrix R. The random eﬀects γ are assumed to have a mean of zero and covariance matrix G. Under the above assumptions, the vector y will have a mean of Xββ and the covariance matrix of y is V = ZGZ ′ + R.

Under the general linear mixed-eﬀects model setting outlined above, we are inter-

′ ′ ˆ ested in testing the null hypothesis H0 : C β = 0 using the test statistics T = C βˆ, where C ′β is an estimable function of β. βˆ denotes an estimator of β (such as the generalized least square estimator, the maximum likelihood estimator, or the restricted maximum likelihood estimator).γ ˆ denotes an unbiased predictor of γ. By solving

Henderson’s mixed model equations, we can get the best linear unbiased estimator

(BLUE) of fixed effects β, and the best linear unbiased predictor (BLUP) of random effects γ, shown as follows:

′ ′ βˆ =(X V −1X)−X V −1y (2.29)

′ γˆ = GZ V −1(y Xββ) (2.30) −

A simple example of the linear mixed-eﬀects model with two treatments in a balanced design is showed as follows:

yijk = µ + τi + γj(i) + ǫijk, (2.31)

i =1, 2; j =1, ,J; k =1, , n. (2.32) ··· ···

In the above example, yijk denotes the observation from the ith treatment, the jth subject, and the kth replicate; µ denotes the overall mean; τi denotes the ith treatment eﬀect; γj(i) denotes the random eﬀect from the jth subject nested in the ith treatment; and ǫijk is the random error associated with yijk.

40 Assume that τ1 + τ2 = 0, and γj(i) and ǫijk are mutually independent with

γ i.i.d. N(0, σ2) and (2.33) j(i) ∼ γ ǫ i.i.d. N(0, σ2). (2.34) ijk ∼ ǫ

2 2 Under these assumptions, the unbiased estimators for unknown σγ and σǫ can be derived using the ANOVA method. These estimators are:

2 J n 2 (yijk y¯ij.) σˆ2 = MSE = i=1 j=1 k=1 − and (2.35) ǫ 2J(n 1) P P P − 2 J 2 2 J n 2 2 MSB MSE i=1 j=1(¯yij. y¯i..) i=1 j=1 k=1(yijk y¯ij.) σˆγ = − = − − . n P P2n(J 1) − P P 2JnP(n 1) − − (2.36)

Simple calculations show that the BLUE of µ and τi, the BLUPs of γj(i), and ǫijk

are

µˆ =y ¯...; (2.37)

τˆ =y ¯ y¯ for i =1, 2; (2.38) i i.. − ... 2 nσγ γˆj(i) = 2 2 (¯yij. y¯i..) for i =1, 2; and j =1, ,J; (2.39) nσγ + σǫ − ··· ǫˆ = y y¯ for i =1, 2; j =1, ,J; and k =1, , n. (2.40) ijk ijk − ij. ··· ···

To test the null hypothesis H : τ τ = 0, we will use the test statistics 0 1 − 2 T =τ ˆ τˆ = Y¯ Y¯ . 1 − 2 1.. − 2.. To get the resampled test statistic values, the permutation test will treat the

observations from the same subject as one resampling unit. By resampling without

replacement among those 2J i.i.d. resampling units, the post-pivot resampling method

will have the same resampling unit as the permutation test. Finally, we will resample

the J i.i.d. resampling units with replacement within each treatment group.

41 The pre-pivot resampling method will calculate the resampled test statistic values

according to the following several steps:

2 2 (1) Get the BLUEs of µ and τi, the unbiased estimators of σγ and σǫ , and the

BLUPs of γj(i) and ǫijk.

(2) Resample the predicted random eﬀectsγ ˆj(i) and the predicted residualsǫ ˆijk

∗ from the jth subject across all treatments with replacement to get the resampledγ ˆj(i)

∗ andǫ ˆijk.

∗ ∗ ∗ ∗ (3) Calculate yijk by yijk =µ ˆ +ˆτi +ˆγj(i) +ˆǫijk.

∗ ∗ ∗ (4) Get the BLUE ofµ ˆ andτ ˆi from the resampled yijk. (5) Get the resampled test statistic T ∗ =τ ˆ∗ τˆ∗ = Y¯ ∗ Y¯ ∗ in the linear mixed- 1 − 2 1.. − 2.. eﬀects model.

Theorem 2.4. In the above balanced, simple, linear mixed-eﬀects model with two

treatments, the maximum number of unique resampled test statistic values in the per-

mutation test, the post-pivot resampling method and the pre-pivot resampling method

are:

1. The permutation test: 2J (2.41) J 2. The post-pivot resampling method:

2J 1 2 − (2.42) J 3. The pre-pivot resampling method:

4J 1 3nJ 1 2 3nJ 1 − − 2J − +1. (2.43) 2J nJ − nJ

42 Proof. (Theorem 2.4). In the balanced linear mixed eﬀects model setting, the permutation test can only permute the subjects since all observations from the same subject are correlated. Therefore, there will be

J + J 2J = (2.44) J J

2J unique permutations so that no more than J of the maximum number of unique resampled test statistic values.

For the post-pivot resampling method, we will resample subjects with replacement within each treatment.

The possible arrangements in each within-treatment resampling are

J + J 1 2J 1 − = − (2.45) J J without considering the order of those resampled subjects.

Based on the fundamental counting theorem, there will be at most

2J 1 2J 1 2J 1 2 − − = − (2.46) J J J unique resampled test statistics for the post-pivot resampling method.

For the pre-pivot resampling method, we will resample the predicted values of

γˆj(i) andǫ ˆijk across the treatments with replacement, under the assumption of the balanced simple linear mixed eﬀects model.

The maximum number of unique resampled test statistic values from the pre-pivot resampling method can be calculated with the following steps.

2J+2J−1 4J−1 (1) First, resample 2J predicted values with replacement to get 2J = 2J possible arrangements without considering the order.

43 (2) Next, resample nJ predicted residualsǫ ˆijk from the pooled 2nJ predicted residuals with replacement for each treatment group to get

nJ +2nJ 1 3nJ 1 − = − (2.47) nJ nJ possible arrangements for each treatment group.

(3) Based on the fundamental counting theorem, there will be

4J 1 3nJ 1 3nJ 1 4J 1 3nJ 1 2 − − − = − − (2.48) 2J nJ nJ 2J nJ possible arrangements for calculating the test statistic values under the null hypoth-

esis.

(4) The test statistic value will be equal to zero when both treatment groups have

the same resampled predicted random eﬀects and predicted residuals. In total, there

3nJ−1 are 2J nJ test statistic values of zero given by all possible arrangements. Thus, there will be 2J 3nJ−1 1 redundant test statistic values of zero in the resampling nJ − process.

(5) Therefore, the maximum number of unique resampled test statistic values will

be 4J 1 3nJ 1 2 3nJ 1 − − 2J − +1 (2.49) 2J nJ − nJ for the pre-pivot resampling method.

As shown, the pre-pivot resampling method always gives the largest maximum number of unique resampled test statistic values, following by the post-pivot resampling method and the permutation test. As indicated in Table 2.2, the pre-pivot resampling method produces more reliable P-values than the other two resampling methods because the test statistic’s null distribution estimated by the pre-pivot resampling method tends to be more close to the true test statistic’s null distribution

44 Table 2.2: Maximum number of unique resampled test statistic values for the permutation test, post-pivot resampling method and pre-pivot resampling method Resampling Method m = n =3 m = n =4 Paired samples Permutation 8 16 Post-pivot resampling 10 35 Pre-pivot resampling 10 35 Two independent samples Permutation 20 70 Post-pivot resampling 100 1225 Pre-pivot resampling 3081 108571

than the other two resampling methods with more unique resampled test statistic values.

45 CHAPTER 3

CONDITIONS FOR RESAMPLING METHODS TO CONTROL MULTIPLE TESTING ERROR RATES

Resampling methods are popular tools in hypotheses testing where the (joint or

marginal) distribution of test statistics are unknown. To control multiple testing error

rates at a desired level, resampling methods need to satisfy certain conditions. In this

chapter, we derive the suﬃcient conditions for controlling the multiple testing error

rates using the permutation test, the post-pivot resampling method or the pre-pivot

resampling method for both two-group comparisons and ﬁxed-eﬀects general linear

models. Throughout our derivation, microarray data analysis is used for illustration.

3.1 Two-group comparison

Suppose that we observe gene expressions in m patients with high risk for breast cancer and n patients with low risk for breast cancer. Let Xl = (Xl1,...,Xlk), l = 1,...,m, denote the observations of k genes of the lth individual from the high

risk group, and let Yj = (Yj1,...,Yjk), j = 1,...,n, denote the observations of the

same k genes of the jth individual from the low risk group. We assume that X i.i.d F i ∼ X and Y i.i.d F , where the F and F are arbitrary multivariate distributions. The j ∼ Y X Y goal is to identify which genes are diﬀerentially expressed between the high risk group

46 and the low risk group, i.e., to test the k null hypotheses

H : θ = µ µ =0, i =1,...,k. 0i i Xi − Yi

The test statistics are

T = X¯ Y¯ , i =1,...,k, i i − i 1 m 1 n X¯ = X , Y¯ = Y . i m li i n ji j=1 Xl=1 X 3.1.1 Permutation tests

A step-down permutation test (based on test statistics Tis), that controls FWER,

would proceed as follows (Westfall and Young (1993)).

Suppose the test statistics for these k tests are ordered such that T T | [1]|≤| [2]| ≤ T . ···≤| [k]| Step 1. If T >c , then infer θ = 0 and go to step 2; otherwise stop. | [k]| k [k] 6 Step 2. If T >c , then infer θ = 0 and go to step 3; otherwise stop. | [k−1]| k−1 [k−1] 6

··· Step k: If T >c , then infer θ = 0 and stop; otherwise stop. | [1]| 1 [1] 6

The critical values ck,ck−1,...,c1 can be estimated by the following permutation

test:

For the pth permutation (resample without replacement), p = 1,...,P :

1. Permute the group labels (here we use 0 to denote the low risk group and 1 to

denote the high risk group) of the data vectors X , , X , Y , , Y . 1 ··· m 1 ··· n 2. Compute the test statistics T p ,..., T p based on the permuted data. | [1]| | [k]| 3. Find max T p , max T p , . . . , max( T p , T p ) and T p . i=1,...,k| [i]| i=1,...,k−1| [i]| | [1]| | [2]| | [1]|

47 Repeat the above steps P times, and compute the upper α quantiles of the distri-

butions of the max T p , max T p , . . . , max( T p , T p ) and T p . Set i=1,...,k| [i]| i=1,...,k−1| [i]| | [1]| | [2]| | [1]| them as critical values ck,ck−1,...,c1 respectively.

To control FWER strongly, the permutation distribution of the test statistics T

(T =(T1,...,Tk)) needs to be the same as the true distribution of the test statistics

T under the null hypothesis. Let ka(FX ) and ka(FY ), a = 1, 2, 3,... , denote the cumulants of FX and FY respectively (assuming they exist).

Huang et al. (2006) showed that the permutation distribution and the true distribution of the test statistics can be expressed in terms of ka(FX ) and ka(FY ).

Theorem 3.1. (1) The true distribution of the test statistics T = X¯ Y¯ has cumu- − lants

k (T )= m1−ak (F )+( 1)an1−ak (F ). (3.1) a a X − a Y

(2) For a given permutation with r elements relabeled, the distribution (Pr) of the test statistics T r = X¯ r Y¯ r obtained by a permutation has cumulants − 1 ( 1)a k (T r)= k (T ) r( − )(k (F ) k (F )). (3.2) a a − ma − na a X − a Y

Comparing the cumulants in the permutation distribution with the true distribu-

tion, we obtain the following results immediately.

Corollary 3.2. (1) If m = n, i.e., two groups have the same sample size, then the

true and permutation distributions of the test statistics T have the same even order

cumulants.

(2) The true and permutation distributions of the test statistics do not necessarily

have the same odd order cumulants even if m = n, unless ka(FX ) = ka(FY ) for all

odd a’s.

48 Therefore, despite whether the sample sizes for the high risk group and the low risk group are equal or not, the data distributions must have matching cumulants at the least favorable configuration (LFC) to make the true and permutation distribution of the test statistics the same. The LFC is the configuration where the supremum is taken. A sufficient condition for this to occur is the marginal-determine-the-joint

(MDJ) condition proposed by Xu and Hsu (2007).

The MDJ condition is used to connect the marginal distributions with the joint distribution. In partition testing, the null hypotheses are the equivalence of two marginal distributions as follows:

HP : F = F for i I and F = F for j/ I 0I Xi Yi ∈ Xj 6 Yj ∈ for each I 1,...,k . ⊆{ } The null hypotheses being tested by permutation testing are, however

perm H0I : FXI = FYI ,

where FXI and FYI are the joint distributions of the expression levels of genes with indices in I from the low risk and high risk groups, respectively.

perm P Thus, a level-α test for H0I would be a level-α test for H0I only if the following marginal-determine-the-joint (MDJ) condition holds:

MDJ let I , j = 1,...,n, be any collection of disjointed subsets of 1,...,k , j { } I 1,...,k . If the marginal distributions of the observations are identical for two j ⊆{ }

groups, FXIj = FYIj for all j = 1,...,n, then the joint distributions are identical as

U well, F U = F U where I = I . XI YI ∪j=1,...,n j

49 3.1.2 Post-pivot resampling method

The post-pivot resampling method resamples (with replacement) data within each treatment group instead of across treatment groups, and estimates the null distribution of test statistics by centering (and scaling) the resampled test statistics. A step-down FWER-controlling multiple test procedure proceeds as follows:

Suppose the test statistics are ordered such that T T T . | [1]|≤| [2]|≤···≤| [k]| Step 1. If T >c , then infer θ = 0 and go to step 2; otherwise stop. | [k]| k [k] 6 Step 2. If T >c , then infer θ = 0 and go to step 3; otherwise stop. | [k−1]| k−1 [k−1] 6

··· Step k: If T >c , then infer θ = 0 and stop; otherwise stop. | [1]| 1 [1] 6

The critical values ck,ck−1,...,c1 can be obtained by the following steps.

For the bth bootstrap (resample with replacement), b = 1,...,B:

1. Resample the data vectors X , , X and Y , , Y with replacement 1 ··· m 1 ··· n within the low risk group and high risk group respectively.

b b 2. Compute the test statistics T1 ,...,Tk based on the resampled data. 3. Repeat step 1 and step 2 B (B mmnn) times and get all resampled test ≤ statistics.

b b 4. Center the resampled test statistics T1 ,...,Tk to get centered test statistics Zb,...,Zb (where Zb = T b B (T b)/B for i =1,...,k). 1 k i i − b=1 i 5. Find max Zb , Pmax Zb , . . . , max( Zb , Zb ) and Zb for i=1,...,k| [i]| i=1,...,k−1| [i]| | [1]| | [2]| | [1]| each resample.

6. Compute the upper α quantiles of the distributions of the max Zb , i=1,...,k| [i]| max Zb , . . . , max( Zb , Zb ) and Zb . Set them as the critical values of i=1,...,k−1| [i]| | [1]| | [2]| | [1]| ck,ck−1,...,c1 respectively.

50 The post-pivot step-down FWER-controlling multiple testing procedure controls

FWER asymptotically at level α. It means that, for a sample of size n, the error rate

α has the property limsup α α under the true data generating distribution. n n→∞ n ≤ Let S = j : θ (P )=0 denote a set of true null hypotheses (P denotes the true 0 { j }

data generating distribution), Q denote the true distribution of test statistics (Qn

is an estimate of Q), and Q0 denote the null distribution of test statistics. Given a

vector of cutoﬀ values c, the random variables V (c Q) and R(c Q) are deﬁned as: | |

V (c Q) = I( T >c ) | | jn| j j∈S X0 k R(c Q) = I( T >c ), where T Q. | | jn| j n ∼ j=1 X Let V = V (c Q (P )) denote the number of false positives of the multiple testing n | n procedure and R = R(c Q (P )) denote the total number of rejected null hypotheses n | n in the same multiple testing procedure. For a discrete distribution F on 0,...,k , we { } deﬁne a real valued parameter η(F ) (0, 1) to represent a particular multiple testing ∈

error rate, where F represents a candidate for the distribution of Vn. Let FVn denote

the cumulative distribution of the random variable V . We wish to have η(F ) α n Vn ≤ at least asymptotically.

FWER can be written as a function of the distribution of FVn , shown below:

η(F )=1 F (0) = Pr(V > 0). Vn − Vn n

The distance measure we will use between two cumulative distribution functions

F and F on 0,...,k is deﬁned as 1 2 { }

d(F , F )= max F ( j ) F ( j ) . 1 2 j∈{0,...,k}| 1 { } − 2 { } |

51 Similarly, we will test the null hypotheses

H : θ = µ µ =0, i =1,...,k, 0i i Xi − Yi

using the test statistics

T = X¯ Y¯ , i =1,...,k. i i − i

Pollard and van der Laan (2005) proved that the post-pivot step-down multiple testing procedure controls FWER asymptotically at level α if the following assumptions are satisﬁed.

1. Uniform Continuity: if d(F ,G ) 0, then η(F ) η(G ) 0; n n → n − n →

2. When the centered test statistics is Z D Z, it has a limiting distribution of n → Q N(0, Σ(P )); 0 ≡

3. Let Q be an estimate of Q , deﬁne c c(Q , α) and c c(Q , α). c c 0n 0 0n ≡ 0n 0 ≡ 0 0n → 0 in probability for n . →∞ 3.1.3 Pre-pivot resampling method

Under an assumption that the residuals from two groups have the same distribu-

tion, the pre-pivot resampling method resamples (with replacement) residuals of a

model across treatment groups. We add the centered residuals back to get resampled

observations, and recompute the test statistics for each resampled observations to

get the estimated test statistic’s null distribution. Based on the pre-pivot resampling

method, a step-down FWER-controlling multiple test procedure would proceed as

follows:

Suppose that the test statistics are ordered such that T T T . | [1]|≤| [2]|≤···≤| [k]|

52 Step 1. If T >c , then infer θ = 0 and go to step 2; otherwise stop. | [k]| k [k] 6 Step 2. If T >c , then infer θ = 0 and go to step 3; otherwise stop. | [k−1]| k−1 [k−1] 6

··· Step k: If T >c , then infer θ = 0 and stop; otherwise stop. | [1]| 1 [1] 6

The critical values ck,ck−1,...,c1 can be obtained by the following steps.

For the bth bootstrap (resample with replacements), b = 1,...,B:

1. Calculate the residuals for both groups by subtracting the average from each observation within each group for each gene ((X X¯ ,X X¯ ,...,X X¯ ,Y i1 − i i2 − i im − i i1 − Y¯ ,Y Y¯ ,...,Y Y¯ ) for i =1,...,k). i i2 − i in − i 2. Resample the above m + n residual vectors with replacements B (B mmnn) ≤ times.

3. For each gene, check whether the (B(m + n)) resampled residuals have an estimated mean (average) of 0. If not, center the resampled residuals at the estimated mean.

4. For each gene, add the centered residuals back to the original observations to get resampled observations for each of those B resamples.

5. Calculate the test statistics T b∗ (i =1,...,k and b =1,...,B) for each gene | i | within each of the resampled observations.

6. Find max T b∗ , max T b∗ , . . . , max( T b∗ , T b∗ ) and T b∗ for i=1,...,k| [i] | i=1,...,k−1| [i] | | [1] | | [2]| | [1]| each resampled sample.

7. Compute the upper α quantiles of the distributions of the max T b∗ , i=1,...,k| [i] | max T b∗ , . . . , max( T b∗ , T b∗ ) and T b∗ . Set them as the critical values i=1,...,k−1| [i] | | [1]| | [2]| | [1]| ck,ck−1,...,c1 respectively.

53 The pre-pivot step-down FWER-controlling multiple testing procedure also controls FWER asymptotically at a level of α under certain conditions. Two-group comparisons are special cases of fixed-effects general linear models. Therefore, for controlling FWER asymptotically at a level of α, the conditions for two-group comparisons are exactly the same as that for fixed-effects general linear models.

3.2 Fixed-eﬀects general linear model

Consider a ﬁxed-eﬀects general linear model with the form:

y = Xβ + ǫ (3.3)

In this equation, y =(Y ,...,Y )′ is an n 1 data vector of observed responses, and β 1 n × =(β , β ,...,β )′ isa(k+1) 1 vector of unknown parameters that can be estimated 0 1 k × from the data. In addition, X is an n (k +1) data matrix of full rank (k + 1) n × ≤ of known predictors, while ǫ = (ǫ ,...,ǫ )′ is an n 1 random vector of unobserved 1 n × errors. The matrix X can be written as 1 X X 11 ··· 1k 1 X21 X2k X =  . . ···. .  . . . .    1 Xn1 Xnk   ···   ′ Here, the first column of X is the vector 1n = (1,..., 1) so that the first coefficient

β is the intercept. We can write X = (1 ,X ), where X is an n k matrix. There 0 n ∗ ∗ ×

are k coeﬃcients β1,...,βk that correspond to the explanatory variables X1,...,Xk.

We assume E(ǫ)=0, i =1,...,n.

Using the least squares approach, when r(X ′ X)=(k + 1), there is only a unique

ordinary least squares estimator of β denoted by

′ ′ βˆ =(X X)−1X y,

54 with a mean of β and a variance-covariance matrix of σ2(X ′ X)−1.

To determine whether those predictors have signiﬁcant eﬀects on the response variable, we need to test multiple null hypotheses H0i : βi =0(i = 1,...,k). The test statistics are T = βˆ (i =1,...,k). | i| | i| 3.3 Estimating the test statistic’s null distribution

To control the Type I error rate in each partitioning null hypothesis, we need to know the test statistic’s null distribution (distribution of test statistics under the null hypothesis). However, the test statistic’s null distribution is unknown in most cases. Resampling techniques are commonly used to estimate the test statistic’s null distribution in multiple hypotheses testing.

3.3.1 Permutation tests

Permutation tests are popular resampling techniques for hypothesis testing when the distribution of the test statistic is unknown. The conditions for permutation tests to be valid for two-group comparisons were discussed by Xu and Hsu (2007). Com- pared to two-group comparisons, fixed-effects general linear models are more widely used in microarray data analysis. We (Calian et al. (2008)) showed, for the first time, that the test statistic’s distribution estimated by permutation tests is identical to the test statistic’s true distribution under a typical partition null hypothesis. For the

ﬁxed-eﬀects general linear model (3.3), the permutation test estimates the distribution of the test statistic max T = max βˆ under a typical partition null γ=1,...,t| γ| γ=1,...,t| γ| hypothesis H0{12···t} by the following steps:

′ Step 1. Permute y =(Y1,...,Yn) data vector to get a permuted response vector

p p p ′ y =(Y1 ,...,Yn ) ;

55 ˆp ˆp ˆp ′ ′ −1 ′ p Step 2. Calculate permuted test statistics βI =(β1 ,..., βt ) based on (X X) X y ; Step 3. Repeat steps 1 and 2 for all n! possible permutations;

Step 4. Get the joint permutation distribution of βˆ by combining all n! joint ˆp permutation distributions of βI for p =1,...,n!; Step 5. Find the permutation distribution of max T = max βˆ γ=1,...,t| γ| γ=1,...,t| γ| based on the joint permutation distribution of βˆ ,..., βˆ . { 1 t} As shown in theorem 3.3, if the errors in the model (3.3) are i.i.d. distributed and the test statistics are ordinary least squares estimates, the test statistic max T ’s γ=1,...,t| γ | distribution estimated by permutation tests is identical to its true distribution under a typical partitioning null hypothesis H0{12···t}

Theorem 3.3. In the ﬁxed-eﬀects general linear model (3.3), assume

(1) The errors ǫ , , ǫ are i.i.d., and 1 ··· n

(2) The test statistics are simply the ordinary least squares estimates (OLS), Ti = βˆi,

i =1,...,k.

Then, ˆ ˆ (a) Average over all permutations of Y , one has β0 = Y¯ , βi =0, i =1,...,k;

(b) The test statistic max T ’s distribution estimated by permutation tests is γ=1,...,t| γ|

identical to its true distribution under a typical partitioning null hypothesis H0{12···t}.

¯ ¯ n Proof. First we show that the average of all permuted Y is Y 11, where Y = (1/n) i=1 Yi and 11=(1,..., 1)′ . P

Among all the permuted Y , those with the i-th element being y appear (n 1)! 1 − times, and those with the i-th element being y appear (n 1)! times, etc. Thus, the 2 −

56 sum of all permutations of Y is

n 1 . (n 1)!( yi) . −   i=1 1 X     Since there are n! permutations, the average of the permuted Y is Y¯11.

For any Y , βˆ is the projection of Y onto the subspace spanned by columns of the design matrix X. Because X remains ﬁxed, the average of this projection of the permuted Y is the same as the projection of the average of the permuted Y . From the above results, the average of the permuted Y is Y¯11. For such a data vector, with an intercept term in the model, βˆ0 = Y¯ , βˆi = 0, i =1,...,k, proving (a).

For any data vector, the variance-covariance matrix (second order cumulant) of

βˆ is σ2(X ′ X)−1. Therefore, the true distribution and the permutation distribution

have the same variance-covariance matrix.

For higher order cumulants, the following three properties about cumulants (p1,

p2, and p3) will be used in the proof of Theorem 3.3.

(p1) Multilinearity of cumulants: for any two random vectors Z, W and any

matrix M so that Z = MW, the components of the cumulants ka(Z) and ka(W)

(which are, in general, tensors of order a) are related by:

(ka(Z))j1j2...ja = Mj1l1 Mj2l2 ...Mjala (ka(W))l1l2...la l1Xl2...la (p2) Any permutation matrix A(P ) satisﬁes:

A(p)(A(p))T =(A(p))T A(p) = I

(p3) When the errors are i.i.d., the joint distribution of (Y1,...,Yn) has higher

order (a 2) cumulants which are diagonal tensors: k (Y) k (Y ,...,Y ) ≥ a j1j2...ja ≡ a j1 j1 57 with non-zero elements k (Y ) k (y ). Only when the errors are also identically a jj...j ≡ a j distributed, write k (Y) = k (ǫ) k (y). a jj...j a ≡ a Since βˆ = ΩY with Ω (X ′ X)−1X ′ , using (p1), we ﬁnd that the true distribution ≡ of βˆ has the following cumulants:

ˆ (ka(β))i1i2...ia = Ωi1j1 Ωi2j2 . . . Ωiaja (ka(Y))j1j2...ja j j ...j 1X2 a which are, for independent errors:

ˆ (ka(β))i1i2...ia = Ωi1jΩi2j . . . Ωiajka(yj) j X and ˆ (ka(β))i1i2...ia = ka(y) Ωi1jΩi2j . . . Ωiaj j X if errors are also identically distributed. ˆ Let βpermute denote the permutation distribution of the test statistic. At any ˆ(p) given permutation, the cumulants of the distribution of βpermute deﬁned in terms of a permutation matrix A(p) are:

ˆ(p) (p) βpermute = Ω(A Y).

Using (p1), one can write:

ˆ(p) (p) (p) (ka(βpermuted))i1i2...ia = Ωi1j1 Ωi2j2 . . . Ωiaja Aj1l1 ...Ajala (ka(Y))l1l2...la . j j ...j 1X2 a l1Xl2...la When errors are independent, this gives:

ˆ(p) ka(βpermuted))i1i2...ia = Ωi1j(p)Ωi2j(p) . . . Ωiaj(p)k(yl(p)) (j(Xp),l(p)) where (j(p), l(p)) are all pairs that correspond to A(p) =00 for any permutation matrix 6 A(p). As shown in the following, we obtain common higher order cumulants only when the errors are also identically distributed.

58 From (p3), we have:

ˆ(p) (p) (p) ka(βpermuted))i1i2...ia = Ωi1j1 . . . Ωiaja (Aj1l ...Ajal)ka(y) j j ...j 1X2 a Xl and by the permutation matrix structure we obtain:

ˆ(p) ˆ ka(βpermuted))i1i2...ia = Ωi1j . . . Ωiaj(ka(y))=(ka(β))i1i2...ia . j X Therefore, the joint distribution of βˆp,..., βˆp estimated by the permutation { 1 t } test is exactly the same as the true joint distribution of βˆ ,..., βˆ under a typical { 1 t}

partition null hypothesis H0{12···t}.

Since the distribution of max βˆ is a one-to-one function of the joint distri- γ=1,...,t| γ| bution of βˆ ,..., βˆ , the distribution of max T = max βˆ under the { 1 t} γ=1,...,t| γ| γ=1,...,t| γ| null hypothesis H is exactly the same as the distribution of max T p = 0{12···t} γ=1,...,t| γ | max βˆp . γ=1,...,t| γ | ∗ To control the Type I error rate for testing H0I at level α, we need to have

supH∗ P maxγ=1,...,t Tγ >c α. 0{12···t} { | | } ≤

Since the joint distribution of βˆ ,..., βˆ only depends on β ,...,β , { 1 t} { 1 t}

p supH∗ P maxγ=1,...,t Tγ >c = supH∗ P maxγ=1,...,t Tγ >c α, 0{12···t} { | | } 0{12···t} { | | } ≤

p ∗ where c (the critical value for testing H0I ) is just the upper αth quantile of max T p estimated by the permutation test. γ=1,...,t| γ |

Therefore, a permutation test, rejecting H∗ when max βˆ > c, can correctly 0I i∈I | i| estimate the distribution of max T under a typical partition null hypothesis γ=1,...,t| γ |

H0{12···t}.

59 An example of testing H0i : βi =0(i = 1,...,k) using permutation tests is the analysis of quantitative trait loci (QTL) by Churchill and Doerge (1994). In the QTL

th setting, Xri is a categorical predictor indicating whether the allele linked with i marker is in a recurrent or non-recurrent state for the rth observation. In Churchill and Doerge (1994)’s paper, the test statistics used in permutation test for testing

QTL eﬀects is the LOD score in Lander and Botstein (1989)’s paper. Lander and

Botstein (1989) gave the LOD score for testing the QTL eﬀects based on a single marker. The regression equation involving only a single marker is simply

φi = a + bgi + ǫi, (3.4)

where φi is the phenotype; gi is an indicator variable (with values 1 and 0) for allele

2 status; and ǫi is i.i.d. normal random variable with mean 0 and variance σ . a, b

and σ2 are unknown parameters. Here, b denotes the estimated phenotypic eﬀect of

a single allele substitution at a putative QTL. To test whether there is a QTL eﬀect,

they tested H0 : b = 0 using the following LOD score:

L(ˆa, ˆb, σˆ2) LOD = log10( 2 ), (3.5) L(ˆµA, 0, σˆB1) where

L(a, b, σ2)= z((φ (a + bg )), σ2), i − i i Y Here, z is the probability density of a normal distribution with mean 0 and variance σ2;

ˆ 2 2 (ˆa, b, σˆ ) are maximum likelihood estimates (MLE); (ˆµA, 0, σˆB1) are the constrained

2 MLEs under the null hypothesis H0 : b = 0. The estimated MLEs for a, b, σ , µA

60 2 and σB1 are

aˆ = φ¯ ˆbg¯ − n (φ φ¯)(g g¯) ˆb = i=1 i − i − n (g g¯)2 P i=1 i − n (φ aˆ ˆbg )2 σˆ2 = i=1P i − − i n P ¯ µˆA = φ n (φ φ¯)2 σˆ2 = i=1 i − . B1 n P The formula of the LOD score shows that the test used by Lander and Botstein

(1989) is a likelihood ratio test, which is equivalent to the t test for testing a single null hypothesis. When the sample size n is large enough, the LOD score is asymptot-

2 2 2 ically distributed as 1/2(log10e)χ , where χ denotes the χ distribution with 1 d.f..

Therefore, the conclusion of Theorem 3.1 still holds when the LOD score is used as the test statistic for testing the QTL eﬀect based on a single marker.

It is interesting to observe that the sum squares of treatment (SSTreatment, de-

2 ˆ2 noted byσ ˆexp in Lander and Botstein’s paper) happens to be proportional to b for

61 the regression model involving only a single marker.

σˆ2 =σ ˆ2 σˆ2 exp B1 − res n (φ φ¯)2 n (φ aˆ ˆbg )2 = i=1 i − i=1 i − − i n − n P P n (φ φ¯ + φ aˆ ˆbg )(φ φ¯ φ +ˆa + ˆbg ) = i=1 i − i − − i i − − i i n P n [2ˆb(φ φ¯)(g g¯) ˆb2(g g¯)2] = i=1 i − i − − i − n P 2ˆb n (φ φ¯)(g g¯) ˆb2 n (g g¯)2 = i=1 i − i − − i=1 i − n P P 2ˆb2 n (g g¯)2 ˆb2 n (g g¯)2 = i=1 i − − i=1 i − n P P ˆb2 n (g g¯)2 = i=1 i − n P ˆb2 = . 4 The last equality holds because in the backcross (B1) population, there are equal

amounts of recurrent and non-recurrent states and (g g¯)2 =( 1 )2 = 1 . Under the i − ± 2 4 ˆ 1 ˆ assumption of complete co-dominance and no epistasis, we will have b = 2 δ, where δ denotes the phenotypic eﬀect of a QTL. Therefore, the variance explained by the

2 2 QTL can be written as σexp = δ /16 as presented by Lander and Botstein (1989)

2 ˆ 2 since bothσ ˆexp and b are unbiased estimators for σexp and b.

3.3.2 Pre-pivot resampling method

The pre-pivot resampling method is a bootstrap method. The bootstrap method,

ﬁrst introduced by Efron (1979), is another resampling technique commonly used to estimate the unknown test statistics distributions. The bootstrap method consists of approximating the test statistic’s distribution by the empirical distribution of the data, and then resampling the data with replacement to obtain the estimated test statistic’s distribution that is asymptotically consistent.

62 The pre-pivot resampling method resamples the residuals of the ﬁxed-eﬀects gen-

eral linear model with replacement, and estimates the joint distribution of the test

statistics T ,...,T = βˆ ,..., βˆ under a typical partition null hypothesis H { 1 t} { 1 t} 0{12···t} according to the following several steps:

Step 1. Get the estimated residuals (ˆǫ) from the ﬁxed-eﬀects general linear model,

i.e.,ǫ ˆ = y Xβˆ; − Step 2. Check whether E(ˆǫ) = 0. If the mean is not equal to 0, then center the

residuals at the mean, i.e., computeǫ ˆ µˆ , whereµ ˆ = (1/n) n ǫˆ ; i − n n i=1 i Step 3. Resample the n centered residuals with replacementP to get the bootstrap

∗ ∗ residuals ǫ1,...,ǫn;

Step 4. Get the resampled y∗ values by y∗ = Xβˆ+ǫ∗, and calculate the resampled

ordinary least square estimates βˆ∗ =(X ′ X)−1X ′ y∗;

Step 5. Repeat Step 3 and Step 4 for B (B nn) times; ≤ Then, the joint distribution of √n(βˆ∗ βˆ) is asymptotically consistent to the joint − distribution of √n(βˆ β), provided n is large and n trace(X ′ X)−1 is small as shown − · in Theorem 3.4. Suppose

1 ′ X X V, which is positive deﬁnite. n →

If we also assume that the elements of X are uniformly small compared to √n, then

√n(βˆ β) is asymptotically normal with mean 0 and variance-covariance matrix − 2 −1 σ V . To satisfy this condition, the sample size nγ (γ = 1,...,r) in each treatment group should go to inﬁnity at the same rate, i.e., for a sequence of sample size

ν ν (n1,...,nr ), ν =1, 2,..., with total sample size

N = nν + + nν ν 1 ··· r 63 such that ν nr λγ as ν Nν → →∞

where λγ = 1 and λγ > 0 ((Lehmann, 1999)). P Theorem 3.4. In the ﬁxed-eﬀects general linear model (3.3), assume

(a) The errors are i.i.d., and

(b) 1 X ′ X V , which is positive deﬁnite, n → then, the joint distribution of the test statistics T ,...,T , estimated by the pre- { 1 t} pivot resampling method, converges to the true joint distribution of the test statistics

asymptotically under a typical partitioning null hypothesis H0{12···t}.

r Proof. Letρ ˜ be the Mallows metric on F = G F s : x dG(x) < . The r r,s { ∈ R k k ∞}

Mallow’s distance between two distributions H and G in Fr,sR is deﬁned as

ρ˜ (H,G)= inf (E X Y r)1/r, r τX,Y k − k

where τX,Y is the collection of all possible joint distributions of the pairs (X,Y ) that

have marginal distributions H and G respectively. For random variables U and V ,

which have distributions H and G in Fr,s respectively, we deﬁneρ ˜r(U, V )=˜ρr(H,G).

Theρ ˜r-convergence is stronger than the convergence in distribution.

Bickel and Freedman (1981) proved that the Mallow’s distance between a distri-

bution F F and its empirical distribution F converges to 0 a.s., i.e., ∈ r,p n

ρ˜ (F , F ) a.s. 0; (3.6) r n →

Freedman (1981) also proved that

ρ˜ (Fˆ , F ) a.s. 0, (3.7) 2 n n → 64 where Fˆn is the estimated empirical distribution using the bootstrap method. In the

ﬁxed-eﬀects general linear model, Fˆn is the estimated empirical distribution of the centered residuals ˆǫ1,..., ǫˆn.

Next, we express √n(βˆ∗ βˆ) and √n(βˆ β) in the form of residuals. − −

′ ′ √n(βˆ∗ βˆ)= √n((X X)−1X Y ∗ βˆ) − − ′ ′ = √n((X X)−1X (Xβˆ + ǫ∗) βˆ) − ′ ′ = √n(βˆ +(X X)−1X ǫ∗ βˆ) − ′ ′ = √n((X X)−1X ǫ∗).

Similarly, we have

′ ′ √n(βˆ β)= √n((X X)−1X ǫ). − Therefore, we have

′ ′ ′ ′ ρ˜ (√n(βˆ∗ βˆ), √n(βˆ β))=ρ ˜ (√n((X X)−1X ǫ∗, √n((X X)−1X ǫ)) 2 − − 2 n trace X ′ X −1 ρ˜ (Fˆ , F )2 ≤ · { } · 2 n q 2n trace X ′ X −1 ρ˜ (Fˆ , F )2 +ρ ˜ (F , F )2 ≤ · { } · 2 n n 2 n p q = o(1) a.s. (3.8)

The ﬁrst inequality holds by Theorem 2.1 in Freedman (1981). The ﬁrst term n · trace X ′ X −1 = O(1) by assumption (b); the second term and the third term go to { } zero a.s. by (4) and (5) respectively. Therefore, we haveρ ˜ (√n(βˆ∗ βˆ), √n(βˆ β)) 2 − − goes to o(1) a.s., showing the strongρ ˜ - consistency of √n(βˆ∗ βˆ) for T = βˆ. 2 − n Thus, the joint distribution of the test statistics T ,...,T , estimated by the pre- { 1 t} pivot resampling method, converges to the true joint distribution of the test statistics

asymptotically.

65 The test statistic for testing H is max T . Next, we show that the dis- 0{12···t} γ=1,...,t| γ| tribution of max T , estimated by the pre-pivot resampling method, converges γ=1,...,t| γ| to the true distribution of max T asymptotically. γ=1,...,t| γ|

Theorem 3.5. In the ﬁxed-eﬀects general linear model (3.3), assume

(i) The errors are i.i.d., and

(ii) 1 X ′ X V , which is positive deﬁnite, n → then, the distribution of the test statistic max T , estimated by the pre-pivot re- γ=1,...,t| γ| sampling method, converges to the true distribution of the test statistic asymptotically

under a typical partitioning null hypothesis H0{12···t}.

Proof. Under assumption (i), it is trivial that max βˆ is a continuous func- γ=1,...,t| γ| tion based on βˆ , βˆ ,..., βˆ . According to theorem 3.4, the joint distribution of { 1 2 t} the test statistics T ,...,T estimated by the pre-pivot resampling method has a { 1 t}

strongρ ˜2 consistency under a typical partition null hypothesis H0{12···t}. Thus, by the

continuous mapping theorem, we have

√n(Mˆ ∗ Mˆ ) a.s. √n(Mˆ M), (3.9) − → − where M = max β . γ=1,...,t| γ| Therefore, the distribution of the test statistic max T , estimated by the γ=1,...,t| γ| pre-pivot resampling method, converges to the true distribution of the test statistic

asymptotically under a typical partition null hypothesis H0{12···t}.

3.3.3 Post-pivot resampling method

Similar to the permutation test, the post-pivot resampling method also resamples

the observed data. However, it will resample with replacement and center and/or

66 scale the resampled test statistics to estimate the test statistic’s null distribution. To

estimate the joint distribution of the test statistics T ,...,T = βˆ ,..., βˆ under { 1 t} { 1 t} a typical partitioning null hypothesis H0{12···t}, the post-pivot resampling method proceeds as follows:

′ Step 1. Resample the y = (Y1,...,Yn) data vector with replacement to get a

♯ ♯ ♯ ′ bootstrap response vector y =(Y1 ,...,Yn ) ;

ˆ♯ ˆ♯ ˆ♯ ′ ′ −1 ′ ♯ Step 2. Calculate the resampled test statistics β =(β1,..., βk) based on (X X) X y ; Step 3. Repeat step 1 and step 2 for B (B nn) times; ≤ Step 4. Center the resampled test statistics at the sample average, i.e. βˆc♯ =

βˆ♯ E (βˆ♯) (where G denotes the empirical distribution of y and G denotes the − Gn n true distribution of y).

Then, the joint distribution of √n(βˆ♯ E (βˆ♯)) converges to the joint distribution − Gn of √n(βˆ β) asymptotically, provided n is large and n trace(X ′ X)−1 is small. The − · limiting distribution of √n(βˆ β) is normal with mean 0 and variance-covariance − matrix σ2V −1.

Theorem 3.6. In the ﬁxed eﬀects general linear model (3.3), assume

(a) The errors are i.i.d., and

(b) 1 X ′ X V , which is positive deﬁnite, n → then, the joint distribution of the test statistics T ,...,T , estimated by the post- { 1 t} pivot resampling method, converges to the true joint distribution of the test statistics

asymptotically under a typical partitioning null hypothesis H0{12···t}.

Proof. First, I will introduce two properties about Mallow’s distance.

67 (P1). Let U and V be random vectors and E(Ui) = E(Vi) (i = 1,...,n). Let A be a m n matrix of scalars. Then, ×

′ ρ˜ (AU, AV )2 trace(AA ) ρ˜ (U ,V )2. 2 ≤ · 2 i i

(P2). If E U 2 < and E V 2 < , k k ∞ k k ∞

[˜ρ (U, V )]2 = [˜ρ (U EU,V EV )]2 + EU EV 2. 2 2 − − k − k

In the post-pivot resampling method,

′ ′ AU = βˆ♯ =(X X)−1X Y ♯,

′ ′ AV = βˆ =(X X)−1X Y,

EU = E(Y ♯)= Xβ,

EV = E(Y )= Xβ.

By (4), E (βˆ♯) a.s. E (βˆ♯)= β, and E Y ♯ 2 < and E Y 2 < . Therefore, Gn → G k k ∞ k k ∞

ρ˜ (√n(βˆ♯ E (βˆ♯), √n(βˆ β))=ρ ˜ (√n(βˆ♯ E (βˆ♯), √n(βˆ β)) 2 − Gn − 2 − G − =ρ ˜ (√n(βˆ♯ β), √n(βˆ β)) 2 − − [˜ρ (√nβˆ♯, √nβˆ)]2 ≤ 2 q ′ −1 ′ ♯ ′ −1 ′ 2 = [˜ρ2(√n(X X) X Y , √n(X X) X Y )] q n trace X ′ X −1 ρ˜ (G ,G)2 ≤ · { } · 2 n p = o(1) a.s. (3.10)

By (P2), the ﬁrst inequality holds. By (P1), the second inequality holds. From assumption (b) and (4), we get the last equality. Therefore, we can show the strong

ρ˜ consistency of √n(βˆ♯ E (βˆ♯)) to √n(βˆ β). 2− − Gn − 68 Thus, the joint distribution of the test statistics T ,...,T , estimated by the { 1 t} post-pivot resampling method, converges to the true joint distribution of the test

statistics asymptotically under a typical partitioning null hypothesis H0{12···t}.

Using the similar proof as that for the pre-pivot resampling method, we can show

that the distribution of the test statistic max T , estimated by the post-pivot γ=1,...,t| γ | resampling method, converges to the true distribution of the test statistic asymptot-

ically under a typical partitioning null hypothesis H0{12···t}.

3.4 Estimating critical values for strong control of FWER

In the previous section (3.3), we discussed the procedures that use the Partition- ing Principle to achieve a strong control of FWER for testing multiple null hypotheses. According to the partitioning principle, each partitioning null hypothesis will be

∗ tested at a level of α. For a typical partitioning null hypothesis H0{12···t}, we need

supH∗ P maxγ=1,...,t Tγ >c α to control FWER strongly. 0{12···t} { | | } ≤ We also proved that the null distribution of the test statistic (max T ) γ=1,...,t| γ| can be estimated either exactly or asymptotically using the resampling methods. In

this section, we will focus on estimating the critical value for a typical partition null

hypothesis based on the estimated test statistic’s null distribution.

3.4.1 Permutation tests

Let G denote the cumulative distribution function of max T and Gˆ denote γ=1,...,t| γ | the cumulative distribution function of max T p . The following theorem gives γ=1,...,t| γ | ∗ the critical value estimated by the permutation test for testing H0{12···t}.

69 Theorem 3.7. The critical value determined by the permutation test for testing

∗ H0{1···t} at a level of α is given by

cp = Gˆ−1(1 α). (3.11) −

Proof. In previous sections, we showed that the test statistic’s distribution estimated by the permutation test is exactly the same as the true test statistic’s distribution under the null hypothesis H0{1···t}. Therefore, we will have Gˆ = G under the null

∗ hypothesis H0{1···t}.

∗ For the ﬁxed-eﬀects general linear model (3.3) with i.i.d. errors, when H0{1···t} is true, the joint distribution of βˆ (i 1 t ) only depends on β (i 1 t ), but i ∈{ ··· } i ∈{ ··· } not β (j / 1 t ). Thus, we can estimate the critical value from the test statistic’s j ∈{ ··· } permutation distribution as follows:

p supH∗ P maxγ=1,...,t Tγ >c α 0{12···t} { | | } ≤ p p = supH∗ P maxγ=1,...,t T >c α 0{12···t} { | γ | } ≤ = sup P max T p >cp α H0{12···t} { γ=1,...,t| γ | } ≤ = P max T p >cp α H0{12···t} { γ=1,...,t| γ | } ≤ 1 Gˆ(cp) α ⇔ − ≤ Gˆ(cp) 1 α ⇔ ≥ − cp = Gˆ−1(1 α). (3.12) ⇔ −

Therefore, the critical value determined by the permutation test is just the (1 − α)th percentile of the max T p distribution (the max T distribution γ=1,...,t| γ | γ=1,...,t| γ| ∗ under the null hypothesis H0{1···t}).

70 3.4.2 Pre-pivot resampling method

The pre-pivot resampling method controls FWER asymptotically. We showed that the max T distribution, estimated by the pre-pivot resampling method, con- γ=1,...,t| γ | verges to the true test statistic’s distribution asymptotically under a typical partition

∗ null hypothesis H0{1···t}. Let G∗ denote the cumulative distribution function of max T ∗ = max βˆ∗ . γ=1,...,t| γ | γ=1,...,t| γ | Then, the critical value estimated by the pre-pivot resampling method was given in

∗ Theorem 3.8 for testing H0{1···t} at a level of α.

Theorem 3.8. The critical value determined by the pre-pivot resampling method for

∗ asymptotic control of type I error rate at a level of α for testing H0{1···t} is given by

c∗ = G∗−1(1 α). (3.13) −

Proof.

∗ supH∗ P maxγ=1,...,t Tγ >c α 0{12···t} { | | } ≤ ∗ ∗ = limn→∞supH∗ P maxγ=1,...,t T >c α 0{12···t} { | γ | } ≤ = lim sup P max T ∗ >c∗ α n→∞ H0{12···t} { γ=1,...,t| γ | } ≤ = lim P max T ∗ >c∗ α n→∞ H0{12···t} { γ=1,...,t| γ | } ≤ lim (1 G∗(c∗)) α ⇔ n→∞ − ≤ 1 lim G∗(c∗) α ⇔ − n→∞ ≤ lim G∗(c∗) 1 α ⇔ n→∞ ≥ − c∗ = G∗−1(1 α) for asymptotic control. (3.14) ⇒ −

The ﬁrst equality holds due to the asymptotic convergence of the max T ∗ γ=1,...,t| γ | distribution estimated by the pre-pivot resampling method to the true max T γ=1,...,t| γ| 71 distribution. The joint distribution of βˆ (i 1 t ) only depends on β (i i ∈ { ··· } i ∈ 1 t ), but not on β (j / 1 t ), which establishes the third equality. { ··· } j ∈{ ··· } Thus, the critical value determined by the pre-pivot resampling method is just

the (1 α)th percentile of the max T ∗ distribution. − γ=1,...,t| γ | 3.4.3 Post-pivot resampling method

Similar to the pre-pivot resampling method, the post-pivot resampling method

only controls FWER asymptotically. We have shown that the distribution of max T , γ=1,...,t| γ| estimated by the post-pivot resampling method, converges to the true distribution of

max T asymptotically under a typical partitioning null hypothesis H∗ . γ=1,...,t| γ| 0{1···t} Let G♯ denote the cumulative distribution function of max T ♯ = max βˆ♯ . γ=1,...,t| γ | γ=1,...,t| γ | Then, the theorem below will give the critical value estimated by the post-pivot resampling method for asymptotic type I error rate control.

Theorem 3.9. The critical value determined by the post-pivot resampling method for

∗ controlling the type I error rate asymptotically at a level of α for testing H0{1···t} is given by

c♯ = G♯−1(1 α). (3.15) −

72 Proof.

♯ supH∗ P maxγ=1,...,t Tγ >c α 0{12···t} { | | } ≤ ♯ ♯ = limn→∞supH∗ P maxγ=1,...,t T >c α 0{12···t} { | γ | } ≤ = lim sup P max T ♯ >c♯ α n→∞ H0{12···t} { γ=1,...,t| γ | } ≤ = lim P max T ♯ >c♯ α n→∞ H0{12···t} { γ=1,...,t| γ | } ≤ lim (1 G♯(c♯)) α ⇔ n→∞ − ≤ 1 lim G♯(c♯) α ⇔ − n→∞ ≤ lim G♯(c♯) 1 α ⇔ n→∞ ≥ − c♯ = G♯−1(1 α) for asymptotic control. (3.16) ⇒ −

Therefore, the critical value determined by the post-pivot resampling method is

the (1 α)th percentile of the max T ♯ distribution for controlling the type I − γ=1,...,t| γ | error rate at a level of α asymptotically.

3.5 Shortcuts of partitioning tests using resampling methods

According to the Partitioning Principle, for testing k null hypotheses, we need to test 2k 1 partitioning null hypotheses. When the number of k is very large, such as − thousands even millions in the microarray data analysis, the partitioning tests become computationally impracticable.

Under certain conditions, there are shortcuts for partitioning tests that can reduce

the number of tests from 2k 1 to at most k. The reason that partitioning tests have − ∗ shortcuts is that the rejection of some H0I ’s can cause the rejection of many other

∗ H0I ’s without actual testing.

73 The shortcuts are usually in the form of step-wise tests (step-down or step-up tests). For step-down shortcuts, the form of the test statistic for each partition-

∗ ∗ ing hypothesis H0I need to be maxT type, and the decision rule is: Reject H0I if maxi∈I Ti >cI . Xu and Hsu (2007) showed three conditions for step-down shortcuts of partitioning tests with the gFWER control. A more precise set of the suﬃcient conditions for a step-down shortcut to be valid are shown as follows ((Calian et al.,

2008)):

∗ ∗ D1: The test for H0I is of the form: reject H0I if maxi∈I Ti >cI ;

D2: supH∗ P maxi∈I Ti >cI α; 0I { } ≤

D3: The values of the test statistics Ti, i = 1,...,k, are not re-computed for

∗ diﬀerent H0I ;

D4: Critical values c have the property that if J I then c c . I ⊂ J ≤ I Condition D2 is used to control the FWER strongly for shortcuts of partitioning tests. Conditions D1, D3 and D4 make the partitioning tests computationally practicable.

Let [1], [2],..., [k] be random indices for 1, 2,...,k, such that the test statistics

T , T ,...,T have an order T T T . 1 2 k [1] ≤ [2] ≤···≤ [k] If T >c , all partitioning null hypotheses H∗ with [k] J [1],..., [k] [k] {[1],...,[k]} 0J ⊂ ⊆{ } k−1 ∗ will be rejected. There are actually 2 such null hypotheses H0J . Thus, H0[k] can be rejected. Otherwise, if T c , one can accept H and stop. [k] ≤ {[1],...,[k]} 0[k] If T > c , all partitioning null hypotheses H∗ with [k 1] J [k−1] {[1],...,[k−1]} 0J − ⊂ ⊆ [1],..., [k 1] will be rejected. There are actually 2k−2 such null hypotheses H∗ . { − } 0J Thus, H can be rejected. Otherwise, if T c , one can accept 0[k−1] [k−1] ≤ {[1],...,[k−1]}

H0[k−1] and stop.

74 We continue this process until T[1] is compared with c[1] for testing H0[1]. There is

only 20 = 1 null hypothesis in this step.

Altogether, with k steps, there are

20 +21 +22 + +2k−1 =2k 1 ··· − null hypotheses in the step-down shortcut, which is just the complete set of partitioning null hypotheses.

For example, if we want to test four null hypotheses H0i : θi =0(i =1, 2, 3, 4) using the test statistic values T1, T2, T3 and T4 calculated from the data, we need to test 24 1 = 15 partitioning null hypotheses in total. −

Suppose T2 < T3 < T4 < T1, then we have [1]=2, [2]=3, [3]=4, and [4]=1. With

∗ the conditions D1-D4 satisﬁed, if T[4] = T1 >c{1,2,3,4}, then H0{1,2,3,4} will be rejected.

∗ ∗ ∗ ∗ ∗ ∗ ∗ We do not need to test H0{1,2,3}, H0{1,2,4}, H0{1,3,4}, H0{1,2}, H0{1,3}, H0{1,4}, and H0{1} because all these null hypotheses will be automatically rejected. There are 23 = 8 such partitioning null hypotheses to be rejected once T[4] = T1 >c{1,2,3,4}. Finally, all the above rejects lead to the rejection of the null hypothesis H01 : θ1 = 0.

Once the above eight null hypotheses are rejected, we will compare T[3] = T4 with

2 c{[1],[2],[3]} = c{2,3,4}. If T4 >c{2,3,4}, then there are 2 = 4 partitioning null hypotheses

∗ ∗ ∗ ∗ (H0{2,3,4}, H0{2,4}, H0{3,4} and H0{4}) to be rejected. The rejections will lead to the

ﬁnal rejection of the null hypothesis H04 : θ4 = 0.

1 ∗ ∗ In step 3, we will reject 2 = 2 partitioning null hypotheses H0{2,3} and H0{3} if

T3 >c{2,3}. This will lead to the ﬁnal rejection of the null hypothesis H03 : θ3 = 0.

0 ∗ In the last step, we only have 2 = 1 partitioning null hypothesis H0{2}. If T2 > c{2}, then we will reject H0{2} and ﬁnally reject the null hypothesis H02 : θ2 = 0.

75 By making the above four comparisons, we actually tested all 8+4+2+1=15 partitioning null hypotheses.

Thus, the partitioning tests have the following step-down shortcut when the conditions D1-D4 are satisﬁed:

Step 1: If T >c , then infer θ / Θ and go to step 2; otherwise stop. [k] {[1],...,[k]} [k] ∈ [k] Step 2: If T >c , then infer θ / Θ and go to step 3; otherwise [k−1] {[1],...,[k−1]} [k−1] ∈ [k−1] stop.

··· Step k: If T >c , then infer θ / Θ and stop; otherwise stop. [1] {[1]} [1] ∈ [1] For the ﬁxed-eﬀects general linear model (3.3) with i.i.d. errors, the conditions

D1-D3 are all satisﬁed when the test statistics are Ti = βˆi for testing H0i : βi = 0

(i =1,...,k). Thus, we will show whether the condition D4 is satisﬁed or not for the resampling methods, and the corresponding shortcuts for each resampling method.

3.5.1 Permutation tests

In section 3.3, for the ﬁxed-eﬀects general linear model (3.3) with i.i.d. errors, we showed that the permutation distribution of the test statistic max T is γ=1,...,t| γ | ∗ identical to its true distribution under a typical partitioning null hypothesis H0{12···t}. Next, we will show that the critical values estimated by the permutation test satisfy the condition D4.

Suppose J I, then ⊆

max( T p ,γ I) max( T p ,γ J) | γ | ∈ ≥ | γ | ∈

Therefore, the (1-α)th percentile of max( T p ,γ I) is at least as large as the (1-α)th | γ | ∈ percentile of max( T p ,γ J), i.e., c c . | γ | ∈ J ≤ I 76 In the step-down shortcut, we need to determine k critical values from k permutation distributions of maxT type test statistics.

In step 1, we determine c based on the permutation distribution of max T {[1],...,[k]} i=1,...,k| i| (max βˆ in model (3.3) setting). Let Gˆ denote the distribution of max T i=1,...,k| i| k i=1,...,k| i| under the partitioning null hypothesis H∗ , thenc ˆ = Gˆ−1(1 α). 0{12···k} {[1],...,[k]} k −

In step 2, c{[1],...,[k−1]} can be determined from the permutation distribution of max T (max βˆ ). Let Gˆ denote the distribution of max T i=[1],...,[k−1]| i| γ=[1],...,[k−1]| i| k−1 i=[1],...,[k−1]| i| under the partitioning null hypothesis H∗ , thenc ˆ = Gˆ−1 (1 α). 0{[1][2]···[k−1]} {[1],...,[k−1]} k−1 −

Similarly, we can determine all of the critical values for k steps. In step k,c ˆ{[1]} =

Gˆ−1(1 α), where Gˆ denotes the permutation distribution of T = βˆ . 1 − 1 | [1]| | [1]| The above process can be summarized in the following algorithm.

1. Calculate the test statistic values T1, T2,...,Tk from the data.

2. Order the test statistic values T (i =1,...,k) such that T T | i| | [1]|≤| [2]|≤···≤ T . | [k]| 3. Permute (resample without replacement) the observed data to get the permuted data.

4. Calculate the test statistics T p (i =1,...,k) based on the permuted data. | [i]| 5. Compute max T p , max T p , . . . , max(T p , T p ) and T p . i=1,...,k| [i]| i=1,...,k−1| [i]| [1] [2] [1] 6. Repeat Step 3 to Step 5 for n! times.

7. The critical values (c{[1],...,[k]}, c{[1],...,[k−1]}, . . . , and c{[1]}) determined from the permutation distributions are the (1 α)th percentiles of the distributions of − max T p , max T p , . . . , and T p (Gˆ−1(1 α), Gˆ−1 (1 α),..., Gˆ−1(1 i=1,...,k| [i]| i=1,...,k−1| [i]| [1] k − k−1 − 1 − α)) respectively.

77 3.5.2 Pre-pivot resampling method

For the ﬁxed-eﬀects general linear model, we showed that the distribution of the test statistic max T , estimated by the pre-pivot resampling method, converges γ=1,...,t| γ| to its true distribution asymptotically under a typical partitioning null hypothesis

∗ H0{12···t}. Next, we show that the critical values determined by the pre-pivot resampling method also satisfy the condition D4.

Suppose J I, then ⊆

max( T ∗ ,γ I) max( T ∗ ,γ J) | γ | ∈ ≥ | γ | ∈

Therefore, the (1-α)th percentile of max( T ∗ ,γ I) is at least as large as the (1-α)th | γ | ∈ percentile of max( T ∗ ,γ J), i.e. c c . | γ | ∈ J ≤ I The critical values in the step-down shortcuts can be determined by the pre-pivot resampling method according to the following algorithm.

1. Calculate the test statistics T1, T2,...,Tk.

2. Order the test statistic values T (i =1,...,k) such that T T | i| | [1]|≤| [2]|≤···≤ T . | [k]| 3. Calculate the test statistic values T ∗ (i =1,...,k) using the pre-pivot resam- | [i]| pling method according to steps 1-5 in section 3.3.2.

4. Compute max T ∗ , max T ∗ , . . . , max(T ∗ , T ∗ ) and T ∗ . i=1,...,k| [i]| i=1,...,k−1| [i]| [1] [2] [1] 5. Repeat Step 3 and Step 4 for B (B nn) times. ≤

6. The critical values (c{[1],...,[k]}, c{[1],...,[k−1]}, . . . , and c{[1]}) determined by the pre-pivot resampling method are the (1 α)th percentiles of the distributions of −

78 max T ∗ , max T ∗ , . . . , and T ∗ (G∗−1(1 α),G∗−1(1 α),...,G∗−1(1 i=1,...,k| [i]| i=1,...,k−1| [i]| [1] k − k−1 − 1 − α)) respectively.

3.5.3 Post-pivot resampling method

In the previous section, we proved the asymptotic convergency of the distribution

of the test statistic max T , estimated by the post-pivot resampling method, γ=1,...,t| γ| to the true distribution of max T under a typical partitioning null hypothesis γ=1,...,t| γ |

H0{12···t}.

Similar to the pre-pivot resampling method, we have c c when J I holds J ≤ I ⊆ for the post-pivot resampling method.

The following algorithm shows the steps of estimating the critical values in the

step-down shortcuts by the post-pivot resampling method.

1. Calculate the test statistics T1, T2,...,Tk.

2. Order the test statistic values T (i =1,...,k) such that T T | i| | [1]|≤| [2]|≤···≤ T . | [k]| 3. Calculate the test statistics T ♯ (i =1,...,k) using the post-pivot resampling | [i]| method according to steps 1-4 in section 3.3.3.

4. Compute max T ♯ , max T ∗ , . . . , max(T ♯ , T ♯ ) and T ♯ . i=1,...,k| [i]| i=1,...,k−1| [i]| [1] [2] [1] 5. Repeat Step 3 and Step 4 for B (B nn) times. ≤

6. The critical values (c{[1],...,[k]}, c{[1],...,[k−1]}, . . . , and c{[1]}) determined by the

post-pivot resampling method are the (1 α)th percentiles of the distributions of − max T ♯ , max T ♯ , . . . , and T ♯ (G♯−1(1 α),G♯−1 (1 α),...,G♯−1(1 i=1,...,k| [i]| i=1,...,k−1| [i]| [1] k − k−1 − 1 − α)) respectively.

79 CHAPTER 4

CONDITIONS FOR SIGNIFICANT ANALYSIS OF MICROARRAYS (SAM) TO CONTROL THE EMPIRICAL FDR

SAM (Tusher et al. (2001)) is frequently used in the biological sciences to identify genes whose expression levels have significantly changed between different biological states in microarray experiments. From the website of Stanford Univer- sity (Chu et al. (2000)), the SAM software can be freely downloaded (http://www- stat.stanford.edu/ tibs/SAM/), which makes SAM a popular method for identifying significantly expressed genes.

In SAM, Tusher et al. (2001) used permutation to estimate the empirical FDR

(Fdr), which is deﬁned as the ratio of the expected number of falsely rejected null

hypotheses to the observed number of rejected null hypotheses. The SAM software

aims to control the Fdr at a desired nominal level α such as 5% or 10 %. To identify signiﬁcantly changed genes in expression between two diﬀerent biological states, the test statistics used in SAM (Tusher et al. (2001)) are

x¯i y¯i ti = − , i =1,...,k, (4.1) si + s0

80 wherex ¯i andy ¯i are deﬁned as the average expression levels of the ith gene in two diﬀerent biological states, respectively. The standard error ofx ¯ y¯ is s : i − i i

1/m +1/n m n s = (x x¯ ) (y y¯ ) , (4.2) i v m + n 2 ij − i − il − i u ( j=1 l=1 ) u − X X t where xij are the levels of expression of the ith gene, jth sample in the ﬁrst biological state (j = 1,...,m), and yil are the levels of expression of the ith gene, lth sample in the second biological state. s0 is a small positive constant, which is chosen to minimize the coeﬃcient of variation.

It was reported in recent literature, however, that the Fdr is not well controlled by SAM. Pan (2003) found the permutation-based SAM over-estimates the number of false positives. The simulation studies by Dudoit et al. (2003) showed the nominal

Fdr estimated by SAM was much smaller than the actual Fdr in some simulations, and greater than 1 in other simulations. Xie et al. (2005) also pointed out that SAM tends to overestimate Fdr. Larsson et al. (2005) used real data examples to argue for caution when using SAM. Zhang (2007) evaluated the SAM R-package SAM 2.20 and showed the poor estimation of Fdr by SAM, and pointed out that SAM 2.20 may produce erroneous and conﬂicting results under certain situations.

There are some misconceptions about the poor performance of SAM regarding the permutation method used. For example, Zhang (2007) provided one reason for SAM’s poor estimation of Fdr is that the test statistic’s null distribution estimated from permutation may not have a mean of zero, which leads to over-dispersed null scores.

However, for the SAM procedure with two independent samples, when the sample sizes of those two groups are equal, the permutated test statistics (unstandardized or

81 standardized like the SAM test statistics) with complete enumerations have a mean of zero.

Theorem 4.1. For comparing two independent samples with equal sample sizes using either unstandardized or standardized t-test statistics, permuting the labels of the two groups with a complete enumeration makes the mean of the permuted test statistics equal to zero.

Proof. The reason is intuitive. For a complete enumeration with equal numbers of the two group labels of zeros and ones, one permutation with one set of labels always has its opposite set of labels (with zeros and ones switched). Thus, the positive permuted test statistics and the corresponding negative permuted test statistics cancel each other out (since we always have the same MSE for standardized test statistics), resulting in a zero mean of all permuted test statistics with a complete enumeration.

In this chapter, ﬁrst we will show the discrepancies between true expected values of order statistics and expected values estimated by permutation in SAM through simulation studies. Then, we will show conditions for the equivalence of the expected number of false rejections estimated by permutation to the true expected number of false rejections in SAM. For simplicity, we use unstandardized test statistics t =x ¯ y¯ i i− i in SAM. Another reason we choose unstandardized test statistics is to avoid the possibility of near zero standard errors calculated from some permutation resamples since those standard errors make the test statistics close to inﬁnity and lead to invalid results. Finally, we will propose a more powerful adaptive two-step procedure that controls the expected number of false rejections at a pre-determined constant µ.

82 4.1 Introduction to Signiﬁcant Analysis of Microarrays (SAM) method

The Significance Analysis of Microarrays (SAM) was first introduced by Tusher et al. (2001) for identifying genes with statistically significant changes in expression by assimilating a set of gene-specific t tests. In SAM, each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. Then, a scatter plot of the observed relative difference versus the expected relative difference estimated by permutation is used to select statistically significant genes based on a fixed threshold.

The SAM procedure can be summarized as follows based on the description of

SAM in Tusher et al. (2001).

1. Compute a test statistic ti for each gene i (i =1,...,g).

2. Compute order statistics t such that t t t . (i) (1) ≤ (2) ···≤ (g)

3. Perform B permutations of the responses/covariates y1,...,yn. For each per-

mutation b, compute the permuted test statistics ti,b and the corresponding order

statistics t t t . (1),b ≤ (2),b ···≤ (g),b 4. From the B permutations, estimate the expected values of order statistics by ¯ t(i) = (1/B) b t(i),b.

5. Form aP quantile-quantile (Q-Q) plot (SAM plot) of the observed t(i) versus the

expected t¯(i).

6. For a given threshold ∆, starting at the origin, and moving up to ﬁnd the

ﬁrst i = i such that t t¯ > ∆. All genes past i are called signiﬁcant positive. 1 (i) − (i) 1

Similarly, starting at the origin, moving down to the left and ﬁnd the ﬁrst i = i2

such that t¯ t > ∆. All genes past i are called signiﬁcant negative. Deﬁne (i) − (i) 2

83 the upper cut point cut (∆) = min t : i i = t , and the lower cut point up { (i) ≤ 1} (i1) cut (∆) = max t : i i = t . low { (i) ≥ 2} (i2) 7. For a given threshold ∆, the expected number of false rejections E(V ) is estimated by computing the number of genes with ti,b above cutup(∆) or below cutlow(∆) for each of the B permutations, and averaging the numbers over B permutations.

8. A threshold ∆ is chosen to control the Fdr (Fdr = E(V )/r) under the complete null hypothesis, at an acceptable nominal level.

4.2 Discrepancies between true expected values of order statistics and expected values estimated by permutation

In this section, we will show the situations in which the joint distribution of test statistics estimated by permutation is diﬀerent from the true joint distribution in

SAM. The diﬀerence of joint distributions leads to the discrepancies between true expected values of order statistics and expected values estimated by permutation.

4.2.1 Eﬀect of unequal variances-covariance matrices and sample sizes

Suppose we want to observe gene expressions of m samples in the ﬁrst biological condition and n samples in the second biological condition. Let Xl =(Xl1,...,Xlk), l =1,...,m denote the observations of k genes of the lth sample from the ﬁrst biological condition, and Yj = (Yj1,...,Yjk), j = 1,...,n, denote the observations of the same k genes of the jth sample from the second biological condition. We assume that

i.i.d both X and Y follow multivariate normal distributions, i.e. X MV N (µX ,ΣX ) i j i ∼ k X X i.i.d and Y MV N (µY ,ΣY ), i =1, 2,...,m, j =1, 2,...,n. Consider testing the null j ∼ k Y Y

84 µ hypotheses H : µX = µY using the test statistics T = X¯ Y¯ , where 0 X Y −

T =(T ,...,T )=(X¯ Y¯ ,..., X¯ Y¯ ). 1 k 1 − 1 k − k

µ Under the null hypotheses H0 : µX = µY , the distribution of T is

Σ Σ T MV N (00, X + Y ) ∼ k m n

µ The permutation distribution of T under the null hypotheses H0 : µX = µY is

min(m,n) m n (m r)ΣΣX + rΣY rΣX +(n r)ΣΣY r r MV N (00, − X Y + X − Y ). (4.3) m+n k m2 n2 r=0 m X Thus, if m = n and ΣX = ΣY , the true and permutation distributions of T 6 X 6 Y µ under the null hypotheses H0 : µX = µY are diﬀerent, which leads to the diﬀerence

between the true expected values of order statistics and the expected values of order

statistics estimated by permutation. We conduct simulation studies to show the eﬀect

of unequal variances-covariance matrices and sample sizes on the SAM procedure.

In the ﬁrst simulation study, 10,000 sets of random samples are drawn from

MV N50(µX = 0,ΣX = I50 ) with m = 3 for the ﬁrst biological condition and from

MV N (µY =00,ΣY =4 I50) with n = 6 for the second biological condition, indepen- 50 Y Y · 50 dently. Figure 4.1 shows the Q-Q plot of the true expected values of order statistics against the expected values of order statistics estimated by permutation. The Q-Q plot shows the departure of the expected values of order statistics by permutation from the true expected values of order statistics.

Let J denote a k k matrix with all elements equalling to 1. In the second × simulation study, 10,000 sets of random samples are drawn from MV N50(µX =

0,ΣX = 0.1J50 +0.9I50) with m = 3 for the ﬁrst biological condition and from

85 Q−Q plot True expectation of ordered test statistics −2 −1 0 1 2

−2 −1 0 1 2

Expectation of ordered test statistics estimated by permutation

Figure 4.1: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal variance and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line.

86 Q−Q plot True expectation of ordered test statistics −1.0 −0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0

Expectation of ordered test statistics estimated by permutation

Figure 4.2: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal correlations and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line.

MV N50(µY = 0,ΣY = 0.9J50 +0.1I50 ) with n = 6 for the second biological condition, independently. Figure 4.2 shows the Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation. There is a clear diﬀerence between the true expected values of order statistics and the expected values estimated by permutation as shown in Figure 4.2.

87 4.2.2 Eﬀect of higher order cumulants with equal sample sizes

In previous section (4.2.1), we showed the discrepancies between the true expected values of order statistics and expected values of order statistics estimated by permutation when two independent multivariate normal distributions have unequal variance-covariance matrices and sample sizes. In this section, we will show the effect of skewness and third order cross cumulants on the SAM procedure when the observations in both biological conditions have independent multivariate lognormal distributions.

Let X =(X ,...,X ) i.i.d MV Lognormal (µ ,Σ ) and Y =(Y ,...,Y ) i.i.d i i1 ik ∼ k X X j j1 jk ∼

MV Lognormalk(µY ,ΣY ), i =1, 2,...,m, j =1, 2,...,n, where MV Lognormalk(µµ,ΣΣ) denotes a k-dimensional multivariate lognormal distribution with logmean vector µ

and log variance-covariance matrix ΣΣ.

To discover the genes whose expression levels are signiﬁcantly changed between

µ two biological conditions, one can test the null hypotheses H0 : E(X) = E(Y ).

Consider the following test statistics

T = X¯ Y¯ , (4.4) − where

T = T , T ,...,T { 1 2 k} X¯ = X¯ , X¯ ,..., X¯ { 1 2 k} Y¯ = Y¯ , Y¯ ,..., Y¯ { 1 2 k} T = X¯ Y¯ , l =1,...,k, l l − l ¯ m ¯ n where Xl = i=1 Xil/m and Yl = j=1 Yjl/n.

P P 88 X Y Suppose ΣX =(σlp ) and ΣY =(σlp) (l =1,...,k and p =1,...,k). Consider the following case: X Y σlp = σlp = σll if l = p σX = σY if l = p. lp 6 lp 6 If the correlations within the logarithm of X are diﬀerent from the correlations within the logarithm of Y , i.e. ρX = ρY , then we get σX = σY when l = p. lp 6 lp lp 6 lp 6 The mean vectors of X and Y are respectively

µX +(1/2)σX µX +(1/2)σX E(X)= E(X ),...,E(X ) = e 1 11 ,...,e k kk { 1 k } { } Y Y Y Y E(Y )= E(Y ),...,E(Y ) = eµ1 +(1/2)σ11 ,...,eµk +(1/2)σkk . { 1 k } { }

In addition, the variance, covariance, skewness and third order cross cumulants of

Xil and Yjl (l =1,...,k) are

X X X V ar(X )= e2µl +σll (eσll 1) il − Y Y Y V ar(Y )= e2µl +σll (eσll 1) jl − X X X X σX Cov(X ,X )= eµl +µp +(1/2)(σll +σpp)(e lp 1) il ip − Y Y Y Y σY Cov(Y ,Y )= eµl +µp +(1/2)(σll +σpp)(e lp 1) jl jp − X X 2 X 2 X 2 κ (X )= e3µl +(2/3)(σll ) (e3(σll ) 3e(σll ) + 2) 3 il − 3µY +(2/3)(σY )2 3(σY )2 (σY )2 κ (Y )= e l ll (e ll 3e ll + 2) 3 jl − X X X X X X σX +σX +σX σX X X Cum(X ,X ,X )= eµl +µp +µs +(1/2)(σll +σpp+σss)(e lp ls ps e lp eσls eσps + 2) il ip is − − − Y Y Y Y Y Y σY +σY +σY σY Y Y Cum(Y ,Y ,Y )= eµl +µp +µs +(1/2)(σll +σpp+σss)(e lp ls ps e lp eσls eσps + 2). jl jp js − − −

X Y When σll = σll for l =1,...,k, the mean vectors deﬁned above can be written as

µX +(1/2)σ µX +(1/2)σ E(X)= e 1 11 ,...,e k kk { } Y Y E(Y )= eµ1 +(1/2)σ11 ,...,eµk +(1/2)σkk . { } 89 X Y µ Thus, only when σll = σll for l = 1,...,k, testing the null hypothesis H0 :

µ E(X)= E(Y ) is equivalent to testing the null hypothesis H0 : µX = µY .

Under the null hypothesis Hµ : E(X) = E(Y ), the test statistics T = X¯ 0 l l −

Y¯l, l = 1,...,k, have the ﬁrst and second order cumulants for l = 1,...,k, shown as follows:

µX +(1/2)σX µY +(1/2)σY E(T )= E(X¯ Y¯ )= e l ll e l ll =0 l l − l − 1 1 V ar(T )= V ar(X¯ Y¯ )= V ar(X )+ V ar(Y ) l l − l m il n jl 1 X X X 1 Y Y X = e2µl +σll (eσll 1) + e2µl +σll (eσll 1) m − n − Cov(T , T )= Cov(X¯ Y¯ , X¯ Y¯ ) l p l − l p − p 1 1 = Cov(X ,X )+ Cov(Y ,Y ) m il ip n il ip 1 X X X X σX = eµl +µp +(1/2)(σll +σpp)(e lp 1) m − 1 Y Y Y Y σY + eµl +µp +(1/2)(σll +σpp)(e lp 1). n −

The third order cumulants are:

κ (T )= κ (X¯ Y¯ ) 3 l 3 l − l 1 1 = κ (X ) κ (Y ) m2 3 il − n2 3 jl 1 3µX +(2/3)(σX )2 3(σX )2 (σX )2 = e l ll (e ll 3e ll + 2) m2 − 1 Y Y 2 Y 2 Y 2 e3µl +(2/3)(σll ) (e3(σll ) 3e(σll ) + 2) − n2 − 1 1 Cum(T , T , T )= Cum(X ,X ,X ) Cum(Y ,Y ,Y ) l p s m2 il ip is − n2 jl jp js 1 X X X X X X σX +σX +σX σX X X = eµl +µp +µs +(1/2)(σll +σpp+σss)(e lp ls ps e lp eσls eσps + 2) m2 − − − 1 Y Y Y Y Y Y σY +σY +σY σY Y Y eµl +µp +µs +(1/2)(σll +σpp+σss)(e lp ls ps e lp eσls eσps + 2). − n2 − − −

90 If we resample X i, i =1,...,m and Y j, j =1,...,n, from the pooled sample

X ,...,XX ,YY ,...,YY , and recompute T = T ,...,T each time from resam- { 1 m 1 n} { 1 k} pled observation vectors, then, for a given permutation with r (r min(m, n)) vectors ≤ relabeled, the distribution of the test statistic T r = X¯ r Y¯ r has the ﬁrst, second and l l − l µ third order cumulants as follows under the null hypothesis H0 : E(X)= E(Y ):

E(X + + X + Y + + Y ) E(T r)= l1 ··· l(m−r) l1 ··· lr l m E(Y + + Y + X + + X ) l1 ··· l(n−r) l1 ··· lr − n X X Y Y meµl +(1/2)σll neµl +(1/2)σll = m − n =0

X + + X + Y + + Y V ar(T r)= var( l1 ··· l(m−r) l1 ··· lr l m Y + + Y + X + + X l1 ··· l(n−r) l1 ··· lr ) − n m r r r n r =( − + )V ar(X )+( + − )V ar(Y ) m2 n2 il m2 n2 jl m r r X X X n r Y Y Y =( − + )e2µl +σll (eσll 1) + − )e2µl +σll (eσll 1) m2 n2 − n2 − m r r r n r Cov(T r, T r)=( − + )Cov(X ,X )+( + − )Cov(Y ,Y ) l p m2 n2 il ip m2 n2 jl jp m r r X X X X σX =( − + )eµl +µp +(1/2)(σll +σpp)(e lp 1) m2 n2 − r n r Y Y Y Y σY +( + − )eµl +µp +(1/2)(σll +σpp)(e lp 1) m2 n2 −

91 m r r r n r κ (T r)=( − )κ (X )+( − )κ (Y ) 3 l m3 − n3 3 il m3 − n3 3 jl m r r X X 2 X 2 X 2 =( − )e3µl +(2/3)(σll ) (e3(σll ) 3e(σll ) + 2) m3 − n3 − r n r Y Y 2 Y 2 Y 2 +( − )e3µl +(2/3)(σll ) (e3(σll ) 3e(σll ) + 2) m3 − n3 − 1 1 Cum(T r, T r, T r)= Cum(T , T , T ) r( + ) (Cum(X ,X ,X ) l p s l p s − m3 n3 · il ip is Cum(Y ,Y ,Y )) − jl jp js 1 1 = Cum(T , T , T ) r( + ) l p s − m3 n3 · X X X X X X σX +σX +σX σX X X [eµl +µp +µs +(1/2)(σll +σpp+σss)(e lp ls ps e lp eσls eσps + 2) − − − Y Y Y Y Y Y σY +σY +σY σY Y Y eµl +µp +µs +(1/2)(σll +σpp+σss)(e lp ls ps e lp eσls eσps + 2)]. − − − − When σX = σY for l = 1,...,k, the test statistics T = X¯ Y¯ , l = 1,...,k, ll ll l l − l µ have the cumulants that can be simpliﬁed as follows under the null hypothesis H0 :

E(X)= E(Y ):

E(T )= E(X¯ Y¯ )= eµl+(1/2)σll eµl+(1/2)σll =0 l l − l − 1 1 V ar(T )= V ar(X¯ Y¯ )=( + )e2µl+σll (eσll 1) l l − l m n − Cov(T , T )= Cov(X¯ Y¯ , X¯ Y¯ ) l p l − l p − p 1 X 1 Y = eµl+µp+(1/2)(σll+σpp)[ (eσlp 1) + (eσlp 1)]. m − n − 1 1 3µ +(3/2)σ2 3σ2 σ2 κ (T )=( )e l ll (e ll 3e ll + 2) 3 l m2 − n2 −

µl+µp+µs+(1/2)(σll +σpp+σss) Cum(Tl, Tp, Ts)= e

1 σX +σX +σX σX X X [ (e lp ls ps e lp eσls eσps + 2) · m2 − − − 1 σY +σY +σY σY Y Y (e lp ls ps e lp eσls eσps + 2)]. − n2 − − − The permuted test statistics T r = X¯ r Y¯ r, l = 1,...,k, have the following l l − l X Y µ simpliﬁed cumulants when σll = σll for l = 1,...,k under the null hypothesis H0 :

92 E(X)= E(Y ):

E(X + + X + Y + + Y ) E(T r)= l1 ··· l(m−r) l1 ··· lr l m E(Y + + Y + X + + X ) l1 ··· l(n−r) l1 ··· lr − n meµl+(1/2)σll neµl+(1/2)σll = m − n =0

X + + X + Y + + Y V ar(T r)= var( l1 ··· l(m−r) l1 ··· lr l m Y + + Y + X + + X l1 ··· l(n−r) l1 ··· lr ) − n 1 1 =( + )e2µl+σll (eσll 1) m n − m r r X r n r Y Cov(T r, T r)= eµl+µp+(1/2)(σll +σpp)[( − + )(eσlp 1)+( + − )(eσlp 1)] l p m2 n2 − m2 n2 −

1 1 2 2 2 κ (T r)=( )e3µl+(3/2)σll (e3σll 3eσll + 2) 3 l m2 − n2 − 1 1 Cum(T r, T r, T r)= Cum(T , T , T ) r( + ) [Cum(X ,X ,X ) l p s l p s − m3 n3 · il ip is Cum(Y ,Y ,Y )] − jl jp js 1 1 = Cum(T , T , T ) r( + )eµl+µp+µs+(1/2)(σll +σpp+σss) l p s − m3 n3 σX +σX +σX σX X X σY +σY +σY σY Y Y [(e lp ls ps e lp eσls eσps ) (e lp ls ps e lp eσls eσps )]. · − − − − − − −

Direct consequences from the above calculations are:

r X Y (1) V ar(Tl)= V ar(Tl ) when m = n or σll = σll .

X Y r r (2) If m = n and σll = σll , then Cov(Tl, Tp)= Cov(Tl , Tp ). (3) If m = n and σX = σY , then κ (T ) = κ (T r) = 0; If m = n, but σX = σY , ll ll 3 l 3 l 6 ll ll then κ (T )= κ (T r) = 0; If m = n and σX = σY , then κ (T ) = κ (T r). 3 l 3 l 6 ll 6 ll 3 l 6 3 l 93 (4) Even when m = n and σX = σY , Cum(T , T , T ) = Cum(T r, T r, T r) as a ll ll l p s 6 l p s consequence of Cum(X ,X ,X ) = Cum(Y ,Y ,Y ). il ip is 6 jl jp js Thus, if two joint distributions have diﬀerent skewness or third order cross cumulants, the permutation and the true distributions of T are diﬀerent under the null

µ hypothesis H0 : E(X) = E(Y ). We will conduct simulation studies to show the discrepancies between the true expected values of order statistics and expected values of order statistics estimated by permutation when two independent multivariate lognormal distributions have unequal skewness and third order cross cumulants.

In the ﬁrst simulation study, two groups have unequal skewness with the same sample sizes. We generate 10,000 sets of random samples for both biological conditions

(Group X and Group Y ) from multivariate lognormal distributions with the following mean vectors and variance-covariance matrices in the logarithm scale.

µ = 0.4 , Σ =1.6I , m = 5; X − 50 X 50

µY =00.250, ΣY =0.4I 50, n =5.

Figure 4.3 shows the Q-Q plot of the true expected values of the order statistics versus the expected values of the order statistics estimated by permutation. Appar- ently, the true expected values of order statistics and the expected values estimated by permutation are diﬀerent.

In the second simulation study, we generate 10,000 sets of random samples for both biological conditions (Group X and Group Y ) from multivariate lognormal distributions with the following mean vectors and variance-covariance matrices in the

94 Q−Q plot True expectation of ordered test statistics −2 0 2 4

−4 −2 0 2 4

Expectation of ordered test statistics estimated by permutation

Figure 4.3: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal skewness. Dashed line in the Q-Q plot is the 45 degree diagonal line.

95 logarithm scale.

µX =1150, ΣX =0.1I 50 +0.9J 50, m = 5;

µY =1150, ΣY = I 50, n =5.

The two multivariate lognormal distributions have unequal third order cross cumulants. Figure 4.4 shows the Q-Q plot of the true expected values of the order statistics versus the expected values of the order statistics estimated by permutation.

There is a diﬀerence between the true expected values of order statistics and the expected values estimated by permutation.

4.3 Conditions for controlling the expected number of false rejections in SAM

We showed the situations that the expected values of order statistics estimated by permutation are diﬀerent from the true expected values of order statistics. SAM shows invalid results in those situations. In this section, we will derive the conditions for controlling the expected number of false rejections in SAM.

To identify signiﬁcantly diﬀerentially expressed genes, the SAM procedure proceeds as follows:

Consider testing H1,H2,...,Hk using the corresponding test statistics T1, T2,...,Tk

(T = X¯ Y¯ ). Let T T T be the ordered test statistics, and H de- i i − i (1) ≤ (2) ≤···≤ (k) (i) note the null hypothesis corresponding to T(i). For a given threshold ∆, the procedure for ﬁnding signiﬁcant positives is:

Let i be the smallest i for which T E(T ) > ∆; 1 (i) − (i)

then reject all H(i) i = i1, i1 +1,...,k.

96 Q−Q plot True expectation of ordered test statistics −10 −8 −6 −4 −2 0 2 4

−5 0 5

Expectation of ordered test statistics estimated by permutation

Figure 4.4: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal third order cross cumulants. Dashed line in the Q-Q plot is the 45 degree diagonal line.

97 Similarly, the procedure for ﬁnding signiﬁcant negatives is:

Let i be the largest i for which E(T ) T > ∆; 2 (i) − (i)

then reject all H(i) i =1, 2,...,i2.

For a given data set, the upper cut-point and the lower cut-point according to

Tusher et al. (2001) and Chu et al. (2000) are deﬁned as

Cutup(∆) = min(t(i1), t(i1+1),...,t(k))= t(i1) and

Cutlow(∆) = max(t(1), t(2),...,t(i2))= t(i2), where t ,...,t , t ,...,t , t ,...,t are the observed order statistics. { (1) (i2) (i2+1) (i1−1) (i1) (k)} Thus, for a given threshold,

Reject H if T Cut (∆) (t ) or T Cut (∆) (t ) (i) (i) ≥ up (i1) (i) ≤ low (i2) is equivalent to

Reject H if T > E(T )+∆ or T < E(T ) ∆. (i) (i) (i1) (i) (i2) −

Let S = i : H is true . There are k true null hypotheses. Then, the number 0 { (i) } 0 of false rejections V for a given threshold ∆ can be expressed as

V = I T > E(T ) + ∆ or T < E(T ) ∆ (4.5) { (i) (i1) (i) (i2) − } i∈S X0

For the bth permutation, the number of false rejections Vb for the same given threshold ∆ has the following expression:

k V = I T b > E(T ) + ∆ or T b < E(T ) ∆ (4.6) b { (i) (i1) (i) (i2) − } i=1 X

98 To simplify our expression, let

p = Pr T > E(T ) + ∆ or T < E(T ) ∆ , i { (i) (i1) (i) (i2) − } pb = Pr T b > E(T ) + ∆ or T b < E(T ) ∆ , i { (i) (i1) (i) (i2) − } k b pi p p¯ = i∈S0 , p¯b = i=1 i . k k P 0 P Then,

E(V )= E( I T > E(T ) + ∆ or T < E(T ) ∆ ) { (i) (i1) (i) (i2) − } i∈S X0 = Pr T > E(T ) + ∆ or T < E(T ) ∆ { (i) (i1) (i) (i2) − } i∈S X0 = k0p,¯ (4.7) and

k E(V )= E( I T b > E(T ) + ∆ or T b < E(T ) ∆ ) b { (i) (i1) (i) (i2) − } i=1 k X = Pr T b > E(T ) + ∆ or T b < E(T ) ∆ { (i) (i1) (i) (i2) − } i=1 X = kp¯b. (4.8)

Assume all test statistics T1, T2,...,Tk are independent, identically distributed with pdf f and CDF F . Let c = E(T ) + ∆ and c = E(T ) ∆. Then the T T 1 (i1) 2 (i2) − number of false rejections V has the following mean:

E(V )= Pr T >c or T

99 The number of false rejections from bth permutation Vb has the expected value

E(Vb):

k E(V )= Pr T b >c or T b

If the true and permutation distribution of test statistics T can be described in

b terms of cumulants κa(T ) and κa(T ), then the following results can be established based on Theorem 2.2 in Huang et al. (2006)’s paper:

1 ( 1)a κ (T b)= κ (T ) b( − )(κ (F ) κ (F )). (4.11) a a − ma − na a X − a Y

Thus, E(V )= E(Vb) only when the following conditions are satisﬁed:

(1) k0 = k (when all the null hypotheses are true),

(2) m = n for even order cumulants, or κa(FX )= κa(FY ) for odd order cumulants.

The counter examples given in the previous sections do not satisfy the above conditions. Therefore, when the above conditions are not satisﬁed, E(V ) can not be controlled by the SAM procedure.

4.4 An adaptive two-step procedure controlling the expected number of false rejections

Gordon et al. (2007) proposed that the Bonferroni procedure can control E(V ) at a pre-speciﬁed number of false rejections γ. The Bonferroni procedure compares each observed P -value with γ/k, and rejects the null hypothesis Hi if Pi < γ/k. The simulation studies conducted by Gordon et al. (2007) showed that the Bonferroni

100 procedure has an equivalent power, and a smaller variance of both the number of true

rejections and the total number of rejections compared to the Benjamini-Hochberg

procedure (Benjamini and Hochberg (1995)).

The proof of Bonferroni procedure controlling E(V ) is straightforward as follows:

γ E(V )= E( I P γ/k )= Pr P γ/k k γ, (4.12) { i ≤ } { i ≤ } ≤ 0 k ≤ i∈S i∈S X X where S is the set of indices of true null hypotheses, and Pi are observed P-values that are uniformly distributed on [0, 1].

A sharp bound can be obtained using the Bonferroni procedure described above when all null hypotheses are true (k0 = k). When some null hypotheses are false, the Bonferroni procedure is conservative by a factor of k0/k. Similarly, the FDR

controlling procedures, such as the Benjamini-Hochberg procedure, are conservative

by a factor of k0/k when k0 < k. Extensive eﬀorts have been taken to estimate k0

from the data in order to sharp the bound in FDR controlling procedures (Schweder

and Spjotvoll (1982), Benjamini and Hochberg (2000), Efron et al. (2001), Storey

(2002), Storey and Tibshirani (2003b), Storey and Tibshirani (2003a), Storey et al.

(2004)). Benjamini et al. (2006) reviewed the previous adaptive FDR controlling

procedures (estimating k0 from the data), and further proposed a two-stage linear

step-up procedure (TST) as follows:

Step 1. Use the Benjamini-Hochberg linear step-up procedure (Benjamini and

′ Hochberg (1995)) at level q = q/(1+q). Let r1 be the number of rejected hypotheses.

If r1 = 0, do not reject any hypothesis and stop; if r1 = k, reject all k hypotheses and stop; otherwise continue.

Step 2. Let kˆ = k r , use the Benjamini-Hochberg linear step-up procedure at 0 − 1 ∗ ′ level q = q k/kˆ0.

101 Benjamini et al. (2006) proved that when the test statistics are independent, the above linear step-up procedure (TST) could control FDR at level q. Their proof for controlling FDR is summarized as follows:

According to Benjamini and Yekutieli (2001), the FDR of any multiple testing procedure can be expressed as

k0 k 1 F DR = Pr l hypotheses are rejected one of which is H l { 0i} i=1 X Xl=1 k 1 = k Pr l hypotheses are rejected one of which is H 0 l { 01} l=1 X (1) = k0EP (1) Q(P ). (4.13)

Conditioning on P (1), the vector of P-values corresponds to m 1 hypotheses − (1) excluding H01. Q(P ) is deﬁned as

k (1) 1 Q(P )= Pr (1) l hypotheses are rejected one of which is H , l P01|P { 01} Xl=1 where P01 is the P-value associated with H01.

(1) For each value of P , let r(P01) denote the number of hypotheses that are rejected

as a function of P01, and l(P01) be the indicator that H01 is rejected as a function of

P01. When r(p) is a non-increasing function and l(p) takes the form of 1[0,p∗], where

p∗ p∗(P (1)) satisﬁes p∗ qr(p∗)/kˆ (p∗), Q(P (1)) can be expressed as ≡ ≤ 0 ∗ (1) l(P01) (1) l(p) p q Q(P )= E( P )= dµ01(p) ∗ , (4.14) r(P01)| r(p) ≤ r(p ) ≤ kˆ p∗ Z 0( ) where, by assumed independence, µ01 is the marginal distribution of P01. In continuous case, µ01 is a uniform distribution on [0, 1]. In discrete case, it is stochastically

larger than the uniform. Thus, FDR has the following upper bound

k0 F DR qE (1) ( ). (4.15) P ∗ ≤ kˆ0(p ) 102 ˆ ˆ ˆ In the two-stage procedure, k0 can only be one of two values, k0(0) or k0(1).

For P r q′ /k, H is rejected at both stages of the two-stage procedure, and 01 ≤ 1 01 ′ kˆ0 = kˆ0(1). For P01 > r1q /k, H01 is not rejected at the ﬁrst stage, and hence

kˆ = kˆ (0). As long as P r(P )q′ /kˆ (1), however, H is rejected at the second 0 0 01 ≤ 01 0 01 ∗ stage, and kˆ0(p )= kˆ0(1). We will have

q′ Q(P (1)) . (4.16) ≤ kˆ0(1) ˆ ′ If k0(1) = k, then for P01 > r1q /k, the second stage of the testing procedure is identical to the ﬁrst stage. Thus, H01 is no longer rejected, and kˆ0 = kˆ0(0). As

p∗ = r q′ /k and r r(P ∗), however, we will have 1 1 ≤ ∗ ′ ′ ′ (1) p r1q /k q q Q(P ) ∗ = = . (4.17) ≤ r(p ) ≤ r1 k kˆ0(1)

As kˆ (1) is stochastically larger than Y + 1, where Y Binomial(k 1, 1 0 ∼ 0 − − q/(q + 1)) and E 1/(Y + 1) < 1/(k /(1 + q)), the above inequality yields { } 0

(1) q k0 q k0 F DR k0EP (1) Q(P ) EP (1) = q. (4.18) ≤ ≤ 1+ q Y +1 ≤ 1+ q k0/(1 + q)

A simulation study was conducted by Benjamini et al. (2006) to compare various adaptive FDR-controlling procedures in terms of the control of FDR and the power.

The simulation results showed that, in the independent case, FDR can not be controlled at a nominal level using the adaptive linear step-up procedure incorporated in recent version of SAM software (Storey (2002), Storey and Tibshirani (2003b), Storey and Tibshirani (2003a)) and the Median adaptive linear step-up procedure (Storey

(2002)). When the test statistics are positively correlated, among all proposed adaptive procedures, only the two-stage linear step-up procedure (TST) can control FDR at the nominal level. Their simulation results also showed that the variance of kˆ0

103 in the two-stage linear step-up procedure was smaller than that in Benjamini and

Hochberg (2000) and that in Storey et al. (2004).

To control E(V ), k0 can be estimated from the data to adjust the conservative

factor in the Bonferroni procedure. We propose the following adaptive two-step pro-

cedure to control the expected number of false positives at a pre-speciﬁed number µ

(0 <µ

′ Step 1. Compare each P-value with µ = µ (1 µ ). Let r be the number of rejected k k − k hypotheses. If r = 0, do not reject any null hypothesis and stop; If r = k, reject all k null hypotheses and stop; otherwise, go to the second step.

′ ′ ˆ µ µ Step 2. Let k0 = k r, and compare each P-value with ˆ = . − k0 k−r

Theorem 4.2. The adaptive two-step procedure described above controls E(V ) at the pre-speciﬁed number µ.

Proof. Let Bv denote the event that the Bonferroni procedure rejects V true null hypotheses. Then E(V ) can be written as

E(V )= vPr(Bv). v=1 X

104 For a ﬁxed v, let s denote a subset of 1,...,k of size v, and Bs denote the { 0} v

event in Bv that the v true null hypotheses rejected are s. Then, we have

k0 µ k0 µ Pr( P B )= Pr( P Bs) { i ≤ k } ∩ v { i ≤ k } ∩ v i=1 i=1 s X X X k0 µ = Pr( P Bs) { i ≤ k } ∩ v s i=1 X X k0 = I i s Pr(Bs) { ∈ } v s i=1 X X s = v Pr(Bv) s X = vPr(Bv). (4.19)

Thus, 1 k0 µ Pr(B )= Pr( P B ). v v { i ≤ k } ∩ v i=1 X Therefore, the E(V ) is

k0 k0 µ E(V )= Pr( P B ) { i ≤ k } ∩ v v=1 i=1 X X k0 k0

= Pr(V true null hypotheses are rejected, one of which is H0i) v=1 i=1 X X k0

= k0 Pr(V true null hypotheses are rejected, one of which is H01). v=1 X The second equality follows as the exchangeability of the problem in P-values corresponding to the k0 true null hypotheses.

Let P be the P-value associated with H . We have k 1 since otherwise 01 01 0 ≥ E(V )=0. Let P (01) be the vector of P-values corresponding to the remaining k 1 0 − (01) true null hypotheses excluding H01. Conditioning on P , E(V ) can be expressed as

(01) E(V )= k0EP (01) W (P ), (4.20)

105 where W (P (01)) is deﬁned as

k0 (01) (01) W (P )= PrP01|P (V true null hypotheses are rejected, one of which is H01). v=1 X (4.21)

(01) For each value of P , let l(P01) be the indicator that H01 is rejected, as a function

of P01. Then

W (P (01))= E(l(P ) P (01))= l(p)dx (p), 01 | 01 Z where x01 is the marginal distribution of P01, which is just Uniform(0, 1) in continu-

ous cases, and stochastically larger than the uniform in discrete cases. l(p) takes the

∗ ∗ (01) ∗ µ ∗ form 1[0,p ], where p p (P ) satisﬁes p ˆ ∗ . Note that l(P01)=1 as long as ≡ ≤ k0(p ) µ P01 ˆ . Then, we have ≤ k0(P01)

(01) ∗ µ W (P )= l(p)dx01(p) p (4.22) ≤ ≤ kˆ p∗ Z 0( ) and the following upper bound for E(V ):

k0 E(V ) µE (01) ( ). (4.23) P ˆ ∗ ≤ k0(p )

′ In our two-step procedure, each P-value is compared with µ = µ (1 µ ), r is the k k − k number of rejected null hypotheses at the ﬁrst step, and kˆ = k r. kˆ can only be 0 − 0 one of two values kˆ (0) or kˆ (1). For P µ′ /k, H is rejected at both steps of the 0 0 01 ≤ 01 ′ two-step procedure, and kˆ0 = kˆ0(1). For P01 > µ /k, H01 is not rejected at the ﬁrst

step, and hence kˆ = kˆ (0). As long as P µ′ /kˆ (1), however, H is rejected at 0 0 01 ≤ 0 01 ∗ the second step, and thus kˆ0(p )= kˆ0(1). We will have

µ′ W (P (01)) . (4.24) ˆ ≤ k0(1)

′ If kˆ0(1) = k, then for P01 > µ /k, the second step of the testing procedure is identical to the ﬁrst step. Thus, H01 is no longer rejected, and kˆ0 = kˆ0(0). As

106 p∗ = µ′ /k, we have µ′ µ′ W (P (01)) p∗ = = . (4.25) ≤ k kˆ0(1) As kˆ (1) is stochastically larger than D +1, where D Binomial(k 1, 1 µ′ /k) 0 ∼ 0 − − 1 1 and E < ′ , the above inequality yields { D+1 } k0(1−µ /k)

(01) ′ k0 ′ k0 E(V )= k0EP (01) W (P ) µ EP (01) µ EP (01) ≤ kˆ0(1) ≤ D +1 µ k µ(1 ) 0 µ. ≤ − k k (1 µ (1 µ )) ≤ 0 − k − k

µ(1−µ/k) In the adaptive two-step procedure, the critical value in the second step is k−r , which is greater than µ if µ

than the Bonferroni procedure.

If the joint distribution of test statistics can be estimated, more powerful multiple testing procedures can be established when the test statistics are correlated. Resam- pling methods can be used to estimate the distribution of test statistics. Yekutieli and Benjamini (1999) reported that, for correlated test statistics, the resampling- based multiple testing procedures not only control FDR at nominal level but also have higher power than other FDR controlling procedures.

Yekutieli and Benjamini (1999) proposed two resampling-based FDR local estimators: a point estimator and a 1 β FDR upper limit based on the number of − rejections from each resample R∗, the number of rejections from original data r, and

r∗ which is the 1 β quantile of R∗. β − R∗(p) ∗ ∗ ∗ ER R∗(p)+r(p)−p·k , if r(p) rβ(p) p k F DR (p)= ∗ − ≥ · Pr ∗ (R (p) 1), otherwise ( R ≥ 107 is the point estimator of FDR and

R∗(x) ∗ ∗ ER∗ ∗ ∗ , if r(x) rβ(x) > 0 F DR (p)= sup R (x)+r(x)−rβ (x) − β x∈[0,p] ∗ ( PrR∗ (R (x) 1), otherwise ≥ is the 1 β FDR upper limit. − The resampling-based FDR multiple testing procedures are as follows:

ﬁnd k = max F DR∗(p) q , then reject H ,...,H . q k{ ≤ } 0(1) 0(kq)

Similarly, multiple testing procedures controlling E(V ) can be explored based on

resampling methods.

4.5 Discussion

By showing the diﬀerence between the true joint distribution of test statistics

and the joint distribution of test statistics estimated by permutation, and conducting

simulation studies, we explored the discrepancies between the true expected values of

order statistics and the expected values of order statistics estimated by permutation.

Then, we derived the formulas for both the true expected number of false rejections

and the permutation-based expected number of false rejections. The expected values

for both the number of false rejections and permutation-based number of false re-

jections were also derived for independent test statistics. The formulas clearly show

the evidence for overestimating Fdr by SAM. Based on the formulas, the conditions

for SAM to control the expected number of false rejections were given to provide

researchers guidance when using SAM for their microarray data analysis.

kˆ0 In SAM 2.20, the estimation of the proportion of true null hypothesis (ˆπ0 = k ) was used to improve the Fdr estimation by SAM. However, the Fdr is still greater than the nominal level according to the simulation studies by Benjamini et al. (2006).

108 The formulas for calculating the expected number of false rejections showed that the expected number of false rejections is a function of the joint distribution of test statistics. As Efron (2000) pointed out, the correlations among test statistics can considerably widen or narrow the null distribution of test statistics. At the same time, there is a strong dependence of the true false discovery proportion on the dispersion variable. The dispersion variable is a function of the correlations between test statistics. How the variability of the number of false rejections (V ) is aﬀected by the correlations between genes need to be further investigated.

We proposed an adaptive two-step procedure to control E(V ) at a predetermined value of µ, which has larger critical values compared to the Bonferroni procedure proposed by Gordon et al. (2007). Therefore, the adaptive two-step procedure is more powerful than the Bonferroni procedure.

Yekutieli and Benjamini (1999) proposed to improve the power of the FDR controlling procedures using resampling methods. Likewise, exploring resampling-based

E(V ) controlling procedures is an interesting topic for future research.

109 CHAPTER 5

CONCLUDING REMARKS

In microarray studies with two-group comparisons, the sample size of each group is typically small and the distributions of test statistics are usually unknown. Thus, resampling methods are widely used in microarray data analysis. For the same paired data set with a small sample size of two or three, permutation tests are very unlikely to give small P-values. On the contrary, the post-pivot and pre-pivot resampling methods are likely to give small P-values even adjusted for multiplicity.

The above contradictive result was addressed in chapter 2. At ﬁrst, for paired samples with a sample size of two or three in each group, the necessary and suﬃ- cient conditions for obtaining zero adjusted P-values were derived for the post-pivot and pre-pivot resampling methods. In addition, the test statistics null distribution estimated by the post-pivot resampling method was shown to be the same as that estimated by the pre-pivot resampling method for paired samples.

The discreteness of the test statistic’s null distribution, estimated by the resampling methods, was further explored by calculating the maximum number of unique resampled test statistic values. The mathematical formulas for calculating the maximum number of unique resampled test statistic values were derived for two-group comparisons, fixed-effects general linear models, and general linear mixed-effects models

110 using three resampling methods. According to the formulas, the pre-pivot resampling

method always gives the largest maximum number of unique resampled test statistic

values among all three resampling methods. Therefore, the P-values computed by

the pre-pivot resampling method are more reliable than the P-values computed by

the permutation test and post-pivot resampling method.

To estimate the test statistic’s null distribution, resampling methods are widely used in various hypothesis testing procedures. But researchers tend to ignore the conditions for resampling methods to control the multiple testing error rates when testing multiple null hypotheses simultaneously, resulting in inflated type I error rates. To control FWER at a desired level α in fixed-effects general linear models, the conditions were explored in chapter 3 for the permutation test, the post-pivot resampling method and the pre-pivot resampling method.

In two-group comparisons, to control the multiple testing error rates at a de-

sired level α, Xu and Hsu (2007) showed that permutation tests need to satisfy the Marginals-equalities-Determine-Joint-equalities (MDJ) condition to connect the

marginal distributions (the null hypotheses) and the joint distributions (the assump-

tion made by permutation tests). In ﬁxed-eﬀects general linear models, the conditions

for permutation tests to control FWER were derived for the ﬁrst time based on the

Partitioning Principle, a basic multiple testing principle. The conditions are: (1) the

errors of the ﬁxed-eﬀects general linear models are i.i.d., and (2) the test statistics are simply the ordinary least square (OLS) estimates.

The conditions for the post-pivot resampling method and pre-pivot resampling

method to control FWER asymptotically were also derived based on the Partitioning

Principle for ﬁxed-eﬀects general linear models. The conditions are: (1) the errors of

111 the fixed-effects general linear models are i.i.d., and (2) 1 X ′ X V , where X is a n → design matrix and V is a positive definite matrix.

The Signiﬁcance Analysis of Microarrays (SAM) procedure was ﬁrst proposed by

Tusher et al. (2001) to identify genes with statistically significant changes in expression, which is based on the Q-Q plot of the observed relative differences versus the expected relative differences estimated from permutation. SAM is frequently used in the biological sciences to identify differentially expressed genes in microarray experiments. Much literature (Pan (2003), Dudoit et al. (2003), Xie et al. (2005), Larsson et al. (2005), Zhang (2007)), however, has shown that SAM can not control the Fdr at desired nominal levels. The reason for this lack of control was explored by showing the discrepancies between the true expected values of order statistics and the expected values estimated by permutation. To provide a good reference for researchers who are willing to use SAM to find differentially expressed genes in microarray experiments, the situations when SAM gives invalid results were explored in chapter 4.

SAM can not give correct reference distributions in the following cases:

Case 1: Two independent data generating distributions have unequal variances and sample sizes.

Case 2: Two independent data generating distributions have unequal correlations and sample sizes.

Case 3: Two independent data generating distributions have unequal marginal skewness.

Case 4: Two independent data generating distributions have unequal third order cross cumulants.

The SAM procedure should be avoided in the above cases.

112 The conditions for SAM to control the expected number of false rejections are:

(1) k0 = k (When all the null hypotheses are true), and

(2) m = n (equal sample size) for even order cumulants or κa(FX ) = κa(FY ) for odd order cumulants.

Gordon et al. (2007) proposed that the Bonferroni procedure can control the expected number of false rejections at certain number. Based on the Bonferroni procedure, an adaptive two-step procedure was proposed to control the expected number of false rejections at a predetermined number µ, which has larger critical values than the

Bonferroni procedure, resulting in more powerful test than the Bonferroni procedure.

The adaptive two-step procedure based on the Bonferroni procedure is as follows:

′ Step 1. Compare each P-value with µ = µ (1 µ ). Let r be the number of rejected k k − k hypotheses. If r = 0, do not reject any hypothesis and stop; If r = k, reject all k hypotheses and stop; otherwise, go to the second step.

′ ˆ µ Step 2. Let k0 = k r and compare each P-value with ˆ . − k0 With more probes being put on microarrays, ChIP-chip (a technique that com- bines chromatin immunoprecipitation (ChIP) with microarray technology (chip)) experiments are being used to ﬁnd transcription factor target genes in genomes to build the regulatory network in plants, animals and human beings. In contrast to microarray data analysis, one-sided testing (looking for enrichments) is used in ChIP-chip data analysis, and the distribution of probe intensities is usually skewed to the right.

Buck and Lieb (2004), Buck et al. (2005), Hong et al. (2005), Smith et al. (2005), and

Keles et al. (2006) have done some exciting work on ChIP-chip data analysis, but the methods for controlling the type I error rate are very conservative and the powers are very low in their studies. One of interesting future research topics is applying the

113 Partitioning Principle and resampling techniques to the ChIP-chip data sets to ﬁnd more powerful testing procedures.

114 REFERENCES

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289 – 300. Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60 – 83. Benjamini, Y., Krieger, A. M., and Yekutieli, D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93(3):491 – 507. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165 – 1188. Berger, J. O. (1993). Statistical decision theory and Bayesian analysis. Springer, 2nd edition. Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6):1196 – 1217. Buck, M. J. and Lieb, J. D. (2004). ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83:349 – 360. Buck, M. J., Nobel, A. B., and Lieb, J. D. (2005). ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biology, 6:R97:doi:10.1186/gb–2005– 6–11–r97. Buyse, M., Loi, S., van’t Veer, L., Viale, G., Delorenzi, M., Glas, A. M., d’Assignies, M. S., Bergh, J., Lidereau, R., Ellis, P., Harris, A., Bogaerts, J., Therasse, P., Floore, A., Amakrane, M., Piette, F., Rutgers, E., Sotiriou, C., Cardoso, F., and Piccart, M. J. (2006). Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute, 98(17):1183 – 1192. Calian, V., Li, D., and Hsu, J. C. (2008). Partitioning to uncover conditions for permutation tests to control multiple testing error rates. Biometrical Journal, 50(5):756 – 766.

115 Casella, G. and Berger, R. L. (1990). Statistical inference. California: Wadsworth, Paciﬁc Grove.

Chu, G., Goss, V., Narasimhan, B., and Tibshirani, R. (2000). Sam ”Signiﬁcant Analysis of Microarrays” - Users guide and technical document. Technical report, Stanford University.

Churchill, G. A. and Doerge, R. W. (1994). Empirical threshold values for quantitative trait mapping. Genetics, 138:963 – 971.

Davison, A. C. and Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge University Press.

Dudoit, S., Shaﬀer, J. P., and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71 – 103.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1 – 26.

Efron, B. (2000). Correlation and large-scale simultaneous signiﬁcance testing. Jour- nal of the American Statistical Association, 102(477):93 – 103.

Efron, B. and Tibshirani, R. J. (1994). An introduction to the bootstrap. Chapman & Hall/CRC.

Efron, B., Tibshirani, R. J., Storey, J. D., and Tusher, V. (2001). Empirical Bayes analysis of microarray experiment. Journal of American Statistical Association, 96:1151 – 1160.

Ein-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. (2005). Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics, 21:171 – 178.

Finner, H. and Strassburger, K. (2002). The partitioning principle: A powerful tool in multiple decision theory. Annuals of Statistics, 30:1194 – 1213.

Freedman, D. A. (1981). Bootstrapping regression models. The Annals of Statistics, 9(6):1218 – 1228.

Glas, A. M., Floore, A., Delahaye, L. J., Witteveen, A. T., Pover, R. C., Bakx, N., Lahti-Domenici, J. S., Bruinsma, T. J., Warmoes, M. O., Bernards, R., Wessels, L. F., and Van ’t Veer, L. J. (2006). Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genomics, 7:278.

Good, P. I. (2005). Permutation, parametric and bootstrap tests of hypotheses. Springer, 3rd edition.

116 Gordon, A., Glazko, G., Qiu, X., and Yakovlev, A. (2007). Control of the mean number of false discoveries, bonferroni and stability of multiple testing. The Annals of Applied Statistics, 1(1):179 – 190.

Hall, P. (1986). On the bootstrap and conﬁdence intervals. The Annals of Statistics, 14(4):1431 – 1452.

Hehir-Kwa, J., Egmont-Petersen, M., Janssen, I., Smeets, D., Geurts van Kessel, A., and Veltman, J. (2007). Genome-wide copy number proﬁling on high-density bac- terial artiﬁcial chromosomes, single-nucleotide polymorphisms, and oligonucleotide microarrays: A platform comparison based on statistical power analysis. DNA Research, 14:1 – 11.

Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of signiﬁcance. Biometrika, 75:800 – 802.

Hochberg, Y. and Tamhane, A. C. (1987). Multiple comparison procedures. New York: Wiley.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65 – 70.

Hong, P., Liu, X. S., Zhou, Q., Lu, X., Liu, J. S., and Wong, W. H. (2005). A boosting approach for motif modeling using ChIP-chip data. Bioinformatics, 21(11):2636 – 2643.

Huang, Y. and Hsu, J. C. (2007). Hochberg’s step-up method: Cutting corners oﬀ Holm’s step-down method. Biometrika, 94:965 – 975.

Huang, Y., Xu, H., Calian, V., and Hsu, J. C. (2006). To permute or not to permute. Bioinformatics, 22:2244 – 2248.

Keles, S., Van Der Laan, M. J., Dudoit, S., and Cawley, S. E. (2006). Multiple testing methods for ChIP-chip high density oligonucleotide array data. Journal of Computational Biology, 13(3):579 – 613.

Kulesh, D. A., Clive, D. R., Zarlenga, D. S., and Greene, J. J. (1987). Identiﬁcation of interferon-modulated proliferation-related cDNA sequences. Proceedings of the National Academy of Sciences, 84:8453 – 8457.

Lander, E. S. and Botstein, D. (1989). Mapping mendelian factors underlying quantitative traits using RFLP linkage maps.

Larsson, O., Wahlestedt, C., and Timmons, J. A. (2005). Considerations when using the signiﬁcance analysis of microarrays (SAM) algorithm. BMC Bioinformatics, 6:129 – 134.

117 Lashkari, D. A., DeRisi, J. L., McCusker, J. H., Namath, A. F., Gentile, C., Hwang, S. Y., Brown, P. O., and Davis, R. W. (1997). Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proceedings of the National Academy of Sciences, 94:13057 – 13062. Lee, M. T., Kuo, F. C., Whitmore, G. A., and Sklar, J. (2000). Importance of replica- tion in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences, 97(18):9834 – 9839. Lehmann, E. L. (1999). Elements of large-sample theory. New York: Springer. Lehmann, E. L. and Romano, J. P. (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3):1138 1154. Liu, Y., Smith, M. R., and Rangayyan, R. M. (2004). The application of Efron’s bootstrap methods in validating feature classification using artificial neural networks for the analysis of mammographic masses. Engineering in Medicine and Biology Society, 1:1553 – 1556. Marcus, R., Eric, P., and Gabriel, K. R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655 – 660. Mei, R., Galipeau, P. C., Prass, C., Berno, A., Ghandour, G., Patil, N., Wolff, R. K., Chee, M. S., Reid, B. J., and Lockhart, D. J. (2000). Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays. Genome Research, 10:1126 – 1137. Pan, W. (2003). On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19(11):1333 – 1340. Pan, W., Lin, J., and Le, C. T. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional & Integrative Genomics, 3:117 – 124. Pollack, J. R., Perou, C. M., Alizadeh, A. A., Eisen, M. B., Pergamenschikov, A., Williams, C. F., S., J. S., Botstein, D., and Brown, P. O. (1999). Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genetics, 23:41 – 46. Pollard, K. S. and van der Laan, M. J. (2005). Resampling-based multiple testing: Asymptotic control of type I error and applications to gene expression data. Journal of Statistical Planning and Inference, 125:85 – 100. Ptitsyn, A., Zvonic, S., and Gimble, J. (2006). Permutation test for periodicity in short time series data. BMC Bioinformatics, 7(2):S10.

118 Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative moni- toring of gene expression patterns with a complementary DNA microarray. Science, 270:467 – 470.

Schweder, T. and Spjotvoll, E. (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika, 69:493 – 502.

Shaﬀer, J. P. (1995). Multiple hypothesis testing: A review. Annual Review of Psychology, 46:561 – 584.

Smith, A. D., Sumazin, P., Das, D., and Zhang, M. Q. (2005). Mining ChIP-chip data for transcription factor cofactor binding sites. Bioinformatics, 21:i403 – i412.

Stefansson, G., Kim, W., and Hsu, J. C. (1988). On conﬁdence sets in multiple comparisons. In Gupta, S.S. and Berger, J.O., editors, Statistical Decision Theory and Related Topics IV, 2:89 – 104.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Methodological), 64(3):479 – 498.

Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A uniﬁed approach. Journal of Royal Statistical Society. Series B (Methodological), 66:187 – 205.

Storey, J. D. and Tibshirani, R. (2003a). Sam thresholding and false discovery rates for detecting diﬀerential gene expression in DNA microarrays. In Parmigiani, G., Garrett, E. S., Irizarry, R. A. and Zeger, S. L. (eds.) The Analysis of Gene Ex- pression Data: Methods and Software. Springer, New York.

Storey, J. D. and Tibshirani, R. (2003b). Statistical signiﬁcance for genome-wide studies. Proceedings of the National Academy of Sciences, 100:9440 – 9445.

Strimmer, K. (2008). A uniﬁed approach to false discovery rate estimation. BMC Bioinformatics, 9:303.

Tsai, C., Hsueh, H., and Chen, J. J. (2003). Estimation of false discovery rates in multiple testing: Application to gene microarray data. Biometrics, 59(4):1071 – 1081.

Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Signiﬁcant analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):5116 – 5121. van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink,

119 H., Rodenhuis, S., Rutgers, E. T., Friend, S. H., and Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999 – 2009. van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven1, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002). Gene expression proﬁling predicts clinical outcome of breast cancer. Nature, 415(6871):530 – 536.

Westfall, P. H. and Young, S. S. (1993). Resampling-based multiple testing: examples and methods for P-Value adjustment. New York: Wiley.

Xie, Y., Pan, W., and Khodursky, A. B. (2005). A note on using permutation-based false discovery rate estimates to compare diﬀerent analysis methods for microarray data. Bioinformatics, 21(23):4280 – 4288.

Xu, H. and Hsu, J. C. (2007). Applying the generalized partitioning principle to control the generalized familywise error rate. Biometrical Journal, 49:52 – 67.

Yekutieli, D. and Benjamini, Y. (1999). Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference, 82:171 – 196.

Zhang, S. (2007). A comprehensive evaluation of SAM, the SAM R-package and a simple modiﬁcation to improve its performance. BMC Bioinformatics, 8:230 – 241.

120