MULTIPLE HYPOTHESIS TESTING for FINITE and INFINITE NUMBER of HYPOTHESES by Zhongfa Zhang Submitted in Partial Fulfillment of Th

MULTIPLE HYPOTHESIS TESTING FOR FINITE AND INFINITE NUMBER OF HYPOTHESES by Zhongfa Zhang Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy Dissertation Advisor: Dr. Jiayang Sun Department of Statistics CASE WESTERN RESERVE UNIVERSITY August 2005 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the dissertation of Zhongfa Zhang candidate for the Doctor of Philosophy degree Committee Chair: Dr. Jiayang Sun Dissertation Advisor Professor Department of Statistics Committee: Dr. Wojbor Woyczynski Professor Department of Statistics Committee: Dr. Robert Elston Professor Department of Epidemiology & Biostatistics Committee: Dr. Hemant Ishwaran Adjunct Associate Professor Department of Statistics Staff, Dept. of Quantitative Health Sciences Cleveland Clinic Foundation August 2005 Table of Contents TableofContents............................. iii ListofTables ............................... v ListofFigures............................... vi Acknowledgement............................. ix Abstract.................................. x 1 Introduction 1 1.1 HypothesisTesting ............................ 1 1.1.1 SingleHypothesisTesting . 1 1.1.2 MultipleHypothesisTesting . 3 1.1.3 Test Equality of Curves . 11 1.2 Road Map of the Following Chapters . 14 2 Multiple Hypothesis Testing— New FDR Controlling Procedures 15 2.1 Introduction................................ 15 2.2 Relationship between FDR and FWER . 17 2.3 LiteratureReview............................. 20 2.4 AFewTheorems ............................. 22 2.5 OurProposedProcedure(PP). 34 2.6 ComparisonwithOtherProcedures . 42 2.7 ApplicationtoaRealDataSet . 44 3 Test Equality of Curves 48 3.1 AnEnvironmentalStudy—LeadProject . 48 3.2 ModelSetup................................ 52 3.3 RelatedWorkandOutline.. 53 3.4 Methods................................... 54 3.4.1 Homoscedastic Case. 54 iii 3.4.2 Special Case When f (t) 0................... 61 2 ≡ 3.4.3 Heteroscedastic Case. 62 3.5 Simulations ................................ 64 3.6 Test Results on Teeth Lead Data Set. 66 4 Connections and Discussions 69 4.1 Connections................................ 69 4.2 DiscussionsandFutureResearch . 71 Appendices 75 A Proofs of Lemmas and Theorems in Chapter 2 76 A.1 ProofofLemma2.4.2........................... 76 A.2 ProofofTheorem2.4.6.......................... 81 A.2.1 KeyLemma............................ 81 A.2.2 OtherLemmas .......................... 86 A.2.3 ProofofTheorem2.4.6. 91 B Proof of Theorem in Chapter 3 92 B.1 Lemmas .................................. 92 B.2 ProofofTheorem3.4.2.......................... 99 C Software ctest 102 Bibliography 103 iv List of Tables 1.1 Outcomeofsinglehypothesistesting . 3 1.2 Outcomeofmultiplehypothesistesting . 7 2.1 Number of genes discovered by three FDR procedures . 47 B.1 Comparison of simulated degrees of freedom ν = 4πm2 (upper element, via simulation) with approximated degrees of freedom ν (lower element, by formula (3.4.19)) of different combinations of sample sizes n1, n2 and degrees of freedom ν1,ν2. ........................ 95 v List of Figures 2.1 Trellis plot to explore the functional relationship between FWER and FDR: simulated samples from normal distribution N(µ, 1). Total hypotheses m = 1000, with number of true null hypotheses m0 = 100, 400, 700, 950, 990, 1000 from left to right, µ = 0 for null and µ = 0.06, 0.12, ... 0.36 from bottom up for the alternative distributions. 19 2.2 Explanation of why the FDR produced by the BH procedure at level β (in case of independent test statistics ) is (m0/m)β, which depends on m0,m and β only, but not on the realized p-values from alternative. Solid straight line: BH critical line; thick blue curve: sorted p-values againstindices. .............................. 24 2.3 Partition of unit square such that the joint distribution of (P1, P2) constitutes a counter example to Theorem 2.4.1 when the independence assumption is violated. β/2= c1 and β = c2. ............. 31 2.4 A joint distribution of (P1, P2) that constitutes a counter example to Theorem2.4.1. .............................. 31 2.5 The asymptotic quadratic relationship between FDR level β and variance when m and m0 are large, based on Corollary 2.4.7. A realization of variances for a fixed m and m0. ................... 34 2.6 Comparison of three FDR controlling procedures: 1. ST (Storey’s), 2. PP (Proposed, Uncorrected), 3. BH Procedure. 10000 repetition were performed to average the FDR, FNR, and Power. Total number of tests is m = 1000. The generated signal sampled from N(µ, σ2) is relatively weak with means µ = 0.04, 0.04 + 1.18 1/(m 1), 0.04 + ∗ 1 − 1.18 2/(m 1),..., 1.2. ........................ 39 ∗ 1 − vi 2.7 Average FDP (left panel), FNP (middle panel), POWER (right panel) (y-axis) by Storey’s (line with mark 1), Corrected PP (line with mark 2), with δ = 0.035 in formula (2.5.4). 10000 replications were used for average. Number of total tests is m=1000, m0 (x-axis) is the number oftruenullhypotheses. ......................... 43 2.8 Variance comparison of false discovery proportion of three procedures. Averaged FDR’s were plotted together produced by Storey’s(ST, solid blue), Uncorrected PP(PP, dashed green), BH’s(BH, dotted brown) procedures . “Confidence bands” (plus and minus one standard devia- tion) were added to the plot. 12000 replications were used for average. Total test number is m=1000, m0 =number of true null hypotheses. 45 2.9 Index plot for the 7129 p-values computed through permutation and t-test. Two straight lines are added to the plot. One is y = x/m, which corresponds to case when all genes are insignificant to the class differentiation. The other is y =(β/m)x, corresponding to the BH line at level β. ................................. 46 3.1 Plot for teeth lead concentrations: Red square: M1 group, Blue circle: M2group.................................. 51 3.2 Plot for teeth lead concentrations: Solid red : M1 group, Dotted blue : M2 group. Local smoothing curves were superimposed for each group. Solid red line: M1 group, dotted blue line: M2 group. 51 3.3 Simulation result. Test: f (t) = f (t),t = [0, 1]. Homoscedastic 1 2 ∈ T variances were assumed. 10000 repetitions were used. 65 3.4 Simulation result: Test H0 : f(t) = 0. 10000 iterations were used. σ = 0.1,h = 0.1.............................. 68 3.5 Simulation results. Test: f1(t) = f2(t), for t = [0, 1]. 10000 ∈ T 2 repetitions were used. Heteroscedastic variances were used with σ1 = 2 0.02 and σ2 = 0.03. ............................ 68 A.1 Illustration for case 3: m = 40,m0 = 10. All p-values are the same except one, which comes from the alternative. This point was marked asMintheleftpanelandNintherightpanel. 84 2 B.1 The true density of Y (solid black) and the density of χν/ν (dashed red), 2 with ν computed by formula (3.4.19). The density curves of χνi /νi are also added on the plot. n1 = 800, n2 = 1000,ν1 = 120,ν2 = 300. 96 vii B.2 Compare the degrees of freedom ν estimated by formula (3.4.19) (dotted green lines) and the degrees of freedom ν by simulated data with 2 ν = 4πm (solid red line) with different combination of values ν1 = 100, 200,..., 800 (x-axis) and ν2 = 100, 200,..., 800 (from bottom curve up). Here n1 = 1000 and n2 =1500. ............... 96 B.3 Tubes with 2 endpoints around a 1-dimensional manifold embedded in R2. .................................... 98 viii ACKNOWLEDGEMENTS First, I would like to thank my parents who have sacrificed so much for their children, and my brother and sisters for their unselfish love. No matter what happened and what will happen, they were and will always be there ready to give their support of whatever they can offer at their utmost. My gratitude also goes to numerous other people who have enlightened me during my primary and middle school years and who have sincerely cared about and helped me. I have been feeling so lucky in having them in my life. Without their help, I could not imagine what could happen in my life. I would also like to express my gratitude toward Drs. Alexander, Elston, Ish- waran, Sedransk, Sun, Werner, Woyczynski and Wu for my education in Mathemat- ics/Statistics Departments and for their understanding. I thank Drs Elston, Ishwaran and Woyczynski for serving on my thesis committee. Special thanks to my thesis advisor Jiayang Sun, who not only supported this research in part by her NSF awards, but also spent so much time, taken so much effort in trying to make me a successful researcher during my graduate years here in CWRU. I thank her guidance, knowledge and patience. Finally, I thank Dr. Steve Ganocy for proofreading my entire thesis and his support at this critical period. ix Multiple Hypothesis Testing For Finite and Infinite Number of Hypotheses Abstract by Zhongfa Zhang Multiple hypotheses testing is one of the most active research areas in statistics. The number of hypotheses can be finite or infinite. For a multiple hypothesis testing, an overall error criterion must be properly defined and different test procedures must be developed. In this thesis, we investigate situations of both finite and infinite hypotheses testing. Accordingly, the thesis will be roughly divided into two parts. The first part of this thesis will focus on the finite hypotheses testing. We study the False Discovery Rate (FDR) proposed by Benjamini and Hochberg in 1995, as an error criterion for a multiple testing procedure. We first attempt to find a functional relationship between FDR and the more familiar family-wise error rate (FWER) in order to study the practical aspects of the two criteria and to get a controlling procedure of one from that of the other. A few new theoretic results are then presented about FDR and based on these results, a new and “suboptimal” FDR controlling procedure is proposed. Some comparisons are made to compare the performance of the proposed procedure with that of Benjamini and Hochberg’s (1995) and Storey et al’s (2003). The procedure is then applied to a microarray data set to illustrate its application in the bioinformatics area.

Load more