Novel P-Value Combination Methods for Signal Detection in Large-Scale Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Novel P-value Combination Methods for Signal Detection in Large-Scale Data Analysis by Hong Zhang A Dissertation Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Mathematical Sciences April 2018 c 2018 - Hong Zhang All rights reserved. Thesis advisor Author Zheyang Wu Hong Zhang Novel P-value Combination Methods for Signal Detection in Large-Scale Data Analysis Abstract In this dissertation, we first study the distributional properties of gGOF, a family of maximum based goodness-of-fit statistical tests and we propose TFisher, a new family of aggregation based tests that generalize and optimize the classic Fisher's p-value combination method. The robust data-adaptive versions of these tests are proposed to reduce the sensitivity of statistical power to different signal patterns. We also develop analytical algorithms to efficiently find the p-values of both tests under arbitrary correlation structures so that these optimal methods are not only powerful but also computationally feasible for analyzing large-scale correlated data. Both families of tests are successfully applied to detect the joint genetic effect of human complex diseases by analyzing genome-wide association study (GWAS) data and whole exome sequencing data. In Chapter 1, we study analytical distribution calculations for gGOF statistics, which covers the optimal tests, φ-divergence statistics, under arbitrary independent and continuous H0 and H1 models. Comparing with a rich literature of analytical p-value calculations, this work possesses advantages in its generality, accuracy, and computational simplicity. We also provide a general data analysis framework to apply gGOF statistics into SNP-set based GWAS for either quantitative or categorical traits. An application to Crohn's disease study shows that these optimal tests do have a good potential for detecting novel disease genes. iii Abstract iv In Chapter 2, we address the issue for gGOF under general settings of corre- lated data. We provide a novel p-value calculation approach which often possess an improved accuracy than commonly used moment-matching approach under various correlation structures. We also propose a strategy of combining innovated trans- formation and gGOF statistics, called igGOF. Furthermore, igGOF allows a nature double-ominbus test, called digGOF, which adapt both the functional of statistics and the truncation of input p-values to unknown data and signal patterns. We applied the tests in genetic association studies, both by simulations and a real exome-sequencing data analysis of amyotrophic lateral sclerosis (ALS). In Chapter 3, we propose a unifying family of Fisher's p-value combination statis- tics, called TFisher, with general p-value truncation and weighting schemes. Analyt- ical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The TFisher tests were applied to analyzing a whole exome sequencing data of ALS. In Chapter 4, we propose an approximation for the p-value of TFisher and oT- Fisher for analyzing correlated data with general correlation structures. The methods extend the Brown's method [1] to a more general Gamma distribution. An analytical approximation of the variance of TFisher is also provided. Numerical results show that both approximations are accurate. Contents Title Page....................................i Abstract..................................... iii Table of Contents................................v 1 Distributions and Statistical Power of Optimal Signal-Detection Meth- ods In Finite Cases1 1.1 Introduction................................2 1.2 The gGOF Family for Weak-Sparse Signals.............. 10 1.3 Analytical Results............................. 14 1.4 Numerical Results............................. 23 1.5 A Framework for GWAS And Application to Crohn's Disease Study. 34 1.6 Discussion................................. 38 2 digGOF: Double-Omnibus Innovated Goodness-Of-Fit Tests For De- pendent Data Analysis 40 2.1 Introduction................................ 41 2.2 Models of Hypotheses........................... 46 2.3 The gGOF Family Under Dependence................. 50 2.4 Innovated Transformation........................ 56 2.5 Numerical Studies............................. 66 2.6 Application to Genome-wide Association Study............ 85 2.7 Discussion................................. 88 3 TFisher Tests: Optimal and Adaptive Thresholding for Combining p-Values 91 3.1 Introduction................................ 92 3.2 TFisher Tests and Hypotheses...................... 95 3.3 TFisher Distribution Under H0 ..................... 100 3.4 TFisher Distribution Under General H1 ................. 104 3.5 Asymptotic Optimality for Signal Detection.............. 106 3.6 Statistical Power Comparison For Signal Detection.......... 119 v Contents vi 3.7 ALS Exome-seq Data Analysis...................... 124 3.8 Discussion................................. 129 4 TFisher Distribution Under Dependent Input Statistics 131 4.1 Introduction................................ 132 4.2 Approximate Null Distribution of TFisher............... 133 4.3 Numerical Results............................. 136 4.4 Extension to oTFisher.......................... 143 A Proofs of Chapter1 156 B Proofs of Chapter2 169 C Proofs of Chapter3 177 D Proofs of Chapter4 186 Chapter 1 Distributions and Statistical Power of Optimal Signal-Detection Methods In Finite Cases 1 Chapter 1: gGOF under Independence 2 1.1 Introduction In big data analysis, signals are often buried within a large amount of noises and are thus relatively weak and sparse. Developing optimal tests for detecting weak- sparse signals is important for many data-driven scientific researches. For example, in large-scale genetic association studies millions of genetic variants are quarried. Only a relatively small proportion are expected to be truly associated with a given disease, and most genetic effects are relatively weak especially comparing with the cumulative noise level of high-throughput data [2,3]. In recent years, theoretical studies have revealed a collection of asymptotically optimal tests in the sense that they can reach the boundary of detectable region. In other words, they are capable of detecting the signals at an intensity level that is minimally required for any statistics to detect reliably. In particular, the Higher Criticism (HC) type tests [4,5,6,7], the Berk-Jones (B-J) type tests [8], the φ-divergence type tests [9], etc., have been proven to share the same property of such asymptotic optimality. Despite these exciting progresses, some practical questions still remain to be an- swered. In particular, the asymptotic optimality is mostly proven under theoretical assumptions such as mixture model and testing group size n ! 1 [4,9, 10, 11]. For real data analysis, however, the hypothesis model could be more arbitrary and n is al- ways finite (even small or moderate). Optimal tests that are equivalent in asymptotic sense could in fact perform quite differently. In order to apply these optimal tests to real data analysis as well as to make an appropriate choice based on signal patterns, it is important to analytically calculate p-values as well as statistical power under more general and realistic assumptions. Moreover, in order to better understand the Chapter 1: gGOF under Independence 3 performance of relevant tests in a broader context, it is beneficial to study a generic statistic family that unifies the common style of these test statistics. This chapter is to address these issues. 1.1.1 Scope of The Work This chapter considers a generic family of goodness-of-fit (GOF) test statistics, called gGOF. Following the essential idea of GOF tests, a gGOF statistic is loosely defined as a summary statistic based on the maximal contrast between ordered input p-values and their expectations under the null hypothesis. Let P1; :::; Pn, for given n > 1, be a set of input p-values, and P(1) ≤ ::: ≤ P(n) be the ordered. A gGOF statistic is defined as the supremum of a generic contrast function f over a truncation domain R: i \ Sn;R = sup f( ;P(i)); where R = fi : k0 ≤ i ≤ k1g fP(i) : α0 < P(i) < α1g; (1.1) R n for given k0 ≤ k1 2 f1; :::; ng and α0 ≤ α1 2 [0; 1]. The gGOF requires the null hypothesis i:i:d: H0 : Pi ∼ Uniform[0; 1]; i = 1; :::; n; (1.2) i so that the null expectation E(P(i)) = n+1 . If the null is untrue, P(i) will differ from their null expectations, which is to be captured by the contrast function f. In practice, smaller p-values indicate signals or positive outcomes. Therefore f(x; y) can be any monotonically decreasing function in y at fixed x, so that the smaller the input p-values, the larger the statistic and the stronger the evidence against the null i i hypothesis. In other words, gGOF is one-sided in nature. Note that n , instead of n+1 , Chapter 1: gGOF under Independence 4 is used to represent the null mean of P(i) following the tradition of GOF definition. In fact, either fraction can be used and they impose no practical difference. For both theoretical and practical reasons, it is important to allow a general trun- cation domain R to restrict both the index i and the magnitude of the p-values P(i). Aside the benefit of computational efficiency (e.g., big input p-values can be safely truncated as they are likely not signals), some gGOF statistics could be improved by excluding too small p-values. For example, HC could have the long-tail problem under the null due to small p-values, while removing P(1);P(2), etc., may not guar- antee the fix [4, 12]. It is better to directly restrict the magnitude of the p-values, and thus a modified version of HC was created with R = f1 < i ≤ n=2;P(i) ≥ 1=ng [4, 13]. The significant influence of restricting P(i) ≥ 1=n under finite n is further demonstrated in Section 1.4.