Novel P-value Combination Methods for Signal Detection in Large-Scale Data Analysis
by
Hong Zhang
A Dissertation
Submitted to the Faculty
of the
WORCESTER POLYTECHNIC INSTITUTE
in partial fulfillment of the requirements for the
Degree of Doctor of Philosophy
in
Mathematical Sciences
April 2018 c 2018 - Hong Zhang
All rights reserved. Thesis advisor Author
Zheyang Wu Hong Zhang
Novel P-value Combination Methods for Signal Detection in Large-Scale Data Analysis
Abstract
In this dissertation, we first study the distributional properties of gGOF, a family
of maximum based goodness-of-fit statistical tests and we propose TFisher, a new
family of aggregation based tests that generalize and optimize the classic Fisher’s
p-value combination method. The robust data-adaptive versions of these tests are
proposed to reduce the sensitivity of statistical power to different signal patterns.
We also develop analytical algorithms to efficiently find the p-values of both tests
under arbitrary correlation structures so that these optimal methods are not only
powerful but also computationally feasible for analyzing large-scale correlated data.
Both families of tests are successfully applied to detect the joint genetic effect of
human complex diseases by analyzing genome-wide association study (GWAS) data
and whole exome sequencing data.
In Chapter 1, we study analytical distribution calculations for gGOF statistics,
which covers the optimal tests, φ-divergence statistics, under arbitrary independent and continuous H0 and H1 models. Comparing with a rich literature of analytical
p-value calculations, this work possesses advantages in its generality, accuracy, and
computational simplicity. We also provide a general data analysis framework to apply
gGOF statistics into SNP-set based GWAS for either quantitative or categorical traits.
An application to Crohn’s disease study shows that these optimal tests do have a good
potential for detecting novel disease genes.
iii Abstract iv
In Chapter 2, we address the issue for gGOF under general settings of corre- lated data. We provide a novel p-value calculation approach which often possess an improved accuracy than commonly used moment-matching approach under various correlation structures. We also propose a strategy of combining innovated trans- formation and gGOF statistics, called igGOF. Furthermore, igGOF allows a nature double-ominbus test, called digGOF, which adapt both the functional of statistics and the truncation of input p-values to unknown data and signal patterns. We applied the tests in genetic association studies, both by simulations and a real exome-sequencing data analysis of amyotrophic lateral sclerosis (ALS).
In Chapter 3, we propose a unifying family of Fisher’s p-value combination statis- tics, called TFisher, with general p-value truncation and weighting schemes. Analyt- ical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The
TFisher tests were applied to analyzing a whole exome sequencing data of ALS.
In Chapter 4, we propose an approximation for the p-value of TFisher and oT-
Fisher for analyzing correlated data with general correlation structures. The methods extend the Brown’s method [1] to a more general Gamma distribution. An analytical approximation of the variance of TFisher is also provided. Numerical results show that both approximations are accurate. Contents
Title Page...... i Abstract...... iii Table of Contents...... v
1 Distributions and Statistical Power of Optimal Signal-Detection Meth- ods In Finite Cases1 1.1 Introduction...... 2 1.2 The gGOF Family for Weak-Sparse Signals...... 10 1.3 Analytical Results...... 14 1.4 Numerical Results...... 23 1.5 A Framework for GWAS And Application to Crohn’s Disease Study. 34 1.6 Discussion...... 38
2 digGOF: Double-Omnibus Innovated Goodness-Of-Fit Tests For De- pendent Data Analysis 40 2.1 Introduction...... 41 2.2 Models of Hypotheses...... 46 2.3 The gGOF Family Under Dependence...... 50 2.4 Innovated Transformation...... 56 2.5 Numerical Studies...... 66 2.6 Application to Genome-wide Association Study...... 85 2.7 Discussion...... 88
3 TFisher Tests: Optimal and Adaptive Thresholding for Combining p-Values 91 3.1 Introduction...... 92 3.2 TFisher Tests and Hypotheses...... 95 3.3 TFisher Distribution Under H0 ...... 100 3.4 TFisher Distribution Under General H1 ...... 104 3.5 Asymptotic Optimality for Signal Detection...... 106 3.6 Statistical Power Comparison For Signal Detection...... 119
v Contents vi
3.7 ALS Exome-seq Data Analysis...... 124 3.8 Discussion...... 129
4 TFisher Distribution Under Dependent Input Statistics 131 4.1 Introduction...... 132 4.2 Approximate Null Distribution of TFisher...... 133 4.3 Numerical Results...... 136 4.4 Extension to oTFisher...... 143
A Proofs of Chapter1 156
B Proofs of Chapter2 169
C Proofs of Chapter3 177
D Proofs of Chapter4 186 Chapter 1
Distributions and Statistical Power
of Optimal Signal-Detection
Methods In Finite Cases
1 Chapter 1: gGOF under Independence 2
1.1 Introduction
In big data analysis, signals are often buried within a large amount of noises and
are thus relatively weak and sparse. Developing optimal tests for detecting weak-
sparse signals is important for many data-driven scientific researches. For example,
in large-scale genetic association studies millions of genetic variants are quarried.
Only a relatively small proportion are expected to be truly associated with a given
disease, and most genetic effects are relatively weak especially comparing with the
cumulative noise level of high-throughput data [2,3]. In recent years, theoretical
studies have revealed a collection of asymptotically optimal tests in the sense that
they can reach the boundary of detectable region. In other words, they are capable of
detecting the signals at an intensity level that is minimally required for any statistics
to detect reliably. In particular, the Higher Criticism (HC) type tests [4,5,6,7], the
Berk-Jones (B-J) type tests [8], the φ-divergence type tests [9], etc., have been proven to share the same property of such asymptotic optimality.
Despite these exciting progresses, some practical questions still remain to be an- swered. In particular, the asymptotic optimality is mostly proven under theoretical assumptions such as mixture model and testing group size n → ∞ [4,9, 10, 11]. For real data analysis, however, the hypothesis model could be more arbitrary and n is al- ways finite (even small or moderate). Optimal tests that are equivalent in asymptotic sense could in fact perform quite differently. In order to apply these optimal tests to real data analysis as well as to make an appropriate choice based on signal patterns, it is important to analytically calculate p-values as well as statistical power under more general and realistic assumptions. Moreover, in order to better understand the Chapter 1: gGOF under Independence 3 performance of relevant tests in a broader context, it is beneficial to study a generic statistic family that unifies the common style of these test statistics. This chapter is to address these issues.
1.1.1 Scope of The Work
This chapter considers a generic family of goodness-of-fit (GOF) test statistics, called gGOF. Following the essential idea of GOF tests, a gGOF statistic is loosely defined as a summary statistic based on the maximal contrast between ordered input p-values and their expectations under the null hypothesis. Let P1, ..., Pn, for given n > 1, be a set of input p-values, and P(1) ≤ ... ≤ P(n) be the ordered. A gGOF statistic is defined as the supremum of a generic contrast function f over a truncation domain R:
i \ Sn,R = sup f( ,P(i)), where R = {i : k0 ≤ i ≤ k1} {P(i) : α0 < P(i) < α1}, (1.1) R n for given k0 ≤ k1 ∈ {1, ..., n} and α0 ≤ α1 ∈ [0, 1]. The gGOF requires the null hypothesis
i.i.d. H0 : Pi ∼ Uniform[0, 1], i = 1, ..., n, (1.2)
i so that the null expectation E(P(i)) = n+1 . If the null is untrue, P(i) will differ from their null expectations, which is to be captured by the contrast function f. In practice, smaller p-values indicate signals or positive outcomes. Therefore f(x, y) can be any monotonically decreasing function in y at fixed x, so that the smaller the input p-values, the larger the statistic and the stronger the evidence against the null
i i hypothesis. In other words, gGOF is one-sided in nature. Note that n , instead of n+1 , Chapter 1: gGOF under Independence 4
is used to represent the null mean of P(i) following the tradition of GOF definition.
In fact, either fraction can be used and they impose no practical difference.
For both theoretical and practical reasons, it is important to allow a general trun- cation domain R to restrict both the index i and the magnitude of the p-values P(i).
Aside the benefit of computational efficiency (e.g., big input p-values can be safely truncated as they are likely not signals), some gGOF statistics could be improved by excluding too small p-values. For example, HC could have the long-tail problem under the null due to small p-values, while removing P(1),P(2), etc., may not guar- antee the fix [4, 12]. It is better to directly restrict the magnitude of the p-values, and thus a modified version of HC was created with R = {1 < i ≤ n/2,P(i) ≥ 1/n}
[4, 13]. The significant influence of restricting P(i) ≥ 1/n under finite n is further demonstrated in Section 1.4.
Concerning the power calculation problem, this chapter considers general hypothe- ses for input statistics T1, ..., Tn:
iid iid H0 : Ti ∼ F0,H1 : Ti ∼ F1, i = 1, ..., n, (1.3)
where Fj, j = 0, 1, denote the cumulative distribution functions (CDFs). F0 and F1 can be arbitrary continuous distributions, such as Gaussian model, which is often as- sumed in theoretical studies, or t- or chi-squared distributions, which are ubiquitously seen in practical data analysis. The input p-values for gGOF are obtained based on the input statistics. Without loss of generality, the p-values are defined as
Pi = 1 − F0(Ti). (1.4)
A few remarks should be made regarding the setting of the study. First, the Chapter 1: gGOF under Independence 5
p-value definition in (3.8) actually covers two-sided p-values. Since F0 is allowed ar-
bitrary, if the signs of input statistics have meaningful directionality, the statistics can
0 2 0 simply be replaced by Ti = Ti ∼ F0. Therefore the framework of this chapter allows detecting directional signals, e.g., both protective and deleterious effects of mutations
in genetic association studies. Second, the hypotheses are defined based on input
statistics rather than input p-values because that is more convenient for modeling
“signal patterns” meaningfully. Following the tradition of statistical power studies,
signal patterns are defined by the distinction between F0 and F1, often through their
parameters. More details on the interpretation of hypotheses are given in Section 2.2.
Thirdly, the iid assumption in (1.3) is for the convenience of power calculation. If
p-value calculation is the only concern in real data analysis, Ti’s are allowed having
different distributions, for which the null hypothesis in (1.3) can be generalized to
H0 : Ti ∼ F0i, i = 1, ..., n. (1.5)
As long as Ti’s are independent, the p-values obtained in (3.8) still satisfy the null in (1.2), which is the only requirement for the p-value calculation. That is, our p- value calculation methods can be applied into meta-analysis or integrative analysis of heterogeneous data, where input test statistics could follow significantly different distributions.
One application of gGOF is the SNP-set based association test for finding disease genes. Here each gene (or other meaningful genome segment) is a subject of test.
To decide whether a gene is associated with a phenotype trait, a gGOF statistic is calculated based on input p-values that measure the associations between the trait and each single nucleotide polymorphisms (SNPs) within that gene. Section 2.6 provides Chapter 1: gGOF under Independence 6 a general framework for analyzing GWAS data of either quantitative or categorical traits. For a real GWAS of Crohn’s disease, Figure 1.1 shows that the framework and the analytical p-value calculation work well for four classic weak-sparse optimal tests.
Details on the data analysis and the putative disease genes are given in Section 2.6.
1.1.2 Connection to Relevant Literature
Analytical calculation for p-value and statistical power possesses significant ad- vantages over studies based on Monte-Carlo simulations or permutations, not only for faster and more precise computation but also for deeper understanding. In particu- lar, empirical p-values by simulations and permutations are discrete, and the accuracy and stability depends on the number of simulations and permutations. In big data analysis such as GWAS, very small p-values are desired to control error rate due to enormous simultaneous tests. That requires huge numbers of repeated simulations to stably obtain very small p-values [14,6]. Even worse, when genetic mutations are rare, more permutations actually cannot break the ties among p-values. As for the study of statistical power, comparing to the blackbox-like simulations, analytical calculations could provide mathematical insights to elucidate how the signal-defining parameters affect the statistic’s distributions and its power. These insights could be potentially helpful in improving the design for better statistics.
Analytical power calculation involves calculating the statistic’s distributions under both H0 and H1. To our best knowledge there is no satisfiable work yet to calculate the alternative distribution for gGOF under finite n. As for the null distribution, indeed a rich literature are out there on p-value calculations for Kolmogorov-Smirnov type Chapter 1: gGOF under Independence 7
Figure 1.1: The association p-values for genes by exact calculation of four asymptotic
weak-sparse-optimal tests. First row: HC2004 and B-J; second row: reverse B-J and
HC2008. Chapter 1: gGOF under Independence 8
statistics. Two main strategies were used – the exact calculation and the approxima-
tion. For exact calculation, various recursive methods (e.g., Noe’s recursion[15, 16],
Bolshev’s recursion [17, 18], Steck’s recursion [19, 20], Ruben’s recursion [21], etc.)
were developed to calculate the null distribution. These methods had covered a more
recent work that specifically dealt with HC [22]. Such recursive methods were for
non-truncated statistics (i.e., R = {1 ≤ i ≤ n} in (1.1)), and require computational complexity of O(n3). Denuit et al [23] provided an exact calculation that unified these recursive methods, allowed R = {k0 ≤ i ≤ k1}, and brought down the complexity to O(n2). It should be noted that Denuit’s method had fully covered a result given in a later paper [12] too. Our results on exact calculation were developed without knowing [23] and [12] first. We did share a similar idea of utilizing the joint distribu- tion of ordered uniform random variables, but the differences are significant. First, for the case R = {k0 ≤ i ≤ k1} our computational complexity is further reduced to
2 O((k1−k0) ), where k1−k0 could be much smaller than n. Second, we provided results for truncation by P(i), which cannot be trivially extended from results in [23, 12]. As discussed above, truncation by P(i) has a naturally different influence to the statistic from the truncation by i. Actually, the corresponding computational complexity is much different (see Section 1.3.2). Third, the main focus of our work is not only for calculating p-value but also statistical power.
As for approximating the null distribution, it is well known that Kolmogorov-
Smirnov type statistics converge in law to an extreme-value distribution [24, 25].
However, such convergence is too slow to be accurate for even moderately large n
[22]. Recently, Li and Siegmund (LS) [13] developed an asymptotic approximation Chapter 1: gGOF under Independence 9
Figure 1.2: Comparison among different methods for calculating the right-side prob- ability of the modified HC test (MHC) in (1.8) with R = {1 < i ≤ n/2,P(i) ≥ 1/n}[4] i.i.d. under H0 : Ti ∼ N(0, 1). Simu: curve obtained by simulation; Exact: by Corollary
1; Li&Siegmund: by [13]. Chapter 1: gGOF under Independence 10
for HC and B-J, which performs well at the right tail but not at the left tail. See
Figure 1.2 for an example. This natural limitation prevents LS method from power
calculation too. In this chapter we also studied distribution approximation in order
to further simplify computation and revealing insights on gGOF performance. We
give a sufficient condition for LS type asymptotics to be workable for gGOF family
under general hypotheses in (1.3). Furthermore, we propose to use the gamma ap-
proximation, instead of the beta approximation used by the original LS. Our formula
remains the same accuracy at the right tail, and could improve accuracy for the whole
distribution at small n.
The paper is organized as follows. In section 2.2 we review the literature of asymptotically optimal tests for weak-sparse signals, and illustrate their connection with gGOF family. The analytical results are presented in Section 1.3 for both exact and approximated calculations. Through simulations Section 1.4 numerically evi- dences the calculation accuracy, and provides systematic power comparisons among these asymptotically optimal tests. We show the application of the gGOF tests in a real GWAS in Section 2.6. In Section 1.6 we discuss the limitation of this work and further plan. All proofs and supportive lemmas are given in the AppendixA.
1.2 The gGOF Family for Weak-Sparse Signals
The signal detection problem is a set-testing problem that combines the input
p-values P1, ..., Pn of a set of input statistics T1, ..., Tn into one summary statistic,
which is then used to test whether there exist “signals”. Signals are characterized
by the contrast between the alternative and the null hypothesis of the whole set. As Chapter 1: gGOF under Independence 11
a special case of (1.3), a classic setting of the null and the alternative is Gaussian
mixture model:
H0 : Ti ∼ F0 = Φ,H1 : Ti ∼ F1 = Φµ + (1 − )Φ, i = 1, ..., n, (1.6)
where Φ and Φµ are the CDFs of N(0, 1) and N(µ, 1), respectively. H1 indicates that
∈ (0, 1) proportion of n input statistics are for true “signals” (e.g., disease markers) with strength µ [4, 26,6]. The setting is also consistent with meta-analysis, where
H1 could indicate that proportion of n studies are true positives (e.g., differential gene expressions) with effect size µ [27, 28]. The summary statistic usually combines p-values rather than the input statistics because p-values directly measure statistical significance, no matter how different the data scales are. When the distribution of a statistic is known, its p-value provides the same information as itself.
Under the asymptotic rare and weak (ARW) setting, i.e., the parameters in (1.6)
−α p are regulated as n = n , α ∈ (1/2, 1), µn = 2r log(n), r ∈ (0, 1), a few seminal
studies [4, 29, 30, 11] have discovered the asymptotic detection boundary in terms of
a function curve of the signal strength and sparsity: α − 1/2 1/2 < α ≤ 3/4 ? r = ρ (α) = √ (1.7) (1 − 1 − α)2 3/4 < α < 1.
When the signal-representing parameters (α, r) of the input statistics are below the
curve, H0 and H1 converges as n → ∞. That is, no statistical methods can reliably
detect signals because they are too weak. Whenever (α, r) are above the curve, the
asymptotically optimal tests are asymptotically powerful in the sense that they are
capable to make both the type I and the type II error rates converge to zero as Chapter 1: gGOF under Independence 12 n → ∞. A particular optimal statistic is the Higher Criticism (HC) statistic [31,4]:
√ i/n − P(i) HCn,R = sup np , (1.8) R P(i)(1 − P(i)) where R has several versions as special cases of that in (1.1)[4, 13]. Note that in literature [4, 10, 22] HC formula could also be written as (assuming input p-values being one-sided): P ¯ i{Ti > t} − nΦ(t) HC = sup p , (1.9) t∈R∗ nΦ(¯ t)Φ(t) where Φ(¯ t) = 1 − Φ(t). We do not follow this formula because it is restricted to the
∗ hypothesis setting of F0 = Φ, and the supremum domain R on t is equivalent to R on P(i) (but not on the index i).
A variety of versions of HC statistics, the Berk-Jones (B-J) type statistics, a spectrum of φ-divergence statistics, etc. were all proven asymptotically optimal [4,5,
32,9]. These statistics can be considered as a goodness-of-fit (GOF) statistic, which by definition is to test the distinction between given data and a given distribution (i.e., whether the input statistics have a good “fit” with the null distribution). P-values that are smaller than their null expectations evidence again the null. The simplest
GOF statistic is the simple one-sided Kolmogorov-Smirnov test statistic (c.f. [33],
+ page 447, denoted KS here), which directly measures the difference between P(i) and i/n. Under the roof of the gGOF family in (1.1), KS+ statistic correspond to the contrast f function:
fKS+ (x, y) = x − y. (1.10)
Because smaller p-values are more likely to indicate the alternative, the absolute difference i/n − P(i) should be reweighed with regard to P(i) or i/n. Such rescaled KS Chapter 1: gGOF under Independence 13
tests are related to the Higher Criticism (HC) statistics proposed in 2004 and 2008
[4,5], respectively, where the f functions are defined as √ x − y fHC2004 (x, y) = n ; py(1 − y) (1.11) √ x − y fHC2008 (x, y) = n . px(1 − x) HC2004 statistic is similar as the Anderson-Darling statistic [34], but is more general
due to its definition based on p-values and the truncation domain R for improved performance [4, 12].
Jager and Wellner introduced a collection of φ-divergence statistics [9]; each one
of them is based on a contrast function at a given s: 1 f φ(x, y) = (1 − xsy1−s − (1 − x)s(1 − y)1−s), s 6= 0, 1, s s(1 − s) x 1 − x f φ(x, y) = x log( ) + (1 − x) log( ), (1.12) 1 y 1 − y y 1 − y f φ(x, y) = y log( ) + (1 − y) log( ). 0 x 1 − x At certain s values (e.g., s = 2 or −1) these statistics are two-sided in the sense
that switching the values of x = i/n and y = P(i) gives the same statistic. However,
as mentioned above, because smaller p-values indicates signals, we consider the one-
sided version of φ-divergence statistics. A simple adjustment of the f function could
be: q φ 2nfs (x, y) y ≤ x, fs(x, y) = (1.13) q φ − 2nfs (x, y) y > x.
Now for all s, fs(x, y) is guaranteed decreasing in y. Such one-sided φ-divergence
statistics cover HC exactly: f2 = fHC2004 and f−1 = fHC2008 . Also, s = 1 and 0
correspond to the Berk-Jones statistic [8,4, 13] and the reverse Berk-Jones statistic
[8], respectively. Chapter 1: gGOF under Independence 14
1.3 Analytical Results
1.3.1 Calculation Strategy
In the following we first summarize the general idea of the calculation. Then,
specific strategies for obtaining the exact or approximated distributions will follow
under various settings and assumptions.
First, regarding the hypotheses in (1.3), for any given continuous CDFs F0,F1 we define a monotone transformation function in domain [0, 1]: x under H0, D(x) = (1.14) −1 1 − F1(F0 (1 − x)) under H1.
Note that for any p-value Pi, D(Pi) ∼ Uniform[0, 1] under either H0 or H1.
Second, regarding the gGOF statistic Sn,R in (1.1), for each fixed x define the
inverse of the contrast function f(x, y):
g(x, ·) = f −1(x, ·). (1.15)
For example, the g functions for the HC statistics defined (1.11) at any constant b
are √ √ 2 2 1 b /n−(b/ n) b /n+4x(1−x) gHC2004 (x, b) = 2 [x + ]; 1+b /n 2 (1.16) √ p gHC2008 (x, b) = x − (b/ n) x(1 − x). In general if the closed form of g function is not available, it can always be numerically found since f(x, y) is strictly decreasing in y. Chapter 1: gGOF under Independence 15
Now under either H0 or H1, the CDF of Sn,R is
i P (Sn ≤ b) = P (sup f( ,P(i)) ≤ b) R n \ i = P ( {P > g( , b)}) (1.17) (i) n R i = P {D(P(i)) > D(g( n , b)), all i and P(i) in R}.
For both exact and approximate calculation of the distributions, we take advantage
th of the fact that under either H0 or H1, U(i) := D(P(i)) is the i order statistic of
Uniform[0, 1], and we study the joint distribution of U(i) under the restriction R in different ways.
To simplify the presentation, we list below the notations to be referred later on.
k (N1) uk := D(g( n , b) ∨ α0) based on equations in (1.14)–(1.17), and α0 ≥ 0 is the
lower bound constant for truncating P(i) in (1.1).
¯ (N2) FB(α,β)(x) denotes the survival function of Beta(α, β) distribution.
¯ (N3) FΓ(α)(x) and FΓ(α)(x) denote the CDF and survival function of Gamma(α, 1)
distribution, respectively, where the shape parameter is α, the scale parameter
is 1.
(N4) Based on the notation (N3) define
hk(x) := xFΓ(k−1)(kx) − FΓ(k)(kx).
(N5) fP (λ)(x) denotes the probability mass function of Poisson(λ) distribution. Chapter 1: gGOF under Independence 16
1.3.2 Exact Calculations of gGOF Distributions
In this section we provide calculation methods for the exact distribution of any
gGOF statistic in (1.1) under either H0 or H1 in (1.3). Accordingly, p-value and
statistical power can be calculated in an exact manner. In the following each theorem
concerns a specific truncation domain R. The first theorem is for truncation based
on the index i only. For example, the initial HC was defined with R = {1 ≤ i ≤ n/2}
[4].
Theorem 1.3.1. Consider any gGOF statistic in (1.1) with R = {k0 ≤ i ≤ k1} for
given 1 ≤ k0 ≤ k1 ≤ n. Let m = n − k1 + 1. Follow notations (N1) and (N2), and define
n! ¯ ak = F (uk ) , and 1 (n−k1+1)! B(1,m) 1 k1−k j n! X uk+j−1 a = F¯ (u ) − a , k = k − 1, ..., 1. k (n − k + 1)! B(k1−k+1,m) k1 j! k+j 1 j=1
Under either H0 or H1, we have
k −1 X1 ui P (S ≤ b) = F¯ (u ) − i a . n,R B(k1,m) k1 i! i+1 i=k0
It should be noted that the computational complexity of this equation is O((k1 −
2 2 k0) ), which is a significant reduction from O(n ) by [23] and [12], especially when
(k1 − k0) = o(n). The next theorem concerns the truncation based on the magnitude of p-values, i.e., R = {α0 ≤ P(i) ≤ α1}. For example, HC statistic defined in (1.9)
has the equivalent truncation, which is required in proving some theoretical properties
(see [10] for example). Also, such truncation is needed to resolve the long tail problem
of HC [4, 12]. Chapter 1: gGOF under Independence 17
Theorem 1.3.2. Consider any gGOF statistic in (1.1) with R = {α0 ≤ P(i) ≤ α1} for given 0 ≤ α0 < α1 ≤ 1. Follow notations (N1) and (N2), and define
i−1 n−j+1 β0 (1−β1) β0 = D(α0), β1 = D(α1), cij = (i−1)!(n−j+1)! , j−k l n! j−k ¯ uj−1 X uk+l−1 aj(k) = β1 FB(j−k,1)( ) − aj(k + l), and (j − k)! β1 l! l=1
aj(j) = 0, 1 ≤ i ≤ n, i < j ≤ n + 1, k = 1, ..., j − 1.
Under either H0 or H1, we have n n+1 X X P (Sn,R ≤ b) = cijaj(i). i=1 j=i+1 Comparing Theorems 1.3.1 and 1.3.2, we can see that the truncation imposed on
P(i) requires much more complicated computation than the truncation imposed on i.
The complexity of the formula in Theorem 1.3.2 is O(n3) (or more precisely O(n3/6)).
Next, the following theorem provides the exact calculation under the most general R
defined in (1.1), where truncation is for both the index and the p-values.
Theorem 1.3.3. Consider any gGOF statistic in (1.1) with R = {α0 ≤ P(i) ≤
α1} ∩ {k0 ≤ i ≤ k1} for given 1 ≤ k0 ≤ k1 ≤ n and 0 ≤ α0 < α1 ≤ 1. Follow
notations (N1) and (N2) and those in Theorem 1.3.2. Define
˜ ˜ ˜ i = i ∨ k0, j = j ∧ (k1 + 1), β0 = β0I{i aj(˜j) = 0, 1 ≤ i ≤ k1,˜i < j ≤ n + 1, k = 1, ..., ˜j − 1. Under either H0 or H1, we have P (Sn,R ≤ b) ˜ k1 n+1 ˜ j−i ˜ j−1 ˜ k−i+1 X X n!(β1 − β0) u˜j−1 − β0 X (uk − β0) = c F¯ ( ) − a (k + 1) . ij B(˜j−i,j−˜j+1) ˜ j (j − i)! β1 − β0 (k − i + 1)! i=1 j=˜i+1 k=˜i Chapter 1: gGOF under Independence 18 2 The complexity of the formula in Theorem 1.3.3 is O(nk1); adding truncation on index i actually simplifies the computation comparing with Theorem 1.3.2. As discussed above, too small p-values under H0 is a major concern for the performance of some gGOF statistics (e.g., causing long-tail problem for HC). Thus, the truncation on p-values could be on the lower bound α0 only, which can also significantly reduce the computational complexity. Corollary1 below addresses such special case of Theorem 2 1.3.3 with α1 = 1, where the formula complexity reduces to O(k1). Corollary 1. Consider any gGOF statistic in (1.1) with R = {α0 ≤ P(i)} ∩ {k0 ≤ i ≤ k1} for given 1 ≤ k0 ≤ k1 ≤ n and α0 > 0. Follow notations (N1) and (N2) and i−1 β0 those in Theorem 1.3.1, 1.3.2. Define ci = (i−1)! , 1 ≤ i ≤ k1. Under either H0 or H1, we have P (Sn,R ≤ b) k1 ˜ n+1−i ˜ k1−1 ˜ k+1−i X n!(1 − β0) uk − β0 X (uk − β0) = c F¯ ( 1 ) − a . i B(k1+1−i,m) ˜ k+1 (n + 1 − i)! 1 − β0 (k + 1 − i)! i=1 k=˜i A special case of Corollary1 is the modified HC in (1.8) with R = {1 < i ≤ n/2,P(i) ≥ 1/n}[4, 13]. As shown in Figure 1.2, LS approximation [13] is good only for the right tail of the distribution under H0. Corollary1 gives the exact distribution under both H0 and H1. Obviously, Theorem 1.3.3 addresses the most general truncation and covers other theorems and corollary. Based on this general formula, the formula in Theorem 1.3.1 is obtained by fixing i = 1, j = n + 1, α0 = 0, and α1 = 1. It covers the formula of Theorem 1.3.2 by letting k0 = 1, k1 = n, and also covers the formula of Corollary1 by fixing j = n + 1 and α1 = 1. However, we still separate these formulae and their Chapter 1: gGOF under Independence 19 implementations in order to simplify the computation whenever possible. 1.3.3 Approximation of gGOF Distributions In this section we study approximation methods for the distributions of gGOF statistics based on appropriate asymptotics that holds good accuracy under small or moderate n. The purpose is to 1) further simplify computation, and 2) reveal more insights to understand the gGOF performance. Two strategies are considered. First, we follow the basic idea of the exact calculation described above, except applying distribution approximation. This strategy maintains the generality of the results and provides the inspiring technique of the gamma approximation for the second strat- egy to produce non-iterative one-step formulae. The cost for such further simplified formulae is the requirement of stronger assumptions. Here we study the calculation based on the linearity property of the D function in (1.14). Relating to that, we give a sufficient condition for LS style asymptotics [13] to be workable under the general gGOF family and general hypotheses. Moreover, the gamma approximation much simplifies the proof (comparing with the beta approximation in the original LS paper) and could potentially improve the accuracy under circumstances (e.g., small n). For the simplicity of presentation, the theorems below focus on the case of R = {k0 ≤ i ≤ k1}. The results can be extended to general R with truncation on P(i). First, following the idea of the exact calculation, Theorem 1.3.4 below gives a formula based on the approximation by the joint gamma distribution. Theorem 1.3.4. Consider any gGOF statistic in (1.1) with R = {k0 ≤ i ≤ k1}. Chapter 1: gGOF under Independence 20 Follow the notations (N1) and (N3), and define k dk = (n + 1)D(g( n , b)), k = k0, ..., k1, k−1 j X dk −k+j c = F¯ (d ) − 1 c , k = 2, ..., k , and k Γ(k) k1 j! k−j 1 j=1 ¯ c1 = FΓ(1)(dk1 ). Under either H0 or H1, we have k −1 X1 dk P (S ≤ b) = (1 + o(1))(F¯ (d ) − k c ). n,R Γ(k1) k1 d! k1−k k=k0 Theorem 1.3.4 does not have extra advantage over Theorem 1.3.1 in terms of com- putation and accuracy. However, it evidences that gamma approximation is a good choice under general settings of gGOF statistics and hypotheses, since the formula is pretty accurate under finite n (see Section 1.4 for numeric results). This result inspired us to apply gamma approximation for distribution calculation with further simplified formula. k Under stronger assumptions, in particular if D(g( n , b)) is a linear or near-linear function of k, we can provide a one-step formula for the distribution calculation. Starting with the exact linear case, Theorem 1.3.5 gives such a one-step formula that guarantees the same accuracy as Theorem 1.3.4 due to the same gamma approxima- tion. Theorem 1.3.5. Consider a gGOF statistic in (1.1) with R = {1 ≤ i ≤ k1} and k D(g( n , b)) = a + λk, for some λ ≥ 0. Following notations (N3) and (N4), under either H0 or H1, we have −a P (Sn,R ≤ b) = (1 + o(1))e (1 − λ + hk1 (λ)). Chapter 1: gGOF under Independence 21 k One example that satisfies the linearity of D(g( n , b)) is the simple Kolmogorov- + n+1 Smirnov (KS ) statistic in (1.10) under H0, where a = −(n + 1)b and λ = n . The following corollary summarizes this case. + Corollary 2. Consider the test statistic KS in (1.10) with R = {1 ≤ i ≤ k1}. 1 Following notations (N3) and (N4), for b ≤ n , we have that under H0, 1 n + 1 P (KS+ ≤ b) = (1 + o(1))e(n+1)b(− + h ( )). n k1 n k In general, the requirement of linear D(g( n , b)) is often too stringent. However, if k D(g( n , b)) is close to linear, we can still simplify the calculation of Theorem 1.3.4. In k particular, Theorem 1.3.6 below provides a sufficient condition on D(g( n , b)), under which LS style asymptotics [13] can be extended to gGOF family under general hy- potheses. Again, we apply gamma approximation (rather than beta approximation used in original LS paper), which has a simpler density function for easier general- ization to gGOF family (note that LS paper mainly addresses the HC and B-J type statistics). See the supplemental proof [35] for details. Theorem 1.3.6. Consider any gGOF statistic in (1.1) with R = {k0 ≤ i ≤ k 0 k1}. Follow notations (N1) – (N5), and define dk = (n + 1)D(g( n , b)), dk = (n + d k ∗ √ 1) dx D(g( n , b)), and k = min{k1 − k, n}. Assume D(g(x, b)) satisfies k0 k1 1. D(g(x, b)) < 1 is increasing and convex in x for n ≤ x ≤ n , d 2. dx D(g(x, b)) < 1, and k 3. D(g(k/n, b)) < n+1 , for k > 1 and large n. Chapter 1: gGOF under Independence 22 Under either H0 or H1 in (1.3), we have k1 0 0 X dk dk P (S ≥ b) = (1 + o(1)) (1 − + h ∗ ( ))f (k). n,R n k n P (dk) k=k0 k 2004 This sufficient condition of D(g( n , b)) can be partially satisfied by HC under k H0, for which D(g( n , b)) = g(x, b) is given in (1.16). The result is officially stated in Corollary3 below, which basically says that the condition is satisfied on the right tail √ when b is in the order of O( n). 2004 Corollary 3. Consider HC statistic in (1.11) with R = {k0 ≤ i ≤ k1}. Let √b k0 k1 b0 = n be a positive constant > 2x − 1, n < x < n . Define q 1 2 2 g(x, b0) = 2 [x + (b0 − b0 b0 + 4x(1 − x))/2], 1 + b0 0 1 b0(1 − 2x) g (x, b0) = [1 − ]. 2 p 2 1 + b0 b0 + 4x(1 − x) Following the notation (N2), under H0, we have k1 2004 X 0 k 0 k P (HC ≥ b) = (1 + o(1)) 1 − g ( , b0) + hk∗ (g ( , b0)) fP (g( k ,b )n)(k). n n n 0 k=k0 The formula of Corollary3 is different from that given in Li and Sigmund [13]. √ However, both formulae require the threshold b = O( n). Thus, in theory both methods do not get the whole distribution. However, as shown in Figure 1.3, our formula based on gamma approximation could be closer to the whole distribution under small n. Meanwhile, the accuracy also depends on the linear approximation k of the D(g( n , b)) function, which could be hardly true under general H1. Thus this type of calculation has a natural limitation for being utilized to calculate statistical power. Chapter 1: gGOF under Independence 23 1.4 Numerical Results In this section we first evidence the accuracy of our methods by comparing the calculations with the Monte-Carlo simulations under various settings of H0 and H1. Then, based on calculation we compare the finite-n performance of the asymptotically optimal tests over various signal patterns. Unless specified otherwise, results reported below were based on truncation domain R = {1 ≤ i ≤ n/2} and the number of simulations was set at 5,000. 1.4.1 Calculation Accuracy for gGOF Distributions Our calculation methods can handle general hypothesis setting in (1.3) with in- put statistics of arbitrary continuous distributions. In this section we evaluate how accurate our calculation methods are for constructing the distribution curves of HC statistic, as an example of gGOF, under various H0 and H1. First we calculate the null distribution of HC statistic under general H0 in (1.2). Figure 1.3 shows the right-tail probability of HC statistic over varying threshold b. Comparing with simulation (black solid curves), the exact calculation by Theorem 1.3.1 (cyan dashed curves) has a perfect match. The approximation by Theorem 1.3.4 is fairly accurate over the whole distribution too. The one-step formulae of Li and Siegmund [13] (blue dotted curves) and of Corollary3 (green dashed curves) can provide good approximation for the right tail, and thus can be used for calculating small p-values at large threshold. Li and Siegmund’s formula has a limitation for the left tail of the distribution; the formula of Corollary3 provides a correction of a sort, which is preferred at small n but is more conservative at large n. Chapter 1: gGOF under Independence 24 Figure 1.3: Comparison among different calculations for the null distribution of HC. Simulation: curve obtained by simulations; Exact: by Theorem 1.3.1; Approximate: by Theorem 1.3.4; Li&Siegmund: by [13]; Corollary 3: by Corollary3. Chapter 1: gGOF under Independence 25 Now we assess the accuracy of calculating the alternative distribution of HC statis- tic. Assume the input statistics were from either of following mixture models: i.i.d. H1 : Ti ∼ (1 − )N(0, 1) + N(1, 1), or i.i.d. H1 : Ti ∼ (1 − )N(0, 1) + tν, while the input p-values for gGOF were obtained by Pi = 1 − Φ(Ti) (i.e., under i.i.d. H0 : Ti ∼ N(0, 1)). These two alternatives can be roughly interpreted as that proportion of “signals” have either different means (i.e., N(1, 1)) or different vari- ances (i.e., the Student’s t with degrees of freedom ν) when comparing with “noises” (i.e., N(0, 1)). Accordingly, Figure 1.4 demonstrates the right-tail probability of HC statistic (row 1: µ = 1, = 0.1; row 2: ν = 5, = 0.5). In both cases the exact calculation (Theorem 1.3.1, cyan dashed curves) is perfect, and the approximation (Theorem 1.3.4, red dot-dashed) is close to simulation (black solid curves) with its accuracy increasing together with n. Besides the normal distributions, we also assessed four non-normal settings studied in the initial paper of HC [4]. The first setting regards a chi-squared model: i.i.d. 2 i.i.d. 2 2 H0 : Ti ∼ χν(0), vs. H1 : Ti ∼ (1 − )χν(0) + χν(δ), where ν is the degree of freedom, δ is the non-centrality parameter. The second setting is a Student’s t mixture model: i.i.d. i.i.d. H0 : Ti ∼ tν(0), vs. H1 : Ti ∼ (1 − )tν(0) + tν(δ). The third setting is a chi-squared-exponential mixture model: i.i.d. i.i.d. 2 H0 : Ti ∼ exp(ν), vs. H1 : Ti ∼ (1 − ) exp(ν) + χν(δ). Chapter 1: gGOF under Independence 26 i.i.d. Figure 1.4: The alternative distribution of HC statistic under H0 : Ti ∼ N(0, 1) vs. H1 : Ti ∼ 0.9N(0, 1) + 0.1N(1, 1) (row 1), or H1 : Ti ∼ 0.5N(0, 1) + 0.5t5 (row 2). Column 1: n = 10; column 2: n = 100. Simulation: curve obtained by simulations; Exact: by Theorem 1.3.1; Approximate: by Theorem 1.3.4. Chapter 1: gGOF under Independence 27 The fourth setting concerns a generalized normal distribution (also known as power exponential distribution) model: i.i.d. i.i.d. H0 : Ti ∼ GNp(0, σ), vs. H1 : Ti ∼ (1 − )GNp(0, σ) + GNp(µ, σ), where the probability density function of GNp(µ, σ) is p 1 |x − µ| 1/p exp(− p ),Cp = 2p Γ(1 + 1/p)σ. Cp pσ 2 Notice that GN1(µ, σ) is the Laplace distribution and GN2(µ, σ) is N(µ, σ ). Each row of Figure 1.5 illustrates the alternative distribution of HC under each of the four settings for n = 10 (left column) and 100 (right column). Again, the exact calculation is perfect, and the approximation is fairly accurate especially when n is large. For the one-step calculation formula given by Theorem 1.3.5, the boundary is i + assumed linear: D(g( n , b)) = a + λk ≥ 0 in (1.17). One example is the KS statistic in (1.10) under H0. Figure 1.6 demonstrates the accuracy of the calculation based on either fixed slope λ = 0.5 or fixed intercept a = 0.5. Here k0 = 1, k1 = n = 50. It shows that this gamma-approximation based one-step formula performs well if the i linearity assumption of D(g( n , b)) is satisfied. As the boundary a + λk increases, the probabilities from both calculation and simulation decrease as expected. 1.4.2 Comparison of Asymptotically Optimal Tests Under Finite n As discussed in Section 2.2 the asymptotically optimal methods for weak-sparse signals possess the same asymptotic property. It is of interest to know the performance of those statistics under finite n. Here we focus on φ-divergence statistics defined in Chapter 1: gGOF under Independence 28 Figure 1.5: The alternative distributions of HC statistic under four non-normal set- tings for H0 and H1. Column 1: n = 10; column 2: n = 100. Simulation: curve obtained by simulations; Exact: by Theorem 1.3.1; Approximate: by Theorem 1.3.4. Chapter 1: gGOF under Independence 29 Figure 1.6: Probability in (1.17) with a hypothetical linear boundary function i D(g( n , b)) = a + λk. Simulation: probability obtained by simulations; Approxi- mation: by Theorem 1.3.5. Left panel: fix λ = 0.5 and vary a; right panel: fix a = 0.5 and vary λ. (1.13), which is asymptotically optimal for any statistic-defining parameter s ∈ [−1, 2] [9]. As discussed in Section 2.2, the values of s = 2, 1, 0, −1 correspond to HC2004, the Berk-Jones statistic, the reverse Berk-Jones statistic, and HC2008, respectively. These s values represent a spectrum of gGOF statistics of different performances. First, we show the accuracy of p-value calculations in a similar manner as [13]. Specifically, for each gGOF statistic the thresholds at the significance levels of 10%, 5% and 1% were obtained through calculation (by Theorem 1.3.1). Then at these thresholds the empirical type I error rates were acquired through simulations (10,000 repetitions). As shown in Table 1.1, the close match of the given significance levels and the obtained empirical type I error rates evidences that the calculations for the p-values of these statistics are accurate. Not surprisingly, the accuracy by the Chapter 1: gGOF under Independence 30 Table 1.1: Empirical type I error rates at the calculated thresholds for the significance levels of 10%, 5% and 1%. HC2004: s = 2; B-J: s = 1, reverse B-J: s = 0, and HC2008: s = −1. s n 10% 5% 1% Threshold Emp. Err. Threshold Emp. Err. Threshold Emp. Err. 2 10 3.357 0.992 4.648 0.049 10.088 0.010 50 3.507 0.102 4.714 0.050 10.102 0.011 100 3.539 0.103 4.723 0.049 10.102 0.009 1 10 2.181 0.101 2.504 0.050 3.110 0.011 50 2.408 0.098 2.716 0.048 3.300 0.010 100 2.478 0.104 2.780 0.049 3.354 0.009 0 10 1.750 0.100 1.974 0.049 2.390 0.011 50 2.040 0.101 2.301 0.047 2.803 0.011 100 2.136 0.101 2.402 0.051 2.915 0.010 -1 10 1.618 0.098 1.838 0.051 2.227 0.009 50 1.909 0.099 2.165 0.049 2.662 0.009 100 2.010 0.107 2.271 0.052 2.777 0.010 approximated calculation of [13] requires relatively large n, whereas the calculation by Theorem 1.3.1 is exact and shall be perfectly accurate at any n. Now through power calculation (again by Theorem 1.3.1), we can systematically compare the power of any gGOF statistics. To be consistent with literature, here we focus on the classic normal mixture model in (1.6). With the type I error rate controlled at 5%, Figure 1.7 provides the statistical power of HC2004, B-J, reverse B- J, and HC2008 at various signal patterns represented by parameters (n, µ, ). There are a few interesting observations. First, it seems that at finite n the average num- ber n of signals is more relevant than the proportion of the signals. To see this point, note that columns 1 – 3 of the figure panels correspond to fixed signal numbers n = 5, 25, 50, respectively; each column demonstrates one same pattern of compar- Chapter 1: gGOF under Independence 31 ative performance among these four statistics. Meanwhile, the diagonal of the figure panel matrix correspond to a fixed signal proportion = 0.05 but the comparative performances of the four statistics changed significantly over increased n at differ- ent rows. Similar observations can be seen at fixed = 0.01 or 0.005 but different n. Second, considering signals sparsity / density in terms of signal numbers, within the φ-divergence family, bigger s values are related to better performance for sparser signals (HC2004: s = 2; B-J: s = 1), whereas smaller s values are related to better performances for denser signals (reverse B-J: s = 0, and HC2008: s = −1). This is evidenced by the columns of the figure: HC2004 performs the best in the first column, and HC2008 performs the best in the third column. One possible reason for HC2008 being less powerful for sparse signals is because its statistic is about weighting the expectation-observation difference, i.e., i/n − P(i), by i/n rather than by P(i) at the denominator, and therefore it is less insensitive to small p-values than HC2004 (see their formulas in (1.11)). Thirdly, with s = 1 in the middle of the parameter space [−1, 2] of the optimality, B-J has a more robust performance over various µ, n and . This robustness of B-J’s is consistent with the finding of Li and Siegmund [13]. It is also of interest to compare the performance of these optimal methods along the asymptotic detection boundary given in (1.7) under finite n. As discussed in Section 2.2, when the signal-representing parameters (α, r) are below the curve, signals are too weak to be reliably detectable by any statistics. Whenever these parameters are above the curve, all of these four optimal tests are asymptotically powerful as n → ∞. Thus, areas right above the detection boundary are the challenging scenario for optimal methods being prominent, since sub-optimal tests will have asymptotically zero power Chapter 1: gGOF under Independence 32 Figure 1.7: Comparison of statistical power. HC2004: s = 2; B-J: s = 1, reverse B-J: s = 0, and HC2008: s = −1. Rows 1 – 4: n = 100, 500, 1000, 5000. Columns 1 – 3: n = 5, 25, 50. Type I error rate: 5%. Chapter 1: gGOF under Independence 33 Figure 1.8: Statistical power along the ARW detection boundary (at type I error rate 5%). there. Figure 1.8 shows the statistical power of the four optimal methods over the sparsity parameter α ∈ (1/2, 1); the r value is calculated according to equation (1.7). It shows that statistical power of these methods are in fact significantly different even for very large n. In consistence with Figure 1.7, HC2008 and reverse B-J have similar power curve; they are more powerful for denser signals (at smaller α corresponding to bigger = n−α). HC2004 is more powerful for very sparse signals (at larger α). B-J again shows a more robust performance over all α values. Last but not least, the truncation domain R in (1.1) is very important to the per- formance of test statistics. In particular, as discussed in Introduction, the truncation based on p-values P(i) could have extra benefit over the truncation based on index i only [4, 13]. Here we compare HC2004 under R = {1 ≤ i ≤ n/2} with the modified HC (MHC) under R = {1 < i ≤ n/2,P(i) ≥ 1/n}. Figure 1.9 shows that the MHC performs poorly when the number of signals is small, whereas it improves the per- Chapter 1: gGOF under Independence 34 formance when the number of signals increases. One reason is because 1/n is fairly large at finite n. By excluding p-values less than 1/n, MHC could easily miss those signal-representing p-values, especially when there are just a few strong true signals with large µ. However, when signals are dense MHC is more powerful, because (A) with high chance some signals (especially the weaker ones) will have p-values larger than 1/n, and (B) removing p-values less than 1/n corrects the long-tail problem of HC [4]. Thus, in practice when n is not too big, the original HC is still a better choice for relatively sparser and stronger signals, whereas MHC is better for denser and weaker signals. 1.5 A Framework for GWAS And Application to Crohn’s Disease Study According to the genetics of complex diseases, disease-associated markers usually have moderate to small genetic effects [36]. In genome-wise studies that tend to screen as many markers as possible, the number of true disease markers often account for a small proportion of the total candidate markers. Therefore, it is appealing to apply optimal tests for weak-sparse signals to detect weak genetic effects. In this section, we provide a general framework for applying the gGOF tests into SNP-set association study in GWAS data analysis. The input p-values are obtained based on the generalized linear models (GLMs) so that the framework can handle both quantitative and categorical traits. Here we focus on gene-based tests: each gene is tested separately; input p-values from the group of SNPs within that gene form Chapter 1: gGOF under Independence 35 Figure 1.9: Power comparison for HC statistic with R = {1 ≤ i ≤ n/2} and MHC statistic with R = {1 < i ≤ n/2,P(i) ≥ 1/n}. Type I error rate: 5%. a gGOF statistic, then the summary p-value of this statistic is obtained to measure how significant the gene is associated. Certainly, similar idea can be straightforwardly extended to SNP-set tests based on other meaningful segments of loci (e.g., pathway- based association studies [37]). Specifically, assume a gene contains n SNPs. With an appropriate link function, a GLM can be defined as 0 0 link(E(Yk|Xk, Zk)) = Xkβ + Zkγ, (1.18) where for the kth subject, k = 1, ..., N, Yk denotes the trait value (quantitative or Chapter 1: gGOF under Independence 36 categorical), Xk = (Xk1, ..., Xkn) denotes the genotype vector of the n SNPs in the gene. Zk = (Zk1, ..., Zkm) denotes a vector of m controlling variables of environmental and/or other independent genetic factors. The null hypothesis is that none of the SNPs are associated with the trait, and therefore the gene is not associated: H0 : βi = 0, i = 1, ..., n. Many statistics can be used to test this null hypothesis while controlling the effects of other factors represented by Zk. One classic example is a marginal test with statistics given by [38, 39] N X ˜ Mi = Xki(Yk − Yk), i = 1, ..., n, k=1 ˜ where Yk is the fitted outcome value (e.g., by least squares or iteratively reweighted least squares) under H0. It can be shown that under H0 the vector of the marginal D statistics M = (M1, ..., Mn) → N(0, Σ), as N → ∞. The covariance matrix Σ can be estimated by Σˆ = X0WX − X0WZ(Z0WZ)−1Z0WX, where matrices X = (Xki), Z = (Zki), and W is the covariance matrix of Y . In the case of multiple regression model for quantitative traits, W =σ ˆ2I, whereσ ˆ2 is the least squares estimate of the residual variance. In the case of logistic regression model ˜ ˜ for binary traits, W = diag{Yk(1 − Yk), k = 1, ..., N}. We can de-correlate M to obtain the input statistics for gGOF: ˆ − 1 D (T1, ..., Tn) = Σ 2 W → N(0,In×n), i.i.d. and thus the input p-values are Pi = 2(1 − Φ(|Ti|)) → Uniform[0, 1]. Then for any gGOF statistic, it’s p-value can be calculated by the methods given in this chapter for Chapter 1: gGOF under Independence 37 measuring how significant the gene is associated with the phenotype trait. It should be noted that the input statistics are not required to follow normal distribution; the calculation methods only requires that the input p-values are iid Uniform[0, 1] under the null. That is, other input statistics following t or chi-squared distribution can be used as long as they are not correlated or can be de-correlated. We applied the gene-based analysis framework to a GWAS data of Crohn’s disease from NIDDK-IBDGC (National Institute of Diabetes, Digestive and Kidney Diseases - Inflammatory Bowel Disease Genetics Consortium). It contains 1,145 individuals from non-Jewish population (572 Crohn’s disease cases and 573 controls) [40]. After typical quality control for genotype data, 308,330 somatic SNPs were grouped into 15,857 genes according to their physical locations. As a special case of the GLM in (1.18) the logistic regression model was applied to search genes associated with Crohn’s disease susceptibility. The controlling covariate Zk = (1,Zk1,Zk2) contain an intercept and the first two principal components of the genotype data, which serve the purpose of controlling potential population structure [41]. In case a gene contains only one SNP, no gGOF test was needed. We examined four gGOF statistics HC2004, B-J, reverse B-J, and HC2008. Figure 1.1 gives the QQ plots of the gene-based p-values calculated by Theorem 1.3.1. The genomic inflation factors (i.e., the ratios of empirical median of -log(p-values) vs. the expected median under H0 [42]) are all close to 1, indicating that the genome-wide type I errors were well controlled. Among the four statistics, B-J seemed having higher power because it yielded more genes significantly above the red diagonal line of the H0-expected p-values. Among the top ranked genes, many of them are relevant Chapter 1: gGOF under Independence 38 to Crohn’s disease. In particular, IL23R and CARD15 (also known as NOD2) are well-known Crohn’s disease genes [43, 44, 40]. Gene NPTX2 was top ranked by both HC2004 and B-J. It hasn’t been reported previously through association studies, but could be a putative disease gene because it encodes a neuronal petraxin, which is related to C-reactive protein [45], an indicator for Crohn’s disease activity level [46]. Furthermore, NPTX2 has an important paralog gene APCS (www.genecards.org), which is related to arthritis, a disease highly correlated with Crohn’s disease [47]. Gene SLC44A4 is also related to the pathophysiology of Crohn’s disease. Defects in this gene can cause sialidosis [45], a lysosomal storage disease due to a deficiency of sialidase, an enzyme important for various cells to defend against infection [48]. Gene BMP2 was identified by B-J, reversed B-J, and HC2008. This gene could also be relevant because it is associated with digestive phenotypes, especially colon cancer [49, 50]. Certainly, further studies are needed to validate those top ranked genes. 1.6 Discussion This chapter provided techniques to calculate the exact and approximated null and alternative distributions of a generic gGOF statistic family. It gave a foundation for applying gGOF statistics in real data analysis, and for studying and comparing important statistics such as the asymptotically optimal ones in the finite n case. A few future studies will be carried out. First, to calculate the exact distribution, the result in Theorem 1.3.1 brings down the computational complexity to O((k1 − 2 k0) ). Meanwhile, when k1 − k0 is large, the calculation could suffer from the loss of significant digits. The current practice is to truncate the summation to the first 25 - Chapter 1: gGOF under Independence 39 30 terms, which yields a fairly accurate result and saves computation time. This issue could be further addressed by improving numeric techniques. Second, we will still look for better one-step approximation, especially for power calculation. Third, in real data analysis input statistics are often correlated. It would be nice to incorporate such correlation into the calculation of p-values and statistical power. For that, we will report the results we have gotten in a separate paper. Chapter 2 digGOF: Double-Omnibus Innovated Goodness-Of-Fit Tests For Dependent Data Analysis 40 Chapter 2: gGOF for Dependent Data Analysis 41 2.1 Introduction With a long history and numerous applications the goodness-of-fit (GOF) test is one of the breakthroughs in statistics [51, 52]. In particular, GOF test provides a promising tool for signal detection problems in analyzing big data. A collection of GOF statistics such as the Higher Criticism (HC) type statistics, Berk-Johns (BJ) type statistics, and φ-divergence statistics, have been proven asymptotically optimal for weak-and-rare signals [4,9,6]. These GOF statistics can be unified into in a general family called gGOF defined by a generic functional and a general truncation scheme of input p-values P1, ..., Pn [53]. Under the null hypothesis all p-values are from Uniform(0, 1). Let P(1) ≤ ... ≤ P(n) be the ordered p-values. A gGOF statistic measures the supremum departure of P(i) from its null expectation, which is roughly i n : i Sn,f,R = sup f( ,P(i)), (2.1) R n where \ R = {i : k0 ≤ i ≤ k1} {P(i) : α0 ≤ P(i) ≤ α1} represents an arbitrary truncation scheme for p-values based on their ranks for given k0 ≤ k1 ∈ {1, ..., n}, and/or their magnitudes for given α0 ≤ α1 ∈ [0, 1]. For fixed x = i/n the function f(x, y) is monotonically decreasing in y = P(i), so that the smaller the input p-values, the larger the statistic and the stronger the evidence against H0. Under the assumption that the input p-values are independent and identically distributed (iid), the p-value and power calculations of gGOF family have been well resolved. However, correlation is ubiquitous in real data analysis. A few problems Chapter 2: gGOF for Dependent Data Analysis 42 are to be addressed for analyzing correlated data. First, to practically apply gGOF statistics, p-value calculation under arbitrary correlation is desired. Secondly, to study the power of gGOF for correlated data, it is important to understand how signal patterns and correlation structures would influence signal-to-noise ratio (SNR). Proper transformations for correlated data has a potential to advance signal detection, while improper ones could do harm too. Thirdly, since different test statistics in the gGOF could have relative advantages for certain signal patterns and data properties, it is ideal to fully utilize the family-retained advantages to provide a powerful and robust solution to suit various situations. In recent years, individual modifications have been proposed based on specific statistics in gGOF for analyzing correlated data. In particular, GHC and GBJ [7, 54] were proposed based on original HC and BJ statistics in genetic association studies. These developments were motivated from an interpretation perspective, for example, on how to incorporate data variation into the statistic under correlations [7]. However, certain interpretation does not necessarily guarantee a higher statistical power. On one hand, under finite n, Figure 2.1 shows that if input p-values are properly trans- formed GHC and GBJ could be similar or less powerful than the original versions of HC and BJ. On the other hand, under asymptotics for n → ∞, gGOF statistics have already reached optimality under both independent and dependent cases [55, 56]. Thus, as long as the computation issue is addressed, gGOF can be a traditional and natural choice that contains numerous statistics and solutions for analyzing various data, whether independent or dependent. Chapter 2: gGOF for Dependent Data Analysis 43 Figure 2.1: Statistical power for HC, GHC, BJ, and GBJ, with and without inno- vated transformation (innov) of the input test statistics (T1, ..., T100) ∼ N(µ, Σ). The nonzero elements (i.e., signals) in µ are distributed arbitrarily, with same magnitude A. Σ is equal correlation matrix with off-diagonal elements ρ. Left: A = 2, Σ = Σ1 with ρ = 0.3 ; Right: A = 0.3, Σ = Σ2 with ρ = −0.0099. Σ1 and Σ2 are mutually −1 inverse matrices, i.e., Σ1 = Σ2 . Type I error rate 0.05; 10,000 simulations. Rather than carrying out individual development for specific test statistics, the higher perspective of this work is to address the whole statistic family, and automat- ically choose the best statistic function f and truncation domain R for any given data. Indeed, a family of various statistics defined by different f functions could pos- sess high statistical power over a broader (if not the full spectrum of) space of data properties and signal patterns. For example, HC2004 statistic [4] is more powerful for very sparse signals, while BJ and HC2008 statistics [5] are more powerful for denser Chapter 2: gGOF for Dependent Data Analysis 44 signals [53]. Moreover, the truncation of input p-values R is also quite relevant. For 2004 example, the modified HC with R = {1 < i ≤ n/2,P(i) ≥ 1/n} improves its performance over that if no truncations were applied [4, 13]. Thus, with allowing a general f and R the gGOF family could retain all of these advantages to give high and robust power in analyzing various data. In addition, the wonderful feature of gGOF allows automatic selection of the best f and R in harmony to the calculation for itself, without the need of simulation or approximation. Specifically, any gGOF statistic is a supremum of the monotone func- tion f. Therefore, the distribution of any gGOF statistic is essentially a probability function for cross-boundaries of ordered p-values: i P (Sn,f,R ≤ b) = P (sup f( ,P(i)) ≤ b) = P (P(k) > uk , for all k and P(k) ∈ R), R n (2.2) where the boundaries are decided by f and the threshold b: k u = f −1( , b). (2.3) k n This special property makes it convenient to analytically calculate the exact null distribution for a double-adaptation to both f and R. Specifically, for a given gGOF statistic Sf,R at a fixed n, let Gf,R be its survival functions under the null hypothesis. We define a double-adaptation omnibus statistic by the smallest p-value (i.e., the strongest statistical evidence against the null) among all statistics indexed by various f and R: So = inf Gf,R(Sf,R). (2.4) f,R Chapter 2: gGOF for Dependent Data Analysis 45 Under the null, the survival function of So is −1 ? ? P (So > so) = P (Sf,R ≤ Gf,R(so), for all f, R) = P (P(1) > u1, ..., P(n) > un), (2.5) where, for each k = 1, ..., n, ? −1 k −1 uk = sup uf,R,k = f ( ,Gf,R(so)). (2.6) f,R n Now the calculation for the cross-boundary probabilities in 2.2 and 2.5 are similar. We have provided efficient calculations for exact and approximate solutions in the independence case [53]. In this paper, to calculate p-values in the dependence case, we will first provide the exact calculation under equal correlation. Based on that, we approximate the calculation under arbitrary complex correlation. This strategy is an explicit calculation method based on a theoretical deduction, which is different from the typical moment-matching of certain distributions. It is shown to be more accurate under broad circumstances. Regarding statistical power, beyond extensive simulation-based studies, in this chapter we explore how correlation structure and signal pattern influence the signal- to-noise ratio (SNR) under both Gaussian mean models (GMM) and generalized linear models (GLM). Ideally, correlation information could be incorporated to help with improving power of signal detection. Indeed, this could be realized through linear transformation of the input statistics for the input p-values. We study de- correlation transformation (DT) and the innovated transformation (IT) [55, 57] and reveal conditions for them to strengthen or weaken SNR under GMM and GLM. In particular, for GLM, we show that under weak conditions marginal model-fitting is essentially the IT transformation of the joint model-fitting. This result is particularly Chapter 2: gGOF for Dependent Data Analysis 46 interesting in applied statistical studies because it indicates that the computationally simple marginal model-fitting is actually often superior over the computationally more demanding joint model-fitting for dependent data analysis, even when some extent of signal cancellation exists. Since IT is considered the optimal linear transformation to maximize SNR (at least under sparse cases) [58, 57], it should be applied, either implicitly or explicitly, before testing hypotheses by gGOF. As a natural extension of iHC [55], we call such a procedure igGOF test. When the double-adaptation omnibus test in (2.4) is carried out on the top of igGOF, we call the testing procedure digGOF. The paper is organized as follows. Section 2.2 formulates the problem by defining models for hypotheses. Section 2.3.1 provides analytical p-value calculation methods; the accuracy are evaluated under various settings. Section 2.5 provides studies of innovated transformation and statistical power. In Section 2.6 igGOF and digGOF were applied and evaluated for GWAS studies under the correlated cases, and real GWAS analysis is also applied for find new disease genes. We discuss the limitation of this work and future plans in Section 2.7. 2.2 Models of Hypotheses In this section we consider two well connected settings for hypotheses: the Gaus- sian mean model (GMM) and the generalized linear model (GLM). GMM serves as a foundation for GLM model, the latter can be considered as a constraint GMM conditional on data patterns. GMM assumes the distribution of the vector of n input test statistics T = (T1, ..., Tn) Chapter 2: gGOF for Dependent Data Analysis 47 are joint Gaussian: T ∼ N(µ, Σ), (2.7) where Σ is known or can be reliably estimated; µ is the unknown parameter corre- sponding to the hypotheses: H0 : µ = 0 versus H1 : µ 6= 0. (2.8) For consistency of the presentation, in this chapter, the input statistics are assumed always standardized such that its variance Σ is a correlation matrix with diagonal of 1’s. Following that, the magnitude of the nonzero mean elements µi represents the SNR, serving as a measure for signal strength. Certainly, the higher of the SNR, the higher is the statistical power of any reasonable grouping tests based on Ti’s. Here the gGOF statistics defined in (2.1) take T ’s p-values as input, which could be one-sided or two-sided depending on specific data analysis: ¯ ¯ One-sided: Pi = Φ(Ti); Two-sided: Pi = 2Φ(|Ti|). (2.9) Note that this pre-assumption of automatic data standardization is consistent with the fact that any point-wise rescaling for Ti won’t affect the input p-values nor their summary statistics. A GLM is defined as 0 0 g(E(Yk|Xk·,Zk·)) = Xk·β + Zk·γ. (2.10) 0 For the kth subject, k = 1, ..., N, Yk denotes the response value, Xk· = (Xk1, ..., Xkn) denotes the design matrix XN×n’s kth row vector of n inquiry covariates, Zk· = 0 (Zk1, ..., Zkm) denotes ZN×m’s kth row vector of m control covariates. For example, Chapter 2: gGOF for Dependent Data Analysis 48 in the gene-based single-nucleotide polymorphism (SNP)-set studies, assume a gene contains n SNPs, Xk· denotes the genotype vector of the n SNPs, Zk· denotes a data vector of m control covariates, such as the intercept and other environmental and other genetic variants. The function g is called a link function based on the distribution of Yk given Xk· and Zk·. Generally Yk follows a distribution in the exponential family with ykθk−b(θk) density function f(yk) = exp{ + c(yk, φ)}, where θ is the natural parameter, ak(φ) φ is the dispersion parameter, and a, b and c are given functions of them. The null hypothesis is that none of the inquiry covariates are associated with the outcome: H0 : β = 0 vs. H1 : β 6= 0. (2.11) Through asymptotics for joint model-fitting, the GLM with correlated covari- ates are connected to the GMM in (2.7). Joint model-fitting means estimating β simultaneously. There could be many ways of doing so. Here we consider a class one-step maximum likelihood estimation (MLE, cf. Theorem 4.19 and Exer- (0) −1 0 (0) (0) cise 4.152 of [33]) with initial estimation µk = g (Zk·γ ), k = 1, ..., N, γ is the MLE estimator of γ under H0. Let the corresponding estimated weights matrix (0) (0) 00 (0) (0) (0) W = diag{Var (Yk)} = diag{a(φ )b (θk )} , where φ and θk are also MLE of φ ˜ 1/2 ˜ 1/2 and θk, k = 1, ..., N, under H0. Assume N > n, and define X = W X, Z = W Z, and H˜ = Z˜(Z˜0Z˜)−1Z˜0 be the projection matrix onto the column space of Z˜. It can be shown the estimator of β is ˆ ˜ 0 ˜ ˜ −1 0 (0) D ˜ 0 ˜ ˜ −1 βJ = (X (I − H)X) X (Y − µ ) → N(β, (X (I − H)X) ), (2.12) as N → ∞. To obtain the input p-values for gGOF statistics, the input statistics are Chapter 2: gGOF for Dependent Data Analysis 49 ˆ the standardized βJ : ˜ ˆ D ˜ ˜ ˜ 0 ˜ ˜ −1 ˜ TJ = ΛβJ → N(Λβ, Λ(X (I − H)X) Λ), (2.13) ˜ 1 ˜ ˜ 0 ˜ ˜ −1 where the diagonal matrix Λ = diag(√ )1≤i≤n , with λi = ((X (I − H)X) )ii λ˜i being the diagonal elements of the covariance matrix. Write the marginal fitting ˜ 0 ˜ ˜ ˆ ˜ 0 ˜ ˜ 0 /IT by U = (X (I − H)X)βJ , we can see that Var(U) = X (I − H)X = X (W − WZ(Z0WZ)−1Z0W )X. This is consistent with the marginal score statistics in [54]. A special case of GLM is the linear regression model (LM), where the model equation is Y = Xβ + Zγ + , (2.14) where XN×n and ZN×n are still the design matrices with their kth row vectors being 0 0 2 Xk· and Zk·, respectively. The error term ∼ N(0, σ IN×N ), where the variance σ2 known or can be consistently estimated.Here the one-step MLE is the same as the least-squares (LS) estimation in joint model fitting. In particular, in the linear regression model the weights matrix W = diag{1/σ2}, the initial estimation µ(0) = ˜ HY is the projection of Y onto the column space of Z, i.e., the LS of Y under H0. Also note that the standardized statistics follow exact normal distribution: ˆ 0 −1 TJ = ΛβJ /σ ∼ N(Λβ/σ, Λ(X (I − H)X) Λ), (2.15) 1 0 −1 where the diagonal matrix Λ = diag( √ )1≤i≤n, with λi = ((X (I − H)X) )ii being λi the diagonal elements of the covariance matrix. From this perspective, a GLM model can be considered a restricted GMM in (2.7). In GMM, µ and Σ can be defined independently. In GLM, even though the “effects” β are defined separately from data X and Z, both the mean vector Chapter 2: gGOF for Dependent Data Analysis 50 µTJ (i.e., the signal strength under Ha) and the correlation matrix ΣTJ depend on the data. In particular, data correlation structure plays a critical role. Consider the formula in (2.13), X˜ 0(I − H˜ )X˜ gives a measure of the covariance among the inquiry covariates conditional on the control covariates. In a typical case where Z only contains the intercept in regression, X˜ 0(I − H˜ )X˜ = X0(I − J)X, where J is a matrix of 1/n, is exactly the empirical covariance matrix among the columns of X. Such connection between the input statistics TJ and the data correlation structure raises two consequences. First, the signal strength depends on data correlation. As ˜ to be shown in Section 2.4.1, it is often the case that λi < 1 (when σ = 1, for example), indicating the signal strength in TJ is actually less than the effect size defined by the nonzero elements of β. Secondly, linear transformation of the input statistics are related to the data correlation. It is often the case that the de-correlation transformation and the innovated transformation on TJ could increase the signal strength. 2.3 The gGOF Family Under Dependence 2.3.1 P-value of gGOF under Dependence In this section we propose a few methods to calculate the p-value for the gGOF related statistics in (2.1) and (2.4). This novel strategy is different from the typical moment-mapping methods, and is shown more accurate by simulations under various situations. First, we provide an exact calculation for the null distribution of under the equal Chapter 2: gGOF for Dependent Data Analysis 51 correlation matrix summarized in Theorem 2.3.1. Theorem 2.3.1. Consider input statistics T in (2.7) with µ = 0, Σij = ρ for all i 6= j, and the input p-values are defined in (2.9). Assume R = {i : k0 ≤ i ≤ k1}. Let U(1) ≤ ... ≤ U(n) be the order statistics of n iid Uniform(0, 1) random variables. For any gGOF statistic in (2.1): Z ∞ P (Sn,f,R < b) = φ(z)P (U(k) > cik, k = k0, ..., k1)dz, (2.16) −∞ √ √ √ Φ−1(1−u )− ρz Φ−1(1−u /2)− ρz −Φ−1(1−u /2)− ρz √ k √ k √ k where c1k = 1−Φ 1−ρ and c2k = 1−Φ 1−ρ +Φ 1−ρ are respectively for the one-sided and two-sided input p-values in (2.9), and φ(z) and Φ(z) are respectively the density and distribution functions of N(0, 1). Efficient exact as well as approximate calculations for P (U(k) > ak, k = k0, ..., k1) haven been given in [53]. The integration over z can be calculated numerically. In real data analysis, the structure of the correlation matrix Σ could be complex, and Σ often needs to be estimated, which makes it even more complicated. Here we propose several strategies to approximate the distributions of gGOF statistics based on the exact calculation developed above. The first strategy, called the weighted average method (WAM), is specially de- signed for the case where the correlation matrix is roughly a Toeplitz matrix, where the off-diagonal lines have equal elements. This assumption is appropriate for data with the decay correlation when the “distance” of two covariates is increasing. For example, the autoregressive model in time series data, or the genetic data with de- cay linkage disequilibrium (LD) over the physical/genetic distance. Specifically, Let GΣ(b) ≡ P (Sn,f,R < b|Σ) be the distribution function of Sn,f,R when the input statis- tics T has a correlation matrix Σ. Consider Σ be Toeplitz, i.e., Σ(l, k) = ρj, j = Chapter 2: gGOF for Dependent Data Analysis 52 |l − k|. Denote Σj be the equal-correlation matrix with correlation ρj. The WAM approximate is αn X GΣ(b) = ωjGΣj (b). j=1 Theoretical [59] as well as our empirical studies show that the near off-diagonal components are more important for characterizing Σ. Thus, we propose a band- width parameter α, which truncates the far off-diagonal components to be 0. The n−j weights are based on the relative size of the off-diagonal lines: ωj = ((1+α)n−1)(1−α)n/2 , j = 1, ..., αn. Empirical results show that α = 0.5 is a robust choice in most cases. When Σ is not exactly Toeplitz, we can use ρj be the average correlation on the jth off-diagonal of Σ. When the correlation matrix Σ is more complicated than Toeplitz, we propose a second strategy based on the locally weighted smoothing method (LOESS) to estimate the distribution. Specifically, we obtain GΣ(b) curve by local smoothing curve fitting of GΣj (b) around b.Σj still represents the equal correlation matrix with constant correlation ρj. However, instead of focusing on the near off-diagonals as that in WAM, here ρj is chosen within the data range of all elements in Σ. The idea is that large correlation elements in Σ, even if not necessarily close to the diagonal, could likely have non-negligible influence. We first set m equal-distance neighbor points in the interval around b, i.e., bi ∈ [b − , b + ], i = 1, ..., m. Then, for each bi, we randomly choose N off-diagonal elements ρj of Σ to calculate yij = GΣj (bi), j = 1, ..., N. After that, we use yij as the input data for a local polynomial regression with tri-cube weights to predict the curve GΣ(b)[60]. For implementation, we found m = 10, = 1, N = n, and curve function of quadratic polynomials often provide Chapter 2: gGOF for Dependent Data Analysis 53 a good rule of thumb. Comparing these two strategies, extensive numerical studies show that when Σ is a Toeplitz or nearly so, WAM is more accurate, where LOESS is also pretty accurate but is more robust to more complex correlation structures. For the p-value calculation of digGOF, as clarified in (2.5), the calculation is still a cross-boundary probability function. Thus, it follows the same methods given above. ? The key is to obtain the boundaries uk. For efficient implementation, we consider double-adaption over a discrete sequence of functions and truncations {(f1, R1), (f2, R2), ...}. Potential choices of the functions fj are from the φ-divergence statistic family with various s ∈ [−1, 2], which ensures a theoretical optimality for detection weak-and-rare signals [9]. Certainly, any monotone function f would work. The truncation domains Rj = {k0j, ..., k1j} only focus on the index truncation for two reasons. First, for a given data, the adaptation to all k0 ≤ k1 ∈ {1, ..., n} is equivalent to the adaptation to all α0 ≤ α1 ∈ [0, 1]. Secondly, computation can be simplified significantly. Specifi- cally, using the index j the statistic in (2.4) can be denoted as So = infj Gj(Sj), and its survival function under H0 −1 ? ? P (So > so) = P (Sj(P1, ..., Pn) < Gj (so), for all j) = P (P(1) > u1, ..., P(n) > un), ? where uk = supj ujk, and for each given j, we only need to calculate ujk within k0j ≤ k ≤ k1j: −1 k −1 fj ( n ,Gj (so)) if k0j ≤ k ≤ k1j, ujk = (2.17) 0 otherwise. Chapter 2: gGOF for Dependent Data Analysis 54 2.3.2 Rejection Boundary Analysis Following (2.2), the acceptance region of a gGOF statistic is AS = {P(k) : P(k) > uk , for all k, P(k) ∈ R}, (2.18) c and the rejection region is AS. That is, H0 is rejected whenever P(k) ≤ uk at any k. Thus, at a given correlation Σ and a type I error rate control, the series {u1, ..., un} give a rejection boundary (RB), which could provide a indicator of statistical power. Figure 2.2 illustrates the RBs at the logarithm scale for various gGOF statistics. H0 is rejected if any ordered p-values are below the corresponding boundary. If the RB curve of one statistic is uniformly higher than that of the other, then the first statistic will have uniformly higher power over all possible signal strength. However, when RBs cross, the statistical power depends on the signal patterns. The left panel evaluates various φ-divergence statistics with s ∈ {−2, −1, 0, 1, 2, 3}. The RB curves show an interesting pattern: statistics with large s (e.g., HC2004 with s = 2) have higher RBs at the top ranked p-values. If signals are strong and sparse so that the signals likely correspond the smallest p-values, then these statistics are more sensitive to these and thus give more statistical powerful. In particular, at the given setting, some curves (i.e., s ≤ 1) do not start from k = 1 because their uk’s are negative, indicating that they have no power to detect the smallest p-value corre- sponded p-values. On the other hand, statistics with small s are more sensitive to the lower ranked p-values, indicating their advantages to signals that are weaker and/or denser. This patterns also indicates that focusing of certain range of ordered p-values by applying proper p-value truncation domain R could benefits the statistic power too. Chapter 2: gGOF for Dependent Data Analysis 55 As shown in the right panel, the omnibus methods over various fs functions pro- vides a balanced and thus robust solution. The RB curves of the omnibus tests tend to be closer to the best statistic at all positions. Moreover, it seems that we do not need the omnibus of too many tests: the RB for adapting to s ∈ {1, 2} is the similar to that for adapting to s ∈ {−1, 0, 1, 2, 3}. This is likely because BJ is already a robust statistics. In the meanwhile, the power study by RB needs to be treated with caution due its the limitation. In particular, the distances between two RBs at different k locations do not proportionally reflect the relative advantages in terms of statistical power. That is, a small RB curve difference some locations maybe more important than a larger RB curve difference at other locations. Thus, if two RBs cross, it is hard to say which method is more powerful when the signal pattern is ambiguous. Chapter 2: gGOF for Dependent Data Analysis 56 Figure 2.2: The rejection boundary of the φ-divergence and omnibus statistics. The null hypothesis is rejected whenever any one log10(P(k)) is below the boundary. Left panel: RBs for φ-divergence statistics with s = 3 (black), 2 (i.e., HC2004, red), 1 (i.e., BJ, green), 0 (i.e., reversed BJ, blue) and −1 (i.e., HC2008, cyan). Right panel: RBs for HC2004 (red), omnibus over HC2004 and BJ (black), and omnibus over φ- divergence with s ∈ {−1, 0, 1, 2, 3} (green). 2.4 Innovated Transformation 2.4.1 Innovated Transformation and Sparse Signals In this section we consider proper ways of combining correlations into the group testing procedure. This is realized by proper linear transformation of input statistics before getting input p-values for gGOF statistic. We first clarify the innovation transformation suggested by [55] under settings of Gaussian mean model and under Chapter 2: gGOF for Dependent Data Analysis 57 GLM. Under either settings, we illustrate why it may (or may not) provide higher statistical power, especially for detecting sparse signals. Transformations Under GMM For a vector of input test statistics under GMM in (2.7), [55] proposed decorre- lation transformation (DT) and innovated transformation (IT). Define the Cholesky factorization of Σ by Q (a lower triangular matrix), i.e. Σ = QQ0, or Q0Σ−1Q = I. Define U = Q−1, which is also a lower triangular matrix. We have Σ−1 = U 0U, or UΣU 0 = I. After DT the input statistics become T DT = UT ∼ N(Uµ, I). (2.19) IT is defined below: Definition 1. IT is a transformation procedure for Gaussian random vector T in (2.7), if the locations of the nonzero elements of µ do not depend on Σ, then the IT of T is T IT = DΣ−1T ∼ N(DΣ−1µ, DΣ−1D), (2.20) where the matrix D is for rescaling the statistics after transformation so that the IT variance of Ti ’s are 1: p −1 D = diag(1/ (Σ )ii, i = 1, ..., n). Chapter 2: gGOF for Dependent Data Analysis 58 Note that the IT definition requires that the locations of the nonzero elements in the mean vector are independent of the correlation matrix. This condition gives a directional inverse transformation, which avoids “looping definition” of IT (i.e., we consider two statistic vector cannot be mutually IT). It helps to anchor the base line from which IT could increases the signal strength of the baseline. Moreover, the rescaling procedure after inverse transformation is consist with the requirement for GMM in (2.7). That is, the statistics of the gGOF statistics are always normalized so that the p-values are obtained by (2.9) and the nonzero mean elements give the SNRs. Note that [55]’s proposal is broader, including a spectrum of transformations 0 Vbn T ∼ N(Vbn µ, Vbn ΣVbn ), (2.21) where Vbn is a diagonal-band truncated (bandwidth bn) and columns normalized (for rescaling the statistics) transformation matrix. When bn = 1, V1 = U, the transfor- −1 mation is DT. When bn = n, V1 = DΣ , the transformation is IT. Besides providing a spectrum of transformations, introducing bn also provide technique convenience in theoretical proof of optimality (see Theorem 2.4.3). However, in finite data analyses, UT and IT are often enough to present the two extremes of the signal strength after transformation. Figure 2.3 gives an experiment on the influence of bn. When the sig- nals are separated far from each other (first row), IT is better than UT; when signals are clustered (second row), UT could be better. For simplicity we can simply use either DT or IT to get the best case scenario without concerning the proper choice of bn value in between. Chapter 2: gGOF for Dependent Data Analysis 59 Figure 2.3: Transformed signals strength. First row: µ16 = µ32 = µ48 = 1 and other components of µ = 0. Second row: µ16 = µ17 = µ18 = 1 and other components −1 of µ = 0. Black: V1µ. Red: DΣ µ. Σ is a polynomial decay correlation matrix, −1 Σij = |i − j| , i 6= j and Σii = 1, i = 1, ..., 50. Left column: distribution of the transformed signals. Right column: the maximum signal after transformation across different bandwidth bn. In the following we illustrate why UT and IT could increase SNR under sparse Chapter 2: gGOF for Dependent Data Analysis 60 signals and sparse correlations (similar as conditions in the first row of 2.3 and the conditions given by [55]). The reasoning is based on a few important linear algebra properties about the relevant matrices. q P 2 q P 2 1. The diagonal of Q can be calculated as Qj,j = Σj,j − k Qj,k = 1 − k Qj,k < 1. 2. As the inverse of Q, the diagonal elements of U satisfy Uj,j = 1/Qj,j, thus Uj,j > 1. −1 −1 −1 3.Σ is symmetric with diagonal elements (Σ )1,1 = ... = (Σ )n,n > 1. 4. If the correlations in Σ are all positive, most of the off diagonal elements in Σ−1 will be negative. 5. If Σ is polynomially decay, then U, Q, and Σ−1 are all polynomially decay [61, 55]. p −1 6. Uii are increasing in i, with U1,1 = 1 < U2,2 < ... < Un,n = (Σ )j,j. This is due to n−j n−1 −1 X 2 −1 X 2 −1 2 (Σ )j,j = uj+k,j, with special cases: (Σ )1,1 = u1+k,1 and (Σ )n,n = un,n. k=0 k=0 7. Point-wise rescaling the test statistics T , i.e., by DT with D being a diago- nal matrix, does not change the signal-noise-ratio (SNR), because it adds the q 2 same proportion to the mean and the standard deviation: Djµj/ Dj Σj,j = p µj/ Σj,j. The input p-values also remain the same. To show when signals are surely enhanced by DT and ET, let’s first consider a special case where there is only one non-zero elements of µ at position j, i.e., µj = A Chapter 2: gGOF for Dependent Data Analysis 61 for a constant A and µi = 0 for all i 6= j. For DT, the post-transformed mean vector Uµ has elements (Uµ)j = AUj,j ≥ A and (Uµ)i = 0 for all i 6= j. Similarly, for IT, the mean vector after IT DΣ−1µ has elements (DΣ−1µ) = √ 1 (Σ−1) A = j −1 j,j (Σ )j,j p −1 (Σ )j,jA > A. At the same time, after transformation the marginal variances of the input statistics remain the same as 1. Thus, signal strength in terms of SNR is p −1 increase. Note that IT is better than DT because 1 ≤ Uj,j ≤ (Σ )j,j as specific in the linear algebra properties above. ∗ More generally, µj = A for j ∈ M = {j1, ..., jK }, the domain of true signals. A sufficient condition for UT and IT to strengthen signals is: 1) the signals are sparse, i.e., K = o(n); 2) the signal locations are randomly distributed independent of the correlation structure of Σ, e.g., M ∗ is uniformly distributed over {1, ..., n}; and 3) correlation matrix is sparse, e.g., Σ is polynomially decay, such that U, Q, and Σ−1 are all polynomially decay. Under this condition, after transformation, the SNR are p −1 still roughly AUj,j for UT and A (Σ )j,j for IT (cf. Lemma 6 in [10] and Lemma 11.2 in [55]). Thus UT and IT still strengthen the signals. This idea is demonstrated by the diagram in Figure 2.4. Dots represent nonzero elements; segments represent the width of the correlations. The nonzero elements of µ are strengthened by the banded correlation of DΣ−1 after transformation to get DΣ−1µ. On the other hand, the inverse transformation from DΣ−1µ to µ will reduce SNR, but because DΣ−1µ depends on the correlation Σ, which is not considered as IT as defined in (2.20). As to be illustrated for the GLM below, this assumption is important for us to start from joint-model fitting and show that the marginal-model fitting is the IT, not vice versa. Chapter 2: gGOF for Dependent Data Analysis 62 Figure 2.4: Demonstration of IT when both correlations (DΣ−1) and signals (µ) are sparse. Transformations Under GLM In this section study the performance of UT and IT in terms of SNR under GLM setting in (2.10). The marginal model-fitting can be roughly considered as the IT of the joint model-fitting, and we show various situations when joint model-fitting, UT, and IT have their relative advantages. For the linear regression model in (2.14), Theorem 2.4.1 illustrate the relationship between model-fitting types and the transformation types. When signals are sparse, in particular if there is only one signals, no matter what the correlation structure of the data is, UT guarantees to improve the original joint model-fitting, and IT (i.e., the marginal model-fitting) guarantees a further improvement. Theorem 2.4.1. Consider linear regression model in (2.14) with error term ∼ Chapter 2: gGOF for Dependent Data Analysis 63 N(0, σ2I), where σ is known. H = Z(Z0Z)−1Z0 is the projection matrix onto the columns of Z. Denote Xj the jth column of X, j = 1, ..., n. 1. The test statistics by joint model-fitting (least squares estimation) is TJ in (2.15), the test statistics by marginal model-fitting is 0 −1 TM = CX (I − H)Y/σ ∼ N(ΣTM C β/σ, ΣTM ), (2.22) √ 1 0 where C = diag{ 0 , j = 1, ..., n} so that ΣTM = CX (I − H)XC is a Xj (I−H)Xj correlation matrix with diagonal of 1’s. IT 2. The IT of TJ is TM , i.e., TJ = TM , and the UTs of TJ and TM are the same: UT UT 0 0 −1 TJ = TM = UM CX (I − H)Y/σ ∼ N((CUM ) β/σ, I), 0 where UM is the inverse of the Cholesky factorization of ΣTM , i.e., UM ΣTM UM = I. 3. If β has one nonzero element A > 0 under Ha, then SNRs of the three methods have the relationship: UT E(TJ ) ≤ E(TJ ) ≤ E(TM ). Note that TJ and TM can be mutually transformed by multiplying the inverse of the correlation matrix. However, according to the rule of IT described above, we say TM is the IT of TJ , not vice versa. This is because locations of nonzero elements in the mean vector of TM depends on the correlation structure of its correlation matrix. For example, it could be the case that two element statistics TJ1 has nonzero mean (indicating a causal factor) and TJ2 has zero mean (non-causal factor), and yet they Chapter 2: gGOF for Dependent Data Analysis 64 0 are correlated. However, since TM depends on data covariance matrix X (I − H)X (conditional on Z). If X1 and X2 are correlated, then µM1 6= 0 implies µM2 6= 0 even if β2 = 0. βM is a biased estimator, but could provide more power. Theorem 2.4.1 considers one signal for simplicity, but in general, similar result can be shown that UT and IT guarantee the improvement of SNR as long as the signals are sparser than the correlation, a situation illustrated in Figure 2.4. In this case, marginal model-fitting still always preferred over joint model-fitting no matter whether the correlations are positive or negative. Furthermore, Theorem 2.4.1 assumes that N > n+m so that joint fitting is doable. In general, for high dimensional data analysis problem, where N < n+m, the marginal fitting is still a good choice due to it’s simple computation and it’s relationship with IT that maximizes the signal- to-noise ratio of the test statistics when the covariates are correlated and the true β is sparse. Now we extend the result to GLM. Theorem 2.4.2. Consider GLM in (2.10). 1. The test statistics by joint model-fitting is TJ in (2.13). The test statistics by marginal model-fitting is 0 (0) D −1 TM = CX (Y − µ ) → N(ΣTM C β, ΣTM ), (2.23) (0) 1 where µ is the MLE estimator of the mean under H0, C = diag{ q , j = ˜ 0 ˜ ˜ Xj (I−H)Xj ˜ 1/2 ˜ 1/2 ˜ ˜ ˜0 ˜ −1 ˜0 (0) 1, ..., n}, X = W X, Z = W Z, and H = Z(Z Z) Z and W = diag{Var (Yj)}, which is the estimated weights matrix under H0. Chapter 2: gGOF for Dependent Data Analysis 65 IT 2. The IT of TJ is TM , i.e., TJ = TM , and the UTs of TJ and TM are the same: UT UT 0 (0) D 0 −1 TJ = TM = UM CX (Y − µ ) → N((CUM ) β, I), 0 where UM is the inverse of the Cholesky factorization of ΣTM , i.e., UM ΣTM UM = I. 3. If β has one nonzero element A > 0 under Ha, then SNRs of the three methods have the relationship: UT E(TJ ) ≤ E(TJ ) ≤ E(TM ). By Theorems 2.4.1 and 2.4.2 and [55], under the GLM with sparse coefficients β and Toeplitz correlation of covariates, we can show the optimality of iHC for detection asymptotically weak-and-rare signals. Specifically, the assumptions are ∗ p ∗ 1. For jk ∈ M = {j1, ..., jK }, βj = Anj = 2rj log n, if j ∈ M , and = 0 otherwise. The domain of β, M ∗, has a size K = n1−α, α ∈ (1/2, 1), indicating sparse signals. Moreover, M ∗ uniformly distributes on {1, ..., n} with equal