Novel P-value Combination Methods for Signal Detection in Large-Scale Data Analysis

by

Hong Zhang

A Dissertation

Submitted to the Faculty

of the

WORCESTER POLYTECHNIC INSTITUTE

in partial fulfillment of the requirements for the

Degree of Doctor of Philosophy

in

Mathematical Sciences

April 2018 c 2018 - Hong Zhang

All rights reserved. Thesis advisor Author

Zheyang Wu Hong Zhang

Novel P-value Combination Methods for Signal Detection in Large-Scale Data Analysis

Abstract

In this dissertation, we first study the distributional properties of gGOF, a family

of maximum based goodness-of-fit statistical tests and we propose TFisher, a new

family of aggregation based tests that generalize and optimize the classic Fisher’s

p-value combination method. The robust data-adaptive versions of these tests are

proposed to reduce the sensitivity of statistical power to different signal patterns.

We also develop analytical algorithms to efficiently find the p-values of both tests

under arbitrary correlation structures so that these optimal methods are not only

powerful but also computationally feasible for analyzing large-scale correlated data.

Both families of tests are successfully applied to detect the joint genetic effect of

human complex diseases by analyzing genome-wide association study (GWAS) data

and whole exome sequencing data.

In Chapter 1, we study analytical distribution calculations for gGOF statistics,

which covers the optimal tests, φ-divergence statistics, under arbitrary independent and continuous H0 and H1 models. Comparing with a rich literature of analytical

p-value calculations, this work possesses advantages in its generality, accuracy, and

computational simplicity. We also provide a general data analysis framework to apply

gGOF statistics into SNP-set based GWAS for either quantitative or categorical traits.

An application to Crohn’s disease study shows that these optimal tests do have a good

potential for detecting novel disease .

iii Abstract iv

In Chapter 2, we address the issue for gGOF under general settings of corre- lated data. We provide a novel p-value calculation approach which often possess an improved accuracy than commonly used moment-matching approach under various correlation structures. We also propose a strategy of combining innovated trans- formation and gGOF statistics, called igGOF. Furthermore, igGOF allows a nature double-ominbus test, called digGOF, which adapt both the functional of statistics and the truncation of input p-values to unknown data and signal patterns. We applied the tests in genetic association studies, both by simulations and a real exome-sequencing data analysis of amyotrophic lateral sclerosis (ALS).

In Chapter 3, we propose a unifying family of Fisher’s p-value combination statis- tics, called TFisher, with general p-value truncation and weighting schemes. Analyt- ical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The

TFisher tests were applied to analyzing a whole exome sequencing data of ALS.

In Chapter 4, we propose an approximation for the p-value of TFisher and oT-

Fisher for analyzing correlated data with general correlation structures. The methods extend the Brown’s method [1] to a more general Gamma distribution. An analytical approximation of the variance of TFisher is also provided. Numerical results show that both approximations are accurate. Contents

Title Page...... i Abstract...... iii Table of Contents...... v

1 Distributions and Statistical Power of Optimal Signal-Detection Meth- ods In Finite Cases1 1.1 Introduction...... 2 1.2 The gGOF Family for Weak-Sparse Signals...... 10 1.3 Analytical Results...... 14 1.4 Numerical Results...... 23 1.5 A Framework for GWAS And Application to Crohn’s Disease Study. 34 1.6 Discussion...... 38

2 digGOF: Double-Omnibus Innovated Goodness-Of-Fit Tests For De- pendent Data Analysis 40 2.1 Introduction...... 41 2.2 Models of Hypotheses...... 46 2.3 The gGOF Family Under Dependence...... 50 2.4 Innovated Transformation...... 56 2.5 Numerical Studies...... 66 2.6 Application to Genome-wide Association Study...... 85 2.7 Discussion...... 88

3 TFisher Tests: Optimal and Adaptive Thresholding for Combining p-Values 91 3.1 Introduction...... 92 3.2 TFisher Tests and Hypotheses...... 95 3.3 TFisher Distribution Under H0 ...... 100 3.4 TFisher Distribution Under General H1 ...... 104 3.5 Asymptotic Optimality for Signal Detection...... 106 3.6 Statistical Power Comparison For Signal Detection...... 119

v Contents vi

3.7 ALS Exome-seq Data Analysis...... 124 3.8 Discussion...... 129

4 TFisher Distribution Under Dependent Input Statistics 131 4.1 Introduction...... 132 4.2 Approximate Null Distribution of TFisher...... 133 4.3 Numerical Results...... 136 4.4 Extension to oTFisher...... 143

A Proofs of Chapter1 156

B Proofs of Chapter2 169

C Proofs of Chapter3 177

D Proofs of Chapter4 186 Chapter 1

Distributions and Statistical Power

of Optimal Signal-Detection

Methods In Finite Cases

1 Chapter 1: gGOF under Independence 2

1.1 Introduction

In big data analysis, signals are often buried within a large amount of noises and

are thus relatively weak and sparse. Developing optimal tests for detecting weak-

sparse signals is important for many data-driven scientific researches. For example,

in large-scale genetic association studies millions of genetic variants are quarried.

Only a relatively small proportion are expected to be truly associated with a given

disease, and most genetic effects are relatively weak especially comparing with the

cumulative noise level of high-throughput data [2,3]. In recent years, theoretical

studies have revealed a collection of asymptotically optimal tests in the sense that

they can reach the boundary of detectable region. In other words, they are capable of

detecting the signals at an intensity level that is minimally required for any statistics

to detect reliably. In particular, the Higher Criticism (HC) type tests [4,5,6,7], the

Berk-Jones (B-J) type tests [8], the φ-divergence type tests [9], etc., have been proven to share the same property of such asymptotic optimality.

Despite these exciting progresses, some practical questions still remain to be an- swered. In particular, the asymptotic optimality is mostly proven under theoretical assumptions such as mixture model and testing group size n → ∞ [4,9, 10, 11]. For real data analysis, however, the hypothesis model could be more arbitrary and n is al- ways finite (even small or moderate). Optimal tests that are equivalent in asymptotic sense could in fact perform quite differently. In order to apply these optimal tests to real data analysis as well as to make an appropriate choice based on signal patterns, it is important to analytically calculate p-values as well as statistical power under more general and realistic assumptions. Moreover, in order to better understand the Chapter 1: gGOF under Independence 3 performance of relevant tests in a broader context, it is beneficial to study a generic statistic family that unifies the common style of these test statistics. This chapter is to address these issues.

1.1.1 Scope of The Work

This chapter considers a generic family of goodness-of-fit (GOF) test statistics, called gGOF. Following the essential idea of GOF tests, a gGOF statistic is loosely defined as a summary statistic based on the maximal contrast between ordered input p-values and their expectations under the null hypothesis. Let P1, ..., Pn, for given n > 1, be a set of input p-values, and P(1) ≤ ... ≤ P(n) be the ordered. A gGOF statistic is defined as the supremum of a generic contrast function f over a truncation domain R:

i \ Sn,R = sup f( ,P(i)), where R = {i : k0 ≤ i ≤ k1} {P(i) : α0 < P(i) < α1}, (1.1) R n for given k0 ≤ k1 ∈ {1, ..., n} and α0 ≤ α1 ∈ [0, 1]. The gGOF requires the null hypothesis

i.i.d. H0 : Pi ∼ Uniform[0, 1], i = 1, ..., n, (1.2)

i so that the null expectation E(P(i)) = n+1 . If the null is untrue, P(i) will differ from their null expectations, which is to be captured by the contrast function f. In practice, smaller p-values indicate signals or positive outcomes. Therefore f(x, y) can be any monotonically decreasing function in y at fixed x, so that the smaller the input p-values, the larger the statistic and the stronger the evidence against the null

i i hypothesis. In other words, gGOF is one-sided in nature. Note that n , instead of n+1 , Chapter 1: gGOF under Independence 4

is used to represent the null mean of P(i) following the tradition of GOF definition.

In fact, either fraction can be used and they impose no practical difference.

For both theoretical and practical reasons, it is important to allow a general trun- cation domain R to restrict both the index i and the magnitude of the p-values P(i).

Aside the benefit of computational efficiency (e.g., big input p-values can be safely truncated as they are likely not signals), some gGOF statistics could be improved by excluding too small p-values. For example, HC could have the long-tail problem under the null due to small p-values, while removing P(1),P(2), etc., may not guar- antee the fix [4, 12]. It is better to directly restrict the magnitude of the p-values, and thus a modified version of HC was created with R = {1 < i ≤ n/2,P(i) ≥ 1/n}

[4, 13]. The significant influence of restricting P(i) ≥ 1/n under finite n is further demonstrated in Section 1.4.

Concerning the power calculation problem, this chapter considers general hypothe- ses for input statistics T1, ..., Tn:

iid iid H0 : Ti ∼ F0,H1 : Ti ∼ F1, i = 1, ..., n, (1.3)

where Fj, j = 0, 1, denote the cumulative distribution functions (CDFs). F0 and F1 can be arbitrary continuous distributions, such as Gaussian model, which is often as- sumed in theoretical studies, or t- or chi-squared distributions, which are ubiquitously seen in practical data analysis. The input p-values for gGOF are obtained based on the input statistics. Without loss of generality, the p-values are defined as

Pi = 1 − F0(Ti). (1.4)

A few remarks should be made regarding the setting of the study. First, the Chapter 1: gGOF under Independence 5

p-value definition in (3.8) actually covers two-sided p-values. Since F0 is allowed ar-

bitrary, if the signs of input statistics have meaningful directionality, the statistics can

0 2 0 simply be replaced by Ti = Ti ∼ F0. Therefore the framework of this chapter allows detecting directional signals, e.g., both protective and deleterious effects of mutations

in genetic association studies. Second, the hypotheses are defined based on input

statistics rather than input p-values because that is more convenient for modeling

“signal patterns” meaningfully. Following the tradition of statistical power studies,

signal patterns are defined by the distinction between F0 and F1, often through their

parameters. More details on the interpretation of hypotheses are given in Section 2.2.

Thirdly, the iid assumption in (1.3) is for the convenience of power calculation. If

p-value calculation is the only concern in real data analysis, Ti’s are allowed having

different distributions, for which the null hypothesis in (1.3) can be generalized to

H0 : Ti ∼ F0i, i = 1, ..., n. (1.5)

As long as Ti’s are independent, the p-values obtained in (3.8) still satisfy the null in (1.2), which is the only requirement for the p-value calculation. That is, our p- value calculation methods can be applied into meta-analysis or integrative analysis of heterogeneous data, where input test statistics could follow significantly different distributions.

One application of gGOF is the SNP-set based association test for finding disease genes. Here each (or other meaningful genome segment) is a subject of test.

To decide whether a gene is associated with a phenotype trait, a gGOF statistic is calculated based on input p-values that measure the associations between the trait and each single nucleotide polymorphisms (SNPs) within that gene. Section 2.6 provides Chapter 1: gGOF under Independence 6 a general framework for analyzing GWAS data of either quantitative or categorical traits. For a real GWAS of Crohn’s disease, Figure 1.1 shows that the framework and the analytical p-value calculation work well for four classic weak-sparse optimal tests.

Details on the data analysis and the putative disease genes are given in Section 2.6.

1.1.2 Connection to Relevant Literature

Analytical calculation for p-value and statistical power possesses significant ad- vantages over studies based on Monte-Carlo simulations or permutations, not only for faster and more precise computation but also for deeper understanding. In particu- lar, empirical p-values by simulations and permutations are discrete, and the accuracy and stability depends on the number of simulations and permutations. In big data analysis such as GWAS, very small p-values are desired to control error rate due to enormous simultaneous tests. That requires huge numbers of repeated simulations to stably obtain very small p-values [14,6]. Even worse, when genetic mutations are rare, more permutations actually cannot break the ties among p-values. As for the study of statistical power, comparing to the blackbox-like simulations, analytical calculations could provide mathematical insights to elucidate how the signal-defining parameters affect the statistic’s distributions and its power. These insights could be potentially helpful in improving the design for better statistics.

Analytical power calculation involves calculating the statistic’s distributions under both H0 and H1. To our best knowledge there is no satisfiable work yet to calculate the alternative distribution for gGOF under finite n. As for the null distribution, indeed a rich literature are out there on p-value calculations for Kolmogorov-Smirnov type Chapter 1: gGOF under Independence 7

Figure 1.1: The association p-values for genes by exact calculation of four asymptotic

weak-sparse-optimal tests. First row: HC2004 and B-J; second row: reverse B-J and

HC2008. Chapter 1: gGOF under Independence 8

statistics. Two main strategies were used – the exact calculation and the approxima-

tion. For exact calculation, various recursive methods (e.g., Noe’s recursion[15, 16],

Bolshev’s recursion [17, 18], Steck’s recursion [19, 20], Ruben’s recursion [21], etc.)

were developed to calculate the null distribution. These methods had covered a more

recent work that specifically dealt with HC [22]. Such recursive methods were for

non-truncated statistics (i.e., R = {1 ≤ i ≤ n} in (1.1)), and require computational complexity of O(n3). Denuit et al [23] provided an exact calculation that unified these recursive methods, allowed R = {k0 ≤ i ≤ k1}, and brought down the complexity to O(n2). It should be noted that Denuit’s method had fully covered a result given in a later paper [12] too. Our results on exact calculation were developed without knowing [23] and [12] first. We did share a similar idea of utilizing the joint distribu- tion of ordered uniform random variables, but the differences are significant. First, for the case R = {k0 ≤ i ≤ k1} our computational complexity is further reduced to

2 O((k1−k0) ), where k1−k0 could be much smaller than n. Second, we provided results for truncation by P(i), which cannot be trivially extended from results in [23, 12]. As discussed above, truncation by P(i) has a naturally different influence to the statistic from the truncation by i. Actually, the corresponding computational complexity is much different (see Section 1.3.2). Third, the main focus of our work is not only for calculating p-value but also statistical power.

As for approximating the null distribution, it is well known that Kolmogorov-

Smirnov type statistics converge in law to an extreme-value distribution [24, 25].

However, such convergence is too slow to be accurate for even moderately large n

[22]. Recently, Li and Siegmund (LS) [13] developed an asymptotic approximation Chapter 1: gGOF under Independence 9

Figure 1.2: Comparison among different methods for calculating the right-side prob- ability of the modified HC test (MHC) in (1.8) with R = {1 < i ≤ n/2,P(i) ≥ 1/n}[4] i.i.d. under H0 : Ti ∼ N(0, 1). Simu: curve obtained by simulation; Exact: by Corollary

1; Li&Siegmund: by [13]. Chapter 1: gGOF under Independence 10

for HC and B-J, which performs well at the right tail but not at the left tail. See

Figure 1.2 for an example. This natural limitation prevents LS method from power

calculation too. In this chapter we also studied distribution approximation in order

to further simplify computation and revealing insights on gGOF performance. We

give a sufficient condition for LS type asymptotics to be workable for gGOF family

under general hypotheses in (1.3). Furthermore, we propose to use the gamma ap-

proximation, instead of the beta approximation used by the original LS. Our formula

remains the same accuracy at the right tail, and could improve accuracy for the whole

distribution at small n.

The paper is organized as follows. In section 2.2 we review the literature of asymptotically optimal tests for weak-sparse signals, and illustrate their connection with gGOF family. The analytical results are presented in Section 1.3 for both exact and approximated calculations. Through simulations Section 1.4 numerically evi- dences the calculation accuracy, and provides systematic power comparisons among these asymptotically optimal tests. We show the application of the gGOF tests in a real GWAS in Section 2.6. In Section 1.6 we discuss the limitation of this work and further plan. All proofs and supportive lemmas are given in the AppendixA.

1.2 The gGOF Family for Weak-Sparse Signals

The signal detection problem is a set-testing problem that combines the input

p-values P1, ..., Pn of a set of input statistics T1, ..., Tn into one summary statistic,

which is then used to test whether there exist “signals”. Signals are characterized

by the contrast between the alternative and the null hypothesis of the whole set. As Chapter 1: gGOF under Independence 11

a special case of (1.3), a classic setting of the null and the alternative is Gaussian

mixture model:

H0 : Ti ∼ F0 = Φ,H1 : Ti ∼ F1 = Φµ + (1 − )Φ, i = 1, ..., n, (1.6)

where Φ and Φµ are the CDFs of N(0, 1) and N(µ, 1), respectively. H1 indicates that

 ∈ (0, 1) proportion of n input statistics are for true “signals” (e.g., disease markers) with strength µ [4, 26,6]. The setting is also consistent with meta-analysis, where

H1 could indicate that  proportion of n studies are true positives (e.g., differential gene expressions) with effect size µ [27, 28]. The summary statistic usually combines p-values rather than the input statistics because p-values directly measure statistical significance, no matter how different the data scales are. When the distribution of a statistic is known, its p-value provides the same information as itself.

Under the asymptotic rare and weak (ARW) setting, i.e., the parameters in (1.6)

−α p are regulated as n = n , α ∈ (1/2, 1), µn = 2r log(n), r ∈ (0, 1), a few seminal

studies [4, 29, 30, 11] have discovered the asymptotic detection boundary in terms of

a function curve of the signal strength and sparsity:   α − 1/2 1/2 < α ≤ 3/4 ?  r = ρ (α) = √ (1.7)  (1 − 1 − α)2 3/4 < α < 1.

When the signal-representing parameters (α, r) of the input statistics are below the

curve, H0 and H1 converges as n → ∞. That is, no statistical methods can reliably

detect signals because they are too weak. Whenever (α, r) are above the curve, the

asymptotically optimal tests are asymptotically powerful in the sense that they are

capable to make both the type I and the type II error rates converge to zero as Chapter 1: gGOF under Independence 12 n → ∞. A particular optimal statistic is the Higher Criticism (HC) statistic [31,4]:

√ i/n − P(i) HCn,R = sup np , (1.8) R P(i)(1 − P(i)) where R has several versions as special cases of that in (1.1)[4, 13]. Note that in literature [4, 10, 22] HC formula could also be written as (assuming input p-values being one-sided): P ¯ i{Ti > t} − nΦ(t) HC = sup p , (1.9) t∈R∗ nΦ(¯ t)Φ(t) where Φ(¯ t) = 1 − Φ(t). We do not follow this formula because it is restricted to the

∗ hypothesis setting of F0 = Φ, and the supremum domain R on t is equivalent to R on P(i) (but not on the index i).

A variety of versions of HC statistics, the Berk-Jones (B-J) type statistics, a spectrum of φ-divergence statistics, etc. were all proven asymptotically optimal [4,5,

32,9]. These statistics can be considered as a goodness-of-fit (GOF) statistic, which by definition is to test the distinction between given data and a given distribution (i.e., whether the input statistics have a good “fit” with the null distribution). P-values that are smaller than their null expectations evidence again the null. The simplest

GOF statistic is the simple one-sided Kolmogorov-Smirnov test statistic (c.f. [33],

+ page 447, denoted KS here), which directly measures the difference between P(i) and i/n. Under the roof of the gGOF family in (1.1), KS+ statistic correspond to the contrast f function:

fKS+ (x, y) = x − y. (1.10)

Because smaller p-values are more likely to indicate the alternative, the absolute difference i/n − P(i) should be reweighed with regard to P(i) or i/n. Such rescaled KS Chapter 1: gGOF under Independence 13

tests are related to the Higher Criticism (HC) statistics proposed in 2004 and 2008

[4,5], respectively, where the f functions are defined as √ x − y fHC2004 (x, y) = n ; py(1 − y) (1.11) √ x − y fHC2008 (x, y) = n . px(1 − x) HC2004 statistic is similar as the Anderson-Darling statistic [34], but is more general

due to its definition based on p-values and the truncation domain R for improved performance [4, 12].

Jager and Wellner introduced a collection of φ-divergence statistics [9]; each one

of them is based on a contrast function at a given s: 1 f φ(x, y) = (1 − xsy1−s − (1 − x)s(1 − y)1−s), s 6= 0, 1, s s(1 − s) x 1 − x f φ(x, y) = x log( ) + (1 − x) log( ), (1.12) 1 y 1 − y y 1 − y f φ(x, y) = y log( ) + (1 − y) log( ). 0 x 1 − x At certain s values (e.g., s = 2 or −1) these statistics are two-sided in the sense

that switching the values of x = i/n and y = P(i) gives the same statistic. However,

as mentioned above, because smaller p-values indicates signals, we consider the one-

sided version of φ-divergence statistics. A simple adjustment of the f function could

be:  q φ  2nfs (x, y) y ≤ x, fs(x, y) = (1.13) q φ  − 2nfs (x, y) y > x.

Now for all s, fs(x, y) is guaranteed decreasing in y. Such one-sided φ-divergence

statistics cover HC exactly: f2 = fHC2004 and f−1 = fHC2008 . Also, s = 1 and 0

correspond to the Berk-Jones statistic [8,4, 13] and the reverse Berk-Jones statistic

[8], respectively. Chapter 1: gGOF under Independence 14

1.3 Analytical Results

1.3.1 Calculation Strategy

In the following we first summarize the general idea of the calculation. Then,

specific strategies for obtaining the exact or approximated distributions will follow

under various settings and assumptions.

First, regarding the hypotheses in (1.3), for any given continuous CDFs F0,F1 we define a monotone transformation function in domain [0, 1]:   x under H0, D(x) = (1.14)  −1  1 − F1(F0 (1 − x)) under H1.

Note that for any p-value Pi, D(Pi) ∼ Uniform[0, 1] under either H0 or H1.

Second, regarding the gGOF statistic Sn,R in (1.1), for each fixed x define the

inverse of the contrast function f(x, y):

g(x, ·) = f −1(x, ·). (1.15)

For example, the g functions for the HC statistics defined (1.11) at any constant b

are √ √ 2 2 1 b /n−(b/ n) b /n+4x(1−x) gHC2004 (x, b) = 2 [x + ]; 1+b /n 2 (1.16) √ p gHC2008 (x, b) = x − (b/ n) x(1 − x). In general if the closed form of g function is not available, it can always be numerically found since f(x, y) is strictly decreasing in y. Chapter 1: gGOF under Independence 15

Now under either H0 or H1, the CDF of Sn,R is

i P (Sn ≤ b) = P (sup f( ,P(i)) ≤ b) R n \ i = P ( {P > g( , b)}) (1.17) (i) n R i = P {D(P(i)) > D(g( n , b)), all i and P(i) in R}.

For both exact and approximate calculation of the distributions, we take advantage

th of the fact that under either H0 or H1, U(i) := D(P(i)) is the i order statistic of

Uniform[0, 1], and we study the joint distribution of U(i) under the restriction R in different ways.

To simplify the presentation, we list below the notations to be referred later on.

k (N1) uk := D(g( n , b) ∨ α0) based on equations in (1.14)–(1.17), and α0 ≥ 0 is the

lower bound constant for truncating P(i) in (1.1).

¯ (N2) FB(α,β)(x) denotes the survival function of Beta(α, β) distribution.

¯ (N3) FΓ(α)(x) and FΓ(α)(x) denote the CDF and survival function of Gamma(α, 1)

distribution, respectively, where the shape parameter is α, the scale parameter

is 1.

(N4) Based on the notation (N3) define

hk(x) := xFΓ(k−1)(kx) − FΓ(k)(kx).

(N5) fP (λ)(x) denotes the probability mass function of Poisson(λ) distribution. Chapter 1: gGOF under Independence 16

1.3.2 Exact Calculations of gGOF Distributions

In this section we provide calculation methods for the exact distribution of any

gGOF statistic in (1.1) under either H0 or H1 in (1.3). Accordingly, p-value and

statistical power can be calculated in an exact manner. In the following each theorem

concerns a specific truncation domain R. The first theorem is for truncation based

on the index i only. For example, the initial HC was defined with R = {1 ≤ i ≤ n/2}

[4].

Theorem 1.3.1. Consider any gGOF statistic in (1.1) with R = {k0 ≤ i ≤ k1} for

given 1 ≤ k0 ≤ k1 ≤ n. Let m = n − k1 + 1. Follow notations (N1) and (N2), and define

n! ¯ ak = F (uk ) , and 1 (n−k1+1)! B(1,m) 1 k1−k j n! X uk+j−1 a = F¯ (u ) − a , k = k − 1, ..., 1. k (n − k + 1)! B(k1−k+1,m) k1 j! k+j 1 j=1

Under either H0 or H1, we have

k −1 X1 ui P (S ≤ b) = F¯ (u ) − i a . n,R B(k1,m) k1 i! i+1 i=k0

It should be noted that the computational complexity of this equation is O((k1 −

2 2 k0) ), which is a significant reduction from O(n ) by [23] and [12], especially when

(k1 − k0) = o(n). The next theorem concerns the truncation based on the magnitude of p-values, i.e., R = {α0 ≤ P(i) ≤ α1}. For example, HC statistic defined in (1.9)

has the equivalent truncation, which is required in proving some theoretical properties

(see [10] for example). Also, such truncation is needed to resolve the long tail problem

of HC [4, 12]. Chapter 1: gGOF under Independence 17

Theorem 1.3.2. Consider any gGOF statistic in (1.1) with R = {α0 ≤ P(i) ≤ α1} for given 0 ≤ α0 < α1 ≤ 1. Follow notations (N1) and (N2), and define

i−1 n−j+1 β0 (1−β1) β0 = D(α0), β1 = D(α1), cij = (i−1)!(n−j+1)! , j−k l n! j−k ¯ uj−1 X uk+l−1 aj(k) = β1 FB(j−k,1)( ) − aj(k + l), and (j − k)! β1 l! l=1

aj(j) = 0, 1 ≤ i ≤ n, i < j ≤ n + 1, k = 1, ..., j − 1.

Under either H0 or H1, we have n n+1 X X P (Sn,R ≤ b) = cijaj(i). i=1 j=i+1 Comparing Theorems 1.3.1 and 1.3.2, we can see that the truncation imposed on

P(i) requires much more complicated computation than the truncation imposed on i.

The complexity of the formula in Theorem 1.3.2 is O(n3) (or more precisely O(n3/6)).

Next, the following theorem provides the exact calculation under the most general R

defined in (1.1), where truncation is for both the index and the p-values.

Theorem 1.3.3. Consider any gGOF statistic in (1.1) with R = {α0 ≤ P(i) ≤

α1} ∩ {k0 ≤ i ≤ k1} for given 1 ≤ k0 ≤ k1 ≤ n and 0 ≤ α0 < α1 ≤ 1. Follow

notations (N1) and (N2) and those in Theorem 1.3.2. Define

˜ ˜ ˜ i = i ∨ k0, j = j ∧ (k1 + 1), β0 = β0I{i

aj(˜j) = 0, 1 ≤ i ≤ k1,˜i < j ≤ n + 1, k = 1, ..., ˜j − 1.

Under either H0 or H1, we have

P (Sn,R ≤ b)

 ˜  k1 n+1 ˜ j−i ˜ j−1 ˜ k−i+1 X X n!(β1 − β0) u˜j−1 − β0 X (uk − β0) = c F¯ ( ) − a (k + 1) . ij  B(˜j−i,j−˜j+1) ˜ j  (j − i)! β1 − β0 (k − i + 1)! i=1 j=˜i+1 k=˜i Chapter 1: gGOF under Independence 18

2 The complexity of the formula in Theorem 1.3.3 is O(nk1); adding truncation on index i actually simplifies the computation comparing with Theorem 1.3.2. As

discussed above, too small p-values under H0 is a major concern for the performance of

some gGOF statistics (e.g., causing long-tail problem for HC). Thus, the truncation on

p-values could be on the lower bound α0 only, which can also significantly reduce the

computational complexity. Corollary1 below addresses such special case of Theorem

2 1.3.3 with α1 = 1, where the formula complexity reduces to O(k1).

Corollary 1. Consider any gGOF statistic in (1.1) with R = {α0 ≤ P(i)} ∩ {k0 ≤

i ≤ k1} for given 1 ≤ k0 ≤ k1 ≤ n and α0 > 0. Follow notations (N1) and (N2) and

i−1 β0 those in Theorem 1.3.1, 1.3.2. Define ci = (i−1)! , 1 ≤ i ≤ k1. Under either H0 or H1, we have

P (Sn,R ≤ b)   k1 ˜ n+1−i ˜ k1−1 ˜ k+1−i X n!(1 − β0) uk − β0 X (uk − β0) = c F¯ ( 1 ) − a . i  B(k1+1−i,m) ˜ k+1 (n + 1 − i)! 1 − β0 (k + 1 − i)! i=1 k=˜i

A special case of Corollary1 is the modified HC in (1.8) with R = {1 < i ≤

n/2,P(i) ≥ 1/n}[4, 13]. As shown in Figure 1.2, LS approximation [13] is good only

for the right tail of the distribution under H0. Corollary1 gives the exact distribution

under both H0 and H1.

Obviously, Theorem 1.3.3 addresses the most general truncation and covers other

theorems and corollary. Based on this general formula, the formula in Theorem 1.3.1

is obtained by fixing i = 1, j = n + 1, α0 = 0, and α1 = 1. It covers the formula of

Theorem 1.3.2 by letting k0 = 1, k1 = n, and also covers the formula of Corollary1

by fixing j = n + 1 and α1 = 1. However, we still separate these formulae and their Chapter 1: gGOF under Independence 19

implementations in order to simplify the computation whenever possible.

1.3.3 Approximation of gGOF Distributions

In this section we study approximation methods for the distributions of gGOF

statistics based on appropriate asymptotics that holds good accuracy under small or

moderate n. The purpose is to 1) further simplify computation, and 2) reveal more

insights to understand the gGOF performance. Two strategies are considered. First,

we follow the basic idea of the exact calculation described above, except applying

distribution approximation. This strategy maintains the generality of the results and

provides the inspiring technique of the gamma approximation for the second strat-

egy to produce non-iterative one-step formulae. The cost for such further simplified

formulae is the requirement of stronger assumptions. Here we study the calculation

based on the linearity property of the D function in (1.14). Relating to that, we give a sufficient condition for LS style asymptotics [13] to be workable under the general gGOF family and general hypotheses. Moreover, the gamma approximation much simplifies the proof (comparing with the beta approximation in the original

LS paper) and could potentially improve the accuracy under circumstances (e.g., small n). For the simplicity of presentation, the theorems below focus on the case of

R = {k0 ≤ i ≤ k1}. The results can be extended to general R with truncation on

P(i).

First, following the idea of the exact calculation, Theorem 1.3.4 below gives a

formula based on the approximation by the joint gamma distribution.

Theorem 1.3.4. Consider any gGOF statistic in (1.1) with R = {k0 ≤ i ≤ k1}. Chapter 1: gGOF under Independence 20

Follow the notations (N1) and (N3), and define

k dk = (n + 1)D(g( n , b)), k = k0, ..., k1, k−1 j X dk −k+j c = F¯ (d ) − 1 c , k = 2, ..., k , and k Γ(k) k1 j! k−j 1 j=1 ¯ c1 = FΓ(1)(dk1 ).

Under either H0 or H1, we have

k −1 X1 dk P (S ≤ b) = (1 + o(1))(F¯ (d ) − k c ). n,R Γ(k1) k1 d! k1−k k=k0 Theorem 1.3.4 does not have extra advantage over Theorem 1.3.1 in terms of com-

putation and accuracy. However, it evidences that gamma approximation is a good

choice under general settings of gGOF statistics and hypotheses, since the formula

is pretty accurate under finite n (see Section 1.4 for numeric results). This result

inspired us to apply gamma approximation for distribution calculation with further

simplified formula.

k Under stronger assumptions, in particular if D(g( n , b)) is a linear or near-linear function of k, we can provide a one-step formula for the distribution calculation.

Starting with the exact linear case, Theorem 1.3.5 gives such a one-step formula that

guarantees the same accuracy as Theorem 1.3.4 due to the same gamma approxima-

tion.

Theorem 1.3.5. Consider a gGOF statistic in (1.1) with R = {1 ≤ i ≤ k1} and

k D(g( n , b)) = a + λk, for some λ ≥ 0. Following notations (N3) and (N4), under either H0 or H1, we have

−a P (Sn,R ≤ b) = (1 + o(1))e (1 − λ + hk1 (λ)). Chapter 1: gGOF under Independence 21

k One example that satisfies the linearity of D(g( n , b)) is the simple Kolmogorov-

+ n+1 Smirnov (KS ) statistic in (1.10) under H0, where a = −(n + 1)b and λ = n . The following corollary summarizes this case.

+ Corollary 2. Consider the test statistic KS in (1.10) with R = {1 ≤ i ≤ k1}.

1 Following notations (N3) and (N4), for b ≤ n , we have that under H0,

1 n + 1 P (KS+ ≤ b) = (1 + o(1))e(n+1)b(− + h ( )). n k1 n

k In general, the requirement of linear D(g( n , b)) is often too stringent. However, if

k D(g( n , b)) is close to linear, we can still simplify the calculation of Theorem 1.3.4. In

k particular, Theorem 1.3.6 below provides a sufficient condition on D(g( n , b)), under which LS style asymptotics [13] can be extended to gGOF family under general hy-

potheses. Again, we apply gamma approximation (rather than beta approximation

used in original LS paper), which has a simpler density function for easier general-

ization to gGOF family (note that LS paper mainly addresses the HC and B-J type

statistics). See the supplemental proof [35] for details.

Theorem 1.3.6. Consider any gGOF statistic in (1.1) with R = {k0 ≤ i ≤

k 0 k1}. Follow notations (N1) – (N5), and define dk = (n + 1)D(g( n , b)), dk = (n +

d k ∗ √ 1) dx D(g( n , b)), and k = min{k1 − k, n}. Assume D(g(x, b)) satisfies

k0 k1 1. D(g(x, b)) < 1 is increasing and convex in x for n ≤ x ≤ n ,

d 2. dx D(g(x, b)) < 1, and

k 3. D(g(k/n, b)) < n+1 , for k > 1 and large n. Chapter 1: gGOF under Independence 22

Under either H0 or H1 in (1.3), we have

k1 0 0 X dk dk P (S ≥ b) = (1 + o(1)) (1 − + h ∗ ( ))f (k). n,R n k n P (dk) k=k0

k 2004 This sufficient condition of D(g( n , b)) can be partially satisfied by HC under

k H0, for which D(g( n , b)) = g(x, b) is given in (1.16). The result is officially stated in Corollary3 below, which basically says that the condition is satisfied on the right tail √ when b is in the order of O( n).

2004 Corollary 3. Consider HC statistic in (1.11) with R = {k0 ≤ i ≤ k1}. Let

√b k0 k1 b0 = n be a positive constant > 2x − 1, n < x < n . Define

q 1 2 2 g(x, b0) = 2 [x + (b0 − b0 b0 + 4x(1 − x))/2], 1 + b0

0 1 b0(1 − 2x) g (x, b0) = [1 − ]. 2 p 2 1 + b0 b0 + 4x(1 − x)

Following the notation (N2), under H0, we have

k1   2004 X 0 k 0 k P (HC ≥ b) = (1 + o(1)) 1 − g ( , b0) + hk∗ (g ( , b0)) fP (g( k ,b )n)(k). n n n 0 k=k0 The formula of Corollary3 is different from that given in Li and Sigmund [13]. √ However, both formulae require the threshold b = O( n). Thus, in theory both

methods do not get the whole distribution. However, as shown in Figure 1.3, our

formula based on gamma approximation could be closer to the whole distribution

under small n. Meanwhile, the accuracy also depends on the linear approximation

k of the D(g( n , b)) function, which could be hardly true under general H1. Thus this type of calculation has a natural limitation for being utilized to calculate statistical

power. Chapter 1: gGOF under Independence 23

1.4 Numerical Results

In this section we first evidence the accuracy of our methods by comparing the

calculations with the Monte-Carlo simulations under various settings of H0 and H1.

Then, based on calculation we compare the finite-n performance of the asymptotically

optimal tests over various signal patterns. Unless specified otherwise, results reported

below were based on truncation domain R = {1 ≤ i ≤ n/2} and the number of

simulations was set at 5,000.

1.4.1 Calculation Accuracy for gGOF Distributions

Our calculation methods can handle general hypothesis setting in (1.3) with in-

put statistics of arbitrary continuous distributions. In this section we evaluate how

accurate our calculation methods are for constructing the distribution curves of HC

statistic, as an example of gGOF, under various H0 and H1.

First we calculate the null distribution of HC statistic under general H0 in (1.2).

Figure 1.3 shows the right-tail probability of HC statistic over varying threshold b.

Comparing with simulation (black solid curves), the exact calculation by Theorem

1.3.1 (cyan dashed curves) has a perfect match. The approximation by Theorem

1.3.4 is fairly accurate over the whole distribution too. The one-step formulae of Li and Siegmund [13] (blue dotted curves) and of Corollary3 (green dashed curves) can provide good approximation for the right tail, and thus can be used for calculating small p-values at large threshold. Li and Siegmund’s formula has a limitation for the left tail of the distribution; the formula of Corollary3 provides a correction of a sort, which is preferred at small n but is more conservative at large n. Chapter 1: gGOF under Independence 24

Figure 1.3: Comparison among different calculations for the null distribution of HC.

Simulation: curve obtained by simulations; Exact: by Theorem 1.3.1; Approximate: by Theorem 1.3.4; Li&Siegmund: by [13]; Corollary 3: by Corollary3. Chapter 1: gGOF under Independence 25

Now we assess the accuracy of calculating the alternative distribution of HC statis-

tic. Assume the input statistics were from either of following mixture models:

i.i.d. H1 : Ti ∼ (1 − )N(0, 1) + N(1, 1), or i.i.d. H1 : Ti ∼ (1 − )N(0, 1) + tν,

while the input p-values for gGOF were obtained by Pi = 1 − Φ(Ti) (i.e., under i.i.d. H0 : Ti ∼ N(0, 1)). These two alternatives can be roughly interpreted as that

 proportion of “signals” have either different means (i.e., N(1, 1)) or different vari-

ances (i.e., the Student’s t with degrees of freedom ν) when comparing with “noises”

(i.e., N(0, 1)). Accordingly, Figure 1.4 demonstrates the right-tail probability of HC statistic (row 1: µ = 1,  = 0.1; row 2: ν = 5,  = 0.5). In both cases the exact

calculation (Theorem 1.3.1, cyan dashed curves) is perfect, and the approximation

(Theorem 1.3.4, red dot-dashed) is close to simulation (black solid curves) with its

accuracy increasing together with n.

Besides the normal distributions, we also assessed four non-normal settings studied

in the initial paper of HC [4]. The first setting regards a chi-squared model:

i.i.d. 2 i.i.d. 2 2 H0 : Ti ∼ χν(0), vs. H1 : Ti ∼ (1 − )χν(0) + χν(δ),

where ν is the degree of freedom, δ is the non-centrality parameter. The second

setting is a Student’s t mixture model:

i.i.d. i.i.d. H0 : Ti ∼ tν(0), vs. H1 : Ti ∼ (1 − )tν(0) + tν(δ).

The third setting is a chi-squared-exponential mixture model:

i.i.d. i.i.d. 2 H0 : Ti ∼ exp(ν), vs. H1 : Ti ∼ (1 − ) exp(ν) + χν(δ). Chapter 1: gGOF under Independence 26

i.i.d. Figure 1.4: The alternative distribution of HC statistic under H0 : Ti ∼ N(0, 1) vs. H1 : Ti ∼ 0.9N(0, 1) + 0.1N(1, 1) (row 1), or H1 : Ti ∼ 0.5N(0, 1) + 0.5t5 (row 2).

Column 1: n = 10; column 2: n = 100. Simulation: curve obtained by simulations;

Exact: by Theorem 1.3.1; Approximate: by Theorem 1.3.4. Chapter 1: gGOF under Independence 27

The fourth setting concerns a generalized normal distribution (also known as power exponential distribution) model:

i.i.d. i.i.d. H0 : Ti ∼ GNp(0, σ), vs. H1 : Ti ∼ (1 − )GNp(0, σ) + GNp(µ, σ),

where the probability density function of GNp(µ, σ) is

p 1 |x − µ| 1/p exp(− p ),Cp = 2p Γ(1 + 1/p)σ. Cp pσ

2 Notice that GN1(µ, σ) is the Laplace distribution and GN2(µ, σ) is N(µ, σ ). Each row of Figure 1.5 illustrates the alternative distribution of HC under each of the four settings for n = 10 (left column) and 100 (right column). Again, the exact calculation is perfect, and the approximation is fairly accurate especially when n is large.

For the one-step calculation formula given by Theorem 1.3.5, the boundary is

i + assumed linear: D(g( n , b)) = a + λk ≥ 0 in (1.17). One example is the KS statistic in (1.10) under H0. Figure 1.6 demonstrates the accuracy of the calculation based on either fixed slope λ = 0.5 or fixed intercept a = 0.5. Here k0 = 1, k1 = n = 50.

It shows that this gamma-approximation based one-step formula performs well if the

i linearity assumption of D(g( n , b)) is satisfied. As the boundary a + λk increases, the probabilities from both calculation and simulation decrease as expected.

1.4.2 Comparison of Asymptotically Optimal Tests Under

Finite n

As discussed in Section 2.2 the asymptotically optimal methods for weak-sparse signals possess the same asymptotic property. It is of interest to know the performance of those statistics under finite n. Here we focus on φ-divergence statistics defined in Chapter 1: gGOF under Independence 28

Figure 1.5: The alternative distributions of HC statistic under four non-normal set- tings for H0 and H1. Column 1: n = 10; column 2: n = 100. Simulation: curve obtained by simulations; Exact: by Theorem 1.3.1; Approximate: by Theorem 1.3.4. Chapter 1: gGOF under Independence 29

Figure 1.6: Probability in (1.17) with a hypothetical linear boundary function

i D(g( n , b)) = a + λk. Simulation: probability obtained by simulations; Approxi- mation: by Theorem 1.3.5. Left panel: fix λ = 0.5 and vary a; right panel: fix a = 0.5 and vary λ.

(1.13), which is asymptotically optimal for any statistic-defining parameter s ∈ [−1, 2]

[9]. As discussed in Section 2.2, the values of s = 2, 1, 0, −1 correspond to HC2004,

the Berk-Jones statistic, the reverse Berk-Jones statistic, and HC2008, respectively.

These s values represent a spectrum of gGOF statistics of different performances.

First, we show the accuracy of p-value calculations in a similar manner as [13].

Specifically, for each gGOF statistic the thresholds at the significance levels of 10%,

5% and 1% were obtained through calculation (by Theorem 1.3.1). Then at these

thresholds the empirical type I error rates were acquired through simulations (10,000

repetitions). As shown in Table 1.1, the close match of the given significance levels

and the obtained empirical type I error rates evidences that the calculations for

the p-values of these statistics are accurate. Not surprisingly, the accuracy by the Chapter 1: gGOF under Independence 30

Table 1.1: Empirical type I error rates at the calculated thresholds for the significance

levels of 10%, 5% and 1%. HC2004: s = 2; B-J: s = 1, reverse B-J: s = 0, and HC2008: s = −1.

s n 10% 5% 1% Threshold Emp. Err. Threshold Emp. Err. Threshold Emp. Err. 2 10 3.357 0.992 4.648 0.049 10.088 0.010 50 3.507 0.102 4.714 0.050 10.102 0.011 100 3.539 0.103 4.723 0.049 10.102 0.009 1 10 2.181 0.101 2.504 0.050 3.110 0.011 50 2.408 0.098 2.716 0.048 3.300 0.010 100 2.478 0.104 2.780 0.049 3.354 0.009 0 10 1.750 0.100 1.974 0.049 2.390 0.011 50 2.040 0.101 2.301 0.047 2.803 0.011 100 2.136 0.101 2.402 0.051 2.915 0.010 -1 10 1.618 0.098 1.838 0.051 2.227 0.009 50 1.909 0.099 2.165 0.049 2.662 0.009 100 2.010 0.107 2.271 0.052 2.777 0.010

approximated calculation of [13] requires relatively large n, whereas the calculation

by Theorem 1.3.1 is exact and shall be perfectly accurate at any n.

Now through power calculation (again by Theorem 1.3.1), we can systematically

compare the power of any gGOF statistics. To be consistent with literature, here

we focus on the classic normal mixture model in (1.6). With the type I error rate

controlled at 5%, Figure 1.7 provides the statistical power of HC2004, B-J, reverse B-

J, and HC2008 at various signal patterns represented by parameters (n, µ, ). There

are a few interesting observations. First, it seems that at finite n the average num-

ber n of signals is more relevant than the proportion  of the signals. To see this

point, note that columns 1 – 3 of the figure panels correspond to fixed signal numbers

n = 5, 25, 50, respectively; each column demonstrates one same pattern of compar- Chapter 1: gGOF under Independence 31 ative performance among these four statistics. Meanwhile, the diagonal of the figure panel matrix correspond to a fixed signal proportion  = 0.05 but the comparative performances of the four statistics changed significantly over increased n at differ- ent rows. Similar observations can be seen at fixed  = 0.01 or 0.005 but different n. Second, considering signals sparsity / density in terms of signal numbers, within the φ-divergence family, bigger s values are related to better performance for sparser signals (HC2004: s = 2; B-J: s = 1), whereas smaller s values are related to better performances for denser signals (reverse B-J: s = 0, and HC2008: s = −1). This is evidenced by the columns of the figure: HC2004 performs the best in the first column, and HC2008 performs the best in the third column. One possible reason for HC2008 being less powerful for sparse signals is because its statistic is about weighting the expectation-observation difference, i.e., i/n − P(i), by i/n rather than by P(i) at the denominator, and therefore it is less insensitive to small p-values than HC2004 (see their formulas in (1.11)). Thirdly, with s = 1 in the middle of the parameter space

[−1, 2] of the optimality, B-J has a more robust performance over various µ, n and .

This robustness of B-J’s is consistent with the finding of Li and Siegmund [13].

It is also of interest to compare the performance of these optimal methods along the asymptotic detection boundary given in (1.7) under finite n. As discussed in Section

2.2, when the signal-representing parameters (α, r) are below the curve, signals are too weak to be reliably detectable by any statistics. Whenever these parameters are above the curve, all of these four optimal tests are asymptotically powerful as n → ∞. Thus, areas right above the detection boundary are the challenging scenario for optimal methods being prominent, since sub-optimal tests will have asymptotically zero power Chapter 1: gGOF under Independence 32

Figure 1.7: Comparison of statistical power. HC2004: s = 2; B-J: s = 1, reverse B-J: s = 0, and HC2008: s = −1. Rows 1 – 4: n = 100, 500, 1000, 5000. Columns 1 – 3: n = 5, 25, 50. Type I error rate: 5%. Chapter 1: gGOF under Independence 33

Figure 1.8: Statistical power along the ARW detection boundary (at type I error rate

5%).

there. Figure 1.8 shows the statistical power of the four optimal methods over the sparsity parameter α ∈ (1/2, 1); the r value is calculated according to equation (1.7).

It shows that statistical power of these methods are in fact significantly different even for very large n. In consistence with Figure 1.7, HC2008 and reverse B-J have similar power curve; they are more powerful for denser signals (at smaller α corresponding to bigger  = n−α). HC2004 is more powerful for very sparse signals (at larger α). B-J again shows a more robust performance over all α values.

Last but not least, the truncation domain R in (1.1) is very important to the per- formance of test statistics. In particular, as discussed in Introduction, the truncation based on p-values P(i) could have extra benefit over the truncation based on index i only [4, 13]. Here we compare HC2004 under R = {1 ≤ i ≤ n/2} with the modified

HC (MHC) under R = {1 < i ≤ n/2,P(i) ≥ 1/n}. Figure 1.9 shows that the MHC performs poorly when the number of signals is small, whereas it improves the per- Chapter 1: gGOF under Independence 34 formance when the number of signals increases. One reason is because 1/n is fairly large at finite n. By excluding p-values less than 1/n, MHC could easily miss those signal-representing p-values, especially when there are just a few strong true signals with large µ. However, when signals are dense MHC is more powerful, because (A) with high chance some signals (especially the weaker ones) will have p-values larger than 1/n, and (B) removing p-values less than 1/n corrects the long-tail problem of

HC [4]. Thus, in practice when n is not too big, the original HC is still a better choice for relatively sparser and stronger signals, whereas MHC is better for denser and weaker signals.

1.5 A Framework for GWAS And Application to

Crohn’s Disease Study

According to the genetics of complex diseases, disease-associated markers usually have moderate to small genetic effects [36]. In genome-wise studies that tend to screen as many markers as possible, the number of true disease markers often account for a small proportion of the total candidate markers. Therefore, it is appealing to apply optimal tests for weak-sparse signals to detect weak genetic effects. In this section, we provide a general framework for applying the gGOF tests into SNP-set association study in GWAS data analysis. The input p-values are obtained based on the generalized linear models (GLMs) so that the framework can handle both quantitative and categorical traits. Here we focus on gene-based tests: each gene is tested separately; input p-values from the group of SNPs within that gene form Chapter 1: gGOF under Independence 35

Figure 1.9: Power comparison for HC statistic with R = {1 ≤ i ≤ n/2} and MHC statistic with R = {1 < i ≤ n/2,P(i) ≥ 1/n}. Type I error rate: 5%.

a gGOF statistic, then the summary p-value of this statistic is obtained to measure

how significant the gene is associated. Certainly, similar idea can be straightforwardly

extended to SNP-set tests based on other meaningful segments of loci (e.g., pathway-

based association studies [37]).

Specifically, assume a gene contains n SNPs. With an appropriate link function, a GLM can be defined as

0 0 link(E(Yk|Xk, Zk)) = Xkβ + Zkγ, (1.18)

where for the kth subject, k = 1, ..., N, Yk denotes the trait value (quantitative or Chapter 1: gGOF under Independence 36

categorical), Xk = (Xk1, ..., Xkn) denotes the genotype vector of the n SNPs in the

gene. Zk = (Zk1, ..., Zkm) denotes a vector of m controlling variables of environmental

and/or other independent genetic factors. The null hypothesis is that none of the

SNPs are associated with the trait, and therefore the gene is not associated:

H0 : βi = 0, i = 1, ..., n.

Many statistics can be used to test this null hypothesis while controlling the effects of

other factors represented by Zk. One classic example is a marginal test with statistics

given by [38, 39] N X ˜ Mi = Xki(Yk − Yk), i = 1, ..., n, k=1 ˜ where Yk is the fitted outcome value (e.g., by least squares or iteratively reweighted

least squares) under H0. It can be shown that under H0 the vector of the marginal

D statistics M = (M1, ..., Mn) → N(0, Σ), as N → ∞. The covariance matrix Σ can be

estimated by

Σˆ = X0WX − X0WZ(Z0WZ)−1Z0WX,

where matrices X = (Xki), Z = (Zki), and W is the covariance matrix of Y . In the

case of multiple regression model for quantitative traits, W =σ ˆ2I, whereσ ˆ2 is the

least squares estimate of the residual variance. In the case of logistic regression model ˜ ˜ for binary traits, W = diag{Yk(1 − Yk), k = 1, ..., N}.

We can de-correlate M to obtain the input statistics for gGOF:

ˆ − 1 D (T1, ..., Tn) = Σ 2 W → N(0,In×n),

i.i.d. and thus the input p-values are Pi = 2(1 − Φ(|Ti|)) → Uniform[0, 1]. Then for any

gGOF statistic, it’s p-value can be calculated by the methods given in this chapter for Chapter 1: gGOF under Independence 37 measuring how significant the gene is associated with the phenotype trait. It should be noted that the input statistics are not required to follow normal distribution; the calculation methods only requires that the input p-values are iid Uniform[0, 1] under the null. That is, other input statistics following t or chi-squared distribution can be used as long as they are not correlated or can be de-correlated.

We applied the gene-based analysis framework to a GWAS data of Crohn’s disease from NIDDK-IBDGC (National Institute of Diabetes, Digestive and Kidney Diseases

- Inflammatory Bowel Disease Genetics Consortium). It contains 1,145 individuals from non-Jewish population (572 Crohn’s disease cases and 573 controls) [40]. After typical quality control for genotype data, 308,330 somatic SNPs were grouped into

15,857 genes according to their physical locations. As a special case of the GLM in (1.18) the logistic regression model was applied to search genes associated with

Crohn’s disease susceptibility. The controlling covariate Zk = (1,Zk1,Zk2) contain an intercept and the first two principal components of the genotype data, which serve the purpose of controlling potential population structure [41]. In case a gene contains only one SNP, no gGOF test was needed.

We examined four gGOF statistics HC2004, B-J, reverse B-J, and HC2008. Figure

1.1 gives the QQ plots of the gene-based p-values calculated by Theorem 1.3.1. The genomic inflation factors (i.e., the ratios of empirical median of -log(p-values) vs. the expected median under H0 [42]) are all close to 1, indicating that the genome-wide type I errors were well controlled. Among the four statistics, B-J seemed having higher power because it yielded more genes significantly above the red diagonal line of the H0-expected p-values. Among the top ranked genes, many of them are relevant Chapter 1: gGOF under Independence 38

to Crohn’s disease. In particular, IL23R and CARD15 (also known as NOD2) are well-known Crohn’s disease genes [43, 44, 40]. Gene NPTX2 was top ranked by both

HC2004 and B-J. It hasn’t been reported previously through association studies, but

could be a putative disease gene because it encodes a neuronal petraxin, which is

related to C-reactive protein [45], an indicator for Crohn’s disease activity level [46].

Furthermore, NPTX2 has an important paralog gene APCS (www..org),

which is related to arthritis, a disease highly correlated with Crohn’s disease [47].

Gene SLC44A4 is also related to the pathophysiology of Crohn’s disease. Defects in this gene can cause sialidosis [45], a lysosomal storage disease due to a deficiency of sialidase, an enzyme important for various cells to defend against infection [48].

Gene BMP2 was identified by B-J, reversed B-J, and HC2008. This gene could also

be relevant because it is associated with digestive phenotypes, especially colon cancer

[49, 50]. Certainly, further studies are needed to validate those top ranked genes.

1.6 Discussion

This chapter provided techniques to calculate the exact and approximated null

and alternative distributions of a generic gGOF statistic family. It gave a foundation

for applying gGOF statistics in real data analysis, and for studying and comparing

important statistics such as the asymptotically optimal ones in the finite n case.

A few future studies will be carried out. First, to calculate the exact distribution,

the result in Theorem 1.3.1 brings down the computational complexity to O((k1 −

2 k0) ). Meanwhile, when k1 − k0 is large, the calculation could suffer from the loss of

significant digits. The current practice is to truncate the summation to the first 25 - Chapter 1: gGOF under Independence 39

30 terms, which yields a fairly accurate result and saves computation time. This issue could be further addressed by improving numeric techniques. Second, we will still look for better one-step approximation, especially for power calculation. Third, in real data analysis input statistics are often correlated. It would be nice to incorporate such correlation into the calculation of p-values and statistical power. For that, we will report the results we have gotten in a separate paper. Chapter 2

digGOF: Double-Omnibus

Innovated Goodness-Of-Fit Tests

For Dependent Data Analysis

40 Chapter 2: gGOF for Dependent Data Analysis 41

2.1 Introduction

With a long history and numerous applications the goodness-of-fit (GOF) test is

one of the breakthroughs in statistics [51, 52]. In particular, GOF test provides a

promising tool for signal detection problems in analyzing big data. A collection of

GOF statistics such as the Higher Criticism (HC) type statistics, Berk-Johns (BJ)

type statistics, and φ-divergence statistics, have been proven asymptotically optimal

for weak-and-rare signals [4,9,6]. These GOF statistics can be unified into in a

general family called gGOF defined by a generic functional and a general truncation

scheme of input p-values P1, ..., Pn [53]. Under the null hypothesis all p-values are

from Uniform(0, 1). Let P(1) ≤ ... ≤ P(n) be the ordered p-values. A gGOF statistic measures the supremum departure of P(i) from its null expectation, which is roughly

i n : i Sn,f,R = sup f( ,P(i)), (2.1) R n where \ R = {i : k0 ≤ i ≤ k1} {P(i) : α0 ≤ P(i) ≤ α1}

represents an arbitrary truncation scheme for p-values based on their ranks for given

k0 ≤ k1 ∈ {1, ..., n}, and/or their magnitudes for given α0 ≤ α1 ∈ [0, 1]. For fixed

x = i/n the function f(x, y) is monotonically decreasing in y = P(i), so that the

smaller the input p-values, the larger the statistic and the stronger the evidence

against H0.

Under the assumption that the input p-values are independent and identically

distributed (iid), the p-value and power calculations of gGOF family have been well

resolved. However, correlation is ubiquitous in real data analysis. A few problems Chapter 2: gGOF for Dependent Data Analysis 42 are to be addressed for analyzing correlated data. First, to practically apply gGOF statistics, p-value calculation under arbitrary correlation is desired. Secondly, to study the power of gGOF for correlated data, it is important to understand how signal patterns and correlation structures would influence signal-to-noise ratio (SNR).

Proper transformations for correlated data has a potential to advance signal detection, while improper ones could do harm too. Thirdly, since different test statistics in the gGOF could have relative advantages for certain signal patterns and data properties, it is ideal to fully utilize the family-retained advantages to provide a powerful and robust solution to suit various situations.

In recent years, individual modifications have been proposed based on specific statistics in gGOF for analyzing correlated data. In particular, GHC and GBJ [7, 54] were proposed based on original HC and BJ statistics in genetic association studies.

These developments were motivated from an interpretation perspective, for example, on how to incorporate data variation into the statistic under correlations [7]. However, certain interpretation does not necessarily guarantee a higher statistical power. On one hand, under finite n, Figure 2.1 shows that if input p-values are properly trans- formed GHC and GBJ could be similar or less powerful than the original versions of

HC and BJ. On the other hand, under asymptotics for n → ∞, gGOF statistics have already reached optimality under both independent and dependent cases [55, 56].

Thus, as long as the computation issue is addressed, gGOF can be a traditional and natural choice that contains numerous statistics and solutions for analyzing various data, whether independent or dependent. Chapter 2: gGOF for Dependent Data Analysis 43

Figure 2.1: Statistical power for HC, GHC, BJ, and GBJ, with and without inno- vated transformation (innov) of the input test statistics (T1, ..., T100) ∼ N(µ, Σ). The nonzero elements (i.e., signals) in µ are distributed arbitrarily, with same magnitude

A. Σ is equal correlation matrix with off-diagonal elements ρ. Left: A = 2, Σ = Σ1 with ρ = 0.3 ; Right: A = 0.3, Σ = Σ2 with ρ = −0.0099. Σ1 and Σ2 are mutually

−1 inverse matrices, i.e., Σ1 = Σ2 . Type I error rate 0.05; 10,000 simulations.

Rather than carrying out individual development for specific test statistics, the higher perspective of this work is to address the whole statistic family, and automat- ically choose the best statistic function f and truncation domain R for any given data. Indeed, a family of various statistics defined by different f functions could pos- sess high statistical power over a broader (if not the full spectrum of) space of data properties and signal patterns. For example, HC2004 statistic [4] is more powerful for very sparse signals, while BJ and HC2008 statistics [5] are more powerful for denser Chapter 2: gGOF for Dependent Data Analysis 44

signals [53]. Moreover, the truncation of input p-values R is also quite relevant. For

2004 example, the modified HC with R = {1 < i ≤ n/2,P(i) ≥ 1/n} improves its performance over that if no truncations were applied [4, 13]. Thus, with allowing a general f and R the gGOF family could retain all of these advantages to give high and robust power in analyzing various data.

In addition, the wonderful feature of gGOF allows automatic selection of the best f and R in harmony to the calculation for itself, without the need of simulation or approximation. Specifically, any gGOF statistic is a supremum of the monotone func- tion f. Therefore, the distribution of any gGOF statistic is essentially a probability function for cross-boundaries of ordered p-values:

i P (Sn,f,R ≤ b) = P (sup f( ,P(i)) ≤ b) = P (P(k) > uk , for all k and P(k) ∈ R), R n (2.2)

where the boundaries are decided by f and the threshold b:

k u = f −1( , b). (2.3) k n

This special property makes it convenient to analytically calculate the exact null

distribution for a double-adaptation to both f and R. Specifically, for a given gGOF

statistic Sf,R at a fixed n, let Gf,R be its survival functions under the null hypothesis.

We define a double-adaptation omnibus statistic by the smallest p-value (i.e., the

strongest statistical evidence against the null) among all statistics indexed by various

f and R:

So = inf Gf,R(Sf,R). (2.4) f,R Chapter 2: gGOF for Dependent Data Analysis 45

Under the null, the survival function of So is

−1 ? ? P (So > so) = P (Sf,R ≤ Gf,R(so), for all f, R) = P (P(1) > u1, ..., P(n) > un), (2.5) where, for each k = 1, ..., n,

? −1 k −1 uk = sup uf,R,k = f ( ,Gf,R(so)). (2.6) f,R n

Now the calculation for the cross-boundary probabilities in 2.2 and 2.5 are similar.

We have provided efficient calculations for exact and approximate solutions in the independence case [53]. In this paper, to calculate p-values in the dependence case, we will first provide the exact calculation under equal correlation. Based on that, we approximate the calculation under arbitrary complex correlation. This strategy is an explicit calculation method based on a theoretical deduction, which is different from the typical moment-matching of certain distributions. It is shown to be more accurate under broad circumstances.

Regarding statistical power, beyond extensive simulation-based studies, in this chapter we explore how correlation structure and signal pattern influence the signal- to-noise ratio (SNR) under both Gaussian mean models (GMM) and generalized linear models (GLM). Ideally, correlation information could be incorporated to help with improving power of signal detection. Indeed, this could be realized through linear transformation of the input statistics for the input p-values. We study de- correlation transformation (DT) and the innovated transformation (IT) [55, 57] and reveal conditions for them to strengthen or weaken SNR under GMM and GLM. In particular, for GLM, we show that under weak conditions marginal model-fitting is essentially the IT transformation of the joint model-fitting. This result is particularly Chapter 2: gGOF for Dependent Data Analysis 46 interesting in applied statistical studies because it indicates that the computationally simple marginal model-fitting is actually often superior over the computationally more demanding joint model-fitting for dependent data analysis, even when some extent of signal cancellation exists. Since IT is considered the optimal linear transformation to maximize SNR (at least under sparse cases) [58, 57], it should be applied, either implicitly or explicitly, before testing hypotheses by gGOF. As a natural extension of iHC [55], we call such a procedure igGOF test. When the double-adaptation omnibus test in (2.4) is carried out on the top of igGOF, we call the testing procedure digGOF.

The paper is organized as follows. Section 2.2 formulates the problem by defining models for hypotheses. Section 2.3.1 provides analytical p-value calculation methods; the accuracy are evaluated under various settings. Section 2.5 provides studies of innovated transformation and statistical power. In Section 2.6 igGOF and digGOF were applied and evaluated for GWAS studies under the correlated cases, and real

GWAS analysis is also applied for find new disease genes. We discuss the limitation of this work and future plans in Section 2.7.

2.2 Models of Hypotheses

In this section we consider two well connected settings for hypotheses: the Gaus- sian mean model (GMM) and the generalized linear model (GLM). GMM serves as a foundation for GLM model, the latter can be considered as a constraint GMM conditional on data patterns.

GMM assumes the distribution of the vector of n input test statistics T = (T1, ..., Tn) Chapter 2: gGOF for Dependent Data Analysis 47 are joint Gaussian:

T ∼ N(µ, Σ), (2.7) where Σ is known or can be reliably estimated; µ is the unknown parameter corre- sponding to the hypotheses:

H0 : µ = 0 versus H1 : µ 6= 0. (2.8)

For consistency of the presentation, in this chapter, the input statistics are assumed always standardized such that its variance Σ is a correlation matrix with diagonal of

1’s. Following that, the magnitude of the nonzero mean elements µi represents the

SNR, serving as a measure for signal strength. Certainly, the higher of the SNR, the higher is the statistical power of any reasonable grouping tests based on Ti’s. Here the gGOF statistics defined in (2.1) take T ’s p-values as input, which could be one-sided or two-sided depending on specific data analysis:

¯ ¯ One-sided: Pi = Φ(Ti); Two-sided: Pi = 2Φ(|Ti|). (2.9)

Note that this pre-assumption of automatic data standardization is consistent with the fact that any point-wise rescaling for Ti won’t affect the input p-values nor their summary statistics.

A GLM is defined as

0 0 g(E(Yk|Xk·,Zk·)) = Xk·β + Zk·γ. (2.10)

0 For the kth subject, k = 1, ..., N, Yk denotes the response value, Xk· = (Xk1, ..., Xkn) denotes the design matrix XN×n’s kth row vector of n inquiry covariates, Zk· =

0 (Zk1, ..., Zkm) denotes ZN×m’s kth row vector of m control covariates. For example, Chapter 2: gGOF for Dependent Data Analysis 48

in the gene-based single-nucleotide polymorphism (SNP)-set studies, assume a gene

contains n SNPs, Xk· denotes the genotype vector of the n SNPs, Zk· denotes a data

vector of m control covariates, such as the intercept and other environmental and other

genetic variants. The function g is called a link function based on the distribution of Yk

given Xk· and Zk·. Generally Yk follows a distribution in the exponential family with

ykθk−b(θk) density function f(yk) = exp{ + c(yk, φ)}, where θ is the natural parameter, ak(φ) φ is the dispersion parameter, and a, b and c are given functions of them. The null

hypothesis is that none of the inquiry covariates are associated with the outcome:

H0 : β = 0 vs. H1 : β 6= 0. (2.11)

Through asymptotics for joint model-fitting, the GLM with correlated covari-

ates are connected to the GMM in (2.7). Joint model-fitting means estimating

β simultaneously. There could be many ways of doing so. Here we consider a

class one-step maximum likelihood estimation (MLE, cf. Theorem 4.19 and Exer-

(0) −1 0 (0) (0) cise 4.152 of [33]) with initial estimation µk = g (Zk·γ ), k = 1, ..., N, γ is

the MLE estimator of γ under H0. Let the corresponding estimated weights matrix

(0) (0) 00 (0) (0) (0) W = diag{Var (Yk)} = diag{a(φ )b (θk )} , where φ and θk are also MLE of φ

˜ 1/2 ˜ 1/2 and θk, k = 1, ..., N, under H0. Assume N > n, and define X = W X, Z = W Z, and H˜ = Z˜(Z˜0Z˜)−1Z˜0 be the projection matrix onto the column space of Z˜. It can be shown the estimator of β is

ˆ ˜ 0 ˜ ˜ −1 0 (0) D ˜ 0 ˜ ˜ −1 βJ = (X (I − H)X) X (Y − µ ) → N(β, (X (I − H)X) ), (2.12) as N → ∞. To obtain the input p-values for gGOF statistics, the input statistics are Chapter 2: gGOF for Dependent Data Analysis 49

ˆ the standardized βJ :

˜ ˆ D ˜ ˜ ˜ 0 ˜ ˜ −1 ˜ TJ = ΛβJ → N(Λβ, Λ(X (I − H)X) Λ), (2.13)

˜ 1 ˜ ˜ 0 ˜ ˜ −1 where the diagonal matrix Λ = diag(√ )1≤i≤n , with λi = ((X (I − H)X) )ii λ˜i being the diagonal elements of the covariance matrix. Write the marginal fitting

˜ 0 ˜ ˜ ˆ ˜ 0 ˜ ˜ 0 /IT by U = (X (I − H)X)βJ , we can see that Var(U) = X (I − H)X = X (W −

WZ(Z0WZ)−1Z0W )X. This is consistent with the marginal score statistics in [54].

A special case of GLM is the linear regression model (LM), where the model equation is

Y = Xβ + Zγ + , (2.14)

where XN×n and ZN×n are still the design matrices with their kth row vectors being

0 0 2 Xk· and Zk·, respectively. The error term  ∼ N(0, σ IN×N ), where the variance σ2 known or can be consistently estimated.Here the one-step MLE is the same as the least-squares (LS) estimation in joint model fitting. In particular, in the linear regression model the weights matrix W = diag{1/σ2}, the initial estimation µ(0) = ˜ HY is the projection of Y onto the column space of Z, i.e., the LS of Y under H0.

Also note that the standardized statistics follow exact normal distribution:

ˆ 0 −1 TJ = ΛβJ /σ ∼ N(Λβ/σ, Λ(X (I − H)X) Λ), (2.15)

1 0 −1 where the diagonal matrix Λ = diag( √ )1≤i≤n, with λi = ((X (I − H)X) )ii being λi the diagonal elements of the covariance matrix.

From this perspective, a GLM model can be considered a restricted GMM in

(2.7). In GMM, µ and Σ can be defined independently. In GLM, even though

the “effects” β are defined separately from data X and Z, both the mean vector Chapter 2: gGOF for Dependent Data Analysis 50

µTJ (i.e., the signal strength under Ha) and the correlation matrix ΣTJ depend on

the data. In particular, data correlation structure plays a critical role. Consider

the formula in (2.13), X˜ 0(I − H˜ )X˜ gives a measure of the covariance among the

inquiry covariates conditional on the control covariates. In a typical case where Z

only contains the intercept in regression, X˜ 0(I − H˜ )X˜ = X0(I − J)X, where J is a

matrix of 1/n, is exactly the empirical covariance matrix among the columns of X.

Such connection between the input statistics TJ and the data correlation structure

raises two consequences. First, the signal strength depends on data correlation. As ˜ to be shown in Section 2.4.1, it is often the case that λi < 1 (when σ = 1, for

example), indicating the signal strength in TJ is actually less than the effect size defined by the nonzero elements of β. Secondly, linear transformation of the input statistics are related to the data correlation. It is often the case that the de-correlation transformation and the innovated transformation on TJ could increase the signal

strength.

2.3 The gGOF Family Under Dependence

2.3.1 P-value of gGOF under Dependence

In this section we propose a few methods to calculate the p-value for the gGOF

related statistics in (2.1) and (2.4). This novel strategy is different from the typical

moment-mapping methods, and is shown more accurate by simulations under various

situations.

First, we provide an exact calculation for the null distribution of under the equal Chapter 2: gGOF for Dependent Data Analysis 51

correlation matrix summarized in Theorem 2.3.1.

Theorem 2.3.1. Consider input statistics T in (2.7) with µ = 0, Σij = ρ for all

i 6= j, and the input p-values are defined in (2.9). Assume R = {i : k0 ≤ i ≤ k1}.

Let U(1) ≤ ... ≤ U(n) be the order statistics of n iid Uniform(0, 1) random variables.

For any gGOF statistic in (2.1): Z ∞ P (Sn,f,R < b) = φ(z)P (U(k) > cik, k = k0, ..., k1)dz, (2.16) −∞ √ √ √  Φ−1(1−u )− ρz   Φ−1(1−u /2)− ρz   −Φ−1(1−u /2)− ρz  √ k √ k √ k where c1k = 1−Φ 1−ρ and c2k = 1−Φ 1−ρ +Φ 1−ρ are respectively for the one-sided and two-sided input p-values in (2.9), and φ(z) and

Φ(z) are respectively the density and distribution functions of N(0, 1).

Efficient exact as well as approximate calculations for P (U(k) > ak, k = k0, ..., k1) haven been given in [53]. The integration over z can be calculated numerically.

In real data analysis, the structure of the correlation matrix Σ could be complex, and Σ often needs to be estimated, which makes it even more complicated. Here we propose several strategies to approximate the distributions of gGOF statistics based on the exact calculation developed above.

The first strategy, called the weighted average method (WAM), is specially de- signed for the case where the correlation matrix is roughly a Toeplitz matrix, where the off-diagonal lines have equal elements. This assumption is appropriate for data with the decay correlation when the “distance” of two covariates is increasing. For example, the autoregressive model in time series data, or the genetic data with de- cay linkage disequilibrium (LD) over the physical/genetic distance. Specifically, Let

GΣ(b) ≡ P (Sn,f,R < b|Σ) be the distribution function of Sn,f,R when the input statis- tics T has a correlation matrix Σ. Consider Σ be Toeplitz, i.e., Σ(l, k) = ρj, j = Chapter 2: gGOF for Dependent Data Analysis 52

|l − k|. Denote Σj be the equal-correlation matrix with correlation ρj. The WAM

approximate is αn X GΣ(b) = ωjGΣj (b). j=1 Theoretical [59] as well as our empirical studies show that the near off-diagonal components are more important for characterizing Σ. Thus, we propose a band- width parameter α, which truncates the far off-diagonal components to be 0. The

n−j weights are based on the relative size of the off-diagonal lines: ωj = ((1+α)n−1)(1−α)n/2 , j = 1, ..., αn. Empirical results show that α = 0.5 is a robust choice in most cases.

When Σ is not exactly Toeplitz, we can use ρj be the average correlation on the jth off-diagonal of Σ.

When the correlation matrix Σ is more complicated than Toeplitz, we propose a second strategy based on the locally weighted smoothing method (LOESS) to estimate the distribution. Specifically, we obtain GΣ(b) curve by local smoothing curve fitting

of GΣj (b) around b.Σj still represents the equal correlation matrix with constant correlation ρj. However, instead of focusing on the near off-diagonals as that in

WAM, here ρj is chosen within the data range of all elements in Σ. The idea is that large correlation elements in Σ, even if not necessarily close to the diagonal, could likely have non-negligible influence. We first set m equal-distance neighbor points in the interval around b, i.e., bi ∈ [b − , b + ], i = 1, ..., m. Then, for each

bi, we randomly choose N off-diagonal elements ρj of Σ to calculate yij = GΣj (bi), j = 1, ..., N. After that, we use yij as the input data for a local polynomial regression with tri-cube weights to predict the curve GΣ(b)[60]. For implementation, we found m = 10,  = 1, N = n, and curve function of quadratic polynomials often provide Chapter 2: gGOF for Dependent Data Analysis 53

a good rule of thumb. Comparing these two strategies, extensive numerical studies

show that when Σ is a Toeplitz or nearly so, WAM is more accurate, where LOESS

is also pretty accurate but is more robust to more complex correlation structures.

For the p-value calculation of digGOF, as clarified in (2.5), the calculation is still a

cross-boundary probability function. Thus, it follows the same methods given above.

? The key is to obtain the boundaries uk. For efficient implementation, we consider

double-adaption over a discrete sequence of functions and truncations {(f1, R1), (f2, R2), ...}.

Potential choices of the functions fj are from the φ-divergence statistic family with various s ∈ [−1, 2], which ensures a theoretical optimality for detection weak-and-rare signals [9]. Certainly, any monotone function f would work. The truncation domains

Rj = {k0j, ..., k1j} only focus on the index truncation for two reasons. First, for a given data, the adaptation to all k0 ≤ k1 ∈ {1, ..., n} is equivalent to the adaptation to all α0 ≤ α1 ∈ [0, 1]. Secondly, computation can be simplified significantly. Specifi- cally, using the index j the statistic in (2.4) can be denoted as So = infj Gj(Sj), and its survival function under H0

−1 ? ? P (So > so) = P (Sj(P1, ..., Pn) < Gj (so), for all j) = P (P(1) > u1, ..., P(n) > un),

? where uk = supj ujk, and for each given j, we only need to calculate ujk within k0j ≤ k ≤ k1j:   −1 k −1 fj ( n ,Gj (so)) if k0j ≤ k ≤ k1j, ujk = (2.17)  0 otherwise. Chapter 2: gGOF for Dependent Data Analysis 54

2.3.2 Rejection Boundary Analysis

Following (2.2), the acceptance region of a gGOF statistic is

AS = {P(k) : P(k) > uk , for all k, P(k) ∈ R}, (2.18)

c and the rejection region is AS. That is, H0 is rejected whenever P(k) ≤ uk at any k.

Thus, at a given correlation Σ and a type I error rate control, the series {u1, ..., un} give a rejection boundary (RB), which could provide a indicator of statistical power.

Figure 2.2 illustrates the RBs at the logarithm scale for various gGOF statistics.

H0 is rejected if any ordered p-values are below the corresponding boundary. If the

RB curve of one statistic is uniformly higher than that of the other, then the first statistic will have uniformly higher power over all possible signal strength. However, when RBs cross, the statistical power depends on the signal patterns.

The left panel evaluates various φ-divergence statistics with s ∈ {−2, −1, 0, 1, 2, 3}.

The RB curves show an interesting pattern: statistics with large s (e.g., HC2004 with s = 2) have higher RBs at the top ranked p-values. If signals are strong and sparse so that the signals likely correspond the smallest p-values, then these statistics are more sensitive to these and thus give more statistical powerful. In particular, at the given setting, some curves (i.e., s ≤ 1) do not start from k = 1 because their uk’s are negative, indicating that they have no power to detect the smallest p-value corre- sponded p-values. On the other hand, statistics with small s are more sensitive to the lower ranked p-values, indicating their advantages to signals that are weaker and/or denser. This patterns also indicates that focusing of certain range of ordered p-values by applying proper p-value truncation domain R could benefits the statistic power too. Chapter 2: gGOF for Dependent Data Analysis 55

As shown in the right panel, the omnibus methods over various fs functions pro- vides a balanced and thus robust solution. The RB curves of the omnibus tests tend to be closer to the best statistic at all positions. Moreover, it seems that we do not need the omnibus of too many tests: the RB for adapting to s ∈ {1, 2} is the similar to that for adapting to s ∈ {−1, 0, 1, 2, 3}. This is likely because BJ is already a robust statistics.

In the meanwhile, the power study by RB needs to be treated with caution due its the limitation. In particular, the distances between two RBs at different k locations do not proportionally reflect the relative advantages in terms of statistical power.

That is, a small RB curve difference some locations maybe more important than a larger RB curve difference at other locations. Thus, if two RBs cross, it is hard to say which method is more powerful when the signal pattern is ambiguous. Chapter 2: gGOF for Dependent Data Analysis 56

Figure 2.2: The rejection boundary of the φ-divergence and omnibus statistics. The null hypothesis is rejected whenever any one log10(P(k)) is below the boundary. Left panel: RBs for φ-divergence statistics with s = 3 (black), 2 (i.e., HC2004, red), 1

(i.e., BJ, green), 0 (i.e., reversed BJ, blue) and −1 (i.e., HC2008, cyan). Right panel:

RBs for HC2004 (red), omnibus over HC2004 and BJ (black), and omnibus over φ- divergence with s ∈ {−1, 0, 1, 2, 3} (green).

2.4 Innovated Transformation

2.4.1 Innovated Transformation and Sparse Signals

In this section we consider proper ways of combining correlations into the group testing procedure. This is realized by proper linear transformation of input statistics before getting input p-values for gGOF statistic. We first clarify the innovation transformation suggested by [55] under settings of Gaussian mean model and under Chapter 2: gGOF for Dependent Data Analysis 57

GLM. Under either settings, we illustrate why it may (or may not) provide higher statistical power, especially for detecting sparse signals.

Transformations Under GMM

For a vector of input test statistics under GMM in (2.7), [55] proposed decorre- lation transformation (DT) and innovated transformation (IT). Define the Cholesky factorization of Σ by Q (a lower triangular matrix), i.e.

Σ = QQ0, or Q0Σ−1Q = I.

Define U = Q−1, which is also a lower triangular matrix. We have

Σ−1 = U 0U, or UΣU 0 = I.

After DT the input statistics become

T DT = UT ∼ N(Uµ, I). (2.19)

IT is defined below:

Definition 1. IT is a transformation procedure for Gaussian random vector T in

(2.7), if the locations of the nonzero elements of µ do not depend on Σ, then the IT of T is

T IT = DΣ−1T ∼ N(DΣ−1µ, DΣ−1D), (2.20) where the matrix D is for rescaling the statistics after transformation so that the

IT variance of Ti ’s are 1:

p −1 D = diag(1/ (Σ )ii, i = 1, ..., n). Chapter 2: gGOF for Dependent Data Analysis 58

Note that the IT definition requires that the locations of the nonzero elements in

the mean vector are independent of the correlation matrix. This condition gives a

directional inverse transformation, which avoids “looping definition” of IT (i.e., we

consider two statistic vector cannot be mutually IT). It helps to anchor the base

line from which IT could increases the signal strength of the baseline. Moreover, the

rescaling procedure after inverse transformation is consist with the requirement for

GMM in (2.7). That is, the statistics of the gGOF statistics are always normalized

so that the p-values are obtained by (2.9) and the nonzero mean elements give the

SNRs.

Note that [55]’s proposal is broader, including a spectrum of transformations

0 Vbn T ∼ N(Vbn µ, Vbn ΣVbn ), (2.21)

where Vbn is a diagonal-band truncated (bandwidth bn) and columns normalized (for

rescaling the statistics) transformation matrix. When bn = 1, V1 = U, the transfor-

−1 mation is DT. When bn = n, V1 = DΣ , the transformation is IT. Besides providing a spectrum of transformations, introducing bn also provide technique convenience in theoretical proof of optimality (see Theorem 2.4.3). However, in finite data analyses,

UT and IT are often enough to present the two extremes of the signal strength after transformation. Figure 2.3 gives an experiment on the influence of bn. When the sig- nals are separated far from each other (first row), IT is better than UT; when signals are clustered (second row), UT could be better. For simplicity we can simply use either DT or IT to get the best case scenario without concerning the proper choice of bn value in between. Chapter 2: gGOF for Dependent Data Analysis 59

Figure 2.3: Transformed signals strength. First row: µ16 = µ32 = µ48 = 1 and other components of µ = 0. Second row: µ16 = µ17 = µ18 = 1 and other components

−1 of µ = 0. Black: V1µ. Red: DΣ µ. Σ is a polynomial decay correlation matrix,

−1 Σij = |i − j| , i 6= j and Σii = 1, i = 1, ..., 50. Left column: distribution of the transformed signals. Right column: the maximum signal after transformation across different bandwidth bn.

In the following we illustrate why UT and IT could increase SNR under sparse Chapter 2: gGOF for Dependent Data Analysis 60 signals and sparse correlations (similar as conditions in the first row of 2.3 and the conditions given by [55]). The reasoning is based on a few important linear algebra properties about the relevant matrices.

q P 2 q P 2 1. The diagonal of Q can be calculated as Qj,j = Σj,j − k Qj,k = 1 − k Qj,k < 1.

2. As the inverse of Q, the diagonal elements of U satisfy Uj,j = 1/Qj,j, thus

Uj,j > 1.

−1 −1 −1 3.Σ is symmetric with diagonal elements (Σ )1,1 = ... = (Σ )n,n > 1.

4. If the correlations in Σ are all positive, most of the off diagonal elements in Σ−1

will be negative.

5. If Σ is polynomially decay, then U, Q, and Σ−1 are all polynomially decay

[61, 55].

p −1 6. Uii are increasing in i, with U1,1 = 1 < U2,2 < ... < Un,n = (Σ )j,j. This is

due to

n−j n−1 −1 X 2 −1 X 2 −1 2 (Σ )j,j = uj+k,j, with special cases: (Σ )1,1 = u1+k,1 and (Σ )n,n = un,n. k=0 k=0

7. Point-wise rescaling the test statistics T , i.e., by DT with D being a diago-

nal matrix, does not change the signal-noise-ratio (SNR), because it adds the

q 2 same proportion to the mean and the standard deviation: Djµj/ Dj Σj,j = p µj/ Σj,j. The input p-values also remain the same.

To show when signals are surely enhanced by DT and ET, let’s first consider a special case where there is only one non-zero elements of µ at position j, i.e., µj = A Chapter 2: gGOF for Dependent Data Analysis 61

for a constant A and µi = 0 for all i 6= j. For DT, the post-transformed mean vector

Uµ has elements (Uµ)j = AUj,j ≥ A and (Uµ)i = 0 for all i 6= j. Similarly, for

IT, the mean vector after IT DΣ−1µ has elements (DΣ−1µ) = √ 1 (Σ−1) A = j −1 j,j (Σ )j,j p −1 (Σ )j,jA > A. At the same time, after transformation the marginal variances of the input statistics remain the same as 1. Thus, signal strength in terms of SNR is

p −1 increase. Note that IT is better than DT because 1 ≤ Uj,j ≤ (Σ )j,j as specific in

the linear algebra properties above.

∗ More generally, µj = A for j ∈ M = {j1, ..., jK }, the domain of true signals. A

sufficient condition for UT and IT to strengthen signals is: 1) the signals are sparse,

i.e., K = o(n); 2) the signal locations are randomly distributed independent of the

correlation structure of Σ, e.g., M ∗ is uniformly distributed over {1, ..., n}; and 3)

correlation matrix is sparse, e.g., Σ is polynomially decay, such that U, Q, and Σ−1

are all polynomially decay. Under this condition, after transformation, the SNR are

p −1 still roughly AUj,j for UT and A (Σ )j,j for IT (cf. Lemma 6 in [10] and Lemma

11.2 in [55]). Thus UT and IT still strengthen the signals. This idea is demonstrated

by the diagram in Figure 2.4. Dots represent nonzero elements; segments represent

the width of the correlations. The nonzero elements of µ are strengthened by the

banded correlation of DΣ−1 after transformation to get DΣ−1µ. On the other hand,

the inverse transformation from DΣ−1µ to µ will reduce SNR, but because DΣ−1µ

depends on the correlation Σ, which is not considered as IT as defined in (2.20). As

to be illustrated for the GLM below, this assumption is important for us to start from

joint-model fitting and show that the marginal-model fitting is the IT, not vice versa. Chapter 2: gGOF for Dependent Data Analysis 62

Figure 2.4: Demonstration of IT when both correlations (DΣ−1) and signals (µ) are sparse.

Transformations Under GLM

In this section study the performance of UT and IT in terms of SNR under GLM setting in (2.10). The marginal model-fitting can be roughly considered as the IT of the joint model-fitting, and we show various situations when joint model-fitting, UT, and IT have their relative advantages.

For the linear regression model in (2.14), Theorem 2.4.1 illustrate the relationship between model-fitting types and the transformation types. When signals are sparse, in particular if there is only one signals, no matter what the correlation structure of the data is, UT guarantees to improve the original joint model-fitting, and IT (i.e., the marginal model-fitting) guarantees a further improvement.

Theorem 2.4.1. Consider linear regression model in (2.14) with error term  ∼ Chapter 2: gGOF for Dependent Data Analysis 63

N(0, σ2I), where σ is known. H = Z(Z0Z)−1Z0 is the projection matrix onto the columns of Z. Denote Xj the jth column of X, j = 1, ..., n.

1. The test statistics by joint model-fitting (least squares estimation) is TJ in

(2.15), the test statistics by marginal model-fitting is

0 −1 TM = CX (I − H)Y/σ ∼ N(ΣTM C β/σ, ΣTM ), (2.22)

√ 1 0 where C = diag{ 0 , j = 1, ..., n} so that ΣTM = CX (I − H)XC is a Xj (I−H)Xj correlation matrix with diagonal of 1’s.

IT 2. The IT of TJ is TM , i.e., TJ = TM , and the UTs of TJ and TM are the same:

UT UT 0 0 −1 TJ = TM = UM CX (I − H)Y/σ ∼ N((CUM ) β/σ, I),

0 where UM is the inverse of the Cholesky factorization of ΣTM , i.e., UM ΣTM UM = I.

3. If β has one nonzero element A > 0 under Ha, then SNRs of the three methods

have the relationship:

UT E(TJ ) ≤ E(TJ ) ≤ E(TM ).

Note that TJ and TM can be mutually transformed by multiplying the inverse of the correlation matrix. However, according to the rule of IT described above, we say

TM is the IT of TJ , not vice versa. This is because locations of nonzero elements in the mean vector of TM depends on the correlation structure of its correlation matrix.

For example, it could be the case that two element statistics TJ1 has nonzero mean

(indicating a causal factor) and TJ2 has zero mean (non-causal factor), and yet they Chapter 2: gGOF for Dependent Data Analysis 64

0 are correlated. However, since TM depends on data covariance matrix X (I − H)X

(conditional on Z). If X1 and X2 are correlated, then µM1 6= 0 implies µM2 6= 0 even if β2 = 0. βM is a biased estimator, but could provide more power.

Theorem 2.4.1 considers one signal for simplicity, but in general, similar result can be shown that UT and IT guarantee the improvement of SNR as long as the signals are sparser than the correlation, a situation illustrated in Figure 2.4. In this case, marginal model-fitting still always preferred over joint model-fitting no matter whether the correlations are positive or negative. Furthermore, Theorem 2.4.1 assumes that N > n+m so that joint fitting is doable. In general, for high dimensional data analysis problem, where N < n+m, the marginal fitting is still a good choice due to it’s simple computation and it’s relationship with IT that maximizes the signal- to-noise ratio of the test statistics when the covariates are correlated and the true β is sparse.

Now we extend the result to GLM.

Theorem 2.4.2. Consider GLM in (2.10).

1. The test statistics by joint model-fitting is TJ in (2.13). The test statistics by

marginal model-fitting is

0 (0) D −1 TM = CX (Y − µ ) → N(ΣTM C β, ΣTM ), (2.23)

(0) 1 where µ is the MLE estimator of the mean under H0, C = diag{ q , j = ˜ 0 ˜ ˜ Xj (I−H)Xj ˜ 1/2 ˜ 1/2 ˜ ˜ ˜0 ˜ −1 ˜0 (0) 1, ..., n}, X = W X, Z = W Z, and H = Z(Z Z) Z and W = diag{Var (Yj)},

which is the estimated weights matrix under H0. Chapter 2: gGOF for Dependent Data Analysis 65

IT 2. The IT of TJ is TM , i.e., TJ = TM , and the UTs of TJ and TM are the same:

UT UT 0 (0) D 0 −1 TJ = TM = UM CX (Y − µ ) → N((CUM ) β, I),

0 where UM is the inverse of the Cholesky factorization of ΣTM , i.e., UM ΣTM UM = I.

3. If β has one nonzero element A > 0 under Ha, then SNRs of the three methods

have the relationship:

UT E(TJ ) ≤ E(TJ ) ≤ E(TM ).

By Theorems 2.4.1 and 2.4.2 and [55], under the GLM with sparse coefficients β and Toeplitz correlation of covariates, we can show the optimality of iHC for detection asymptotically weak-and-rare signals. Specifically, the assumptions are

∗ p ∗ 1. For jk ∈ M = {j1, ..., jK }, βj = Anj = 2rj log n, if j ∈ M , and = 0

otherwise. The domain of β, M ∗, has a size K = n1−α, α ∈ (1/2, 1), indicating

sparse signals. Moreover, M ∗ uniformly distributes on {1, ..., n} with equal

n −1 probability K .

2. Assume Σ = Λ(˜ X˜ 0(I − H˜ )X˜)−1Λ˜ in (2.13) can be written as a Toeplitz matrix

that is generated by a spectral density f on (−π, π), i.e.,

Z π 1 −i|j−k|t Σjk = f(t)e dt 2π −π is the |j − k|th Fourier coefficient coefficient of f.

3. Assume there exists constants γ > 1 and M0 = M0(f) > 0 such that

M (f) |Σ | ≤ 0 , jk (1 + |j − k|)γ Chapter 2: gGOF for Dependent Data Analysis 66

which indicates that Σ’s off-diagonals decay at a polynomial rate or faster.

Theorem 2.4.3. Consider the LM in (2.14). Follow the assumptions for the mean vector and the correlation matrix described above. The detection boundary for β is   ρ(α) α − 1/2 1/2 < α ≤ 3/4, ρ∗(α) = , where ρ(α) = j X˜ 0 (I − H˜ )X˜ √ j j  2 (1 − 1 − α) 3/4 < α < 1.

∗ When rj < ρj (α), all tests are asymptotically 0 power in the sense that the sum of

∗ type I and II error rates always converges to 1 as N > n → ∞. When rj > ρj (α),

2004 if we apply the IT transformed TJ in (2.21) with the bandwidth bn = log n to HC

2004 statistic in (1.9) with R = {1/n ≤ P(i) ≤ 1/2}, and reject H0 whenever the HC ≥

(log n)5/2. Then such iHC procedure is asymptotically powerful in the sense that the type I error rate converges to 0 and its power converges to 1 as N > n → ∞.

2.5 Numerical Studies

2.5.1 Accuracy of P-value Calculations

By simulation here we study the accuracy of our calculation methods for the null distributions of gGOF under various models for correlation structures. In recent studies [7, 54] the moment-matching based methods have been applied to p-value calculations for GHC (by the beta-binomial distribution) and GBJ (by the extended beta-binomial distribution). Such methods can also be used for calculating p-values for HC and BJ. Here we use the GBJ package implemented by the authors for calcu- lation. Chapter 2: gGOF for Dependent Data Analysis 67

Under Gaussian Mean Models

Consider GMM in (2.7) with several typical patterns of correlation matrix Σ =

(ρij)1≤i,j≤n, indexed by parameter γ.

1) Equal correlation matrix ρij = γ.

2) Polynomial decay matrix

γ ρij = 1/(1 + |i − j|) . (2.24)

|i−j| 3) Exponential decay matrix ρij = γ .

Depending on the correlation decay speed toward the direction of Σ’s off-diagonal, the equal correlation has the strongest/densest correlation and decay in exponential speed has the weakest/sparest correlation. Figure 2.5 gives the null distribution of

HC with k1 = n = 20.

√ i/n − P(i) HCn,R = sup np , (2.25) 1≤i≤n/2 P(i)(1 − P(i))

It is clear that when the correlation is weaker all methods are closer to the truth represented by the simulation. However, our methods (both WAM and LOESS) are largely more accurate than the moment-matching method under stronger correlations. Chapter 2: gGOF for Dependent Data Analysis 68

Figure 2.5: The survival function of HC (first row) and BJ (second row) under various correlation settings. Left: common correlation ρ = 0.5. Middle: polynomial decay

γ = 0.5. Right: exponential decay γ = 0.5. n = 20. 50, 000 repetitions for simulation.

We also studied the multiple regression model in (2.14) with β = 0 and Z = 0.

The design matrix XN×n, with N = 1000 and n = 20, were randomly generated in each of 5, 000 simulations. The iid rows of X were obtained from multivariate normal distribution with two blocks A and B, each contained 10 covariates. Table 2.1 lists seven different correlation structures within and between A and B. The input statistics were based on marginal model-fitting in (2.22). Three gGOF tests were Chapter 2: gGOF for Dependent Data Analysis 69

Table 2.1: Seven correlation patterns among the covariates of multiple regression

model. Covariates have 2 blocks: A and B. ρ = 0.5 for constant correlation; decay:

polynomial decay correlation in (2.24) with γ = 1.

Cases 1 2 3 4 5 6 7 Within A 0 ρ ρ ρ decay decay decay Within B 0 0 ρ ρ 0 decay decay Cross A B 0 0 0 ρ 0 0 decay

studied: HC2004, BJ, and the omnibus of HC2004 and BJ, all with R = {1 ≤ i ≤ n/2}.

We studied whether our p-value calculation (by LOESS method) gave a good control of type I error rates at typical significance levels of α = 0.05, 0.01 and 0.005. Table 2.2 shows the empirical type I error rates, i.e., the proportion of simulations that had the calculated p-values being equal or less than α. The results shows that our calculations were mostly accurate. Sometimes it could be slightly conservative because the true type I error rates, represented by the empirical values, were slightly smaller than the nominal values α.

Moreover, we evaluated the p-value calculation methods by simulations in the context of genetic association studies. Specifically, we obtained 1,290 haplotypes for a genome region of 250k base-pairs by a genetics data simulation software Cosi2

[62]. The genetical model followed the typical coalescent model with the LD pattern being similar to those of the European population. To micmic real GWAS practice, where genotype data were obtained randomly from sample to sample, each simulation should generate one random genotype sample. We randomly drew two haplotypes with replacement from the 1,290 to form each subject’s genotype; the sample size Chapter 2: gGOF for Dependent Data Analysis 70

Table 2.2: Empirical type I error rates at significance levels of α under regression model. Covariates follow normal distribution with 7 cases of correlations listed in

Table 2.1.. Three statistics: HC2004, BJ, and the omnibus of HC2004 and BJ, all with

R = {1 ≤ i ≤ n/2}.

Case 1 Case 2 α BJ HC Omni α BJ HC Omni 0.05 0.0508 0.0522 0.059 0.05 0.0516 0.054 0.059 0.01 0.0092 0.0116 0.009 0.01 0.0106 0.0108 0.008 0.005 0.0052 0.0046 0.003 0.005 0.0044 0.0046 0.003 Case 3 Case 4 α BJ HC Omni α BJ HC Omni 0.05 0.0562 0.0556 0.061 0.05 0.0476 0.0516 0.062 0.01 0.0116 0.0108 0.013 0.01 0.0088 0.0108 0.010 0.005 0.0052 0.0054 0.005 0.005 0.0054 0.0052 0.008 Case 5 Case 6 α BJ HC Omni α BJ HC Omni 0.05 0.0618 0.0508 0.045 0.05 0.0604 0.0464 0.059 0.01 0.0078 0.0098 0.006 0.01 0.010 0.010 0.007 0.005 0.0034 0.004 0.003 0.005 0.003 0.0052 0003 Case 7 α BJ HC Omni 0.05 0.0588 0.0498 0.053 0.01 0.009 0.0098 0.006 0.005 0.0034 0.0048 0.003 Chapter 2: gGOF for Dependent Data Analysis 71

was N = 200, 500 or 1000. Genotype data of the first n = 20 common variants (with minor allele frequency (MAF) > 5%) were used as the data for genetic covariates.

An example of the correlation structure among 20 genotypes is illustrated in Figure

2.6. Two non-genetic controlling covariates were further simulated, Z1 (a binary variable of Bernoulli(0.5)) and Z2 (a continuous variable of N(0, 1)). The responses were generated from the non-genetic controlling covariates in order to mimic the null hypothesis that the phenotype is only influenced by environmental factors, but not any genetic variants:

Y = 0.5Z1 + 0.1Z2 + 

We examined for four individual statistics BJ, reversed-BJ and gGOF digGOF om-

nibus test over four test statistics HC2008, Inverse-BJ, BJ, and HC2004 and over p-value truncation domain R = {k0 ≤ i ≤ k1} with k0 ∈ {1, 2} and k1 ∈ {10, 15, 20}.

Table 2.3 summarizes the empirical type I errors rate, i.e., the proportion of simula-

tions where the calculated p-values based on simulated test statistics are less than a

given α value. Chapter 2: gGOF for Dependent Data Analysis 72

Figure 2.6: Correlation matrix for genotypes of the 20 simulated common variants by

Cosi2.

We perform a digGOF test over s ∈ {−1, 0, 1, 2}, k0 ∈ {1, 2} and k1 ∈ {10, 15, 20} for different sample sizes 200, 500 and 1000. The type I errors (i.e., the percentage of simulations where the calculated p-values based on simulated test statistic are less than a given α value) are summarized in Table 2.3:

2.5.2 Comparisons of Statistical Power

We studied the statistical power under GMM, where the input statistics T were generated from multivariate normal distribution with various mean vector µ and correlation matrices Σ.

First we considered the single signal case, i.e., only one nonzero element of µ representing a sparse signal case, over various SNRs and correlation structures. Figure Chapter 2: gGOF for Dependent Data Analysis 73

Table 2.3: Empirical type I error rates at significance levels of α under simulated

genetic data (10,000 simulations). An example of correlation structure is illustrated

in Figure 2.6.

N=200 α BJ HC rev. BJ rev. HC Omnibus 0.05 0.0662 0.0522 0.0471 0.0421 0.0573 0.01 0.0091 0.0102 0.0049 0.0058 0.0111 0.005 0.0038 0.0050 0.0017 0.0023 0.006 0.001 0.0003 0.0009 0.0002 0.0003 0.0009 N=500 α BJ HC rev. BJ rev. HC Omnibus 0.05 0.0665 0.0545 0.0433 0.0442 0.0605 0.01 0.0089 0.0116 0.0050 0.0051 0.0129 0.005 0.0045 0.0051 0.0015 0.0014 0.0065 0.001 0.0002 0.0013 0.0003 0.0004 0.0015 N=1000 α BJ HC rev. BJ rev. HC Omnibus 0.05 0.0675 0.0540 0.0539 0.0466 0.0614 0.01 0.0094 0.0100 0.0055 0.0079 0.0130 0.005 0.0042 0.0048 0.0013 0.0024 0.0076 0.001 0.0006 0.0009 0.0003 0.0003 0.0014

2.7 gives power curves for HC, GHC, BJ and GBJ tests, where the input statistics

are either the original T (n = 100; SNR values in x-axis) or the transformed T by

(1) IT. Four correlations were studied: positive equal correlation (top-left penal: Σij =

(2) 0.3); negative equal correlation (top-right: Σij = −0.00987); positive polynomial

(3) −1 decay (bottom-left: Σij = |i − j| ); and negative polynomial decay (bottom-right: Σ(4) = (Σ(3))−1). There are a few interesting observations. In this sparse-signal case,

IT always increases the largest SNR, and always increases the power of HC, GHC and BJ. However, the performance of GBJ depends on the correlation structure. Chapter 2: gGOF for Dependent Data Analysis 74

When correlations are positive equal correlation or polynomial decay (left panels),

GBJ almost have no power after IT, which leads to weak and negative correlations.

Moreover, HC and GHC give almost identical power (their curves almost overlapped, except GBJ is slightly better under original positive equal correlation), and they are always the best after IT for detecting such a sparse signal.

Second, we considered increasing numbers of signals with random locations (i.e., signals were likely spread out). Figure 2.8 gives power curves under the four corre- lation structures similar as above. Again, IT increase statistical power significantly for all statistics. HC type statistics are more powerful for sparser signals while BJ type statistics are better for denser signals. After IT, BJ is always similar as (right panels) or significantly better than (left panels) GBJ, and HC is always similar (left panels) or slightly better than (right panels) GHC. Chapter 2: gGOF for Dependent Data Analysis 75

Figure 2.7: Power comparison among HC, GHC, BJ and GBJ for detecting single

(1) signal. X-axis: strength of the single signal. Top-left penal: Σij = 0.3; top-right:

(2) (3) −1 (4) (3) −1 Σij = −0.00987; bottom-left: Σij = |i − j| ; bottom-right: Σ = (Σ ) (i.e., polynomial decay with negative entries). n = 100. Chapter 2: gGOF for Dependent Data Analysis 76

Figure 2.8: Power comparison among HC, GHC, BJ and GBJ for detecting multiple

(1) signals. X-axis: number of signals. Top-left penal: Σij = 0.3, µ = 2; top-right:

(2) (3) −1 Σij = −0.00987, µ = 0.4; bottom-left: Σij = |i − j| , µ = 2; bottom-right: Σ(4) = (Σ(3))−1, µ = 1. n = 100.

Now we move to power study under GLM setting.

Following relevant literature [54], here we consider block-wise correlation struc- tures, a typical pattern in genetic associate data due to haplotype-blocks on DNA Chapter 2: gGOF for Dependent Data Analysis 77

[63]. Specifically, the genotype data were simulated based on two blocks A and B:

XN×n = (XA,XB) with sample size N = 1000 and number of covariates n = 100.

Block A contained all associated SNPs (i.e., signals) with size nA ∈ {1, 2, ..., 10}; block B contained all non-associated SNPs with block size nB = n − nA. Consider

A B AB equal correlations within block A: ρij, within block B: ρij, and cross A and B: ρij . Four LD structures were defined based on the block-wise correlations among SNPs:

A B AB (A) ρij = 0.5, ρij = 0, ρij = 0;

A B AB (B) ρij = 0.5, ρij = 0.2, ρij = 0;

A B AB (C) ρij = 0.5, ρij = 0.2, ρij = 0.2;

A B AB (D) ρij = 0.5, ρij = 0.2, ρij = 0.33.

A The effect of the associated SNPs have equal effect size βj and controlling covariates Z = 0. The response variable Y was quantitative trait by regression model in (2.14) with σ = 1, or binary trait by GLM model in (2.10) with the logistic link function.

The number of simulations is 5000 and the significance level is at 0.05 (controlled by simulation). All SNPs were MAF = 0.3, i.e., Xij ∼binomial(2, 0.3) Chapter 2: gGOF for Dependent Data Analysis 78

Figure 2.9: Statistical power of statsitics under linear regression of genetic covariates of equal correlation. Four correlation structures: (A): top-left panel; (B): top-right;

(C): lower-left; (D): lower right. Chapter 2: gGOF for Dependent Data Analysis 79

Figure 2.10: Statistical power of statistics under logistic regression of genetic co- variates of equal correlation. Four correlation structures: (A): top-left panel; (B): top-right; (C): lower-left; (D): lower-right. Chapter 2: gGOF for Dependent Data Analysis 80

Figure 2.11: Power of gGOF under different fitting methods of linear regression of genetic covariates of equal correlation. ‘M’: marginal regression. ‘J’: joint regression.

A B AB A B ‘D’: decorrelation. Top left: ρij = 0.5, ρij = ρij = 0; top right: ρij = 0.5, ρij =

AB A B AB A B 0.2, ρij = 0, lower left: ρij = 0.5, ρij = 0.2, ρij = 0.2; lower right: ρij = 0.5, ρij =

AB 0.2, ρij = 0.33. HC-m and BJ-m curves here should be the same as HC and BJ in Figure 2.9 except for the smoothing causing slightly difference. Chapter 2: gGOF for Dependent Data Analysis 81

Figure 2.12: Power of GBJ and GHC under different fitting methods of linear re-

gression of genetic covariates of equal correlation. ‘M’: marginal regression. ‘J’:

A B AB joint regression. ‘D’: decorrelation. Top left: ρij = 0.5, ρij = ρij = 0; top right:

A B AB A B AB ρij = 0.5, ρij = 0.2, ρij = 0, lower left: ρij = 0.5, ρij = 0.2, ρij = 0.2; lower right:

A B AB ρij = 0.5, ρij = 0.2, ρij = 0.33. Chapter 2: gGOF for Dependent Data Analysis 82

Figure 2.13: Power comparison under regression model with genotype covariates and block-wise equal correlations. Correlation structures follow Cases 1–4 in Table 2.1

A B with ρ = 0.3, the same settings as [54] (section 5.2). Top-left panel: ρij = ρij =

AB A B AB ρij = 0, i.e., independent covariates; top-right: ρij = 0.3, ρij = ρij = 0, lower-left:

A B AB A B AB ρij = 0.3, ρij = 0.3, ρij = 0; lower-right: ρij = ρij = ρij = 0.3.

We further made power comparisons under decaying correlations. Figure 2.14 studies three cases of polynomial decays defined in (2.24). In all three cases of block Chapter 2: gGOF for Dependent Data Analysis 83 dictated polynomial decays, HC and GHC are virtually equivalent, and give highest power over most signal proportions.

Figure 2.14: Power comparison under regression model with genotype covariates and polynomial decay correlations. Correlation structures follow Cases 5–7 in Table 2.1

A −1 B AB with γ = 1. Left : ρij = |i − j| is polynomial decay and ρij = ρij = 0; middle:

A −1 B −1 AB A B AB ρij = |i − j| , ρij = |i − j| , and ρij = 0; right: ρij, ρij, ρij are all decaying with

−1 ρij = |i − j| .

We also studied cases when covariates are continuous variables. To be consistent, we followed the same settings for Figure 2.13 and Figure 2.14, except each covariate in X followed N(0, 1). Figure 2.15 and Figure 2.16 correspond to equal correlations and polynomial decay correlations, respectively. We observe similar patterns as those by Figure 2.13 and Figure 2.14. This result evidences that the distribution of X is not a primary influential factor for the performance, as long as the test-statistics are based on standardized data (or equivalently, intercept is also included in model-fitting no matter what the underlying true model is). More important factors are the SNR Chapter 2: gGOF for Dependent Data Analysis 84

and correlation structure.

Figure 2.15: Power comparison under LM with Xj ∼ N(0, 1) and equal correlation in blocks. Top left: independent covariates. Top right: 0.3 within correlation of signals.

Lower left: 0.3 within correlation of signals and 0.3 within correlation of noises. Lower right: 0.3 equal correlation of all covariates. Chapter 2: gGOF for Dependent Data Analysis 85

Figure 2.16: Power comparison under LM with Xj ∼ N(0, 1) and polynomial decay correlation. Left : polynomial decay 1 within correlation of signals. Middle: polyno- mial decay 1 within correlation of signals and polynomial decay 1 within correlation of noises. Right: polynomial decay 1 correlation of all covariates.

2.6 Application to Genome-wide Association Study

In this section we exam the p-value calculation for gGOF statistics with corre- lated input statistics in gene-based SNP-set association statistics based on real data.

Two data sets were studied. The first is a GWAS data of Crohn’s disease from

NIDDK-IBDGC [40]. It contains 1,145 individuals from non-Jewish population (572

Crohn’s disease cases and 573 controls). After typical quality control for genotype data, 308,330 somatic SNPs were grouped into 15,857 genes according to their phys- ical locations. The logit regression model was applied to get the input statistics for gGOF. The controlling covariates Z contain an intercept and the first two principal components of the genotype data in order to control potential population structure Chapter 2: gGOF for Dependent Data Analysis 86

[41]. The second is an whole exome sequencing data from the ALS Sequencing Consor- tium [64], and the data cleaning and single nucleotide variant (SNV) filtering process follows the same steps as the original study. After quality control filtering, the data contained 457 ALS cases and 141 controls, with 105,764 SNVs in 17088 genes. Two non-genetic categorical covariates, gender and country origin (6 countries), were in- cluded as the controlling covariates Z in the association tests based on logit model.

For both data analysis no gGOF test was needed for genes containing only one SNP.

For analyzing all genes, Figure 2.17 gives the p-value calculations based on inde- pendence assumption and the calculations that incorporate LD among SNPs (corre- lated calculated by the genotype data), i.e., the WAM and MM methods. In general,

WAM and MM successfully bring down the QQ curves closer to the diagonal, in- dicating a better type I error rate control. Meanwhile, comparing with HC, BJ is more sensitive to correlations, so correlation incorporated calculation is more impor- tant to BJ. The ALS exome-sequencing data have less correlations than the Crohn’s disease GWAS data, but the correlation incorporated calculation still shows benefit, especially for BJ. Chapter 2: gGOF for Dependent Data Analysis 87

Figure 2.17: QQ plots of the HC and BJ statistics. P-value calculations were based on independence assumption vs. correlation incorporated calculations by WAM and

MM methods. Top: Crohn’s disease data. Bottom: ALS disease data. Left: HC test.

Right: BJ test. Chapter 2: gGOF for Dependent Data Analysis 88

2.7 Discussion

This chapter provided a unified solution for a generic family of goodness-of-fit statistics in analysis of correlated data. Comparing to individual developments, such as GHC/GBJ, the study of a broad family possesses a few advantages. First, it allows immediate applications of any related GOF statistics, new or old, into applications with correlated data. Second, different statistics could have advantages over different situations, and therefore as a group the family retains high statistical power over a broad parameter space. For example, HC and GHC have the same or very close power under most circumstances. The relevant merit of BJ and GBJ depends on signal patterns and correlation structures. Other methods such as reversed HC and reversed BJ could be preferred when signals are weak and dense. Thirdly, because of the family-wise advantage, the double-omnibus gGOF, which adapts to both proper statistic function and p-value truncation scheme for a given data, can be robust over different situations. Even more interestingly, the omnibus test shares the same essen- tial property of gGOF statistics. Thus it can be easily applied based on our unified p-value calculation method, without the typical need of computationally intensive simulations.

One challenge of studying correlated data is that it relays on the joint distribution of the input statistics. Many convenient properties, such as exchangeability under independence, are no longer applicable for dependence case in general. This chapter illustrated how correlation structure and signal patterns work together to influence

SNR and thus the power of signal-detection methods. In particular, not only signal magnitude, but also signal locations are relevant. Thus we emphasize the starting Chapter 2: gGOF for Dependent Data Analysis 89 point for defining the innovated transformation (IT): the signal locations should be independent of the correlations before transformation. Under this condition, IT is shown often preferred in the context of generalized linear models, especially when signals are relatively sparse.

Following that, this paper showed that under GLM, the joint model-fitting is the starting point and the marginal model-fitting is the IT. Therefore, for analyzing correlated data, even though marginal model-fitting gives biased estimation for the coefficients, it often gives higher power for the signal-detection purpose. Meanwhile, through a comprehensive study of SNR formulae in 2-dimension, we demonstrated relative advantages of those fundamental modeling strategies: marginal-fitting, joint-

fitting, decorrelation, and statistic summation. For example, when signals are dense, marginal fitting could be harmed by signal cancellations under various situations, such as if the directions of signals and correlations are opposite in a certain way.

These theoretical results provide an insightful guidance to practical data analysis applications. Based on these studies, we proposed to combine the double-adaptive omnibus test, the IT and the gGOF statistics into a so-called digGOF strategy, which is a novel and significant methodology to broad data analysis applications

Beyond the achievements of this paper, there are a few future studies we plan to carry on. First, the paper studied theoretical advantage of the IT under the case of sparse signals or under the case of bi-covariates. It would be interesting to discover rules for more general cases of dense signals. For example, in a specific application if the signal pattern and data correlation structure can be reasonably determined or modeled, we could design proper data transformation and analysis strategies to Chapter 2: gGOF for Dependent Data Analysis 90 improve power. Second, this paper assumed that correlation matrix can be properly estimated. In real data analysis, this requirement could be a challenge. For example in high dimensional data where the number of covariates is much larger than the sample size. This problem may be easier when correlations are relatively sparse [65, 59], but otherwise it could be much more difficult. Thirdly, this paper assumed that the input statistics follow multivariate normal distribution, which is reasonable when sample size is large. It is also of interest to study more general cases when the input statistics are non-normal or departing from normal, for example, when sample size is small and observations are also dependent. Chapter 3

TFisher Tests: Optimal and

Adaptive Thresholding for

Combining p-Values

91 Chapter 3: TFisher under Independence 92

3.1 Introduction

The p-value combination approach is an important statistical strategy for information- aggregated decision making. It is foundational to a lot of applications such as meta- analysis, data integration, signal detection, etc. In this approach a group of input p-values Pi, i = 1, ..., n, are combined to form a single statistic for testing the prop- erty of the whole group. For example, in meta-analysis each p-value corresponds to the significance level of one study, and a group of similar studies and their p-values are combined to test a common scientific hypothesis. In the scenario of signal de- tection, each p-value is for one feature factor, and the p-values of a group of factors are combined to determine whether some of those factors are associated with a given outcome. In either scenario, regardless of the original data variation, p-values provide a commonly scaled statistical evidence of various sources (i.e., studies or a factors), therefore the p-value combination approach can be considered as combining informa- tion from different sources to make a reliable conclusion. Indeed, p-value combination can provide extra power than non-combination methods. In signal detection for ex- ample, weak signals could be detectable as a group but not recoverable as individuals

[4, 58].

The question is how we should combine a given group of p-values. One of the earliest methods is Fisher’s combination statistic proposed in 1930’s [66], which is simply the product of all p-values, or equivalently its monotonic log transformation:

n n Y X T = Pi ⇔ W = −2 log(T ) = −2 log(Pi). (3.1) i=1 i=1

Fisher’s combination enjoys asymptotic optimality over any possible ways of com- Chapter 3: TFisher under Independence 93

bining p-values when all p-values represent “signals”, e.g., all studies are positive or

all features are associated [67, 68]. In this sense, the log-transformation of Fisher’s

combination is superior to other transformation functions, e.g., the inverse Gaussian

Z-transformation [69, 70]. However, in real applications, it is often the case that only

part of the p-values are related to signals. One example is in the meta-analysis of

differential gene expression, where the positive outcomes could happen in one or some

of the studies only [28]. Another example is in detecting genetic associations for a

group of genetic markers, where some of these markers are associated but some others

are not [26, 71]. In fact, it has been shown that when true signals are in a very small

proportion, e.g., at the level of n−α with α ∈ (3/4, 1), an optimal choice is to simply

use the minimal p-value as the statistic [72]. However, the minimal p-value may no

longer be optimal for denser weak signals, e.g., under α ∈ (1/2, 3/4) [4,6]. Thus, between the two ends of the classic methods – the optimality of Fisher’s combination for very dense signals and the optimality of minimal p-value method for very sparse signals – a straightforward idea is to combine a subgroup of smaller p-values that more likely represent true signals. Following this idea, styles of truncation methods were proposed. For example, the truncated product method (TPM) statistic is defined as

[73, 74]:

n n Y I(Pi≤τ) X T = Pi ⇔ W = −2 log(T ) = −2 log(Pi)I(Pi ≤ τ), (3.2) i=1 i=1 where I(·) is the indicator function and τ is the threshold of truncation. A variation of TPM is called the rank truncation product (RTP) method, in which τ is set as the kth smallest p-value for a given k [75, 76].

Truncation-based methods have been widely applied in various practical studies Chapter 3: TFisher under Independence 94 and shown desirable performance. For example, many papers have been published in the genome-wide association studies ([75, 77, 27, 78, 79], and others). However, there is a lack of theoretical study on the best choice of τ. Two ad hoc intuitions were considered. One is a “natural” choice of τ = 0.05, the value of a typical significance level in single hypothesis test [73]. The other intuition is to take τ as the true proportion of signals. In Sections 3.5 and 3.6 of this chapter, however, we will show that in general neither of the two intuitions gives the best choice of τ.

Moreover, even if we can get the best τ for TPM, would it be an optimal statis- tic? The answer is still no. In fact, besides truncation, the statistical power could be improved through properly weighting the p-values. In this chapter, we propose a gen- eral weighting and truncation framework through a family of statistics called TFisher.

We provide accurate analytical calculations for both p-value and statistical power of

TFisher under general hypotheses. For the signal detection problem, theoretical op- timality of the truncation and weighting schemes are systematically studied based on

Bahadur Efficiency (BE), as well as a more sophisticated measure Asymptotic Power

Efficiency (APE) proposed here. The results show that in a large parameter space,

TPM and RTP are not optimal; the optimal method is by coordinating weighting and truncation in a soft-thresholding manner. This result provides an interesting connection to a rich literature of shrinkage and penalty methods in the context of de-noising and model selection [80, 81].

When prior information of signal patterns is unavailable, an omnibus test, called oTFisher, is proposed to obtain a data-adaptive weighting and truncation scheme. In general, omnibus test does not guarantee the highest power for all signal patterns, but Chapter 3: TFisher under Independence 95 it often provides a robust solution that performs reasonably well in most scenarios.

In literature, omnibus test mostly depends on computationally intensive simulations or permutations [77, 27, 82, 83]. In order to reduce the computation and improve the stability and accuracy, we provide an analytical calculation for determining the statistical significance of oTFisher.

The remainder of the paper is organized as follows. Problem formulation is given in Section 3.2, where the definitions of TFisher and the settings of hypotheses are clarified. For the whole TFisher family under finite n, we provide analytical calcu- lations for their p-values in Section 3.3 and their statistical power in Section 3.4.

Theoretical studies of optimality based on BE and APE are given in Section 3.5.

With extensive simulations, Section 3.6 demonstrates that our analytical calculations are accurate and that our theoretical studies reflect the reality well. Section 3.7 shows an application of TFisher tests to analyzing a whole exome sequencing data for find- ing putative disease genes of amyotrophic lateral sclerosis. Concluding remarks are given in Section 3.8. Detailed proofs of lemmas and theorems and the supplementary

figures are given in AppendixC.

3.2 TFisher Tests and Hypotheses

3.2.1 TFisher

With the input p-values Pi, i = 1, ..., n, the TFisher family extends Fisher’s p-value combination to a general weighting and truncation scheme. The general formula of Chapter 3: TFisher under Independence 96

TFisher statistics can be equivalently written as

n  I(Pi≤τ1) n Y Pi X T = ⇔ W = −2 log T = (−2 log(P ) + 2 log(τ )) I(P ≤ τ ), τ i 2i i 1 i=1 2i i=1 (3.3) where τ1 is the truncation parameter that excludes too big p-values and τ2i are the weighting parameters for p-values. This statistic family unifies a broad range of p- value combination methods. When τ1 = τ2i = 1, the statistic is the traditional

Fisher’s combination statistic. When τ1 ∈ (0, 1) and τ2i = 1, it becomes the truncated product method (TPM) [73]. When τ1 = P(k) and τ2i = 1 for a given k, where

P(1) ≤ ... ≤ P(n) are the ordered input p-values, it becomes the rank truncation

1−λi product method (RTP) [75]. When τ1 = 1 and τ2i = Pi , it leads to the power-

Qn λi weighted p-value combination statistic T = i=1 Pi [84, 27]. For the simplicity of theoretical studies, in what follows we restrict to constant parameters τ1 and τ2i = τ2.

Such two-parameter definition corresponds to the dichotomous mixture model in the classic signal detection setting, such as those specified in (3.10) and (3.22).

The weighting and truncation scheme is also related to thresholding methods in a rich literature of shrinkage estimation, de-noising and model selection [81, 80]. In particular, when τ1 = τ and τ2 = 1, TFisher corresponds to the hard-thresholding

(i.e., TPM): n X Wh = (−2 log(Pi)) I(Pi ≤ τ1). (3.4) i=1

When τ1 = τ2 = τ, TFisher is a soft-thresholding method: n X Ws = (−2 log(Pi) + 2 log(τ))+ , (3.5) i=1 where (x)+ = max {x, 0}. The soft-thresholding could have three benefits over hard- thresholding here. First, a value τ2 ∈ (0, 1) downscales the significance of original Chapter 3: TFisher under Independence 97 p-values, which could reduce the type I error rate in the related context of multi-

ple hypotheses testing. Secondly, even though EH1 (Ws) − EH0 (Ws) < EH1 (Wh) −

EH0 (Wh), Ws has a much smaller variance, which could make itself more powerful than Wh. Thirdly, Ws has a better weighting scheme for small p-values. To see this point, Figure 3.1 illustrates that the hard-thresholding scheme, represented by the curve −2 log(Pi)I(Pi ≤ τ), is discontinuous at the cutoff τ. In contrast, the soft- thresholding scheme 2 (− log(Pi) + log(τ))+ is pushed down to be a smoothed curve. The more steeply dropping curve of the soft-thresholding gives relatively heavier weights to smaller p-values that are more likely associated with true signals. In Sec- tion 3.5, we will provide theoretical result for functional relationships between signal patterns and optimal τ1 and τ2. The soft-thresholding is to be shown mostly optimal, which is consistent with the conclusion in shrinkage analysis [81].

Figure 3.1: Comparison between the hard-thresholding curve −2 log(Pi)I(Pi ≤ τ)

(black) and the soft-thresholding curve 2 (− log(Pi) + log(τ))+ (green dot). τ = 0.5. Chapter 3: TFisher under Independence 98

When there is no prior information on the signal pattern, the optimal τ1 and τ2 are difficult to determine. However, we can apply an omnibus test, called oTFisher, which adapts the choice of these parameters to the given data. oTFisher does not guarantee the highest power, but it often provides a robust test that performs reasonably well over most signal patterns. In general, oTFisher adaptively chooses τ1 and τ2 that give the smallest p-value over the space of (0, 1] × (0, +∞):

Wo = min Gτ1,τ2 (W (τ1, τ2)), τ1,τ2

where Gτ1,τ2 is the survival function of W (τ1, τ2) defined in (3.3) under the null hy-

pothesis. For practical computation, we study a discrete domain over (τ1j, τ2j) for j = 1, ..., m:

Wo = min Gj(Wj). (3.6) j

As we will show in theory and in simulations, a grid of τ1j = τ2j ∈ (0, 1) over small, mediate and large values in (0, 1) could perform sufficiently well in most cases.

3.2.2 Hypotheses

To answer the key question of how p-values should be combined, we keep in mind that the performance, in particular the statistical power, of different methods de- pends on the setting of the null and alternative hypotheses. A general setting for the group testing problem is given in the following. For independent and identically dis- tributed (i.i.d.) input statistics X1, ..., Xn, we aim at testing the null and alternative hypotheses:

H0 : Xi ∼ F0 for all i vs. H1 : Xi ∼ F1 for all i, (3.7) Chapter 3: TFisher under Independence 99

where Fj, j = 0, 1, denote arbitrary continuous cumulative distribution functions

(CDFs). Based on the given H0, the corresponding input p-values are

¯ Pi = F0(Xi), (3.8)

¯ where F0 = 1 − F0 denotes the survival function of the null distribution. Note that the one-sided p-value definition in (3.8) actually covers the two-sided tests too. This

0 2 0 is because F0 is arbitrary, and the statistics can simply be replaced by Xi = Xi ∼ F0 whenever the signs of input statistics have meaningful directionality (e.g., protective and deleterious effects of mutations in genetic association studies). Also note that the i.i.d. assumption in (3.7) is for the convenience of power calculation. If p-value calculation of TFisher is the only concern in a data analysis, the null hypothesis can be generalized to

i.i.d. H0 : Independent Ti ∼ F0i , or equivalently, H0 : Pi ∼ Uniform[0, 1], i = 1, ..., n.

(3.9)

That is, the TFisher tests can be applied into meta-analysis or integrative analysis of heterogeneous data, where input test statistics could potentially follow different distributions.

A particularly interesting scenario is the signal detection problem, where the target is to test the existence of “signals” in a group of statistics. Usually the test statistics are, or can be approximated by, the Gaussian distribution. Thus the problem is to test the null hypothesis of all “noises” versus the alternative hypothesis that a proportion of signals exist:

i.i.d. 2 i.i.d. 2 2 H0 : Xi ∼ N(0, σ ) vs. H1 : Xi ∼ N(µ, σ ) + (1 − )N(0, σ ), i = 1, ..., n.

(3.10) Chapter 3: TFisher under Independence 100

Here the zero mean indicates the noise, the none-zero mean µ represents the signal

strength, and  ∈ (0, 1] represents the proportion of the signals. The signal patterns are characterized by the parameter space of (, µ). For simplicity, we assume the variance σ2 is known or can be accurately estimated, which is equivalent to assuming

σ = 1 without loss of generality (otherwise, data can be rescaled by σ).

3.3 TFisher Distribution Under H0

In this section we provide the calculation for the exact null distribution of TFisher in (3.3) when τ1 and τ2 are given. Based on that, an asymptotic approximation for the null distribution of oTFisher in (3.6) is also provided. Thus the p-values of TFisher and oTFisher can be quickly and accurately calculated in practical applications.

3.3.1 Exact Distribution at Given τ1 and τ2

i.i.d. Consider the general null hypothesis in (3.9). Let Ui ∼ Uniform[0, 1], i =

1, ..., n, and N be the number of Ui less than or equal to τ1. The TFisher statistic in

(3.3) can be written as

N   X τ1 W (τ , τ ) = −2 log U . 1 2 τ i i=1 2

For a fixed positive integer k ≥ 1, it is easy to check that

k   !    X τ1 ¯ τ1 P −2 log Ui ≥ w = Fχ2 w + 2k log , τ 2k τ i=1 2 2

where F¯ 2 (x) is the survival function of a chi-squared distribution with degrees of χ2k

freedom 2k. Since N ∼ Binomial(n, τ1), W can be viewed as a compound of this Chapter 3: TFisher under Independence 101

shifted chi-squared distribution and the binomial distribution: n      n X n k n−k ¯ τ1 P (W ≥ w) = (1 − τ1) I{w≤0} + τ (1 − τ1) F 2 w + 2k log . 1 χ2k k τ2 k=1

We can further simplify the above formula by noting the relationship between F¯ 2 (x) χ2k and the upper incomplete gamma function Γ(s, x):

Z +∞ k−1 −u/2 Z +∞ k−1 −y k−1 j ¯ u e y e Γ(k, x/2) −x/2 X (x/2) Fχ2 (x) = du = dy = = e . 2k 2k(k − 1)! (k − 1)! (k − 1)! j! x x/2 j=0 Finally, the survival function of W is given by

n    k k−1 j n X n k n−k −w/2 τ2 X [w/2 + k log(τ1/τ2)] P (W ≥ w) = (1 − τ1) I{w≤0} + τ1 (1 − τ1) e k τ1 j! k=1 j=0 n k−1   j X X n [w + 2k log(τ1/τ2)] = (1 − τ )nI + e−w/2 τ k(1 − τ )n−k . 1 {w≤0} k 2 1 (2j)!! k=1 j=0 (3.11)

Note that the formula is not continuous in the first term because of the truncation at τ1. Also as a special case, for the soft-thresholding statistic with τ1 = τ2 = τ, we have

n k−1 X X n wj P (W ≥ w) = (1 − τ)nI + e−w/2 τ k(1 − τ)n−k . s {w≤0} k (2j)!! k=1 j=0

For given τ1 and τ2, this p-value calculation is exact. As evidenced by simulations,

Figure 3.2 shows that formula (3.11) provides a perfect null distribution curve for the

TFisher family W in (3.3).

3.3.2 Calculation for Omnibus Test

For the omnibus test oTFisher in (3.6), noting that Gj is monotone, we have

P (min Gj(Wj) > t) = P (Wj(P1, ..., Pn) < wj, j = 1, ..., m), (3.12) j Chapter 3: TFisher under Independence 102

Figure 3.2: The right-tail distribution curve of W (τ1, τ2) under H0. Left panel:

(τ1, τ2) = (0.05, 0.05); Right panel: (τ1, τ2) = (0.25, 0.75). Simulation: Curve ob- tained by 104 simulations; Exact: by formula (3.11).

−1 where for each j and given (τ1j, τ2j), the exact value of wj ≡ Gj (t) can be calculated by (3.11). These Wj’s are functions of the same set of input p-values, and therefore Pn they are dependent among each other. Fortunately, since Wj = i=1 −2 log(Pi/τ2j)I(Pi<τ1j ), by the Central Limit Theorem (CLT), the statistics (W1, ..., Wm) follow asymptoti- cally the multivariate normal (MVN) distribution with mean vector µ = (µ1, ..., µm) and covariance matrix Σ, where

µj = E(Wj) = 2nτ1j(1 + log(τ2j/τ1j)), and

Σjk = Cov(Wj,Wk)   τ2j τ2k τ2j τ2k = 4nτ1jk + 4n τ1jk(1 + log( ))(1 + log( )) − τ1jτ1k(1 + log( ))(1 + log( )) , τ1jk τ1jk τ1j τ1k (3.13) Chapter 3: TFisher under Independence 103

where τ1jk = min{τ1j, τ1k}. Note that under the special case of the soft-thresholding

with τ1j = τ2j = τj, the two formulas can be readily simplified (assuming τj ≤ τk) as   τk µj = 2nτj, Σjk = 4nτj 2 − τk + log( ) . τj

Thus we can approximate the p-value of oTFisher by the asymptotic distribution of

Wj’s

0 P (min Gj(Wj) > wo) ≈ P (Wj < wj, j = 1, ..., m), (3.14) j

0 where (Wj) ∼ MVN(µ, Σ), and µ and Σ are given in (3.13). The multivariate normal probabilities can be efficiently computed, e.g., by [85]. Figure 3.3 shows the left-tail

probability of Wo, which corresponds to the p-value because a smaller Wo indicates a

stronger evidence against the null. The figure shows that the calculation method is

accurate even for small n, and the accuracy improves as n increases. The calculation

is slightly conservative, which guarantees that the type I error rate will be sufficiently

controlled in real applications.

Figure 3.3: The left-tail null distribution of Wo over τ1j = τ2j = τj ∈ {0.1, 0.2, ..., 1}.

Simulation: curve obtained by 104 simulations; Approx.: by calculation in (3.11). Chapter 3: TFisher under Independence 104

3.4 TFisher Distribution Under General H1

In this section we provide a methodology for calculating the distribution of TFisher in (3.3) under the general H0 and H1 in (3.7), and thus the statistical power. Even

though the calculation is derived asymptotically, it possesses a high accuracy for small

to moderate n.

For any given CDF F0 or F1 in (3.7), we define a monotone transformation function

on [0, 1]:   x under H0 : F0, D(x) = (3.15)  ¯ ¯−1  F1(F0 (x)) under H1 : F1 6= F0.

For any random p-value Pi in (3.8), we have D(Pi) ∼ Uniform[0, 1] under either H0 or H1. Furthermore, we define function

δ(x) = D(x) − x, (3.16)

which provides a metric for the difference between H0 and H1. For example, for any level α test, δ(α) represents the difference between the statistical power and the size.

For any random p-value P , δ(P ) measures a stochastic difference between the p-value distribution under H0 versus that under H1.

The TFisher statistic can be written as

n   n X Pi X W = −2 log I = Y , (3.17) τ (Pi≤τ1) i i=1 2 i=1

−1  D (Ui)  where Yi ≡ −2 log I −1 , and Ui = D(Pi) are i.i.d. Uniform[0, 1]. τ2 (D (Ui)≤τ1)

For arbitrary F0 and F1, the D function could be complicated and exact calculation

could be difficult. Here we propose an asymptotic approximation for the distribution

of W under H1. Note that since W is the sum of i.i.d. random variables, it is Chapter 3: TFisher under Independence 105

asymptotically normal by the CLT. However, for small to moderate n and for small

truncation parameter τ1, the normal approximation is not very accurate. Here we

use a three-parameter (ξ, ω, α) skew normal distribution (SN) to accommodate the

departure from normality [86]. Specifically, we approximate W by

D W ≈ SN(ξ, ω, α),

where the probability density function of SN is

2 x − ξ   x − ξ  f(x) = φ Φ α , ω ω ω

with φ and Φ being the probability density function and the CDF of N(0, 1), respec- tively. The parameters (ξ, ω, α) are obtained by solving the equations of the first three moments:

 2µ 1/3 ξ = µ − 3 , 4 − π s  2µ 2/3 ω = σ2 + 3 , 4 − π s 2/3 π(2µ3) α = sgn(µ3) 2 2/3 2/3 , 2σ (4 − π) + (2 − π)(2µ3) where

µ = E(W ) = nE(Y1),

2 2 2 σ = Var(W ) = n[E(Y1 ) − E (Y1)],

3 3 µ3 = E(W − E(W )) = n[E(Y1 − E(Y1)) ],

with

Z D(τ1)   −1 k k D (u) EY1 = −2 log du , k = 1, 2, 3. 0 τ2 Chapter 3: TFisher under Independence 106

As shown by Figure 3.4, the SN approximation for calculating statistical power is

accurate even for small n and τ1. We have also studied other distribution-approximation

techniques including the generalized normal distribution [87, 88], the first- and second-

order Edgeworth expansions [89], Saddle point approximation [90, 91], etc. Based on

our simulation results (not reported in this chapter to save space), we note that the

SN approximation provides a better accuracy for calculating the power of TFisher

with small τ1 under small n.

Figure 3.4: The right-tail distribution of W under the alternative hypotheses of Gaus- sian mixture in (3.10). Left panel: (τ1, τ2) = (0.05, 0.05); right panel: (0.10, 0.25).

Simulation: curve obtained by 104 simulations; Approx. SN: by the skew-normal approximation; Approx. N: by the normal approximation.

3.5 Asymptotic Optimality for Signal Detection

In this section, we study the asymptotic performance and optimality within the

TFisher family in (3.3). The subscript n is explicitly added to indicate that the Chapter 3: TFisher under Independence 107 asymptotics is driven by n → ∞. Overall, both studies of BE and APE consistently conclude that the soft-thresholding with τ1 = τ2 is optimal or close to optimal in a broad space of the signal parameters (, µ), whereas Fisher’s method (i.e., no trun- cation) or TPM (i.e., the hard-thresholding) are not. The functional relationship

∗ ∗ between optimal (τ1 , τ2 ) and (, µ) by APE better reflects the patterns of statistical power than that by BE in real data analysis.

3.5.1 Properties Based on Bahadur Efficiency

BE was first introduced by [92] to study the large sample property of test statistics.

Consider a test Tn = T (X1, ..., Xn), where X1, ..., Xn are random samples. Denote

Ln(t) = PH0 (Tn > t) as the survival function of Tn under H0, and Ln(t|θ) as the survival function under H1. Under H1, if

2 lim − log Ln(Tn|θ) = cT (θ) ∈ (0, ∞), (3.18) n→∞ n we call the constant cT (θ) the Bahadur Efficiency (BE, or Bahadur exact slope) of

Tn [93]. Since Ln(Tn|θ) is actually the p-value under H1, cT (θ) suggests how quickly the p-value decays to zero. Thus, BE indicates how much the null and alternative distributions of Tn are separated in an asymptotic sense. It is also related to the minimal sample size n that is necessary for the test to reach a given statistical power

0 0 at a given significance level [94]. If another test T has cT 0 (θ) > cT (θ), Tn is said to be Bahadur asymptotically more efficient than Tn. Here, for the signal detection problem defined in (3.10), the parameter θ is a vector (, µ).

Note that under the hypothesis settings in (3.7) and (3.10), the input statistics

X1, ..., Xn can be regarded as the input samples for the p-value combination tests, Chapter 3: TFisher under Independence 108

e.g., in (3.3). Thus the number n of tests to be combined can be regarded as the sample size n in the Bahadur asymptotics given in (3.18). This setting is similar as some BE studies for p-value combination methods (e.g., [95]), but are different from the others where the input statistics Xi are related to the sample size (e.g., [67, 68]).

To calculate cT (θ), one can apply a composition method (cf. Theorem 1.2.2 in

P [93]). Specifically, if (i) Tn → g(θ) under H1, and (ii) the tail property of p-value

2 under H0 satisfies limn→∞ − n log Ln(t) = f(t), where f(t) is continuous on an open

interval I and g(θ) ∈ I for all θ under H1, then cT (θ) = f(g(θ)). Note that the

P convergence Tn → g(θ) under H1 implies that the variance of Tn will converge to 0

under H1. Thus BE contains the variance information only under H0. We make this

important property as a remark.

Remark 1. Bahadur efficiency does not incorporate the information on the variance

of the statistic under H1.

Now we calculate the BE of any TFisher statistic Wn(τ1, τ2) in (3.3). Considering

an equivalent test statistic Tn = Wn/n and following (3.17) and the Law of Large

Numbers, under H1 we have

Z τ1   Wn P u 0 → E1 = E1(Yi) = − log D (u)du. n 0 τ2 Note that,

1 Pn n i Yi − E0 t − E0 P (Wn/n > t) = P ( p > p ), V0/n V0/n

where E0 and V0 denote the mean and variance of Yi under H0, respectively:

E0 = EH0 (Yi) = τ1(1 − log τ1 + log τ2), (3.19) 2 V0 = VarH0 (Yi) = τ1(1 + (1 − τ1)(1 − log τ1 + log τ2) ). Chapter 3: TFisher under Independence 109

Consider the statistic under H0, by the CLT and Mill’s ratio, we have

2 2 (t − E0) lim − log P (Wn/n > t) = . n→∞ n V0

Thus the BE of Wn is

2 2 (E1 − E0) ∆ c(, µ; τ1, τ2) = = . (3.20) V0 V0

The signal parameters  and µ are involved through the D0(u) function in the expres-

sion of E1. The formula does not contain information on the variance of the statistic

under H1, as stated in Remark1.

The BE-optimal τ1 and τ2 are the ones that maximize c(, µ; τ1, τ2). Under the

general hypotheses in (3.7), based on the metric δ(x) for the difference between H0 and

H1 defined in (3.16), Lemma1 gives a loose condition for the soft-thresholding being

“first-order optimal” in the sense that it reaches the stationary point of maximization.

It means that in a very general case of arbitrary H1, the soft-thresholding with τ1 = τ2

may provide a promising choice for construction of a powerful test.

Lemma 1. Consider TFisher statistics Wn(τ1, τ2) in (3.3) under the general hypothe-

ses in (3.7). With δ(x) in (3.16), if τ ∗ is the solution of equation

Z x  2 − x log(u)dδ(u) = δ(x) log(x) − , (3.21) 0 1 − x

∗ then the soft-thresholding with τ1 = τ2 = τ satisfies the first-order conditions for maximizing c(, µ; τ1, τ2) in (3.20).

Equation (3.21) can be easily checked, and is often satisfied in broad cases, e.g., the signal detection problem defined by the Gaussian mixture model in (3.10). However, Chapter 3: TFisher under Independence 110

∗ ∗ before getting the specific maximizers τ1 and τ2 of BE, we study their first-order property in a more general case than the Gaussian: hypotheses based on a general

mixture model with arbitrary continuous CDFs G0 and G1:

i.i.d. i.i.d. H0 : Xi ∼ G0 vs. H1 : Xi ∼ G1 + (1 − )G0, (3.22)

where the proportion  ∈ (0, 1) can be considered as the signal proportion. Lemma2

∗ ∗ gives a somewhat surprising result that the maximizers τ1 and τ2 of BE are irrelevant to .

Lemma 2. Consider TFisher statistics Wn(τ1, τ2) in (3.3) under the hypotheses of

∗ ∗ mixture model in (3.22), the maximizers τ1 and τ2 of c(, µ; τ1, τ2) do not depend on .

The result of Lemma2 becomes not so surprising if we consider the limitation

of BE as stated in Remark1. In particular, the denominator V0 of BE in (3.20)

represents the variation of the test under H0, which is irrelevant to . BE is related

to H1 only through the difference of the means E1 − E0, which is proportional to 

in the same way no matter what τ1 and τ2 are.

For the signal detection problem defined by the Gaussian mixture model in (3.10),

Theorem 3.5.1 gives a sufficient condition that guarantees soft-thresholding will reach

a local maximum.

Theorem 3.5.1. Consider TFisher statistics Wn(τ1, τ2) in (3.3) under the signal de-

tection problem in (3.10). Follow the same notations in Lemma1. It can be shown

that the solution τ ∗ of equation (3.21) exists for any  ∈ (0, 1) and µ > 0.85. Fur- Chapter 3: TFisher under Independence 111

thermore, if τ ∗ also satisfies the condition

δ(τ ∗) 1 − τ ∗ − Φ(Φ−1(1 − τ ∗) − µ) = > 2 − τ ∗, δ0(τ ∗) eµΦ−1(1−τ ∗)−µ2/2 − 1

∗ then the soft-thresolding with τ1 = τ2 = τ guarantees a local maximum of c(, µ; τ1, τ2) in (3.20). In particular, τ ∗ > Φ(¯ µ/2) satisfies the above condition.

Theorem 3.5.1 illustrates that for the signal detection problem, if µ is not too small, the optimal τ ∗ can be calculated. The theorem does not guarantee the maximum is unique. However, since we have the closed form of BE in (3.20), we can always study its properties numerically. Fixing  = 0.5, for µ = 0.5, 1, and 1.5, Figure 3.5 gives the numerical values of c(, µ; τ1, τ2) over a grid of τ1 ∈ (0, 1) and τ2 ∈ (0, 3) with step

size 0.01. It shows that the local maximum is unique. More numerical studies under

various setups of µ and  also confirm that the maximum is likely unique (results not

shown here to save space).

Figure 3.5: 3D surface of BE c(, µ; τ1, τ2) over τ1 and τ2.  = 0.5. Left panel:

∗ ∗ ∗ µ = 0.5, the maximizers τ1 = 0.9, τ2 = 1.28 and the global maximum c = 0.071;

∗ ∗ ∗ ∗ ∗ Middle: µ = 1, τ1 = τ2 = 0.39 and c = 0.394; Right: µ = 1.5, τ1 = τ2 = 0.05 and c∗ = 1.674.

To further study the relationship between the maximizers and the maximum of Chapter 3: TFisher under Independence 112

∗ ∗ c(, µ; τ1, τ2), the left panel of Figure 3.6 shows the values of global maximizers τ1 , τ2 over µ; the right panel shows the global and restricted maximums. A few observations

∗ ∗ can be made. First, the soft-thresholding with τ1 = τ2 is global optimal for maxi- mizing BE when µ > 0.78. It indicates that the lower bound for the cut-off of 0.85 given in Theorem 3.5.1 is pretty tight. Secondly, when µ is larger than this cutoff,

∗ ∗ ∗ τ1 = τ2 = τ is a decreasing function of the signal strength µ. That is, the stronger the signals, the more beneficial the truncation method will be. When the signals are

∗ ∗ ∗ weaker, i.e., when µ is less than the cutoff, the optimal τ1 , τ2 could be different. τ1

∗ is close to 1, but τ2 could be larger than 1. It means that for weak signals, we should not truncate too much, but instead should give a heavier weight to smaller p-values through τ2. Thirdly, even when the soft-thresholding is not the optimal, it still gives a very similar value of c(, µ; τ1, τ2). That can be seen from the right panel of Figure

3.6: when µ is small, various methods have a similar c(, µ; τ1, τ2) value, which is

close to 0. However, when µ is large, we note that the optimal soft-thresholding is

significantly better than the optimal hard-thresholding (TPM), and both are better

than Fisher’s method (no truncation). This result means that the difference between

soft-thresholding and the global-optimal methods could be practically negligible.

3.5.2 Properties Based on Asymptotic Power Efficiency

BE has a limitation for fully reflecting the statistical power of a given test. Fol-

lowing Remark1 and Lemma2, for any mixture model in (3.22) BE does not reflect

the influence of  to statistical power, which could be not true in real data analy-

sis. In particular, BE in (3.20) is related to H1 only through the difference of the Chapter 3: TFisher under Independence 113

∗ ∗ Figure 3.6: BE-optimality over µ values. Left panel: Global maximizers τ1 and τ2 of

BE c(, µ; τ1, τ2) over µ. Right panel: Maximums of BE over µ. Optimal: Globally maximal BE; Soft: Maximal BE under restriction τ1 = τ2; TPM: Maximal BE under restriction τ2 = 1; Fisher: BE at τ1 = τ2 = 1.  = 0.5.

means E1 − E0, but not the variance. However, in reality a given statistic could have significantly different variations under the null and the alternative. To address this limitation, we develop a new asymptotic metric, called Asymptotic Power Efficiency

(APE), which will take such variation difference into consideration.

APE is defined based on a more direct and accurate asymptotics to reflect the patterns of statistical power. Following equation (3.17), under H0, by the CLT we have p PH0 (Wn > nE0 + zα nV0) → α, where E0 and V0 are defined in (3.19), zα is the (1 − α) quantile of N(0, 1). We √ call nE0 + zα nV0 the level-α asymptotic critical value for Wn. Accordingly, the Chapter 3: TFisher under Independence 114

asymptotic power is

r !  p  Wn − nE1 V0 √ E1 − E0 PH1 Wn > nE0 + zα nV0 = P √ > zα − n √ , nV1 V1 V1

where

Z τ1 Z τ1 2 0 0 2 V1 = EH1 (Yi) =[ log (u)D (u)du − ( log(u)D (u)du) 0 0 Z τ1 0 2 +2(D(τ1) − 1) log(τ2) log(u)D (u)du + log (τ2)D(τ1)(1 − D(τ1))]. 0

The rescaled critical value

r V0 √ ∆ a(, µ; τ1, τ2) = zα − n√ (3.23) V1 V1

Wn−nE1 is called the APE. Since √ → N(0, 1), the smaller the a(, µ; τ1, τ2), the bigger nV1 the asymptotic power, and thus the more “efficient” a test is. BE and APE are consistent in the sense that the bigger the mean difference ∆, the more efficient a test is. Meanwhile, APE is more sophisticated as it accounts for differences of both the means and the variances under the alternative versus the null. √ When n is large, a(, µ; τ1, τ2) is dominated by the n term. We define

∆ b(, µ; τ1, τ2) = √ (3.24) V1

as another measure for the performance of a statistic, called Asymptotic Power Rate

(APR). Note that APR is similar as BE except that the denominator refers to the

alternative variance under H1. Since APR is more directly related to statistical

power than BE, this formula indicates that the variance of the statistic under the

alternative hypothesis could be more relevant to its power than its null variance.

The next theorem indicates that the soft-thresholding method can be a promising Chapter 3: TFisher under Independence 115

candidate in terms of maximizing b(, µ; τ1, τ2), as long as the signal strength µ is not too small and the signal proportion  is not too large.

Theorem 3.5.2. Consider TFisher statistics Wn(τ1, τ2) in (3.3) under signal detec- tion problem in (3.10). When µ > 0.85 and

1 +g ˜1(µ)  < hb(µ) = 2 , (˜g1(µ)) − g˜1(µ) − g˜2(µ)

R 1 k µΦ−1(1−x)−µ2/2 ∗ where g˜k(µ) = 0 log (u)(e −1)du, the soft-thresolding with τ1 = τ2 = τ ,

∗ for some τ , is a stationary point of b(, µ; τ1, τ2) in (3.24).

Comparing with Theorem 3.5.1 for BE, Theorem 3.5.2 for APR provides a con- sistent, yet more comprehensive picture about the optimality domain involving .

Moreover, we give a similar theorem concerning the APE, which further allows the number of tests n and the significance level α to play a role in determining the theo- retical boundary for the soft-thresholding to be promising.

Theorem 3.5.3. Follow the assumptions and notations in Theorem 3.5.2. There exists a lower bound µ0 > 0 such that if µ > µ0 and

(1 − cn)[1 +g ˜1(µ)] + 2˜g1(µ) +g ˜2(µ)  < ha(µ) = 2 , (1 − cn)[(˜g1(µ)) − g˜1(µ) − g˜2(µ)] + 2˜g1(µ) +g ˜2(µ)

√ ∗ ∗ where cn = n/zα, then the soft-thresolding with τ1 = τ2 = τ , for some τ , is a stationary point of a(, µ; τ1, τ2) in (3.23).

Theorems 3.5.2 and 3.5.3 show that when  is not too large and µ is not too small, soft-thresholding is promising. Figure 3.7 shows that when n becomes larger, the theoretical boundary defined by a(, µ; τ1, τ2) is closer to the boundary defined by b(, µ; τ1, τ2), as expected. Under finite n, the advantage of soft-thresholding is even Chapter 3: TFisher under Independence 116

more prominent because the curve with n = 50 covers a bigger parameter space than

that of the other two.

Figure 3.7: The boundaries defined by hb(µ) (Theorem 3.5.2, black) and ha(µ) (The- orem 3.5.3, α = 0.05, red: n = 50; cyan: n = 5000). The soft thresholding

∗ ∗ τ1 = τ2 = τ , for some τ , satisfies the first order condition of maximizing b(, µ; τ1, τ2) or a(, µ; τ1, τ2) for all (, µ) below the corresponding boundary curves.

We further study numerically the optimal points based on APE. At n = 50 and

α = 0.05, the left panels of Figure 3.8 fix  (row 1:  = 0.01; row 2:  = 0.1)

∗ ∗ and plot the maximizer τ1 , τ2 over µ. The pattern is consistent with that for BE in Figure 3.6: the soft-thresholding is indeed globally optimal when µ is large enough, and τ ∗ is a decreasing function of µ. Moreover, the smaller the , the smaller the µ cutoff to guarantee the soft-thresholding being optimal. When µ is smaller than the

∗ ∗ cutoff, both τ1 and τ2 could be large, indicating a light truncation and a significance- upscaling weighting for the p-values. The right panels of Figure 3.8 fix µ (row 1: Chapter 3: TFisher under Independence 117

∗ ∗ µ = 1; row 2: µ = 2) and plot the maximizer τ1 , τ2 over . Consistent with our theorem, the soft-thresholding is indeed globally optimal when  is not too large (i.e.,

sparse signals). Such optimal τ ∗ is proportional to the signal proportion . The τ ∗/ ratio is a decreasing function of µ, which could be larger or smaller than 1. Thus, the best cutoff τ ∗ is not a “natural” value 0.05 as suggested in literature [73]; it is also not simply the signal proportion . Instead, there is a functional relationship between

τ ∗ and the signal pattern defined by  and µ, as is given here.

∗ ∗ Figure 3.8: The global maximizer (τ1 , τ2 ) for a(, µ; τ1, τ2) when n = 50, α = 0.05. From top to bottom, left column:  = 0.01 or 0.1; right column: µ = 1 or 2.

When  is big or when µ is small, the soft-thresholding may not be optimal based

on APE. However, when that happens, the practically meaningful difference is likely Chapter 3: TFisher under Independence 118 small because these areas correspond the true statistical power being close to 0 or 1.

Figure 3.9 shows the comparison of statistical power between the global optimality

∗ ∗ (with maximizers (τ1 , τ2 ) of APE) and the optimal soft-thresholding (under restriction

∗ τ1 = τ2 = τ ). The two power curves match perfectly, even at regions where the soft- thresholding may not be globally optimal in theory. Here the optimization is done by a grid search over τ1 ∈ {0.001, 0.002, ..., 1} and τ2 ∈ {0.001, 0.002, ..., 10}, and statistical power is calculated by the method provided in Section 3.4. The result suggests that we may almost always focus on the soft-thresholding in the TFisher family.

Figure 3.9: Power comparison between globally optimal TFisher statistic (at global

∗ ∗ maximizers (τ1 , τ2 ) of APE) and the optimal soft-thresholding TFisher (at restricted

∗ maximizers τ1 = τ2 = τ of APE). The number of tests n = 50, the type I error

α = 0.05. Left:  = 0.1. Right: µ = 2.

. Chapter 3: TFisher under Independence 119

3.6 Statistical Power Comparison For Signal De-

tection

In this section, we focus on statistical power for the signal detection problem in (3.10). First, we show that our analytical power calculation is accurate when comparing with simulations. Then, we compare the statistical power among different methods and demonstrate their relative performance.

Statistical power calculation combines the calculations for the null distribution (for controlling the type I error) given in Section 3.3 and for the alternative distribution given in Section 3.4. Here we evidence the accuracy of these calculation methods through comparing the statistical power by calculation versus simulation. Figure 3.10 shows that even for relatively small n, we have accurate statistical power calculation under various model parameter setups.

Figure 3.10: The statistical power calculation versus simulation for signal detection.

Type I error rate α = 0.05. Left panel: τ1 = 0.1, τ2 = 0.5; Middle: τ1 = 0.05, τ2 =

4 0.05; Right: τ1 = 0.05, τ2 = 0.25. Simu: curve by 10 simulations. Calc SN: by calculation. Chapter 3: TFisher under Independence 120

Next, we compare various methods in the TFisher family: optimal TFisher with

∗ ∗ global maximizers τ1 , τ2 of APE in (3.23), soft-thresholding with fixed τ1 = τ2 = 0.05, soft-thresholding omnibus test oTFisher with adaptive τ1 = τ2 ∈ {0.01, 0.05, 0.5, 1},

Fisher’s method with τ1 = τ2 = 1, and TMP with τ1 = 0.05 and τ2 = 1. Figure 3.11 illustrates the power over the signal strength µ at various number n of input p-values

(by row) and the expected number n of signals (by column). Figure 3.12 illustrates the power over signal proportion  at various n (by row) and the signal strength µ

(by column). Interesting observations can be seen from these two figures. First, with no surprise, the optimal TFisher is always the best over all settings. Actually in most

∗ ∗ of those cases the optimal TFisher corresponds to the soft-thresholding with τ1 = τ2 , and if they are not equal, the power difference is almost always negligible (see Figures

3.8 and 3.9). Secondly, the soft-thresholding oTFisher is a relatively robust method over various signal patterns. It is often close to the best, and never be the worst.

In fact, its power is often close to the power of the statistic with the parameters it adaptively chooses. For example, if oTFisher chooses τ1 = τ2 = 0.05, it gives a similar but slightly lower power than TFisher with the same parameters. The slight loss of power is possibly due to the variation of the adaptive choice. Thirdly, the soft- thresholding TFisher with fixed τ1 = τ2 = 0.05 has a clear advantage when signals are sparse, i.e., when  is small. It also has a clear disadvantage when the signals are dense. The original Fisher’s method, which is also a special case of soft-thresholding shows an opposite pattern. Meanwhile, the relative advantage of Soft-0.05 versus

Fisher is also related to the signal strength µ. In consistence with the theoretical study of both BE and APE, the larger the µ, the smaller the optimal τ ∗ shall be. Chapter 3: TFisher under Independence 121

Such phenomenon is evidenced by panel 3-3 in Figure 3.11 and the panel 1-3 in Figure

3.12: when  is relatively big, say around 0.1 and 0.2, Soft-0.05 could still be better than Fisher at large µ. Lastly, the hard-thresholding TMP-0.05 is mostly not among the best. In particular, it has a clear disadvantage to Soft-0.05 for detecting sparse signals.

Finally we compare the power of three omnibus tests: oTFisher with soft-thresholding, the adaptive TPM (ATPM, hard-thresholding), and the adaptive RTP (ARTP).

ARTP was shown to have the highest power among a group of adaptive set-based methods for genetic association testing [77, 71]. The Supplementary Figures C.1 and

C.2 in the Supplementary Materials illustrate the power of the optimal TFisher and the three omnibus tests under the same settings as Figures 3.11 and 3.12, respectively.

The key result is that oTFisher actually dominates both ATPM and ARTP across all settings of signal patterns. ARTP could be better than ATPM for sparser and stronger signals, but the opposite is true for denser and weaker signals.

In summary, the pattern of power comparison well reflects the theoretical study in Section 3.5. The soft-thresholding that restricts τ1 = τ2 = τ is the right strategy to reach the optimal statistic in most cases. The optimal τ ∗ is related to the signal pattern defined by both parameters , µ. If we know the signal pattern, e.g., small 

(especially if µ is big at the same time), then we should choose a small τ. However, if no such prior information is available in a study, then the soft-thresholding oTFisher with a grid of τ over small, mediate and large values in (0, 1) will likely be a robust solution. Chapter 3: TFisher under Independence 122

Figure 3.11: The power comparison over signal strength µ. Type I error rate α = 0.05.

Soft 0.05: soft-thresholding at τ1 = τ2 = 0.05; TPM 0.05: hard-thresholding at

τ1 = 0.05, τ2 = 1; Fisher: Fisher’s combination at τ1 = τ2 = 1; Optimal: optimal

∗ ∗ TFisher at maximizers τ1 , τ2 of APE; Omnibus: soft-thresholding oTFisher with adaptive τ ∈ {0.01, 0.05, 0.5, 1}. Chapter 3: TFisher under Independence 123

Figure 3.12: The power comparison over signal proportion . Type I error rate α =

0.05. Soft 0.05: soft-thresholding at τ1 = τ2 = 0.05; TPM 0.05: hard-thresholding at τ1 = 0.05, τ2 = 1; Fisher: Fisher’s combination at τ1 = τ2 = 1; Optimal: optimal

∗ ∗ TFisher at maximizers τ1 , τ2 of APE; Omnibus: soft-thresholding oTFisher with adaptive τ ∈ {0.01, 0.05, 0.5, 1}. Chapter 3: TFisher under Independence 124

3.7 ALS Exome-seq Data Analysis

The p-value combination methods have been widely used for genetic association studies, but most of them were based on hard-thresholding, including TPM and RTP methods [26, 75, 77, 27, 78, 79, 71]. In this section we apply and assess the soft- thresholding TFisher by analyzing a whole exome sequencing data of amyotrophic lateral sclerosis (ALS). ALS is a neurodegenerative disorder resulting from motor neuron death. It is the most common motor neuron disease in adults (Motor Neuron

Diseases Fact Sheet, NINDS). ALS is a brutal disease that causes patients to lose muscle strength and coordination even for breathing and swallowing, while leaving their senses of pain unaffected. ALS is uniformly fatal, usually within five years.

Genetics plays a critical role in ALS; the heritability is estimated about 61% [64].

The identification of ALS genes is foundational in elucidation of disease pathogenesis, development of disease models, and design of targeted therapeutics. Despite numerous advances in ALS gene detection, these genes can explain only a small proportion

(about 10%) of cases [3].

Exome-sequencing data is obtained by the next-generation sequencing technology for sequencing all protein-coding genes in a genome, i.e., the exome. This approach identifies genetic variants that alter protein sequences that may affect diseases. It provides a great balance between the depth of sequencing and the cost comparing with the whole-genome sequencing. Our data comes from the ALS Sequencing Consortium, and the data cleaning and single nucleotide variant (SNV) filtering process follows the same steps as the original study [64]. Specifically, we focused on SNVs which occur at highly conserved positions (with positive GERP score [96]) or which represent stop- Chapter 3: TFisher under Independence 125

gain or stop-loss mutations [97]. SNVs that have low genotyping quality (missing

rate < 40%) were remove; missing genotypes were also removed. After these filtering

steps, the data contained 457 ALS cases and 141 controls, with 105,764 SNVs in

17088 genes. Two non-genetic categorical covariates, gender and country origin (6

countries), were also included into the association tests.

We focus on gene-based SNP-set tests. Each gene is tested separately; input p- values from the group of SNVs within that gene generate a TFisher statistic, then the summary p-value of this statistic is obtained to measure how significant the gene is associated. Here we apply the logistic regression model to obtain the input SNV p- values, which allows adjusting for other covariates such as non-genetic factors. Specif- ically, let yk be the binary indicator of the case (yk = 1) or the control (yk = 0) for the kth individual, k = 1, ..., N. Let Gk = (Gk1, ..., Gkn) denote the genotype vector of n SNVs in the given gene, and let Zk = (1,Zk1,Zk2) be the vector of the intercept and covariates of gender and country origin. The logistic regression model is

0 0 logit(E(Yk|Gk,Zk)) = Gkβ + Zkγ, where β and γ are the coefficients. The null hypothesis is that none of the SNVs in the gene are associated, and thus this gene is not associated:

H0 : βi = 0, i = 1, ..., n.

To test this null hypothesis, we adopt a classic marginal test statistic [38, 39]

N X ˜ Ui = Gki(Yk − Yk), i = 1, ..., n, k=1 ˜ where Yk is the fitted probability of the case under H0. It can be shown that under

D H0, the vector of statistics U = (U1, ..., Un) → N(0, Σ), as N → ∞, where Σ can be Chapter 3: TFisher under Independence 126

estimated by

Σˆ = G0WG − G0WZ(Z0WZ)−1Z0W G,

where G = (Gki) and Z = (Zki) are the corresponding design matrices, and the ˜ ˜ diagonal matrix W = diag(Yk(1 − Yk)). After de-correlation we get the input statis-

ˆ − 1 D i.i.d. tics X = Σ 2 U → N(0,In×n), and the input p-values are 2P (N(0, 1) > |Xi|) →

Uniform[0, 1]. Thus our p-value calculation methods given in Section 3.3 can be ap-

plied to any TFisher or oTFisher statistics.

The left panel of Figure 3.13 gives the Q-Q plot of the gene-level p-values of

TFisher statistics at τ1 = τ2 = 0.05. Because of the truncation, it is natural that

some genes have p-values at 1 (indicated by the flat part of the dots). It often happens

when the gene contains only a few SNVs and their marginal p-values, as the input of

its TFisher statistic, are all large, say larger than 0.05 here. Such genes are likely not

associated anyway, thus the truncation does not influence the type I error rate being

well controlled at the gene level. The majority of p-values are still along the diagonal

as expected. The right panel of Figure 3.13 provides the Q-Q plot of the gene-level

p-values by oTFisher test, in which the parameters τ1 = τ2 adapt over {0.05, 0.5, 1}.

The top ranked genes by both methods are consistent, which is reasonable because the signals, i.e., the ALS associated SNVs, are expected to be in a small proportion of all SNVs. Chapter 3: TFisher under Independence 127

Figure 3.13: Q-Q plots of p-values based on soft-thresholding tests. Left: τ1 = τ2 =

0.05. Right: omnibus with τ1 = τ2 ∈ {0.05, 0.5, 1}.

To the best of our knowledge, most of these top ranked genes have not been directly reported in genetic association studies of ALS, even though they are promisingly related to ALS from the functionality perspective as discussed below. This result indicates that TFisher tests could likely contribute extra power over existing methods for the discovery of novel disease genes. Certainly, the result is based on very limited data; further statistical and biological validations are needed to clarify their genetic mechanisms to ALS.

The biological relevance of the top ranked genes is briefly discussed here. Gene

SMAP1 (a group of 8 SNVs, p-value 1.76 × 10−6) is among significant clusters of altered genes in frontal cortex of ALS samples [98]. The STRING protein-protein network [99] shows that it has a strong connection with LRRK2, a gene associated with late-onset Parkinson’s disease (PD), which is a neurodegenerative diseases closely Chapter 3: TFisher under Independence 128

related to ALS [100]. Gene SLC22A24 (12 SNVs, p-value 1.85 × 10−5) has reported

statistical association with Alzheimer’s disease, another neurodegenerative disease

closely related to ALS [101]. Furthermore, STRING network shows that SLC22A24

has strong connections with two ALS related genes: AMACR and C7orf10. AMACR

is a gene of AMACR deficiency, a neurological disorder similar as ALS; both initiate

and slowly worsen in later adulthood. C7orf10 is associated with ALS types 3 and 4

[102]. Gene OSMR (8 SNVs, p-value 6.35×10−5) has been found critically involved in

neuronal function regulation and protection [103]. Also, it is associated with IL31RA

functional receptor, which is a critical neuroimmune link between TH2 cells and

sensory nerves [104]. Gene TBX6 (8 SNVs, p-value 9.47×10−5) involves regulation in

neural development and maturation [105]. Moreover, in a novel stem cell therapy of

ALS, TBX6 and its associated SOX2 play a critical role [106]. Gene VAX2 (7 SNVs, p-value 1.22 × 10−4) plays a functional role in specifying dorsoventral forebrain. It has direct protein-protein interaction with ALS gene CHMP2B [107]. It also has a direct STRING connection with SIX3, which proliferates and differentiates neural progenitor cells (GeneCards database: www.genecards.org). Gene GFRA1 (4 SNVs, p-value 2.99 × 10−4) encodes a member of the glial cell line-derived neurotrophic factor receptor (GDNFR). It has direct STRING connection with two ALS related genes: RAP1A, which is associated with ALS by influencing the activation of Nox2, a modifier of survival in ALS[108], and PIK3CA, which is an up-regulated gene in the ALS mouse model [109]. Chapter 3: TFisher under Independence 129

3.8 Discussion

We proposed and studied a family of Fisher type p-value combination tests,

TFisher, with a general weighting and truncation scheme, for which many existing methods are special cases. For the signal detection problem, we studied the optimal

TFisher statistics that maximize the BE and the APE. As a result, we showed that soft-thresholding is nearly the best choice, better than the TPM and RTP methods used in a rich literature of applied statistics.

From the theoretical perspective, the studies of BE and APE revealed the rules for best weighting and truncating input p-values in order to best reveal true sig- nals. Our results validated a general principle: when the signals are sparse and strong, more relative weight should be given to the smallest p-values; when the sig- nals are dense and weak, a more “flat” weighting scheme is appropriate. Meanwhile, the original magnitude of p-values often need to be downscaled by the parameter

τ2 ∈ (0, 1). We obtained a quantitative relationship between the optimal weighting and truncation scheme and the signal proportion as well as the signal-to-noise ratio.

Therefore, this work moved forward the literature, which were mostly based on ad hoc justification and simulation studies. Moreover, this work demonstrated an idea of designing novel powerful statistics by studying the interactive relationship between the statistic-defining parameters and the H0/H1-defining parameters. Based on this idea, the statistic family could be further generalized and powerful methods could be obtained for specific testing problems in the future.

From the practical perspective, the paper provided analytical calculations for both p-value and statistical power for a broad family of TFisher statistics under general Chapter 3: TFisher under Independence 130 hypotheses. Data-adaptive omnibus tests could also be applied to real data with unknown signal pattern. A data analysis pipeline for genetic association studies was illustrated, and a list of putative ALS genes were identified and discussed. Chapter 4

TFisher Distribution Under

Dependent Input Statistics

131 Chapter 4: TFisher for Dependent Data Analysis 132

4.1 Introduction

When the p-values that need to be combined are dependent, for example, as in lin- ear regression or generalized linear regression models the p-values of the covariates are not independent, the null distribution of the TFisher statistics is unknown. Theorems

2.4.1 and 2.4.2 show that although de-correlation strategy is better than not doing any transformation, innovation is even better in terms of having higher signal-to-noise ratio. The fact that keeping some correlation of the individual p-values could increase the statistical power motivates an accurate approximation to the null distribution of

TFisher.

The most straightforward idea for approximation is by the method of moments.

Brown [1] proposed a scaled chi-square approximation by matching the first two mo- ments of the Fisher‘s statistic when input statistics are multivariate normal. Kost and McDermott [110] extended this approach to the case when the variance of the input statistics is unknown. Poole etc. al. [111] proposed an empirical method to estimate the variance of the Fisher‘s combination. One drawback of the existing methods is that the variance estimation is either by numerical double integration, which is time consuming, or by simulation or simulation-guided polynomial regres- sion, which can be inaccurate in some scenarios. Also there is no existing method for TFisher statistics [112]. We proposed a Shifted-Mixed Gamma distribution to accommodate the thresholding and normalization parameters of the TFisher, and the variance is estimated by a numerical univariate integration, which is both accurate and computationally efficient. Chapter 4: TFisher for Dependent Data Analysis 133

4.2 Approximate Null Distribution of TFisher

Consider two-sided p-values Pi = 2(1 − Φ(|Zi|)), i = 1, ..., n, Zi’s are multivariate normal with unit variance, Z ∼ MVN(µ, Σ), with correlation matrix Σij = ρij.

Under H0, µ = 0, while under H1, some element of µ 6= 0.

The Tfisher statistic is defined as

n   n X Pi X W = −2 log I(P < τ ) = Y (4.1) τ i 1 i i=1 2 i=1   Pi where Yi = −2 log I(Pi < τ1). τ2

When ρij = 0 for every 1 ≤ i, j ≤ n, W follows a compound of this shifted chi- squared distribution and the binomial distribution [112]. Motivated by the truncation and discontinuity property of TFisher, we proposed a type of four-parameter general

Gamma distribution to approximate the null distribution of TFisher.

4.2.1 Shifted-Mixed Gamma Distribution

Suppose X is a random variable such that it has probability p0 to be 0 and probability 1 − p0 to be a s-shifted Gamma random variable. The CDF of Shifted-

Mixed Gamma Distribution is

P (X ≤ x) = p0 + (1 − p0)FΓ(k,θ)(x − s), w ≥ 0 (4.2)

where p0 is the point mass probability at 0, P (X = 0) = p0, s is the shift parameter, s = infx P (X > x) > p0, k and θ are he shape and scale parameter of Gamma distribution. Chapter 4: TFisher for Dependent Data Analysis 134

The mean, µ, and variance, σ2, of X are

µ = (1 − p0)kθ + (1 − p0)s (4.3) 2 2 2 σ = (1 − p0)(kθ + p0(kθ + s) )

When applied to the approximation of the distribution of W under H0, the point

mass parameter p0 can be estimated by

p0 = P (Pi > τ1, i = 1, ..., n) (4.4)

Shift parameter s is the gap of non-continuity of a TFisher statistic. It can be

estimated by

s = −2 log(τ1/τ2) (4.5)

k and θ are determined by matching the first two moments of W , i.e.

(µ − s(1 − p ))2 k = w 0 (1 − p )σ2 − p µ2 0 w 0 w (4.6) (1 − p )σ2 − p µ2 θ = 0 w 0 w (1 − p0)(µw − s(1 − p0))

4.2.2 Variance Estimation for TFisher

The mean and variance of W are

µw = EW = 2nτ1(1 + log(τ2/τ1)) ! 2 X X X σw = Var(W ) = Cov(Yi,Yj) = Cor(Yi,Yj)Var(Y ) = Var(Y ) Cor(Yi,Yj) i,j i,j i,j (4.7) where

2 Var(Y ) = 4τ1(1 + (1 − τ1)(1 − log τ1 + log τ2) ) (4.8)

which is exactly known. Chapter 4: TFisher for Dependent Data Analysis 135

Notice that by Mill’s ratio, when p-value P is small enough,

2 P  D τ π −2 log ≈ Z2 + log |Z| + log 2 ) (4.9) τ2 2

2 This suggests we may approximate Yi by Zi I(Pi < τ1). More specifically,

2 2  Cor(Yi,Yj) ≈ Cor Zi I(|Zi| > b),Zj I(|Zj| > b) (4.10)

The advantage of this approximation is that it allows us to deduce an analytical

formula for calculating Cor(Yi,Yj). The formula will involve univariate integrals only,

thus the computation can still be efficient.

Theorem 4.2.1. Suppose U, V follow a bivariate normal distribution with mean 0,

variance 1, and correlation ρ. Let b = Φ−1(1 − τ/2), then

2 2 Var(U I(|U| > b)) = 3 − M4(b) − (1 − M2(b))

Cov(U 2I(|U| > b),V 2I(|V | > b)) (4.11) Z =2ρ2b3φ(b) + (1 + 2ρ2)(τ + 2bφ(b)) − (τ + 2bφ(b))2 − u2h(u)du R\[−b,b]   n 2 2 p 2 2 where Mn(b) = EZ I(−b < Z < b) , h(u) = ρ u M0(u) + 2ρ 1 − ρ uM1(u) + (1 − ρ )M2(u) φ(u),

n −ρu+b −ρu−b and Mn(u) = EZ I(g(u) < Z < f(u)) with f(u) = √ and g(u) = √ . Mn(b) 1−ρ2 1−ρ2

and Mn(u) can be found by corollary4 and5

n Corollary 4. Let Mn(b) = EZ I(−b < Z < b). Then M1(b) = M3(b) = ... = 0 and

M0(b) = Φ(b) − Φ(−b) = 1 − τ

M2(b) = M0 − 2bφ(b)

3 M4(b) = 3M2 − 2b φ(b) Chapter 4: TFisher for Dependent Data Analysis 136

n Corollary 5. Let Mn(u) = EZ I(g(u) < Z < f(u)). Then

M0(u) = Φ(f(u)) − Φ(g(u))

M1(u) = φ(g(u)) − φ(f(u))

M2(u) = M0(u) + g(u)φ(g(u)) − f(u)φ(f(u))

Corollary4 and5 are the corollaries of Lemma 10. The proof of Theorem 4.2.1 and Lemma 10 are given in AppendixD.

4.3 Numerical Results

We first evaluate the accuracy of the approximation of the variance estimation. We consider two types of correlation matrices of the input statistics: 1) equal correlation

−λ Σij = ρ and 2) polynomial decaying correlation Σij = |i − j| . The number of input statistics is 20. Figure 4.5, 4.6 show that under equal correlation, especially when the truncation τ1 is not too big and correlation is not too large, our approximation approach improves the accuracy of varaince calculation significantly compared to the

Brown’s polynomial fitting method. Figure 4.7, 4.8 show that under polynomial decaying correlation, our proposed method is accurate across all λ. Chapter 4: TFisher for Dependent Data Analysis 137

Figure 4.1: Variance of soft-thresholding statistic under equal correlation. ‘simu’: sample variance. ‘cal poly’: the polynomial fitting method that we used in our pre- vious reports. ‘cal theo’: the theoretical calculation method that proposed in section

2. Chapter 4: TFisher for Dependent Data Analysis 138

Figure 4.2: Variance of TFisher statistic under equal correlation. Chapter 4: TFisher for Dependent Data Analysis 139

Figure 4.3: Variance of soft-thresholding statistic under polynomial decaying correla- tion. ‘simu’: sample variance. ‘cal poly’: the polynomial fitting method that we used in our previous reports. ‘cal theo’: the theoretical calculation method that proposed in section 2. Chapter 4: TFisher for Dependent Data Analysis 140

Figure 4.4: Variance of TFisher statistic under polynomial decaying correlation.

Next we evalue the accuracy of the p-value approxiamtion of W . From Figure

4.5, 4.6, 4.7 and 4.8 we can see that the theoretical approach leads to the p-value calculation that is very closed to the ‘oracle’ case, that is when we know the true

Var(W ). The proposed method is more accurate than the Brown’s polynomial fitting Chapter 4: TFisher for Dependent Data Analysis 141 method.

Figure 4.5: P-value calculation of soft-thresholding statistic under equal correlation.

The rho parameter is chosen such that the theoretical calculation has large deviation from the sample variance. Chapter 4: TFisher for Dependent Data Analysis 142

Figure 4.6: P-value calculation of TFisher statistic under equal correlation.

Figure 4.7: P-value calculation of soft-thresholding statistic under polynomial decay- ing correlation. Chapter 4: TFisher for Dependent Data Analysis 143

Figure 4.8: P-value calculation of TFisher statistic under polynomial decaying corre- lation.

4.4 Extension to oTFisher

Consider multiple TFisher statistics

n   n X Pi X W = −2 log I(P < τ ) = Y k = 1, ..., K k τ i 1k ik i=1 2k i=1 where

  Pi Yik = −2 log I(Pi < τ1k) i = 1, ..., n τ2k

Our best hope is to model (W1, ..., WK ) as multivariate normal. Thus we are interested in the covariance between W ’s:

X X q Cov(Wl,Wk) = Cov(Yil,Yjk) = Cor(Yil,Yjk) Var(Yil)Var(Yjk) i,j i,j Chapter 4: TFisher for Dependent Data Analysis 144 where

2 Var(Yil) = 4τ1l(1 + (1 − τ1l)(1 − log τ1l + log τ2l) )

2 Var(Yik) = 4τ1k(1 + (1 − τ1k)(1 − log τ1k + log τ2k) )

We can approximate the correlation

2 2  Cor(Yil,Yjk) ≈ Cor Zi I(|Zi| > bl),Zj I(|Zj| > bk)

2 2  The theorectical deduction of Cor Zi I(|Zi| > bl),Zj I(|Zj| > bk) is a natural exten- sion of the TFisher case where bl = bk.

This approximation is overall conservative and works fine when the n is not too large and correlation is not too dense. See Figure 4.9 and 4.10

Figure 4.9: P-value calculation of omnibus soft-thresholding with τ = 0.1, 0.2, ..., 1 under polynomial decaying correlation λ = 2. Chapter 4: TFisher for Dependent Data Analysis 145

Figure 4.10: P-value calculation of omnibus soft-thresholding with τ = 0.1, 0.2, ..., 1 under polynomial qual correlation ρ = 0.2. Bibliography

[1] M. B. Brown, “400: A method for combining non-independent, one-sided tests of significance,” Biometrics, pp. 987–992, 1975.

[2] P. Kraft and D. J. Hunter, “Genetic risk prediction–are we there yet?,” New England Journal of Medicine, vol. 360, no. 17, p. 1701, 2009.

[3] E. T. Cirulli, B. N. Lasseigne, S. Petrovski, P. C. Sapp, P. A. Dion, C. S. Leblond, J. Couthouis, Y.-F. Lu, Q. Wang, B. J. Krueger, et al., “Exome se- quencing in amyotrophic lateral sclerosis identifies risk genes and pathways,” Science, vol. 347, no. 6229, pp. 1436–1441, 2015.

[4] D. L. Donoho and J. Jin, “Higher criticism for detecting sparse heterogeneous mixtures,” The Annals of Statistics, vol. 32, no. 3, pp. 962–994, 2004.

[5] D. L. Donoho and J. Jin, “Higher criticism thresholding: Optimal feature se- lection when useful features are rare and weak,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, pp. 14790–14795, Sep 30 2008.

[6] Z. Wu, Y. Sun, S. He, J. Cho, H. Zhao, and J. Jin, “Detection boundary and Higher Criticism approach for sparse and weak genetic effects.,” The Annals of Applied Statistics, vol. 8, no. 2, pp. 824–851, 2014.

[7] I. Barnett, R. Mukherjee, and X. Lin, “The generalized higher criticism for testing snp-set effects in genetic association studies,” Journal of the American Statistical Association, no. just-accepted, 2016.

[8] R. H. Berk and D. H. Jones, “Goodness-of-fit test statistics that dominate the kolmogorov statistics,” Probability Theory and Related Fields, vol. 47, no. 1, pp. 47–59, 1979.

[9] L. Jager and J. A. Wellner, “Goodness-of-fit tests via phi-divergences,” The Annals of Statistics, pp. 2018–2053, 2007.

146 Bibliography 147

[10] E. Arias-Castro, E. J. Cand`es,and Y. Plan, “Global testing under sparse alter- natives: Anova, multiple comparisons and the higher criticism,” The Annals of Statistics, vol. 39, no. 5, pp. 2533–2556, 2011.

[11] T. T. Cai and Y. Wu, “Optimal detection of sparse mixtures against a given null distribution,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2217–2232, 2014.

[12] A. Moscovich, B. Nadler, and C. Spiegelman, “On the exact berk-jones statistics and their p-value calculation,” Electronic Journal of Statistics, vol. 10, no. 2, pp. 2329–2354, 2016.

[13] J. Li and D. Siegmund, “Higher criticism: p-values and criticism,” Annals of Statistics, vol. 43, no. 3, pp. 1323–1350, 2015.

[14] A. L. Price, G. V. Kryukov, P. I. W. de Bakker, S. M. Purcell, J. Staples, L. J. Wei, and S. R. Sunyaev, “Pooled association tests for rare variants in exon- resequencing studies,” American Journal of Human Genetics, vol. 86, no. 6, pp. 832–838, 2010.

[15] M. No´e,“The calculation of distributions of two-sided kolmogorov-smirnov type statistics,” The Annals of Mathematical Statistics, pp. 58–64, 1972.

[16] M. No´eand G. Vandewiele, “The calculation of distributions of kolmogorov- smirnov type statistics including a table of significance points for a particular case,” The Annals of Mathematical Statistics, vol. 39, no. 1, pp. 233–241, 1968.

[17] V. Kotel’Nikova and E. Chmaladze, “On computing the probability of an em- pirical process not crossing a curvilinear boundary,” Theory of Probability & Its Applications, vol. 27, no. 3, pp. 640–648, 1983.

[18] G. R. Shorack and J. A. Wellner, Empirical processes with applications to statis- tics, vol. 59. SIAM, 2009.

[19] G. Steck, “The smirnov two sample tests as rank tests,” The Annals of Mathe- matical Statistics, pp. 1449–1466, 1969.

[20] M. Breth, “On a recurrence of steck,” Journal of Applied Probability, pp. 823– 825, 1976.

[21] H. Ruben, “On the evaluation of steck’s determinant for rectangle probabilities of uniform order statistics,” Communications in Statistics-Theory and Methods, vol. 5, no. 6, pp. 535–543, 1976.

[22] I. J. Barnett and X. Lin, “Analytical p-value calculation for the higher criticism test in finite-d problems,” Biometrika, vol. 101, no. 4, pp. 964–970, 2014. Bibliography 148

[23] M. Denuit, C. Lef`evre,and P. Picard, “Polynomial structures in order statistics distributions,” Journal of Statistical Planning and Inference, vol. 113, no. 1, pp. 151–178, 2003.

[24] F. Eicker, “The asymptotic distribution of the suprema of the standardized empirical processes,” The Annals of Statistics, pp. 116–138, 1979.

[25] D. Jaeschke, “The asymptotic distribution of the supremum of the standardized empirical distribution function on subintervals,” The Annals of Statistics, vol. 7, no. 1, pp. 108–115, 1979.

[26] J. Hoh, A. Wille, and J. Ott, “Trimming, weighting, and grouping SNPs in human case-control association studies,” Genome Research, vol. 11, no. 12, pp. 2115–2119, 2001.

[27] J. Li and G. C. Tseng, “An adaptively weighted statistic for detecting differ- ential gene expression when combining multiple transcriptomic studies,” The Annals of Applied Statistics, vol. 5, no. 2A, pp. 994–1019, 2011.

[28] C. Song and G. C. Tseng, “Hypothesis setting and order statistic for robust genomic meta-analysis,” The Annals of Applied Statistics, vol. 8, no. 2, pp. 777– 800, 2014.

[29] Y. I. Ingster, “Some problems of hypothesis testing leading to infinitely divisible distributions,” Mathematical Methods of Statistics, vol. 6, no. 1, pp. 47–69, 1997.

[30] Y. I. Ingster, “Minimax detection of a signal for in-balls,” Mathematical Methods of Statistics, vol. 7, no. 4, pp. 401–428, 1998.

[31] J. Tukey, “The higher criticism.” Course Notes, Statistics 411, Princeton Uni- versity., 1976.

[32] D. L. Donoho and J. Jin, “Higher criticism for large-scale inference: especially for rare and weak effects,” Statistical Science, vol. 30, no. 1, pp. 1–25 DOI: 10.1214/14–STS506, 2015.

[33] J. Shao, Mathematical Statistics. Springer Verlag, 2010.

[34] T. W. Anderson and D. A. Darling, “Asymptotic theory of certain” goodness of fit” criteria based on stochastic processes,” The annals of mathematical statis- tics, pp. 193–212, 1952.

[35] Z. W. Hong Zhang, Jiashun Jin, “Supplement to “distributions and statistical power of optimal signal-detection methods in finite cases”,” 2017. Bibliography 149

[36] D. B. Goldstein, “Common genetic variation and human traits,” New England Journal of Medicine, vol. 360, no. 17, pp. 1696–1698, 2009.

[37] L. Luo, G. Peng, Y. Zhu, H. Dong, C. I. Amos, and M. Xiong, “Genome-wide gene and pathway analysis,” European Journal of Human Genetics, vol. 18, no. 9, pp. 1045–1053, 2010.

[38] P. McCullagh and J. A. Nelder, Generalized Linear Models. Florida: CRC Press LLC, 2nd ed., 1989.

[39] D. J. Schaid, C. M. Rowland, D. E. Tines, R. M. Jacobson, and G. A. Poland, “Score tests for association between traits and haplotypes when linkage phase is ambiguous,” The American Journal of Human Genetics, vol. 70, no. 2, pp. 425– 434, 2002.

[40] R. Duerr, K. Taylor, S. Brant, J. Rioux, M. Silverberg, M. Daly, A. Steinhart, C. Abraham, M. Regueiro, A. Griffiths, et al., “A genome–wide association study identifies il23r as an inflammatory bowel disease gene,” Science Signalling, vol. 314, no. 5804, p. 1461, 2006.

[41] A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich, “Principal components analysis corrects for stratification in genome- wide association studies,” Nature genetics, vol. 38, no. 8, pp. 904–909, 2006.

[42] J. Yang, M. N. Weedon, S. Purcell, G. Lettre, K. Estrada, C. J. Willer, A. V. Smith, E. Ingelsson, J. R. O’Connell, M. Mangino, R. Magi, P. A. Madden, A. C. Heath, D. R. Nyholt, N. G. Martin, G. W. Montgomery, T. M. Frayling, J. N. Hirschhorn, M. I. McCarthy, M. E. Goddard, P. M. Visscher, and G. Consor- tium, “Genomic inflation factors under polygenic inheritance,” European jour- nal of human genetics : EJHG, vol. 19, pp. 807–812, Jul 2011.

[43] J.-P. Hugot, M. Chamaillard, H. Zouali, S. Lesage, J.-P. C´ezard,J. Belaiche, S. Almer, C. Tysk, C. A. O’Morain, M. Gassull, et al., “Association of nod2 leucine-rich repeat variants with susceptibility to crohn’s disease,” Nature, vol. 411, no. 6837, pp. 599–603, 2001.

[44] Y. Ogura, D. K. Bonen, N. Inohara, D. L. Nicolae, F. F. Chen, R. Ramos, H. Britton, T. Moran, R. Karaliuskas, R. H. Duerr, et al., “A frameshift muta- tion in nod2 associated with susceptibility to crohn’s disease,” Nature, vol. 411, no. 6837, pp. 603–606, 2001.

[45] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova, “ gene: gene- centered information at ncbi,” Nucleic acids research, vol. 39, no. suppl 1, pp. D52–D57, 2011. Bibliography 150

[46] P. Chamouard, Z. Richert, N. Meyer, G. Rahmi, and R. Baumann, “Diagnos- tic value of c-reactive protein for predicting activity level of crohns disease,” Clinical Gastroenterology and Hepatology, vol. 4, no. 7, pp. 882–887, 2006.

[47] G. Trikudanathan, P. G. Venkatesh, and U. Navaneethan, “Diagnosis and ther- apeutic management of extra-intestinal manifestations of inflammatory bowel disease,” Drugs, vol. 72, no. 18, pp. 2333–2349, 2012.

[48] W. D. James, T. Berger, and D. Elston, Andrew’s diseases of the skin: clinical dermatology. Elsevier Health Sciences, 2011.

[49] S. Yuvaraj, S. H. Al-Lahham, R. Somasundaram, P. A. Figaroa, M. P. Pep- pelenbosch, and N. A. Bos, “E. coli-produced bmp-2 as a chemopreventive strategy for colon cancer: a proof-of-concept study,” Gastroenterology research and practice, vol. 2012, 2012.

[50] M. L. Slattery, A. Lundgreen, J. S. Herrick, S. Kadlubar, B. J. Caan, J. D. Potter, and R. K. Wolff, “Genetic variation in bone morphogenetic protein and colon and rectal cancer,” International Journal of Cancer, vol. 130, no. 3, pp. 653–664, 2012.

[51] A. Kolmogorov, “Sulla determinazione empirica di una leggi di distribuzione,” G. Ist. Ital. Attuari., vol. 4, pp. 83–91, 1933.

[52] S. Kotz and N. L. Johnson, Breakthroughs in Statistics: Foundations and basic theory. Springer Science & Business Media, 2012.

[53] H. Zhang, J. Jin, and Z. Wu, “Distributions and statistical power of optimal signal-detection methods in finite cases,” Submitted, 2016.

[54] R. Sun and X. Lin, “Set-based tests for genetic association using the generalized berk-jones statistic,” arXiv preprint arXiv:1710.02469, 2017.

[55] P. Hall and J. Jin, “Innovated higher criticism for detecting sparse signals in correlated noise,” The Annals of Statistics, vol. 38, no. 3, pp. 1686–1732, 2010.

[56] P. Hall and J. Jin, “Properties of higher criticism under strong dependence,” The Annals of Statistics, vol. 36, no. 1, pp. 381–402, 2008.

[57] Y. Fan, J. Jin, Z. Yao, et al., “Optimal classification in sparse gaussian graphic model,” The Annals of Statistics, vol. 41, no. 5, pp. 2537–2571, 2013.

[58] J. Jin and Z. T. Ke, “Rare and weak effects in large-scale inference: methods and phase diagrams,” Statistica Sinica, vol. 26, pp. 1–34, 2016. Bibliography 151

[59] T. T. Cai, C. H. Zhang, and H. H. Zhou, “Optimal rates of convergence for covariance matrix estimation,” The Annals of Statistics, vol. 38, no. 4, pp. 2118– 2144, 2010.

[60] W. S. Cleveland, E. Grosse, and W. M. Shyu, “Local regression models,” Sta- tistical models in S, pp. 309–376, 1992.

[61] Q. Sun, “Wiener’s lemma for infinite matrices with polynomial off-diagonal decay,” Comptes Rendus Mathematique, vol. 340, no. 8, pp. 567–570, 2005.

[62] I. Shlyakhter, P. C. Sabeti, and S. F. Schaffner, “Cosi2: an efficient simulator of exact and approximate coalescent with selection,” Bioinformatics, vol. 30, no. 23, pp. 3427–3429, 2014.

[63] J. D. Wall and J. K. Pritchard, “Haplotype blocks and linkage disequilibrium in the ,” Nature Reviews Genetics, vol. 4, no. 8, p. 587, 2003.

[64] B. N. Smith, N. Ticozzi, C. Fallini, A. S. Gkazi, S. Topp, K. P. Kenna, E. L. Scotter, J. Kost, P. Keagle, J. W. Miller, et al., “Exome-wide rare variant analysis identifies TUBA4A mutations associated with familial ALS,” Neuron, vol. 84, no. 2, pp. 324–331, 2014.

[65] P. J. Bickel and E. Levina, “Regularized estimation of large covariance matri- ces,” The Annals of Statistics, pp. 199–227, 2008.

[66] R. A. Fisher, Statistical Methods for Research Workers. Oliver and Boyd, Ed- inburgh, 1932.

[67] R. C. Littell and J. L. Folks, “Asymptotic optimality of Fisher’s method of combining independent tests,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 802–806, 1971.

[68] R. C. Littell and J. L. Folks, “Asymptotic optimality of Fisher’s method of com- bining independent tests II,” Journal of the American Statistical Association, vol. 68, no. 341, pp. 193–194, 1973.

[69] S. A. Stouffer, E. A. Suchman, L. C. DeVinney, S. A. Star, and R. M. Williams, The American Soldier: Adjustment during Army Life, vol. I. New Jersey: Princeton University Press, 1949.

[70] M. C. Whitlock, “Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach,” Journal of Evolutionary Biology, vol. 18, no. 5, pp. 1368–1373, 2005. Bibliography 152

[71] Y.-C. Su, W. J. Gauderman, K. Berhane, and J. P. Lewinger, “Adaptive set- based methods for association testing,” Genetic Epidemiology, vol. 40, no. 2, pp. 113–122, 2016.

[72] L. Tippert, The Methods of Statistics. London: Williams and Norgate Ltd., 1931.

[73] D. V. Zaykin, L. A. Zhivotovsky, P. H. Westfall, and B. S. Weir, “Truncated product method for combining p-values,” Genetic Epidemiology, vol. 22, no. 2, pp. 170–185, 2002.

[74] D. V. Zaykin, L. A. Zhivotovsky, W. Czika, S. Shao, and R. D. Wolfinger, “Com- bining p-values in large-scale genomics experiments,” Pharmaceutical Statistics, vol. 6, no. 3, pp. 217–226, 2007.

[75] F. Dudbridge and B. P. Koeleman, “Rank truncated product of p-values, with application to genomewide association scans,” Genetic Epidemiology, vol. 25, no. 4, pp. 360–366, 2003.

[76] C.-L. Kuo and D. V. Zaykin, “Novel rank-based approaches for discovery and replication in genome-wide association studies,” Genetics, vol. 189, no. 1, pp. 329–340, 2011.

[77] K. Yu, Q. Li, A. W. Bergen, R. M. Pfeiffer, P. S. Rosenberg, N. Caporaso, P. Kraft, and N. Chatterjee, “Pathway analysis by adaptive combination of P-values,” Genetic Epidemiology, vol. 33, no. 8, pp. 700–709, 2009.

[78] J. M. Biernacka, G. D. Jenkins, L. Wang, A. M. Moyer, and B. L. Fridley, “Use of the gamma method for self-contained gene-set analysis of SNP data,” European Journal of Human Genetics, vol. 20, no. 5, pp. 565–571, 2012.

[79] H. Dai, J. S. Leeder, and Y. Cui, “A modified generalized Fisher method for combining probabilities from dependent tests,” Frontiers in Genetics, vol. 5, no. 32, 2014.

[80] F. Abramovich, Y. Benjamini, D. L. Donoho, and I. M. Johnstone, “Adapting to unknown sparsity by controlling the false discovery rate,” The Annals of Statistics, vol. 34, no. 2, pp. 584–653, 2006.

[81] D. L. Donoho, “De-noising by soft-thresholding,” IEEE Transactions on Infor- mation Theory, vol. 41, no. 3, pp. 613–627, 1995.

[82] S. Lee, M. J. Emond, M. J. Bamshad, K. C. Barnes, M. J. Rieder, D. A. Nicker- son, D. C. Christiani, M. M. Wurfel, and X. Lin, “Optimal unified approach for rare-variant association testing with application to small-sample case-control Bibliography 153

whole-exome sequencing studies,” The American Journal of Human Genetics, vol. 91, no. 2, pp. 224–237, 2012.

[83] X. Lin, S. Lee, M. C. Wu, C. Wang, H. Chen, Z. Li, and X. Lin, “Test for rare variants by environment interactions in sequencing association studies,” Biometrics, vol. 72, no. 1, pp. 156–164, 2016.

[84] I. J. Good, “On the weighted combination of signifiance tests,” Journal of the Royal Statistical Society: Series B, vol. 17, no. 2, pp. 264–265, 1955.

[85] A. Genz, “Numerical computation of multivariate normal probabilities,” Jour- nal of Computational and Graphical Statistics, vol. 1, no. 2, pp. 141–149, 1992.

[86] A. Azzalini, “A class of distributions which includes the normal ones,” Scandi- navian Journal of Statistics, vol. 12, no. 2, pp. 171–178, 1985.

[87] S. Nadarajah, “A generalized normal distribution,” Journal of Applied Statis- tics, vol. 32, no. 7, pp. 685–694, 2005.

[88] M. K. Varanasi and B. Aazhang, “Parametric generalized Gaussian density estimation,” The Journal of the Acoustical Society of America, vol. 86, no. 4, pp. 1404–1415, 1989.

[89] A. DasGupta, Asymptotic Theory of Statistics and Probability. New York: Springer Science & Business Media, 2008.

[90] H. E. Daniels, “Saddlepoint approximations in statistics,” The Annals of Math- ematical Statistics, vol. 25, no. 4, pp. 631–650, 1954.

[91] R. Lugannani and S. Rice, “Saddle point approximation for the distribution of the sum of independent random variables,” Advances in Applied Probability, vol. 12, no. 2, pp. 475–490, 1980.

[92] R. R. Bahadur, “Stochastic comparison of tests,” The Annals of Mathematical Statistics, vol. 31, no. 2, pp. 276–295, 1960.

[93] Y. Nikitin, Asymptotic Efficiency of Nonparametric Tests. New York: Cam- bridge University Press, 1995.

[94] R. R. Bahadur, “Rates of convergence of estimates and test statistics,” The Annals of Mathematical Statistics, vol. 38, no. 2, pp. 303–324, 1967.

[95] W. A. Abu-Dayyeh, M. A. Al-Momani, and H. A. Muttlak, “Exact bahadur slope for combining independent tests for normal and logistic distributions,” Applied mathematics and computation, vol. 135, no. 2, pp. 345–360, 2003. Bibliography 154

[96] E. V. Davydov, D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow, and S. Bat- zoglou, “Identifying a high fraction of the human genome to be under selec- tive constraint using GERP++,” PLoS Computational Biology, vol. 6, no. 12, p. e1001025, 2010.

[97] X. Liu, C. Wu, C. Li, and E. Boerwinkle, “dbNSFP v3.0: A one-stop database of functional predictions and annotations for human nonsynonymous and splice- site SNVs,” Human Mutation, vol. 37, no. 3, pp. 235–241, 2016.

[98] P. Andr´es-Benito,J. Moreno, E. Aso, M. Povedano, and I. Ferrer, “Amyotrophic lateral sclerosis, gene deregulation in the anterior horn of the spinal cord and frontal cortex area 8: implications in frontotemporal lobar degeneration,” Aging, vol. 9, no. 3, pp. 823–851, 2017.

[99] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta- Cepas, M. Simonovic, A. Roth, A. Santos, K. P. Tsafou, et al., “STRING v10: protein–protein interaction networks, integrated over the tree of life,” Nucleic Acids Research, vol. 43, no. D1, pp. D447–D452, 2014.

[100] V. Bonifati, “Parkinson’s disease: the LRRK2-G2019S mutation: opening a novel era in Parkinson’s disease genetics,” European Journal of Human Genet- ics, vol. 14, no. 10, pp. 1061–1062, 2006.

[101] K. L. Ayers, U. L. Mirshahi, A. H. Wardeh, M. F. Murray, K. Hao, B. S. Glicksberg, S. Li, D. J. Carey, and R. Chen, “A loss of function variant in CASP7 protects against Alzheimer’s disease in homozygous APOE ε4 allele carriers,” BMC Genomics, vol. 17, no. Suppl 2, p. 445, 2016.

[102] S. Fanning, W. Xu, C. Beaurepaire, J. Suhan, A. Nantel, and A. Mitchell, “Functional control of the Candida albicans cell wall by catalytic protein kinase A subunit Tpk1,” Molecular Microbiology, vol. 86, no. 2, pp. 284–302, 2012.

[103] S. Guo, Z.-Z. Li, J. Gong, M. Xiang, P. Zhang, G.-N. Zhao, M. Li, A. Zheng, X. Zhu, H. Lei, et al., “Oncostatin M confers neuroprotection against ischemic stroke,” Journal of Neuroscience, vol. 35, no. 34, pp. 12047–12062, 2015.

[104] F. Cevikbas, X. Wang, T. Akiyama, C. Kempkes, T. Savinko, A. Antal, G. Kukova, T. Buhl, A. Ikoma, J. Buddenkotte, et al., “A sensory neuron– expressed IL-31 receptor mediates T helper cell–dependent itch: Involvement of TRPV1 and TRPA1,” Journal of Allergy and Clinical Immunology, vol. 133, no. 2, pp. 448–460, 2014.

[105] D. L. Chapman and V. E. Papaioannou, “Three neural tubes in mouse embryos with mutations in the T-box gene Tbx6,” Nature, vol. 391, no. 6668, pp. 695– 697, 1998. Bibliography 155

[106] R. S Pandya, L. LJ Mao, E. W Zhou, R. Bowser, Z. Zhu, Y. Zhu, and X. Wang, “Neuroprotection for amyotrophic lateral sclerosis: role of stem cells, growth factors, and gene therapy,” Central Nervous System Agents in Medicinal Chem- istry (Formerly Current Medicinal Chemistry-Central Nervous System Agents), vol. 12, no. 1, pp. 15–27, 2012.

[107] L. E. Cox, L. Ferraiuolo, E. F. Goodall, P. R. Heath, A. Higginbottom, H. Mort- iboys, H. C. Hollinger, J. A. Hartley, A. Brockington, C. E. Burness, et al., “Mutations in CHMP2B in lower motor neuron predominant amyotrophic lat- eral sclerosis (ALS),” PLoS One, vol. 5, no. 3, p. e9872, 2010.

[108] B. J. Carter, P. Anklesaria, S. Choi, and J. F. Engelhardt, “Redox modifier genes and pathways in amyotrophic lateral sclerosis,” Antioxidants & Redox Signaling, vol. 11, no. 7, pp. 1569–1586, 2009.

[109] G. P. de Oliveira, J. R. Maximino, M. Maschietto, E. Zanoteli, R. D. Puga, L. Lima, D. M. Carraro, and G. Chadi, “Early gene expression changes in skeletal muscle from SOD1G93A amyotrophic lateral sclerosis animal model,” Cellular and Molecular Neurobiology, vol. 34, no. 3, pp. 451–462, 2014.

[110] J. T. Kost and M. P. McDermott, “Combining dependent p-values,” Statistics & Probability Letters, vol. 60, no. 2, pp. 183–190, 2002.

[111] W. Poole, D. L. Gibbs, I. Shmulevich, B. Bernard, and T. A. Knijnen- burg, “Combining dependent p-values with an empirical adaptation of brown’s method,” Bioinformatics, vol. 32, no. 17, pp. i430–i436, 2016.

[112] H. Zhang, T. Tong, J. E. Landers, and Z. Wu, “Tfisher tests: Optimal and adaptive thresholding for combining p-values,” arXiv preprint, 2017.

[113] H. David and H. Nagaraja, “Order statistics,” 2003.

[114] A. M. Mathai and P. G. Moschopoulos, “On a multivariate gamma,” Journal of Multivariate Analysis, vol. 39, no. 1, pp. 135–153, 1991.

[115] Y. L. Tong, The multivariate normal distribution. Springer Science & Business Media, 2012. Appendix A

Proofs of Chapter1

Proof of Theorem 1.3.1. For k = k1, ..., 1, define

1 x x Z n! Z k1 Z k+1 n−k1 ak = (1 − xk1 ) ... dxk...dxk1−1dxk1 . (n − k1)! uk1 uk1−1 uk

n! ¯ Then obviously ak = F (uk ), and for k ≤ k1 − 1, 1 (n−k1+1)! B(1,m) 1

1 x x Z n! Z k1 Z k+2 n−k1 ak = (1 − xk1 ) ... xk+1dxk+1...dxk1−1dxk1 − ukak+1 (n − k1)! uk1 uk1−1 uk+1 Z 1 n! xk1−k k1−k uj n−k1 k1 X k+j−1 = (1 − xk1 ) dxk1 − ak+j (n − k1)! (k1 − k)! j! uk1 j=1

k1−k j n! X uk+j−1 = F¯ (u ) − a (n − k + k)! B(k1−k+1,m) k1 j! k+j 1 j=1

156 Appendix A: Proofs of Chapter1 157

Now by Lemma3,

P (Sn ≤ b) = P {U(k) > uk, k0 ≤ k ≤ k1}

1 x x k0−1 Z n! Z k1 Z k0+1 x n−k1 k0 = (1 − xk1 ) ... dxk0 ...dxk1−1dxk1 (n − k1)! (k0 − 1)! uk1 uk1−1 uk0 1 x x k0 k0 Z n! Z k1 Z k0+2 x u n−k1 k0+1 k0 = (1 − xk1 ) ... dxk0+1...dxk1−1dxk1 − ak0+1 (n − k1)! k0! k0! uk1 uk1−1 uk0+1 Z 1 n! xk1−1 k1−1 ui n−k1 k1 X i = (1 − xk1 ) dxk1 − ai+1 uk (n − k1)! (k1 − 1)! i! 1 i=k0 k −1 X1 ui =F¯ (u ) − i a . B(k1,m) k1 i! i+1 i=k0

Proof of Theorem 1.3.2. The idea is to use total probability theorem. n n+1 X X  P (Sn,R ≤ b) = P {Sn,R ≤ b} ∩ { exactly p(i)...p(j−1) fall in [α0, α1]} i=1 j=i+1 n n+1 X X = Pij i=1 j=i+1 Notice that 1 ≤ i ≤ n, i + 1 ≤ j ≤ n + 1, we have

Pij = P (U(i−1) < β0,U(i) ≥ ui, ..., U(j−1) ≥ uj−1,U(j−1) ≤ β1,U(j) > β1),

where the joint density of U(i−1), ..., U(j) is n! f(x , ..., x ) = xi−2(1 − x )n−j, 0 ≤ x ≤ ... ≤ x ≤ 1. i−1 j (i − 2)!(n − j)! i−1 j i−1 j Then

Z 1 n−j Z β1 Z xi+1 Z β0 i−2 (1 − xj) xi−1 Pij = n! ··· dxi−1dxi...dxj−1dxj β1 (n − j)! uj−1 ui 0 (i − 2)! Z 1 n−j Z β1 Z xi+1 Z β0 i−2 (1 − xj) xi−1 = n! dxj ··· dxi...dxj−1 dxi−1 β1 (n − j)! uj−1 ui 0 (i − 2)! Z β1 Z xi+1 = cijn! ··· dxi...dxj−1. uj−1 ui Appendix A: Proofs of Chapter1 158

Direct calculation similar to the proof of Theorem 1.3.1 gives the final result.

Proof of Theorem 1.3.3. Following the idea in the proof of Theorem 1.3.2,

k1 n+1 X X  P (Sn,R ≤ b) = P {Sn,R ≤ b} ∩ { exactly p(i)...p(j−1) fall in [α0, α1]} i=1 j=˜i+1 k n+1 X1 X = Pij. i=1 j=˜i+1

For each feasible pair of (i, j), direct calculation shows Pij can be concisely written

as

˜ ˜ Z β1 j−j Z x˜j−1 Z x˜i+1 ˜ i−i (β1 − x˜j−1) (x˜i − β0) Pij = cijn! ··· dx˜...dx˜ dx˜ (j − ˜j)! (˜i − i)! i j−2 j−1 u˜j−1 u˜j−2 u˜i  ˜  ˜ j−i ˜ j−1 ˜ k−i+1 (β1 − β0) u˜j−1 − β0 X (uk − β0) = c n! F¯ ( ) − a (k + 1) , ij  B(˜j−i,j−˜j+1) ˜ j  (j − i)! β1 − β0 (k − i + 1)! k=˜i where ˜ Z β1 j−j Z x˜j−1 Z xk+1 (β1 − x˜j−1) aj(k) = n! ··· dxk...dx˜ dx˜ (j − ˜j)! j−2 j−1 u˜j−1 u˜j−2 uk ˜ j−k j−k−1 l u˜ u β1 ¯ j−1 X k+l−1 = n! FB(˜j−k,j−˜j+1)( ) − aj(k + 1). (j − k)! β1 l! l=1

Proof of Theorem 1.3.4. The main idea of the proofs of Theorem 1.3.4 and 1.3.5

is as follows. Note that U(i) := D(p(i)) defined in (1.17) follow the same distribution

Γi Pi i.i.d. of , where Γi = εj, εj ∼ Exp(1), so that Γi ∼ Gamma(i, 1). Thus we Γn+1 j=1 can approximate

k P {U > D(g( , b)), for all k ≤ k ≤ k }. (k) n 0 1 k ≈P {Γ > (n + 1)D(g( , b)), for all k ≤ k ≤ k }. (k) n 0 1 Appendix A: Proofs of Chapter1 159

We take advantage of the joint density of (Γk0 , ..., Γk1 ), which is given by Lemma4, while Γn+1 can be approximated by n+1 when n is reasonably large. Accordingly, we apply similar calculation as the proof of Theorem 1.3.1 except by applying Lemmas

7 and8 instead.

Proof of Theorem 1.3.5. Note that

P (Γk > a + λk, 1 ≤ k ≤ k1)

Z +∞ Z zk1 Z z2  −zk1 = e ... dz1...dzk1−1 dzk1 a+λk1 a+λ(k1−1) a+λ

By lemma6,

Z +∞ k1−2 (zk1 − a − (k1 − 1)λ)(zk1 − a) −zk1 = e dzk1 a+λk1 (k1 − 1)! Z +∞  k1−1 k1−2  (zk1 − a) (k1 − 1)λ(zk1 − a) −zk1 −zk1 = e − e dzk1 a+λk1 (k1 − 1)! (k1 − 1)!

Let x = zk1 − a, Z +∞ xk1−1 Z +∞ xk1−2  =e−a e−x dx − λ e−x dx λk1 (k1 − 1)! λk1 (k1 − 2)!

−a =e [1 − P (Γk1 ≤ λk1) − λ (1 − P (Γk1−1 ≤ λk1))]

−a =e (1 − λ + λFΓ(k1−1)(λk1) − FΓ(k1)(λk1))

−a =e (1 − λ + hk1 (λ)).

Thus Theorem 1.3.5 is proved by combining this equation and Lemma7.

The idea of the proof of Theorem 1.3.6 is motivated by [13]. Instead of directly considering the distribution function, we look at the right-tail probability which can be decomposed into the union of disjoint sets. Appendix A: Proofs of Chapter1 160

Proof of Theorem 1.3.6 and Corollary3. Let event An,k be defined as in Lemma

9. They are disjoint and {S ≥ b} = Sk1 A . In this proof we mainly focus on n k=k0 n,k

approximating P (An,k)

k k 0 dD(g( n ,b)) 0 dk1 −dk Let dk = (n + 1)D(g( , b)), d = (n + 1) , d = (n + 1) . Notice that n k dx k,max k1−k 0 dk D(g(x, b)) is convex in x, so dk+j > dk + n j. From Lemma8 and Lemma9, we have

for 1 ≤ k ≤ k1 − 1, dk P (A ) = k (Γ > d , 1 ≤ j ≤ k − k) n,k k! j k+j 1 dk d0 ≤ k P (Γ > d + k j, 1 ≤ j ≤ k − k) k! j k n 1 dk d0 d0 = k e−dk (1 − k + h ( k )) k! n k1−k n d0 d0 = (1 − k + h ( k ))f (k). n k1−k n P (dk)

Next we need to find the lower bound for P (An,k). √ √ We first consider k ≥ k1 − n, here n is chosen for covenience and the proof 0 d √ γ k+ n works for any n , 0 < γ < 1. Note that, due to the convexity, dk+j ≤ dk + n j, √ 1 ≤ j ≤ n dk d0 √ √ P (A ) ≥ k P (Γ ≥ d + k+ n j, 1 ≤ j ≤ n) n,k k! j k n k 0 √ 0 √ d dk+ n dk+ n ≥ k e−dk (1 − + h√ ( ) k! n n n d0 d0 = (1 + o(1))(1 − k + h√ ( k ))f (k). n n n P (dk) The last equation is due to Theorem 1.3.5 and the continuity of D(g(x, b)). We can 0 0 √ dk √ dk see P (An,k) → (1 − n + h n( n ))fP (dk)(k) uniformly in k, k ≥ k1 − n.

√ We then consider k ≤ k1 − n. In this case, the proof is slightly more compli-

cated than the first case, however the idea is similar. Appendix A: Proofs of Chapter1 161

Let pn,k(y) = P (Γj > dk+j − dk + y, 1 ≤ j ≤ k1 − k), 0 ≤ y ≤ dk, and fΓk be the

R dk density function of Γk, then P (An,k) = 0 pn,k(y)fΓk (dk − y)dy. Similar to the proof in the first case,

0 √ k1−k dk+ n √ [ pn,k(y) ≥ P (Γj ≥ y + , 1 ≤ j ≤ n) − P ( {Γj ≤ y + dk+j − dk}) n √ j= n d0 √ d0 √ k1−k d0 −y k+ n √ k+ n X Γj y k,max ≥ e (1 − + h n( )) − P ( ≤ + ) n n √ j j n j= n 0 0 k1−k d0 −y dk √ dk X Γj y k,max = (1 + o(1))e (1 − + h n( )) − P ( ≤ + ). n n √ j j n j= n To prove the residual uniformly in k converges to 0, we need to apply Lemma5.

For y of some lower order, say O(log n)

y d0 log n d0 + k,max ≤ √ + k,max j n n n log n n + 1 dD(g(x, b)) < √ + sup { } n n k0 k1 dx n ≤x≤ n

≤ δ < 1 when n is large.

Then, by Lemma5, we have the residual √ k1−k −I(δ) n X −jI(δ) e resk ≤ e ≤ −I(δ) → 0 exponentially uniformly in k. √ 1 − e j= n From

d0 d0 d0 d0 (1 + o(1))e−y(1 − k + h√ ( k )) − res ≤ p ≤ e−y(1 − k + h√ ( k )), n n n k n,k n n n

we can conclude that

0 0 −y dk √ dk pn,k(y) → e (1 − + h n( )) uniformly in k, n n √ e−I(δ) n with error bound res ≤ . k 1 − e−I(δ) Appendix A: Proofs of Chapter1 162

Let α denote log n for simplicity, α > 1, then, for some constant c > 1,

Z min{dk,cα} d0 d0 ˆ −y k √ k In,k = e (1 − + h n( ))fΓk (dk − y)dy 0 n n √ Z min{dk,cα} ne−I(δ) n →I = p (y)f (d − y)dy uniformly with error bound ≤ . n,k n,k Γk k −I(δ) 0 1 − e (A.1)

When dk > cα, dk − y < dk < k − 1, fΓk (dk − y) is decreasing in y

P (A ) − I (d − cα)f (d − cα) n,k n,k ≤ k Γk k In,k αfΓk (dk − α) cα−dk k−1 (dk − cα)e (dk − cα) = α−d k−1 αe k (dk − α) (d − cα) (c − 1)α = k e(c−1)α(1 − )k−1 α dk − α (c−1)α (dk − cα) (c−1)α (k−1) log(1− ) = e e dk−α α (c−1)α dk α(c−1) −(k−1)( ) ≤ e e dk−α α dk α(c−1)(1− k−1 ) ≤ e dk α

k Let x = n , y = D(g(x, b)), then

(n + 1)y −(c−1)( nx−1 −1) = n (n+1)y . α

nx−1 nx−1 Notice that (n+1)y > 1, we need (c − 1)( (n+1)y − 1) > 1 for all x, that is c =

nx−1 P (An,k)−In,k y −γ sup 2 k1 > 1. Therefore ≤ n , γ > 0. For k = 1, n ≤x≤ n nx−1−(n+1)D(g(x,b)) In,k α

the result follows d1 → 0,

In,k → P (An,k) uniformly in k as n → ∞. (A.2) Appendix A: Proofs of Chapter1 163

d d0 d0 ˆ R k −y k √ k Define P (An,k) = 0 e (1 − n + h n( n ))fΓk (dk − y)dy,

k−1 ˆ ˆ R min{dk,cα} (dk−y) P (An,k) − In,k 0 (k−1)! dy = 1 − k−1 Pˆ(A ) R dk (dk−y) n,k 0 (k−1)! dy k k dk − (dk − cα) = 1 − ( k ) dk  cαk = 1 − dk  k  ≤ exp −c log n . dk

Since c, k > 1, we have dk

ˆ ˆ In,k → P (An,k) uniformly in k as n → ∞. (A.3)

k−1 −y −dk (dk−y) Finally, notice that e fΓk (dk − y) = e (k−1)! , then

0 0 Z dk k−1 d d (dk − y) Pˆ(A ) = (1 − k + h√ ( k ))e−dk dy n,k n n n (k − 1)! 0 (A.4) d0 d0 = (1 − k + h√ ( k ))f (k). n n n P (dk) Combine lemma7 and equations (A.1), (A.2), (A.3), and (A.4), we have

P (Sn ≥ b) =(1 + o(1))P (Γk ≤ dk, for some k0 ≤ k ≤ k1)

k1 X ˆ =(1 + o(1)) P (An,k)

k=k0 k1 0 0 X dk dk =(1 + o(1)) (1 − + h ∗ ( ))f (k), n k n P (dk) k=k0 ∗ √ where k = min{k1 − k, n}.

To proof Corollary3, by Theorem 1.3.6, we have k1   2004 X n + 1 0 k n + 1 0 k P (HC ≥ b) = (1+o(1)) 1 − g ( , b0) + hk∗ ( g ( , b0)) fP ((n+1)g( k ,b )) (k) . n n n n n 0 k=k0 Appendix A: Proofs of Chapter1 164

th Lemma 3. Let U(i) be the i order statistic of i.i.d. samples from Uniform(0,1),

1 ≤ k0 < k1 ≤ n. Then the joint density of (U(k0), ..., U(k1)) is

n! k0−1 n−k1 f(U ,...,U )(zk , ..., zk ) = z (1 − zk ) , 0 < zk < ... < zk < 1. (k0) (k1) 0 1 k0 1 0 1 (n − k1)!(k0 − 1)!

Proof. Standard result from formula (2.2.2) of the book Order Statistics[113].

Lemma 4. Let εi be i.i.d. exponential random variables with parameter 1. Γk = Pk i=1 εi, 1 ≤ k0 < k1 ≤ n. Then the joint density of (Γk0 , ..., Γk1 )

k0−1 zk 0 −zk1 f(zk0 , ..., zk1 ) = e , 0 < zk0 < ... < zk1 < ∞. (k0 − 1)!

Specially, when k0 = 1, the joint density is

−zk1 f(z1, ..., zk1 ) = e , 0 < z1 < ... < zk1 < ∞.

Proof. The proof follows the deduction process shown in Theorem 1.1 by Mathai and

Moschopoulos [114]. Note that their theorem only considers the case of k0 = 1, but

the deduction idea can be applied to general k0.

Pn Lemma 5. Let c < 1, I(c) = c − 1 − log(c) > 0, Γn = i=1 i, where i are i.i.d exponential distributed with parameter 1, then

Γ P ( n ≤ c) ≤ e−nI(c). n

Proof. Let i be exp(1) distributed. The log moment generating function of −i is

Λ(t) = − log(1 + t), t > −1. Then the convex rate function is

Λ∗(x) = sup tx − Λ(t) t>−1

= −x − 1 − log(−x).

By Cramer’s theorem, let x = −c, we have desired result. Appendix A: Proofs of Chapter1 165

Lemma 6. (Abel-Goncharov Polynomial)

Z zk1 Z z2 k1−2 (zk1 − a − λ(k1 − 1)) (zk1 − a) ... dz1...dzk1−1 = . a+λ(k1−1) a+λ (k1 − 1)!

Proof. We will prove by induction.

When k1 = 2, it’s easily shown that

Z z2 2−2 (z2 − a − λ(2 − 1)) (z2 − a) dz1 = z2 − a − λ = . a+λ (2 − 1)!

Assume it’s true for k1 = k,

Z zk Z z2 k−2 (zk − a − λ(k − 1)) (zk − a) ... dz1...dzk−1 = . a+λ(k−1) a+λ (k − 1)!

Then for k1 = k + 1,

Z zk+1 Z zk Z z2 ... dz1...dzk−1dzk a+λk a+λ(k−1) a+λ Z zk+1 k−2 (zk − a − λ(k − 1)) (zk − a) = dzk a+λk (k − 1)! Z zk+1 k−1 Z zk+1 k−2 (zk − a) λ(zk − a) = dk − dk a+λk (k − 1)! a+λk (k − 2)! (z − a)k (λk)k λ(z − a)k−1 λ(λk)k−1 = k+1 − − k+1 + k! k! (k − 1)! (k − 1)! (z − a)k−1(z − a − λk) = k+1 k+1 . k!

This finished the proof.

Lemma 7. Let c = (ck0 , ..., ck1 ), 1 ≤ k0 < k1 ≤ n be a finite sequence of increasing

numbers. Then given  > 0, as n → ∞

Γj P ( > cj, k0 ≤ j ≤ k1) = (1 + O())P (Γj > (n + 1)cj, k0 ≤ j ≤ k1). Γn+1 Appendix A: Proofs of Chapter1 166

Proof. Given  > 0, let L(c; n, ) and R(c; n, ) be the lower bound and upper bound

Γj for P ( > cj, k0 ≤ j ≤ k1), Γn+1

L(c; n, ) =P (Γj > (1 + )(n + 1)cj, k0 ≤ j ≤ k1) Γ −P ( n+1 ∈ [1 − , 1 + ]c), n + 1

R(c; n, ) =P (Γj > (1 − )(n + 1)cj, k0 ≤ j ≤ k1) Γ +P ( n+1 ∈ [1 − , 1 + ]c). n + 1

By Lemma5,

Γ 1 P ( n+1 ∈ [1 − , 1 + ]c) ≤ e−nΛ(1−) → 0 exponentially. n + 1 2

Because of the continuity of the multivariate gamma distribtution,

L(c; n, ) →(1 + O())P (Γj > (n + 1)cj, k0 ≤ j ≤ k1),

R(c; n, ) →(1 + O())P (Γj > (n + 1)cj, k0 ≤ j ≤ k1).

Lemma 8. Let (d1, ..., dk1 ) be a sequence of nondecreasing and nonnegative numbers. ¯ Qj(dj+1, ..., dk1 ) = P {Γk ≥ dk+j, 1 ≤ k ≤ k1−j}, 0 ≤ j ≤ k1−1. FΓk (x) is the survival function of Gamma distribution with shape parameter k and scale parameter 1. Then for l = 2, 3, ..., k1

l−1 j X dk −l+j Q = F¯ (d ) − 1 Q . k1−l Γl k1 j! k1−l+j j=1 ¯ with Qk1−1 = FΓ1 (dk1 ), and, for k0 ≥ 1, the joint survival probability

k1−1 dj ¯ X j P {Γk ≥ dk, k0 ≤ k ≤ k1} = FΓ (dk ) − Qj. k1 1 j! j=k0 Appendix A: Proofs of Chapter1 167

Proof.

Q0 = P {Γk ≥ dk, 1 ≤ k ≤ k1}

Z +∞ Z zk1 Z z2 −zk1 = e ... dz1...dzk1−1dzk1 dk1 dk1−1 d1

Z +∞ Z zk1 Z z3 −zk1 = e ... (z2 − d1)dz2...dzk1−1dzk1 dk1 dk1−1 d2

Z +∞ Z zk1 Z z3 −zk1 = e ... z2dz2...dzk1−1dzk1 − d1Q1 dk1 dk1−1 d2 k −1 j Z +∞ Z zk Z zk +1 k0−1 0 1 0 zk X dj −zk1 0 = e ... dzk0 ...dzk1−1dzk1 − Qj (k0 − 1)! j! dk1 dk1−1 dk0 j=1 Z +∞ k1−1 k1−1 j zk X dj −zk1 1 = e dzk1 − Qj (k1 − 1)! j! dk1 j=1 k1−1 dj ¯ X j = FΓ (dk ) − Qj. k1 1 j! j=1 The rest of recursive formulae can be derived similarly and from the last third equation. Also notice that {Γk ≥ dk, k0 ≤ k ≤ k1} = {Γk ≥ dk, 1 ≤ k ≤ S k1} { some Γk < dk, 1 ≤ k ≤ k0 − 1 and Γk ≥ dk, k0 ≤ k ≤ k1}, by Lemma9 we have

k0−1 j X dj P {Γ ≥ d , k ≤ k ≤ k } = Q + Q k k 0 1 0 j! j j=1 k1−1 dj ¯ X j = FΓ (dk ) − Qj. k1 1 j! j=k0

Lemma 9. Let (d1, ..., dk1 ) be a sequence of increasing numbers, An,k = {Γk ≤ dk, Γk+l > dk+l, 1 ≤ l ≤ k1 − k}, 1 ≤ k ≤ k1 − 1. Then

dk P (A ) = k Q . n,k k! k Appendix A: Proofs of Chapter1 168

Proof.

P (An,k) = P {Γk ≤ dk, Γk+l > dk+l for all 1 ≤ l ≤ k1 − k}

Z +∞ Z zk1 Z zk+1 Z dk k−1 −z zk = e k1 ... dz dz ...dz dz (k − 1)! k k+1 k1−1 k1 dk1 dk1−1 dk+1 0

Z +∞ Z zk1 Z zk+1 k −z dk = e k1 ... dz ...dz dz k! k+1 k1−1 k1 dk1 dk1−1 dk+1 dk = k P {Γ ≥ d , 1 ≤ l ≤ k − k} k! l k+l 1 dk = k Q . k! k Appendix B

Proofs of Chapter2

Proof of Theorem 2.3.1. Consider input statistics T in (2.7) with µ = 1 and Σij =

ρ. The elements of T can be written as

p 2 Ti = ρZi + 1 − ρ Z0, i = 1, ..., n,

iid where Z0,Z1, ..., Zn ∼ N(0, 1). For simplicity, we consider R = {i : k0 ≤ i ≤ k1}. Let

U(1) ≤ ... ≤ U(n) be the order statistics of n iid Uniform(0, 1) random variables. First we consider that input p-values in (2.9) are one-sided. For a gGOF statistic Sn,f,R in (2.1), its distribution function can be written in (2.2)–(2.3). By the property of exchangeable normal random variables [115] we can get the cumulative distribution

169 Appendix B: Proofs of Chapter2 170

function (CDF):

P (Sn,f,R < b) = P (P(k) > uk , all k = k0, ..., k1)

−1 = P (T(n−k+1) < Φ (1 − uk), k = k0, ..., k1)

p √ −1 = P ( 1 − ρZ(n−k+1) + ρZ0 < Φ (1 − uk), k = k0, ..., k1) ∞ −1 √ Z Φ (1 − uk) − ρz = φ(z)P (Z(n−k+1) < √ , k = k0, ..., k1)dz −∞ 1 − ρ ∞  −1 √  Z Φ (1 − uk) − ρz = φ(z)P (U(k) > 1 − Φ √ , k = k0, ..., k1)dz. −∞ 1 − ρ

When the input p-values are two-sided, the calculation is adjusted accordingly.

P (Sn,f,R < b) = P (P(k) > uk, k = k0, ..., k1)

= P (2(1 − Φ(|T |(n−k+1))) > uk, k = k0, ..., k1)

p √ −1 = P (| 1 − ρZ + ρZ0|(n−k+1) < Φ (1 − uk/2), k = k0, ..., k1) Z ∞ p √ −1 = φ(z0)P (| 1 − ρZ + ρz0|(n−k+1) < Φ (1 − uk/2), k = k0, ..., k1)dz0 −∞ Z ∞ −1  = φ(z0)P (U(k) > 1 − Fz0 Φ (1 − uk/2) , k = k0, ..., k1)dz0, −∞ √ √ where Fz0 is the CDF of | 1 − ρZ + ρz0|: √ √ x − ρz  −x − ρz  F (x) = Φ √ 0 − Φ √ 0 . z0 1 − ρ 1 − ρ

√  Φ−1(1−u )− ρz  √ k −1 In summary, define c1k = 1 − Φ 1−ρ , c2k = 1 − Fz (Φ (1 − uk/2)), then

Z ∞ P (Sn,f,R < b) = φ(z)P (U(k) > cik, k = k0, ..., k1)dz, i = 1, 2. −∞

Proof of Theorem 2.4.1. Now we prove Theorem 2.4.1. First we derive the least

squares estimator of β by joint model-fitting. Write SN×(n+m) = [X,Z] and α(n+m)×1 = Appendix B: Proofs of Chapter2 171

[β0, γ0]0,    −1   βˆ X0XX0Z X0Y 0 −1 0       αˆ = (S S) S Y ⇐⇒   =     . γˆ Z0XZ0Z Z0Y

We can write the matrix inverse explicitly,  −1 X0XX0G     G0XG0G   (X0(I − H)X)−1 −(X0(I − H)X)−1X0Z(Z0Z)−1   =   −(Z0Z)−1Z0X(X0(I − H)X)−1 (X0X)−1 − (X0X)−1Z0X(X0(I − H)X)−1X0Z(X0X)−1

Thus

ˆ 0 −1 0 0 −1 0 0 −1 0 βJ = (X (I − H)X) X Y − (X (I − H)X) X Z(Z Z) Z Y

= (X0(I − H)X)−1X0(I − H)Y

Since the convariance matrix of Y is I and I −H is idempotent, the covariance matrix of βˆ is

ˆ 0 −1 0 2 0 −1 Var(βJ ) = (X (I − H)X) X (I − H)σ I(I − H)X(X (I − H)X)

= σ2(X0(I − H)X)−1

Thus the scaled test statistics by joint model-fitting is TJ given in (2.15) with mean

0 −1 µTJ = Λβ and correlation matrix ΣTJ = Λ(X (I − H)X) Λ.

For the marginal model-fitting between Y and the jth covariate Xj, j = 1, ..., n:

ˆ ˆ YMj = XjβMj + Zγ,ˆ following the same deduction as above, the least squares estimator is

2 ˆ 0 −1 0 1 0 σ βMj = (Xj(I−H)Xj) Xj(I−H)Y ∼ N( 0 Xj(I−H)Xβ, 0 ). Xj(I − H)Xj Xj(I − H)Xj Appendix B: Proofs of Chapter2 172

The corresponding scaled statistic is

1 q 1 1 0 ˆ 0 0 TMj = Xj(I − H)XjβMj = Xj(I−H)Y ∼ N( Xj(I−H)Xβ, 1), σ q 0 q 0 σ Xj(I − H)Xj σ Xj(I − H)Xj and the vector TM = (TM1, ..., TMn) is

0 0 0 −1 TM = CX (I−H)Y/σ ∼ N(CX (I−H)Xβ/σ, CX (I−H)XC) = N(ΣTM C β/σ, ΣTM ).

Next, we consider the transformations. To simplify the deduction, define X∗ =

(I − H)XΛ−1. We can write

0 −1 0 ∗0 ∗ −1 ∗0 TJ = Λ(X (I − H)X) X (I − H)Y = (X X ) X Y/σ;

0 −1 ∗0 ∗ −1 ΣTJ = Λ(X (I − H)X) Λ = (X X ) ; 1 D = diag{ , j = 1, ..., n} where (Σ−1) = Λ−1X0 (I − H)X Λ−1. TJ q TJ jj j j j j (Σ−1) TJ jj

Apply IT defined in (2.20) to TJ ,

T IT = D Σ−1T = D X∗0Y/σ, J TJ TJ J TJ

for which the jth element is

IT 1 ∗0 1 −1 0 TJj = q Xj Y/σ = q Λj Xj(I − H)Y/σ = TMj. (Σ−1) Λ−1X0 (I − H)X Λ−1 TJ jj j j j j

Regarding the UTs, for the marginal fitting

UT 0 TM = UM TM = UM CX (I − H)Y/σ.

and since

0 0 0 I = UM ΣTM UM = UM CX (I − H)XCUM . Appendix B: Proofs of Chapter2 173

UT 0 0 −1 The mean is E(TM ) = UM CX (I − H)Xβ/σ = (CUM ) β/σ. For the joint fitting,

let UJ be the inverse of the Cholesky factorization of ΣTJ , i.e.,

0 0 −1 0 I = UJ ΣTJ UJ = UJ Λ(X (I − H)X) ΛUJ .

We have

0 −1 0 0 (ΛUJ ) = (CUM ) = UM C.

Thus,

UT 0 −1 0 0 −1 0 0 TJ = UJ TJ = UJ Λ(X (I−H)X) X (I−H)Y = (ΛUJ ) X (I−H)Y = UM CX (I−H)Y.

Lastly, for the SNRs, assume βj/σ = A > 0 and βi = 0 for all i 6= j. Since

(UJ )jj ≥ 1, we have

UT E(TJj ) = (UJ Λβ/σ)j = (UJ )jjΛjjA ≥ ΛjjA = (Λβ/σ)j = E(TJj).

Note also

0 −1 −1 E(TMj) = (CX (I − H)Xβ/σ)j = (ΣTM C β/σ)j = Cjj A,

−1 and since (UM )jj ≤ 1, we have

UT −1 0 −1 −1 E(TJj ) = ((UM ) C β/σ)j ≤ Cjj A.

UT For all i 6= j, E(TJi) = E(TJi ) = E(TMi) = 0. This ends the proof.

Proof of Theorem 2.4.2. First, by the maximum likelihood estimation (MLE) and the Fisher Scoring method, we show that the MLE estimator of β by joint model-

0 0 0 fitting is given in (2.12). Define SN×(n+m) = [X,Z] and α(n+m)×1 = [β , γ ] . The log-likelihood: N N   X X yiθi − b(θi) log L = log L = + c(y , φ) i a (φ) i i=1 i=1 i Appendix B: Proofs of Chapter2 174

Thus the score functions,

∂ log L X ∂ log Li ∂θi U (α) = = j ∂α ∂θ ∂α j i i j

Since

∂ log L y − b0(θ ) y − µ i = i i = i i ∂θi ai(φ) ai(φ) and with canonical link θi = g(µi) = Si·α,

∂θi ∂Si·α = = Sij ∂αj ∂αj

We have

N X Sij(Yi − µi) U (α) = , j = 1, ..., n + m (B.1) j a (φ) i=1 i

−1 0 or in vector form U(α) = A S (Y − µ), where A = diag{ai(φ)} which is a diagonal

matrix of the overdispersion parameters. For linear regression and logistic regression

A is identity.

Next we can show that the Fisher’s information matrix

  N 00 ∂Uj(α) X b (θi) I (j, k) = cov(U (α),U (α)) = −E = S0 S (B.2) n j k ∂α i· a (φ) i· k i=1 i

0 2 or in matrix form In = S WS, where W = diag{var(Yi)/ai (φ)}. When there is no

overdispersion, the weight matrix becomes a diagonal matrix of the variances of Yi.

The MLE estimator of α is obtained by solving the the score equations U(α) = 0

through the Fisher Scoring method.

(j+1) (j) (j)−1 (j) α = α + In U(α ) (B.3) Appendix B: Proofs of Chapter2 175

Consider a one-step MLE (cf. Theorem 4.19 and Exercise 4.152 of [33]) with initial

estimation

0 0 α(0) = (0, γ(0) ) where γ(0) is the MLE of γ using only control covariates Z.

For simplicity we focus on the case A = I (no overdispersion), if not, we can

re-define the S to be A−1S and the arguments are the same.      −1   β(1) 0 X0W (0)XX0W (0)Z X0(Y − µ(0))           =   +     γ(1) γ(0) Z0W (0)XZ0W (0)Z Z0(Y − µ(0))

where µ(0) is the MLE estimator of µ using only the control covariates Z. W (0) is the

corresponding MLE estimator of the weight matrix defined in/below B.2.

Define X˜ = W (0)1/2X, Z˜ = W (0)1/2Z and H˜ = Z˜(Z˜0Z˜)−1Z˜0

 −1  −1 β(1) = X˜ 0(I − H˜ )X˜ X0(Y − µ(0)) − X˜ 0(I − H˜ )X˜ X˜ 0Z˜(Z˜0Z˜)−1Z0(Y − µ(0))

The second term is exactly 0 because µ(0) is the MLE using only Z, thus the scores

Z0(Y − µ(0)) ≡ 0. The MLE of β therefore can be written as

 −1 βˆ = X˜ 0(I − H˜ )X˜ X0(Y − µ(0)) →D N(β, (X˜ 0(I − H˜ )X˜)−1) (B.4)

ˆ After standardization of β, the test statistics by joint model-fitting is TJ in 2.13. The

rest of the proof follows the proof of Theorem 2.4.1.

Proof of Theorem 2.4.3. Note that for all 1 ≤ j ≤ n:

−1 ˜ ˜ 0 ˜ ˜ −1 ˜ −1 ˜ −2 ˜ 0 ˜ ˜ (Σ )jj = [(Λ(X (I − H)X) Λ) ]jj = Λjj Xj(I − H)Xj = C(f) + o(1), Appendix B: Proofs of Chapter2 176 where 1 Z π C(f) = f(t)−1dt. 2π −π So, after IT, the SNR is

q q p ˜ ˜ 0 ˜ ˜ ˜ 0 ˜ ˜ p C(f)Λjjβj ≈ Xj(I − H)Xjβj = 2Xj(I − H)Xjrj log n > 2ρ(α) log n.

The result applies by following the definition of iHC in (4.8) and the assumptions and the conclusion of Theorem 5.1 in [55]. Appendix C

Proofs of Chapter3

Proof of Lemma1. The first order partial derivative of c(θ; τ1, τ2) with respect to

τ1 is

∂c(θ; τ1, τ2) ∂∆ ∂V0 ∝ 2V0 − ∆ , ∂τ1 ∂τ1 ∂τ1 where

  ∂∆ τ2 0 = log (D (τ1) − 1), ∂τ1 τ1      2! ∂V0 τ2 τ2 = 1 − 2(1 − τ1) 1 + log + (1 − 2τ2) 1 + log . ∂τ1 τ1 τ1

At τ1 = τ2, we have both partials equal to 0.

We further examine the partial derivative with respect to τ2 and then evaluate it at τ2 = τ1,

∂∆ 2(D(τ ) − τ ) ∂V 1 1 0 = ; = 2(1 − τ1); ∂τ2 τ2=τ1 τ1 ∂τ2 τ2=τ1 Z τ1 0 ∆ = − log (u)(D (u) − 1)du + log(τ1)(D(τ1) − τ1); V0 = τ(2 − τ1). τ2=τ1 0 τ2=τ1

177 Appendix C: Proofs of Chapter3 178

Thus

∂c(θ; τ , τ ) 1 2 = 0 ∂τ2 τ2=τ1 Z τ1 0 ⇔(D(τ1) − τ1)(2 − τ1) − (1 − τ1) − log(u)(D (u) − 1)du − (D(τ1) − τ1)(1 − τ1) log(τ1) = 0 0 Z τ1   0 2 − τ1 ⇔ − log(u)(D (u) − 1)du = (D(τ1) − τ1) − log(τ1) . 0 1 − τ1

2 (E1−E0) ∆2 Proof of Lemma2. The Bahadur efficiency is c(θ; τ1, τ2) = = , where V0 V0

R τ1  u  0 V0 is irrelevant to H1, thus to . On the other hand, ∆ = − log (D (u)−1)du. 0 τ2

0 We can show that ∆ = g(τ1, τ2, µ). This is equivalent to show D (u) − 1 has such

separability of .

−1 By (3.15), D(x) = 1 − F1(F0 (1 − x)) where F0(x) = G0(x) and F1(x) = (1 −

)G0(x) + G1(x; µ). We can further write

−1 −1 D(x) = 1 − (1 − )G0(G0 (1 − x)) − G1(G0 (1 − x); µ)

−1 = 1 − (1 − )(1 − x) − G1(G0 (1 − x); µ)

−1 = x +  − x − G1(G0 (1 − x); µ).

−1 D(x) − x = (1 − x − G1(G0 (1 − x); µ)).  0 −1  0 G1(G0 (1 − x); µ) D (x) − 1 =  −1 + 0 −1 . G0(G0 (1 − x)) This completes the proof.

Proof of Theorem 3.5.1. Following Lemma1 for the first-order conditions for max-

imizing c(θ; τ1, τ2) in (3.20), note that for the Gaussian mixture model in (3.10),

D0(x) − 1 = (eµΦ−1(1−x)−µ2/2 − 1) and D(x) − x = (1 − x − Φ(Φ−1(1 − x) − µ)).

∗ Therefore the optimal τ1 = τ2 = τ does not depend on . Appendix C: Proofs of Chapter3 179

2−τ  R τ 0 Let f(τ) = (D(τ)−τ) 1−τ − log(τ) + 0 log(u)(D (u)−1)du. Note that f(0) = 0, f 0(0) = 1 − D0(0) > 0. A sufficient condition for the existence of the root τ ∗ is

Z τ f(1) = 1 − D0(1) − log(u)(D0(u) − 1)du > 0 0 Z τ ⇔  +  log(u)(eµΦ−1(1−x)−µ2/2 − 1)du < 0. 0

This is equivalent to

Z 1 e−µ2/2 − log(u)eµΦ−1(1−u)du > 0 ⇐⇒ µ > µ = 0.84865. 0

Next we will examine the the second order derivatives. In a generic form,

2 ! 2 2∆2 ∂V0 ∂V0 2 ∂V0 ∂∆ + ∆ ∂ V0 2 ∂ c(θ; τ1, τ2) 1 ∂τ1 ∂τ2 ∂τ1 ∂τ2 ∂τ1∂τ2 ∂ ∆ ∂∆ ∂∆ = 2 − + 2∆ + 2 . ∂τ1∂τ2 V0 V0 V0 ∂τ1∂τ2 ∂τ1 ∂τ2

Again ∂∆ = 0 and ∂V0 = 0. We can simplify ∂τ1 ∂τ1 τ2=τ1 τ2=τ1

 2  2 ∂ V0 2 ∂ c(θ; τ , τ ) 1 ∆ ∂τ 2 ∂ ∆ 1 2 = − 1 + 2∆ , 2  2  ∂τ1 τ2=τ1 V0 V0 ∂τ1 τ2=τ1

 ∂V ∂∆ ∂2V  2 2 ∂V0 2 0 0 2 ∂ c(θ; τ , τ ) 1 2∆ ( ) 2 ∂τ ∂τ + ∆ ∂τ 2 ∂ ∆ ∂∆ 1 2 = ∂τ2 − 2 2 2 + 2∆ + 2( )2 , 2  2 2  ∂τ2 τ2=τ1 V0 V0 V0 ∂τ2 ∂τ2 τ2=τ1 2 ! ∂2c(θ; τ , τ ) 1 ∆ ∂ V0 ∂2∆ 1 2 ∂τ1∂τ2 = − + 2∆ . ∂τ1∂τ2 τ2=τ1 V0 V0 ∂τ1∂τ2 τ2=τ1

∗ The following are the relevant terms evaluated at τ2 = τ1 = τ

2 2 ∗ 2 ∂ V0 ∂ V0 2(1 − τ ) ∂ V0 2 = 2; 2 = ∗ ; = −2; ∂τ1 ∂τ2 τ ∂τ1∂τ2 ∂2∆ D0(τ ∗) − 1 ∂2∆ D(τ ∗) − τ ∗ ∂2∆ D0(τ ∗) − 1 2 = − ∗ ; 2 = − ∗2 ; = ∗ ; ∂τ1 τ ∂τ2 τ ∂τ1∂τ2 τ (D(τ ∗) − τ ∗)(2 − τ ∗) ∆ = ; V = τ ∗(2 − τ ∗). 1 − τ ∗ 0 Appendix C: Proofs of Chapter3 180

Plugging them back, we can get the conditions for local maximization. In particular,

2 ∂ c(θ;τ1,τ2) ∗ ∗ 0 ∗ the condition ∂τ 2 < 0 is equivalent to D(1) < D(τ ) + (1 − τ )D (τ ), 1 τ2=τ1=τ∗ which is always true because D(x) is a concave function. Finally, after some algebra,

2 2 2 ∂ c(θ;τ1,τ2) ∂ c(θ;τ1,τ2) ∂ c(θ;τ1,τ2) 2 the condition ( 2 2 − ( ) ) > 0 is equivalent to ∂τ ∂τ ∂τ1∂τ2 1 2 τ2=τ1=τ∗ D(τ ∗) − τ ∗ > 2 − τ ∗. D0(τ ∗) − 1

One sufficient condition for such inequality holds is D0(τ ∗) > 1 which is equivalent to

τ ∗ > 1 − Φ(µ/2).

Proof of Theorem 3.5.2. Taking the partial of b(θ; τ1, τ2) with respect to τ1, we

have

  ∂ ∂∆ ∂V1 b(θ; τ1, τ2) ∝ 2V1 − ∆ , ∂τ1 ∂τ1 ∂τ1 where

Z τ1 ∂V1 2 0 0 0 =[log (τ1)D (τ1) − 2 log(τ1)D (τ1) log(u)D (u)du ∂τ1 0 Z τ1 0 0 0 +2D (τ1) log(τ2) log(u)D (u)du + 2(D(τ1) − 1) log(τ2) log(τ1)D (τ1) 0

2 0 0 + log (τ2)(D (τ1) − 2D(τ1)D (τ1))].

Therefore,

∂V 1 2 0 0 2 2 0 = [log (τ1)D (τ1) − 2D (τ1) log (τ1) + log (τ1)D (τ1)] = 0. ∂τ1 τ2=τ1

Together with ∂V0 = ∂∆ = 0, as was shown in the proof of Theorem 3.5.1, ∂τ1 ∂τ1 τ1=τ2 τ1=τ2 we have

b(θ; τ1, τ2) = 0. ∂τ1 τ1=τ2 Appendix C: Proofs of Chapter3 181

∗ The choice τ1 = τ2 = τ meets the first order conditions for the optimality if

∂ ∗ b(θ; τ1, τ2) = 0 has a solution τ . This is equivalent to solve ∂τ2 ∗ τ1=τ2=τ   ∂∆ ∂V1 2V1 − ∆ = 0, ∗ ∂τ2 ∂τ2 τ1=τ2=τ where

Z τ1 0 ∂∆ D(τ1) − τ1 ∆|τ1=τ2 = − log(u)D (u)du + D(τ1) log(τ1) − τ1; |τ1=τ2 = ; 0 ∂τ2 τ1 Z τ1 Z τ1 2 0 0 2 V1|τ1=τ2 = [ log (u)D (u)du − ( log(u)D (u)du) 0 0 Z τ1 0 2 + 2(D(τ1) − 1) log(τ1) log(u)D (u)du + log (τ1)D(τ1)(1 − D(τ1))]; 0 Z τ1 ∂V1 D(τ1) − 1 0 |τ1=τ2 = 2 [ log(u)D (u)du − D(τ1) log(τ1)]. ∂τ2 τ1 0

Plug them in and after simplification, we want to solve

τ(1 − 2 log(τ))(D(τ) − 1) D(τ) − τ f (τ) = (g (τ))2 − g (τ) − g (τ) b 1 1 − τ 1 1 − τ 2 D(τ)(D(τ) − 1) log(τ)(1 − log(τ)) + = 0, 1 − τ

R 1 k 0 where gk(τ) = gk(τ; , µ) = 0 log (u)D (u)du.

0 It is easy to check that fb(0) = 0 and fb(0) < 0. A sufficient condition for the existence of a root is that fb(1) > 0, i.e.,

2 0 0 fb(1) = (g1(1)) + D (1)g1(1) − (1 − D (1))g2(1) > 0.

Notice that g1(1) = g˜1(µ) − 1 and g2(1) = g˜2(µ) + 2. This is equivalent to

2 [(˜g1(µ)) − g˜1(µ) − g˜2(µ)] > 1 +g ˜1(µ). Appendix C: Proofs of Chapter3 182

2 Since (˜g1(µ)) − g˜1(µ) − g˜2(µ) < 0 and 1 +g ˜(µ) needs to be < 0, the sufficient

conditions for the existence of a root is

µ > µ = 0.84865,

1 +g ˜1(µ)  < 2 , (˜g1(µ)) − g˜1(µ) − g˜2(µ)

where µ is the same given in Theorem 3.5.1.

Proof of Theorem 3.5.3. Taking the partial of a(θ; τ1, τ2) with respect to τ1, we have

    ∂ p √ ∂V0 ∂V1 √ ∂∆ ∂V0 a(θ; τ1, τ2) ∝ (zα V0 − n∆) V1 − V0 − nV1 2V0 − ∆ ∂τ1 ∂τ1 ∂τ1 ∂τ1 ∂τ1   √   ∂V0 ∂V1 nV0 ∂∆ ∂V1 ∝ V1 − V0 − 2V1 − ∆ . ∂τ1 ∂τ1 zα ∂τ1 ∂τ1

∂V1 ∂V0 Following the proof of Theorems 3.5.1 and 3.5.2, we have |τ =τ = |τ =τ = ∂τ1 1 2 ∂τ1 1 2

∂∆ |τ =τ = 0, and thus ∂τ1 1 2

a(θ; τ1, τ2) = 0. ∂τ1 τ1=τ2

∗ The choice τ1 = τ2 = τ meets the first order conditions for the optimality if

∂ ∗ a(θ; τ1, τ2)|τ =τ = 0 has a solution τ . This is equivalent to solve ∂τ2 1 2 √  ∂V ∂V  nV  ∂∆ ∂V  0 1 0 1 V1 − V0 − 2V1 − ∆ = 0, ∂τ2 ∂τ2 zα ∂τ2 ∂τ2 τ1=τ2

∂V0 where V0|τ =τ = τ1(2 − τ1), |τ =τ = 2(1 − τ1), and the rest of the terms are given 1 2 ∂τ2 1 2 in the proof of Theorem 3.5.2. We can simplify the equation to be

τ(D(τ) − 1) f (τ) = (τ − c )(g (τ))2 − (2 log(τ)(1 − τ + c ) − 2 + τ − c ) g (τ) c τ 1 1 − τ τ τ 1  D(τ) − τ  τD(τ)(D(τ) − 1) log(τ) + τ − c g (τ) + (log(τ)(1 − τ + c ) − 2 + τ − c ) = 0, τ 1 − τ 2 1 − τ τ τ Appendix C: Proofs of Chapter3 183

p 0 where cτ = cn τ(2 − τ). Here we have fc(0) = 0 and fc(0) > 0. The condition fc(1) < 0 means

2  (1 − cn) (˜g1(1)) − g˜1(1) − g˜2(1) − (1 − cn)(˜g1(1) + 1)

+ (2˜g1(1) +g ˜2(1)) − (2˜g1(1) +g ˜2(1)) < 0.

2 For cn large enough, (1 − cn)[(˜g1(µ)) − g˜1(µ) − g˜2(µ)] + 2˜g1(µ) +g ˜2(µ) > 0.Thus, a sufficient conditions for the existence of τ ∗, i.e., the stationary point is

0 0 0 0 µ > µ such that (1 − cn)[1 +g ˜1(µ )] + 2˜g1(µ ) +g ˜2(µ ) = 0,

(1 − cn)[1 +g ˜1(µ)] + 2˜g1(µ) +g ˜2(µ)  < 2 . (1 − cn)[(˜g1(µ)) − g˜1(µ) − g˜2(µ)] + 2˜g1(µ) +g ˜2(µ) Appendix C: Proofs of Chapter3 184

Supplementary Figures

Figure C.1: Power comparison between the optimal and adaptive tests over signal strength µ. Type I error rate α = 0.05. Optimal: optimal TFisher at maximizers

∗ ∗ τ1 , τ2 of APE; ARTP: adaptive RTP with adaptive K ∈ {1, 0.05n, 0.5n, n}; oTFisher: soft-thresholding omnibus TFisher with adaptive τ ∈ {0.01, 0.05, 0.5, 1}; ATPM: adaptive TPM (hard-thresholding) with adaptive τ ∈ {0.01, 0.05, 0.5, 1}. Appendix C: Proofs of Chapter3 185

Figure C.2: Power comparison between the optimal and adaptive tests over signal pro-

∗ ∗ portion . Type I error rate α = 0.05. Optimal: optimal TFisher at maximizers τ1 , τ2 of APE; ARTP: adaptive RTP with adaptive K ∈ {1, 0.05n, 0.5n, n}; oTFisher: soft- thresholding omnibus TFisher with adaptive τ ∈ {0.01, 0.05, 0.5, 1}; ATPM: adaptive

TPM (hard-thresholding) with adaptive τ ∈ {0.01, 0.05, 0.5, 1}. Appendix D

Proofs of Chapter4

Proof of Theorem 4.2.1.

U ∼ N(0, 1)

p 2 V = ρU + 1 − ρ Z, ρ = ρij

2 2  where Z is a standard normal random variable independent with U. Thus Cor Zi I(|Zi| > b),Zj I(|Zj| > b) = Cor (U 2I(|U| > b),V 2I(|V | > b2)). Under such transform, the underlying r.v. U and

Z are independent.

Next we will focus on the calculation of

E(U 2I(U 2 > b2)V 2I(V 2 > b2) h p p i =E U 2(ρU + 1 − ρ2Z)2I(|U| > |b|)I(|ρU + 1 − ρ2Z| > b) h p  p i =E ρ2U 4 + 2ρ 1 − ρ2U 3Z + (1 − ρ2)U 2Z2 I(|U| > |b|)I(|ρU + 1 − ρ2Z| > b)

(D.1)

2 S The integration region is R \{−b < u < b} {g(U) < z < f(U)}, where f(u) = √−ρu+b and g(u) = √−ρu−b . For visualization, see Figure D.1. 1−ρ2 1−ρ2

186 Appendix D: Proofs of Chapter4 187

Figure D.1: Integration region in equation (D.1): {U > b} S{U < −b} S{Z >

f(u)} S{Z < g(u)}, where f(u) = √−ρu+b and g(u) = √−ρu−b 1−ρ2 1−ρ2

  Denote S = ρ2U 4 + 2ρp1 − ρ2U 3Z + (1 − ρ2)U 2Z2 . By inclusion-exclusion

principle,

 p  E SI(|U| > |b|)I(|ρU + 1 − ρ2Z| > b)

=ES − ESI(−b < U < b) − ESI (g(U) < Z < f(U))) + ESI (−b < U < b)I(g(U) < Z < f(U)))

(D.2)

It is obvious that ES = 3ρ2 + 1 − ρ2 = 1 + 2ρ2. For the other three terms, we need

lemma 10 to deal with truncated normal moments. Now we have all the information Appendix D: Proofs of Chapter4 188 to calculate equation (D.2).

 p  ESI(−b < U < b) = E ρ2U 4 + 2ρ 1 − ρ2U 3Z + (1 − ρ2)U 2Z2 I(−b < U < b)

2 2 = ρ M4(b) + (1 − ρ )M2(b)

= (1 + 2ρ2)(1 − τ − 2bφ(b)) − 2ρ2b3φ(b)

(D.3)

ESI (g(U) < Z < f(U))) − ESI (−b < U < b)I(g(U) < Z < f(U))

=ESI (U > b or U < −b)I(g(U) < Z < f(U)) Z Z f(u)   2 4 p 2 3 2 2 2 = ρ u + 2ρ 1 − ρ u z + (1 − ρ )u z φ(z)φ(u)dzdu (D.4) R\[−b,b] g(u) Z   2 4 p 2 3 2 2 = ρ u M0(u) + 2ρ 1 − ρ u M1(u) + (1 − ρ )u M2(u) φ(u)du R\[−b,b] Z = u2h(u)du R\[−b,b]   def 2 2 p 2 2 where h(u) = ρ u M0(u) + 2ρ 1 − ρ uM1(u) + (1 − ρ )M2(u) φ(u).

Together back to equation (D.1) and (D.2), we conclude that

E(U 2I(|U| > b)V 2I(|V | > b)) Z (D.5) =2ρ2b3φ(b) + (1 + 2ρ2)(τ + 2bφ(b)) − u2h(u)du R\[−b,b]

EV 2I(|V | > b) = EU 2I(|U| > b) (D.6) 2 = 1 − EU I(|U| < b) = 1 − M2(b) = τ + 2bφ(b)

Therefore the covariance

Cov(U 2I(|U| > b),V 2I(|V | > b)) Z (D.7) =2ρ2b3φ(b) + (1 + 2ρ2)(τ + 2bφ(b)) − u2h(u)du − (τ + 2bφ(b))2 R\[−b,b] Appendix D: Proofs of Chapter4 189

Finally the variance

Var(U 2I(|U| > b)) = EU 4I(|U| > b) − (EU 2I(|U| > b))2 (D.8) 2 = 3 − M4(b) − (1 − M2(b))

n Lemma 10. Consider Z ∼ N(0, 1). Let Mn = EZ I(c < Z < a). Then with

M0 = Φ(a) − Φ(c), M1 = φ(c) − φ(a),

n−1 n−1  Mn = (n − 1)Mn−2 − a φ(a) − c φ(c) .

Proof of Lemma 10.

Z a Z a n n−1 Mn = z φ(z)dz = −z dφ(z) c c Z a n−1 c n−2 = −z dφ(z)|a − (n − 1) −z φ(z)dz c

n−1 n−1  = (n − 1)Mn−2 − a φ(a) − c φ(c)