On genetic correlation estimation with summary statistics from genome-wide association studies

Bingxin Zhao and Hongtu Zhu

University of North Carolina at Chapel Hill March 5, 2019

Abstract Genome-wide association studies (GWAS) have been widely used to examine the association between single nucleotide polymorphisms (SNPs) and complex traits, where both the sample size n and the number of SNPs p can be very large. Recently, cross-trait polygenic risk score (PRS) method has gained extremely popular for assessing genetic correlation of complex traits based on GWAS summary statistics (e.g., SNP effect size). However, empirical evidence has shown a common bias phenomenon that even highly significant cross-trait PRS can only account for a very small amount of genetic (R2 often < 1%). The aim of this paper is to develop a novel and powerful method to address the bias phenomenon of cross-trait PRS. We theoretically show that the estimated genetic correlation is asymptotically biased towards zero when complex traits are highly polygenic/omnigenic. When all p SNPs are used to construct PRS, we show that the asymptotic bias of PRS estimator is independent of the unknown number of causal SNPs m. We propose a consistent PRS estimator to correct such asymptotic bias. We also develop a novel estimator of genetic correlation which is solely based on two sets of GWAS summary statistics. In addition, we investigate whether or not SNP screening by GWAS p-values can lead to improved estimation and show the effect arXiv:1903.01301v1 [stat.ME] 4 Mar 2019 of overlapping samples among GWAS. Our results may help demystify and tackle the puzzling “missing genetic overlap” phenomenon of cross-trait PRS for dissecting the genetic similarity of closely related heritable traits. We illustrate the finite sample performance of our bias-corrected PRS estimator by using both numerical experiments and the UK Biobank data, in which we assess the genetic correlation between brain white matter tracts and neuropsychiatric disorders.

Keywords. Summary statistics; Genetic correlation; Polygenic risk score; GWAS; Omnigenic; Polygenic; Marginal screening; Bias correction.

1 1 Introduction

The major aim of many genome-wide association studies (GWAS) [Visscher et al., 2017] is to examine the genetic influences on complex human traits given that most traits have a polygenic architecture [Fisher, 1919, Gottesman and Shields, 1967, Hill, 2010, Orr and Coyne, 1992, Penrose, 1953, Wray et al., 2018]. That is, a large number of single nucleotide polymorphisms (SNPs) have small but nonzero contributions to the phenotypic variation. Many statistical methods have been developed on the use of individual-level GWAS SNP data to infer the and cross-trait genetic correlation in general populations [Chen, 2014, Golan et al., 2014, Guo et al., 2017b, Jiang et al., 2016, Lee and Van der Werf, 2016, Lee et al., 2012, Loh et al., 2015, Yang et al., 2010, 2011]. For instance, heritability h2 can be estimated by aggregating the small contributions of a large number of SNP markers, resulting in the SNP heritability estimator [Yang et al., 2017]. For two highly polygenic traits, cross-trait genetic corre- lation can be calculated as the correlation of the genetic effects of numerous SNPs on the two traits [Guo et al., 2017b, Lu et al., 2017, Pasaniuc and Price, 2017, Shi et al., 2017]. A growing number of empirical evidence [Chatterjee et al., 2016, Dudbridge, 2016, Ge et al., 2017, Shi et al., 2016, Yang et al., 2010] supports the polygenicity of many complex human traits and verify that the common (minor allele frequency [MAF] ≥ 0.05) SNP can account for a large amount of heritability of many complex traits. The term “omnigenic” has been introduced to acknowledge the widespread causal genetic variants contributing to various complex human traits [Boyle et al., 2017]. Accessing individual-level SNP data is often inconvenient due to policy restrictions, and a recent standard practice in the genetic community is to share the summary association statistics, including the estimated effect size, , p-value, and sample size n, of all genotyped SNPs after GWAS are published [MacArthur et al., 2016, Zheng et al., 2017]. Therefore, joint analysis of summary-level data of different GWAS provides new opportunities for further analyses and novel genetic discoveries, such as the shared genetic basis of complex traits. It has became an active research area to examine the heritability and cross-trait genetic correlation based on GWAS summary statistics [Bulik-Sullivan et al., 2015a,b, Dudbridge, 2013, Lee et al., 2013, Lu et al., 2017, Palla and Dudbridge, 2015, Shi et al., 2017, Weissbrod et al., 2018, Zhou, 2017]. Among them, the cross-trait polygenic risk score (PRS) [Power et al., 2015, Purcell et al., 2009] has became a popular routine to measure genetic similarity of polygenic traits with widespread applications [Bogdan et al., 2018, Clarke et al., 2016, Hagenaars et al., 2016, Mistry et al., 2018, Nivard et al., 2017, Pouget et al., 2018, Socrates et al., 2017]. Compared with other popular methods such as cross-trait LD score regression [Bulik-Sullivan et al., 2015a], Bivariate GCTA [Lee et al., 2012], and BOLT-REML [Loh et al., 2015], cross-trait PRS offers at least two unique strengths as follows. First, cross-trait PRS only requires the GWAS summary statistics of one trait obtained from a large discovery GWAS, while it allows those of the other trait obtained from a much smaller GWAS dataset. In contrast, most other methods require large GWAS data for both traits on either summary or individual-level. Second, cross-trait

2 PRS can provide genetic propensity for each sample in the testing dataset, enabling further prediction and treatment. However, given these strengths of cross-trait PRS, empirical evidence has shown a common bias phenomenon that even highly significant cross-trait PRS can only account for a very small amount of variance (R2 often < 1%) when dissecting the shared genetic basis among highly related heritable traits [Bogdan et al., 2018, Clarke et al., 2016, Mistry et al., 2018, Socrates et al., 2017]. Except for some introductory studies [Daetwyler et al., 2008, Dudbridge, 2013, Visscher et al., 2014], few attempts have ever been made to rigorously study cross-trait PRS and to explain such a counterintuitive phenomenon. This paper fills this significant gap with the following contributions. By compre- hensively investigating the properties of cross-trait PRS for polygenic/omnigenic traits, our first contribution in Section 2 is to show that the estimated genetic correlation is asymptotically biased towards zero, uncovering that the underlying genetic overlap is seriously underestimated. Furthermore, when all p SNPs are used in cross-trait PRS, we show that the asymptotic bias is largely determined by the triple (n, p, h2) and is independent of the unknown number of causal SNPs of the two traits. Thus, our second contribution in Section 2 is to propose a consistent estimator by correcting such asymptotic bias in cross-trait PRS. We also develop a novel estimator of genetic correlation which only requires two sets of summary statistics. Next, in Section 3, we show that when cross-trait PRS is constructed using q top- ranked SNPs whose GWAS p-values pass a given threshold, in addition to (n, p, h2), the asymptotic bias will also be determined by the number of causal SNPs m, since the sparsity m/p determines the quality of the q selected SNPs. Particularly, for highly polygenic/omnigenic traits with dense SNP signals, such screening may fail, resulting in larger bias in genetic correlation estimation. In Section 4, we generalize our results to quantify the influence of overlapping samples among GWAS. We show that our bias- corrected estimator for independent GWAS can be smoothly extended to GWAS with partially or even fully overlapping samples. The remainder of this paper is structured as follows. Sections 2 and 3 study the cross-trait PRS with all SNPs and selected SNPs, respectively. Section 4 considers the effect of overlapping samples among different GWAS. Sections 5 and 6 summarize the numerical results on numerical experiments and real data analysis. The paper concludes with some discussions in Section 7.

2 Cross-trait PRS with all SNPs

Since cross-trait PRS is designed for polygenic traits based on their GWAS summary statistics, we first introduce the polygenic model and some properties of GWAS sum- mary statistics. We note that the standard approach in GWAS is marginal screening. That is, the marginal association between the and single SNP is assessed each at a time, while adjusting for the same set of covariates including population stratification [Price et al., 2006]. Marginal screening procedures often work well to prioritize important variables given that the signals are sparse [Fan and Lv, 2008], but

3 they may have noisy outcomes when signals are dense [Fan et al., 2012], which is often the case for GWAS of highly polygenic traits.

2.1 Polygenic trait and GWAS summary statistics

Let X(1) be an n × m matrix of the SNP data with nonzero effects, and X(2) be an n × (p − m) matrix of the null SNPs, resulting in an n × p matrix of all SNPs, donated by X = [X(1), X(2)] = (x1, ··· , xm, xm+1, ··· , xp), where xi is an n × 1 vector of the SNP i, i = 1, ··· , p. Columns of X are assumed to be independent after linkage disequilibrium (LD)-based pruning. Further, we assume column-wise normalization on X is performed such that each variable has sample mean zero and sample variance one. Therefore, we may introduce the following condition on SNP data:

Condition 1. Entries of X = [X(1), X(2)] are real-value independent random variables with mean zero, variance one and a finite eighth order moment.

Let y be an n × 1 vector of continuous polygenic phenotype. We assume a linear polygenic structure between y and X as follows:

p m X X y = xiβi +  = xiβi +  = X(1)β(1) + , (1) i=1 i=1

T T T  where β = (β1, ··· , βm, βm+1, ··· , βp) = β(1), β(2) is a vector of genetic effects T T T such that βi in β(1) = (β1, ··· , βm) are random variables (i = 1, ··· , m), β(2) = T (βm+1, ··· , βp) are zeros, and  represents the vector of independent non-genetic random errors. For simplicity, we assume that there are no other fixed effects in model (1), or equivalently, other covariates can be well observed and adjusted for. We allow flexible ratios among (n, p, m). As min(n, p) → ∞, we assume m m = γ → γ and = ω → ω for 0 < γ ≤ ∞ and 0 ≤ ω ≤ 1, n 0 p 0 0 0 which should satisfy most large-scale GWAS of polygenic traits. Most GWAS use ordinary least squares (OLS) to perform linear regression given by

∗ y = 1nµ + xiβi + i (2) for i = 1, ··· , p, where 1n is an n×1 vector of ones. Let µb and βbi be the OLS estimates of µ and βi, respectively, for i = 1, ··· , p. When y and xi are normalized and both n and m → ∞, under Condition 1 and model (1), it can be shown that

  E µb = 0, E βbi = βi, and

( −1 Pm 2  n · j6=i βj = O(m/n), for i ∈ [1, m]; Var βbi = −1 Pm 2 (3) n · j=1 βj = O(m/n), for i ∈ [m + 1, p].

4 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● GWAS of polygenic trait: estimation GWAS of polygenic trait: estimation● GWAS of polygenic trait: estimation●●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● n=10000, m=100 n=10000, m=1000 ● n=10000, m=5000 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●● ● 1.0 1.0 1.0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●●●●●●●●●●● ● ● ● ● ●●●● ● ●●●●●●●●●●● ● ●● ● ● ●●● ● ● ●●●● ●●●●● ●●● ●● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ●● ●●●●●●●● ● ● ●●●● ● ●● ●● ● ● ● ● ●●●●●●●●●● ● ●●●● ●●●● ● ●● ●● ●● ●● ●●● ●●●●● ●●●●●●●● ● ● ● ● ●●● ●●●●●●●● ●●● ●● ●●● ● ● ● ●●● ●●●●●● ●●● ●●● ●● ● ● ● ●●● ●● ● ●● ● ●●●● ● ●●●● ●●●●●●●●●●●● ●●●●● ● ●●● ● ● ● ● ● ● ●● ● ●●●●●●●● ●● ●●●●●●●●● ● ●●●●● ● ● ● ●●● ● ● ● ● ● ●● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●● ●●●●● ●● ●●●● ● ●● ● ● ●● ●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●● ● ● ●● ● ●●● ●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●● ● ● ●● ● ● ● ● ●●● ● ●● ●●● ● ●●●●●●●●●●●●●●● ●● ●●● ●●● ● ● ●●●●●●● ● ●●● ●●●●●●●●●● ●●●●● ● ● ●●●●● ●●●● ● ●●●●●●● ●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●● ● ●● ●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●● ● ● ● ●●●● ●● ● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● 0.5 ● 0.5 ●● 0.5 ● ● ● ● ●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●● ●● ● ● ● ●● ●●●●● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●● ●● ● ●●●● ●● ● ● ● ● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ●● ● ● ●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●● ●●●● ●●● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●● ● ●● ●● ●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●● ●● ● ● ●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●● ● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ●●●●●●● ●●●● ●●● ●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ●● ● ●●●●●●●●●●●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ● ● ●●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ●● ● ● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ● ● ●●●● ●●●●● ●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ● ● ●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●● ● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●● ● ● ●●●●●● ●●●● ●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ● ● ● ●●●●● ●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ● 0.0 ● 0.0 0.0 ● ● ● ●● ● ● ●●● ● ●●● ● ● ●●●●●●●●● ●● ●● ●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ● ●● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●●●●●●●●● ●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ● ●●● ●●●●●●●● ●● ●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●● ● ● ● ●●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ● ●● ●●● ●●● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●● ● ●●●●● ●● ●●●●●●●● ●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ●●● ●●●●● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ● ●● ●●●●●●● ●● ●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●● ● ● ●●●●●● ●●●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ●●● ●● ●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ●● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ●● ● ●●●● ● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●● ● ● ● ●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●● ● ● ●● ● ●●●●●● ●● ● ● ●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●● ●●●●●●● ● ● ● ●● ●● ●●●●●●●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●● ● ●● ● ●● ●●●●● ●● ● ● ●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● Marginal Estimates ● Marginal Estimates Marginal Estimates ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●●●●●●●●● ● ●● ● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●● ●● ●● ●●●● ●● ●●● ● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ●●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● −0.5 −0.5 ● ●● −0.5 ● ● ● ● ●●● ●● ● ●●●●●● ● ●● ●●● ● ● ●●●●●●●●●● ●●●●● ●●●● ● ●●● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ● ●●● ●●● ● ● ● ● ●●●●●●●●●●●●●● ●●●●●●● ● ● ● ●● ●●● ●● ●●●●●●● ●●●●●●●●●● ●●●● ● ●●● ●●● ●●● ●●●●● ●●●●●●● ●●●●● ●● ● ●● ●● ● ●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ● ● ●●●●● ● ● ●●●●●● ●●●●●● ●● ● ● ● ● ●● ● ● ● ●●● ●●●●●●●●● ●●●●●●● ● ● ● ●● ● ●●● ●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ●●●●●●●●● ● ●●● ● ● ●●●●●● ●●●●●●● ●● ●● ● ● ● ● ● ● ●●● ●● ●●● ●●●●●●●●● ● ● ● ● ● ●● ● ● ●● ●●● ●● ●● ● ● ● ● ● ●●●●●● ●●●●● ●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●●● ● ● ● −1.0 −1.0 −1.0 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0● ● ● −0.5● 0.0 0.5 1.0 ● ● ● ● ●● ● ● ● ● ● ● True Beta Value True Beta Value ●● ● True Beta Value ●

GWAS of polygenic trait: testing GWAS of polygenic trait: testing GWAS of polygenic trait: testing n=10000, m=100 n=10000, m=1000 n=10000, m=5000

● ● 25 ● ● ●

● 12

120 ●

● ● 10 20

100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● 80 ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ●● ● ● 60 ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● 10 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● 4 ●● ● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● 40 ● ●●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● −log10(P−values) ● −log10(P−values) ●● ● ● ●● −log10(P−values) ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ●●● ●●●●● ● ● ● ●●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●● ● ● ●●●●● ●● ●● ● ● ●●●●●● ●●●●●● ●● ●● ●● ● ● ● ● ● ●●● ●●● ● ●●●●● ●● ● ● ● ● ●● ●●●● ●● ● ● ● ● ●● ●●●●● ●●●● ● ● ● ●●●●●● ●●●●●●●●●● ● ● ● ● ●●● ● ● ●● ● ● ●●● ●●●●●● ● ●●●● ●● ●●● ● ●● ●●●● ●●● ●● ● ● ● ● ● ●● ●●● ●● ● ● ● ●● ● ● ●● ● ●●●●●●● ●● ●●●●●● ● ● ●● ● ● ●●●●●● ●● ●●●● ●● ● ● ● ● ●●● ● ● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●● ● ● ● ●●●● ●● ● 5 ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ●● ● ●●●●●●●●●●●●● ●●●●●●● ●● ● ● ●●●●●●●●●●●●● ●●●●●●● ●●● ● ● ●● ● ● ● ●● ● ● ●●● ●● ●●●●●●● ●●●●●●●●●● ● ●●● ●● ●●● ●●● ●●●●●●●● ●●● ● ●● ● ● ● ●●●●●●●●● ● ●●●● ● ●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●●●●●●●● ●●●●●●●●● ●● ● ● ● ● ● ●● ● ● ● ●●● ●● 2 ● ●●●● ●●● ●●●●● ●●●● ●● ● ●● ●●●● ● ●●●● ●●●●●●●●●●●● ●●●●●● ● ●●● ● ●● ●●● ●●●●●●● ● ●●●● ●● ● ●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●●●● ● ●●●●●●● ● ●●● ●●●●●●●●●●●●●●● ●●●●● ● 20 ● ● ●●●●●● ●●● ● ● ● ●● ●● ●●●●●●●●●●●● ●●●● ●●●● ●●●●●●● ●● ● ●● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ●●●●●● ●● ● ● ● ● ●●●●● ● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ● ● ● ● ●●●●●● ●● ●● ●● ●●●●● ● ● ●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ● ● ● ●● ●● ●● ●●●● ● ●● ●● ●● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ● ● ● ●●● ●● ●● ●●●● ● ●●●●●●●●●●●●●●● ●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ●● ●● ●●●●●●●● ●●● ●●●●● ●●●● ●●●● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●● ● ●●● ●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●● ●●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ●● ●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●● ●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● 0 0 0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 True Beta Value True Beta Value True Beta Value

Fig. 1: Estimation (upper panels) and testing (bottom panels) of marginal genetic effects in GWAS of polygenic traits. We set n = 10, 000 and m = p = 100, 1000 and 5000.

Equation (3) indicates that the variance or mean squared error (MSE) of βbi calculated from model (2) moves up linearly as m → ∞. Therefore, the T scores for testing

H0i : βi = 0 versus H1i : βi 6= 0, for i = 1, ··· , p are given by

( Pm 2 1/2 p βbi/( j6=i βj /n) = βbi · O( n/m), for i ∈ [1, m]; Ti = Pm 2 1/2 p βbi/( j=1 βj /n) = βbi · O( n/m), for i ∈ [m + 1, p] under H0i, i = 1, ··· , p. Remark 1. The above simple derivations reveal important insights into the challenge of performing marginal screening for polygenic traits. For estimation, although βbis are all unbiased given that X are independent, Var(βbi) is O(m/n) instead of O(1/n). Therefore, when m/n is large, the variance (and MSE) of βbis can be so overwhelming that βbis might be dominated by their standard errors. Note that all Var(βbi)s are in the same scale regardless of whether their original βis are zeros or not. Thus, the βbis from causal and null variants can be totally mixed up when m/n is large. In addition, the test statistics Tis may not well preserve the ranking of variables in X when m/n is large, resulting in potential low power and high false positive rate in detecting and prioritizing important SNPs.

Figure 1 demonstrates the estimation and testing of marginal genetic effects in GWAS with n = 10, 000 as p = m increases from 100, 1000 to 5000. Each entry of X

5 is i.i.d generated from N(0, 1), elements of β(1) are i.i.d generated from N(0, 0.4), and entries of  are i.i.d from N(0, 1). Then, y is generated from model (1). The estimated genetic effects are unbiased in general, however, the uncertainty clearly moves up as m increases. The relative contribution of each SNP decreases as m increases, and thus the testing power drops as well. More simulations on GWAS summary statistics can be found in Section 5. As illustrated in later sections, these properties of GWAS summary statistics are closely related to the asymptotic bias of cross-trait PRS and the performance of SNP screening. Specifically, i) when cross-trait PRS is constructed with all p SNPs, the  p Var βbi s are aggregated, resulting in inflated genetic variance and underestimated genetic correlation; and ii) when cross-trait PRS is constructed with top-ranked SNPs that pass a pre-specified p-value threshold, it may have worse performance if GWAS marginal screening fails to prioritize the causal SNPs.

2.2 General setup In this subsection, we introduce the modelling framework to investigate the cross- trait PRS, including the of polygenic traits, distribution of genetic effects, and genetic correlation estimators.

2.2.1 Polygenic traits Consider three independent GWAS that are conducted for three different traits as follows:

n1×p n1×mα • Discovery GWAS-I: (X, yα), with X = [X(1), X(2)] ∈ R , X(1) ∈ R , n ×1 and yα ∈ R 1 .

n2×p n2×mβ • Discovery GWAS-II: (Z, yβ), with Z = [Z(1), Z(2)] ∈ R , Z(1) ∈ R , and n2×1 yβ ∈ R .

n3×p n3×mη • Target testing GWAS: (W , yη), with W = [W(1), W(2)] ∈ R , W(1) ∈ R , n ×1 and yη ∈ R 3 .

Here yα, yβ, and yη are three different continuous studied in three GWAS with sample sizes n1, n2, and n3, respectively. Thus, mα, mβ, and mη are different numbers of causal SNPs in general. The X(1), Z(1), and W(1) denote the causal SNPs of yα, yβ, and yη, respectively, and X(2), Z(2), and W(2) donate the corresponding null SNPs. Thus, X, Z, and W are three matrices of p SNPs. It is assumed that X, Z, and W have been normalized and satisfy Condition 1. Similar to model (1), the linear polygenic model assumes

yα = Xα + α, yβ = Zβ + β, and yη = W η + η, (4) where α, β, and η are p × 1 vectors of SNP effects, and α, β, and η represent independent random error vectors. The overall genetic heritability of yα is, therefore,

6 given by

2 Var(Xα) Var(X(1)α(1)) hα = = , Var(yα) Var(X(1)α(1)) + Var(α) which measures the proportion of variation in yα that can be explained by the genetic 2 variation Xα. The yα is fully heritable when hα = 1. Similarly, we can define the 2 2 2 2 2 heritability hβ of yβ and hη of yη, respectively. We assume hα, hβ, and hη ∈ (0, 1]. The genetic correlation in this paper is defined as the correlation of SNP effects on pairs of phenotypes [Guo et al., 2017b, Lu et al., 2017, Pasaniuc and Price, 2017, Shi et al., 2017].

Definition 1 (Genetic Correlation). The genetic correlation between yα and yη and that between yα and yβ are respectively given by

αT η αT β ϕ = · I(kαk · kηk > 0) and ϕ = · I(kαk · kβk > 0), αη kαk · kηk αβ kαk · kβk where I(·) is the indicator function, k · k is the l2 norm of a vector, and ϕαη and ϕαβ ∈ [−1, 1].

2.2.2 Genetic effects

Since mα, mβ and mη can be different and the causal SNPs of different phenotypes may partially overlap, we let mαη be the number of overlapping causal SNPs of yα and yη, and mαβ be the number of overlapping causal SNPs of yα and yβ. Let F (0,V ) represent a generic distribution with mean zero, (co)variance V , and finite fourth order moments. Without loss of generality, we introduce the following condition on genetic effects and random errors.

Condition 2. αi, βj, and ηk are independent random variables satisfying

2 2 αi ∼ F (0, σα), i = 1, ..., mα; βj ∼ F (0, σβ), j = 1, ..., mβ; 2 ηk ∼ F (0, ση), k = 1, ..., mη.

The mαη overlapping nonzero effects (αi, ηi)s of (yα,yη) and mαβ overlapping nonzero effects (αj, βj)s of (yα,yβ) satisfy

! " ! 2 !# ! " ! 2 !# αi 0 σα σαη αj 0 σα σαβ ∼ F , 2 and ∼ F , 2 , ηi 0 σαη ση βj 0 σαβ σβ

respectively. And αi , βj and ηk are independent random variables satisfying

 ∼ F (0, σ2 ), i = 1, ..., n ;  ∼ F (0, σ2 ), j = 1, ..., n ; αi α 1 βj β 2 2 ηk ∼ F (0, ση ), k = 1, ..., n3; where σαη = ραη · σαση and σαβ = ραβ · σασβ.

7 Since the three GWAS have independent samples, we assume that their random errors are independent. Overlapping samples and the induced non-genetic correlation will be studied in Section 4. Under Condition 2, when n1, n3, and p → ∞, if mαη, mα, √ and mη → ∞, and mαη/ mαmη = καη → κ0αη ∈ (0, 1], then the genetic correlation between yα and yη is asymptotically given by

αT η Pmαη α η ϕ = = i=1 i i αη Pmα 2 1/2 Pmη 2 1/2 kαk · kηk ( i=1 αi ) ( i=1 ηi ) m = αη · ρ ·{1 + o(1)} = κ · ρ ·{1 + o(1)}. 1/2 αη 0αη αη (mαmη) √ Similarly, when n1, n2, n3, and p → ∞, if mαβ, mα, and mβ → ∞ and mαβ/ mαmβ = καβ → κ0αβ ∈ (0, 1], then the genetic correlation between yα and yβ is asymptotically given by

αT β m ϕ = = αβ · ρ ·{1 + o(1)} = κ · ρ ·{1 + o(1)}. αβ 1/2 αβ 0αβ αβ kαk · kβk (mαmβ)

2 2 2 As in Jiang et al. [2016], heritability hα, hβ, and hη can be asymptotically represented as follows:

2 2 2 m σ mβσ mησ h2 = α α , h2 = β , and h2 = η . α m σ2 + σ2 β m σ2 + σ2 η m σ2 + σ2 α α α β β β η η η

2.2.3 Genetic correlation estimators Now we introduce the cross-trait PRS and genetic correlation estimators. We need the following data. As n1, n2, and p → ∞, the summary association statistics for yα and yβ from Discovery GWAS-I & II are given by

1 T  1 T  αb = X X(1)α(1) + α and βb = Z Z(1)β(1) + β . n1 n2

We assume that the individual-level SNP W and phenotype yη in the Target testing 2 2 2 GWAS can be accessed. In addition, hα, hβ, and hη are assumed to be estimable, using either their corresponding individual-level data [Loh et al., 2015, Yang et al., 2011] or summary-level data [Bulik-Sullivan et al., 2015b, Palla and Dudbridge, 2015, Weissbrod et al., 2018], or can be found in the literature [Polderman et al., 2015]. In 2 2 2 summary, besides (n1, n2, n3, p), it is assumed that αb, βb, W , yη, bhα, bhβ, and bhη are available. We construct cross-trait PRSs as follows:

p X Sbα = wibai = W ab = W(1,α)ab(1) + W(2,α)ab(2) for yα and i=1 p X Sbβ = wibi = W b = W(1,β)b(1) + W(2,β)b(2) for yβ, i=1

8 where a = (a , ··· , a , a , ··· , a )T = a T , a T , in which a = α · I(|α | > b b1 bmα bmα+1 bp b(1) b(2) bi bi bi T T T  cα), b = (b1, ··· ,bmβ ,bmβ +1, ··· ,bp) = b(1), b(2) , in which bi = βbi · I(|βbi| > cβ), and cα and cβ are given thresholds used for SNP screening in order to calculate Sbα and Sbβ. Moreover, we define W(1,α) = [w1, ··· , wmα ], W(2,α) = [wmα+1, ··· , wp],

W(1,β) = [w1, ··· , wmβ ], W(2,β) = [wmβ +1, ··· , wp], and W = [W(1,α), W(2,α)] = [W(1,β), W(2,β)].  We estimate the genetic correlation between yα and yη with Sbα,yη and that  between yα and yβ with Sbα,Sbβ . They represent two common cases in real data  applications. For Sbα,yη , individual-level data are available for one trait, but not for another one. It often occurs when the traits are studied in two different GWAS. For  Sbα,Sbβ , neither of the two traits has individual-level data. This happens when we have GWAS summary statistics of two traits and estimate their genetic correction on an independent target dataset. The genetic correlation estimators are given by

T T  yη Sbα W(1)η(1) + η W(1,α)ab(1) + W(2,α)ab(2) Gαη = = yη · Sbα W(1)η(1) + η · W(1,α)ab(1) + W(2,α)ab(2) for ϕαη, and

T T  Sbβ Sbα W(1,β)b(1) + W(2,β)b(2) W(1,α)ab(1) + W(2,α)ab(2) Gαβ = = Sbβ · Sbα W(1,β)b(1) + W(2,β)b(2) · W(1,α)ab(1) + W(2,α)ab(2) for ϕαβ.

2.3 Asymptotic bias and correction

We first investigate Gαβ and Gαη when all of the p candidate SNPs are used, or when cα = cβ = 0. Thus, ab(1) = αb(1), ab(2) = αb(2), b(1) = βb(1), and b(2) = βb(2). Then, we have

T T  W(1)η(1) + η WX X(1)α(1) + α Gαη = T T W(1)η(1) + η · X(1)α(1) + α XW and

T T T  Z(1)β(1) + β ZW WX X(1)α(1) + α Gαβ = . T T T T Z(1)β(1) + β ZW · X(1)α(1) + α XW

We have the following results on the asymptotic properties of Gαη, whose proof can be found in the Appendix A.

Theorem 1. Under polygenic model (4) and Conditions 1 and 2, suppose mαη, mα, a and mη → ∞ as min(n1, n3, p) → ∞, and let p = c · (n1n3) for some constants c > 0

9 and a ∈ (0, ∞]. If a ∈ (0, 1), then we have   r n1 Gαη = ϕαη + 2 · hη − 1 · ϕαη ·{1 + op(1)}. (5) n1 + p/hα

If a ∈ [1, ∞], then we have

Gαη · n3 = Op(1). (6)

p 2 Remark 2. For a ∈ (0, 1), Gαη is a biased estimator of ϕαη since n1/(n1 + p/hα)·hη is smaller than 1. Interestingly, the asymptotic bias is independent of the unknown 2 2 numbers mα, mη, and mαη, and is only determined by n1, p, hα and hη. When n1 and p are comparable, a consistent estimator of ϕαη is given by

s 2 A n1 + p/hα Gαη = Gαη · 2 = ϕαη ·{1 + op(1)}. n1 · hη

In addition, the testing sample size n3 vanishes in Gαη for a ∈ (0, 1), which verifies that given the sample size n1 of discovery GWAS is large, we can apply the summary statistics onto a much smaller set of target samples.

If a ∈ [1, ∞], i.e., p/(n1n3) is too large, then Gαη will have a zero asymptotic limit. In practice, this occurs when the sample size of discovery GWAS is too small to obtain reliable GWAS summary statistics. When these summary statistics are applied on an T independent target dataset, the mean of genetic covariance yη Sbα cannot dominate its T standard error. The genetic variance Sbα Sbα is so overwhelming that Gαη goes to zero. Details can be found in Appendix A.

The asymptotic properties of Gαβ are given as follows.

Theorem 2. Under polygenic model (4) and Conditions 1 and 2, suppose mαβ, mα, 2 a and mβ → ∞ as min(n1, n2, n3, p) → ∞, and let p = c · (n1n2n3) for some constants c > 0 and a ∈ (0, ∞]. If a ∈ (0, 1), then we have

s  n1 n2 Gαβ = ϕαβ + 2 · 2 − 1 · ϕαβ ·{1 + op(1)}. n1 + p/hα n2 + p/hβ

If a ∈ [1, ∞], then we have

n (n + p)(n + p) G · 3 1 2 = O (1). αβ p2 p

p 2 Remark 3. For a ∈ (0, 1), Gαβ is a biased estimator of ϕαβ since n1/(n1 + p/hα) and q 2 n2/(n2 + p/hβ) are smaller than 1. The asymptotic bias is independent of mα, mβ, 2 2 and mαβ, and is determined by n1, n2, p, hα and hβ. Giving that n1, n2, and p are

10 A B

1.0 1.0

0.8 0.8 ● ●

● 0.6 0.6 ● ● ● ● 0.4 ● 0.4 ● ● ● ● Raw PRS Estimates Raw 0.2 0.2

Corrected PRS Estimates ● ● 0.0 0.0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 True Genetic Correlation True Genetic Correlation

C D

● 1.0 1.0 ●

0.8 0.8 ● ● ●

0.6 0.6 ● ● ● ● 0.4 ● ● 0.4 ● ● ● Raw PRS Estimates Raw 0.2 0.2 Corrected PRS Estimates

● ● ● 0.0 0.0 ● 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 True Genetic Correlation True Genetic Correlation

Fig. 2: Raw genetic correlations estimated by cross-trait PRS with all SNPs (left panels, A: Gαη, C: Gαβ) and the bias-corrected genetic correla- A A 2 2 2 tion estimates (right panels, B: Gαη, D: Gαβ). We set hα = hβ = hη = 1, n1 = n2 = n3 = p = 10, 000, and m = 2000.

comparable, a consistent estimator of ϕαβ is given by s (n + p/h2 ) · (n + p/h2 ) A 1 α 2 β Gαβ = Gαβ · = ϕαβ ·{1 + op(1)}. n1n2

Now we propose a novel estimator of ϕαβ that can be directly constructed by using two sets of summary statistics αb and βb. Let

T T T  α βb X(1)α(1) + α XZ Z(1)β(1) + β ϕ = b = , bαβ T T αb · βb X(1)α(1) + α X · Z(1)β(1) + β Z we have the following asymptotic properties.

Theorem 3. Under polygenic model (4) and Conditions 1 and 2, suppose mαβ, mα, a and mβ → ∞ as min(n1, n2, p) → ∞, and let p = c · (n1n2) for some constants c > 0 and a ∈ (0, ∞]. If a ∈ (0, 1), then we have

s n n  ϕ = ϕ + 1 · 2 − 1 · ϕ ·{1 + o (1)}. bαβ αβ 2 2 αβ p n1 + p/hα n2 + p/hβ

11 If a ∈ [1, ∞], then we have

(n + p)(n + p) ϕ · 1 2 = O (1). bαβ p p

It follows from Theorem 3 that a consistent estimator of ϕαβ is given by s (n + p/h2 ) · (n + p/h2 ) A 1 α 2 β ϕbαβ = ϕbαβ · = ϕαβ ·{1 + op(1)}. n1n2

Since ϕbαβ and Gαβ have similar asymptotic properties, in what follows we will focus on Gαβ and the general conclusions of Gαβ remain the same for ϕbαβ.

3 SNP screening

As shown in Theorems 1 and 2, in addition to heritability, the asymptotic bias of Gαη or Gαβ is largely affected by n/p. These results intuitively suggest to select a subset of p SNPs to construct cross-trait PRS. The common approach in practice is to screen the SNPs according to their GWAS p-values. We investigate this strategy in this section.

For a given threshold cα > 0, let qα = p·πα = qα1+qα2 (πα ∈ (0, 1]) be the number of top-ranked SNPs selected for yα, among which there are qα1 true causal SNPs and the remaining qα2 are null SNPs, and we let qαη be the number of overlapping causal SNPs of yα and yη. Similarly, given a threshold cβ > 0, let qβ = p·πβ = qβ1 +qβ2 (πβ ∈ (0, 1]) be the number of top-ranked SNPs selected for yβ, among which there are qβ1 true causal SNPs and the remaining qβ2 are null SNPs, and we let qαβ be the number of overlapping causal SNPs of yα and yβ. Thus, qα1 ≥ qαη and min(qβ1, qα1) ≥ qαβ. The SNP data are defined accordingly. We write X(1) = [X(11), X(12)], X(2) = [X(21), X(22)], Z(1) = [Z(11), Z(12)], Z(2) = [Z(21), Z(22)], W(1,α) = [W(11,α), W(12,α)], W(2,α) = [W(21,α), W(22,α)], W(1,β) = [W(11,β), W(12,β)], and W(2,β) = [W(21,β), W(22,β)]. Here X(11) and W(11,α) are the selected qα1 causal SNPs of yα, and Z(11) and W(11,β) are the selected qβ1 causal SNPs of yβ. Similarly, X(21) and W(21,α) are the selected qα2 null SNPs of yα, and Z(21) and W(21,β) are the selected qβ2 null SNPs of yβ. In addition, we let αb(1) = [αb(11), αb(12)], αb(2) = [αb(21), αb(22)], βb(1) = [βb(11), βb(12)], and βb(2) = [βb(21), βb(22)], where αb(11) and βb(11) correspond to the selected causal SNPs of yα and yβ, respectively, and αb(21) and βb(21) correspond to the selected null ones. Then we have T  W(1)η(1) + η W(11,α)αb(11) + W(21,α)αb(21) CT αη GT αη = = W(1)η(1) + η · W(11,α)αb(11) + W(21,α)αb(21) Vη · VT α

12 n=10000, p=10000 n=10000, p=10000 n=10000, p=10000 n=10000, p=10000 m/p= 0.01 m/p= 0.1 m/p= 0.5 m/p= 0.8 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● 0.6 ● ● ● 0.6 ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 0.4 0.4 0.4 ● ● ● ● ● ● ● ● ● 0.2 0.2 0.2 ● 0.2 ● ● ●

Estimated Genetic Correlation Estimated Genetic Correlation Estimated Genetic Correlation ● Estimated Genetic Correlation 0.0 0.0 0.0 0.0 ● ● 1 1 1 1 0.8 0.5 0.4 0.3 0.2 0.1 0.8 0.5 0.4 0.3 0.2 0.1 0.8 0.5 0.4 0.3 0.2 0.1 0.8 0.5 0.4 0.3 0.2 0.1 0.08 0.05 0.02 0.01 0.08 0.05 0.02 0.01 0.08 0.05 0.02 0.01 0.08 0.05 0.02 0.01 0.001 0.001 0.001 0.001 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08

n=2000, p=10000 n=2000, p=10000 n=2000, p=10000 n=2000, p=10000 m/p= 0.01 m/p= 0.1 m/p= 0.5 m/p= 0.8 1.0 1.0 1.0 1.0

0.8 ● 0.8 0.8 0.8

● 0.6 ● ● 0.6 0.6 0.6 ● ● ● ● ● ● ● ● ● ● 0.4 ● 0.4 0.4 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.2 ● 0.2 ● 0.2 ● ● ● ● ● ● ●

Estimated Genetic Correlation Estimated Genetic Correlation ● Estimated Genetic Correlation Estimated Genetic Correlation 0.0 0.0 0.0 ● 0.0 ● ● 1 1 1 1 0.8 0.5 0.4 0.3 0.2 0.1 0.8 0.5 0.4 0.3 0.2 0.1 0.8 0.5 0.4 0.3 0.2 0.1 0.8 0.5 0.4 0.3 0.2 0.1 0.08 0.05 0.02 0.01 0.08 0.05 0.02 0.01 0.08 0.05 0.02 0.01 0.08 0.05 0.02 0.01 0.001 0.001 0.001 0.001 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08 Cutoffs 1e−04 1e−05 1e−06 1e−07 1e−08

Fig. 3: Raw genetic correlation GT αη estimated by cross-trait PRS with selected SNPs under different sparsity m/p and sample size n. We set 2 2 hα = hη = 1, ϕαη = 0.8, p = 10, 000, and n = 10, 000 (upper panels) or 2000 (lower panels).

where Vη = W(1)η(1) + η ,

T  T  VT α = W(11,α)X(11) X(1)α(1) + α + W(21,α)X(21) X(1)α(1) + α , and T T  CT αη = W(1)η(1) + η W(11,α)X(11) X(1)α(1) + α + T T  W(1)η(1) + η W(21,α)X(21) X(1)α(1) + α .

Corollary 1. Under polygenic model (4) and Conditions 1 and 2, suppose that

min(mαη, mα, mη) → ∞ and min(qαη, qα1, qα2) → ∞ as min(n1, n3, p) → ∞, fur-  2 2 ther if mαη(qα1 + qα2) /(qαηn1n3) → 0, then we have   r n1mα qαη GT αη = ϕαη + 2 · · hη − 1 · ϕαη ·{1 + op(1)}. n1qα1 + mαqα/hα mαη

Corollary 1 shows the trade-off of SNP screening. Given n1, mα, mαη, hα, and hη, the bias of GT αη is also affected by qα, qα1 and qαη. As more SNPs are selected, the p 2 numerator of (n1mα)/(n1qα1 + mαqα/h ) · (qαη/mαη) increases with qαη, while the √ √ α denominator increases with qα (and qα1). Therefore, whether or not SNP screening can improve the estimation is largely affected by the quality of the selected SNPs, which is highly related to the properties of the GWAS summary statistics. In the optimistic

13 case where qαη = mαη and qα = qα1 = mα, GT αη becomes

r n1 2 · hη · ϕαη, n1 + mα/hα which is the theoretical upper limit. We note that this optimistic upper limit is still biased towards zero. Another interesting case is that the GWAS summary statistics of causal and null SNPs are totally mixed up, which may occur when n1 = o(mα) (i.e., sample size is small or trait is highly polygenic/omnigenic) according to (3). Therefore, we have qα1/qα ≈ mα/p. Suppose also qαη/qα1 ≈ mαη/mα, we have

r n1 GT αη ≈ 2 2 · qα · hη · ϕαη, n1p + p /hα which increases with qα. As qα = p, GT αη reaches its upper bound

r n1 2 · hη · ϕαη. n1 + p/hα

That is, GT αη achieves the best performance when the cross-trait PRS is constructed without SNP screening. For example, in the left two panels of Figure 3, we set m/p = 0.01 to reflect the sparse signal case, in which causal and null SNPs can be easily separated by SNP screening. Thus, SNP screening can reduce the bias of Gαη when signals are sparse. However, as the number of causal SNPs increase (from left to right in Figure 3), it becomes much hard to separate causal and null SNPs by their GWAS p-values. Therefore, SNP screening will enlarge the bias. Similarly, we have

T  W βb + W βb W α + W α C G = (11,β) (11) (21,β) (21) (11,α) b(11) (21,α) b(21) = T αβ , T αβ V · V kW(11,β)βb(11) + W(21,β)βb(21)k · kW(11,α)αb(11) + W(21,α)αb(21)k T α T β where

 T  T  T CT αβ = W(11,β)Z(11) Z(1)β(1) + β + W(21,β)Z(21) Z(1)β(1) + β  T  T  W(11,α)X(11) X(1)α(1) + α + W(21,α)X(21) X(1)α(1) + α

T  T  and VT β = W(11,β)Z(11) Z(1)β(1) + β + W(21,β)Z(21) Z(1)β(1) + β . Corollary 2. Under polygenic model (4) and Conditions 1 and 2, suppose that min(mαβ, mα, mβ) → ∞ and min(qαβ, qα1, qα2, qβ1, qβ2) → ∞ as min(n1, n2, n3, p) →  2 2 ∞. Further if mαβ(qα1 + qα2)(qβ1 + qβ2) /(qαβn1n2n3) → 0, then we have s  n1mα n2mβ qαβ Gαβ = ϕαβ + 2 · 2 · − 1 · ϕαβ ·{1 + op(1)}. n1qα1 + mαqα/hα n2qβ1 + mβqβ/hβ mαβ

Corollary 2 shows the trade-off of SNP screening for Gαβ. Given n1, n2, mα, mβ,

14 mαβ, hα, and hβ, the bias of GT αη is also affected by qα, qα1, qβ, qβ1 and qαβ. As more SNPs are selected, the numerator of qαβ/mαβ increases with qαβ, while the denominator q 2 2 √ of (n1mα)/(n1qα1 + mαqα/hα) · (n2mβ)/(n2qβ1 + mβqβ/hβ) increases with qα and √ √ √ qβ (also qα1 and qβ1). In the optimistic case where qαβ = mαβ, qα = qα1 = mα and qβ = qβ1 = mβ, GT αβ reduces to s n1 n2 2 · 2 · ϕαβ, n1 + mα/hα n2 + mβ/hβ which is the theoretical upper limit. On the other hand, suppose qαβ/qα1 ≈ mαβ/mα and qαβ/qβ1 ≈ mαβ/mβ, when n1 = o(mα), n2 = o(mβ), i.e., the causal SNPs and null SNPs are totally mixed, we have qα1/qα ≈ mα/p, qβ1/qβ ≈ mβ/p, and

r n1 n2 GT αβ ≈ 2 2 · 2 2 · qαqβ · ϕαβ, n1p + p /hα n2p + p /hα which increases with qα and qβ. Therefore, as qα = qβ = p, GT αβ reaches its upper bound s n1 n2 2 · 2 · ϕαβ. n1 + p/hα n2 + p/hβ

In conclusion, when causal SNP and null SNP can be easily separated by GWAS, the top-ranked SNPs are more likely to be causal ones, that is, SNP screening helps. However, for highly polygenic complex traits whose m/n is large, SNP screening may result in larger bias and should be used with caution.

4 Overlapping samples

In real data applications, different GWAS may share a subset of participants. It is often inconvenient to recalculate the GWAS summary statistics after removing the overlapping samples. In this section, we examine the effect of overlapping samples on the bias of cross-trait PRS, which provides more insights into the bias phenomenon of cross-trait PRS. Particularly, we focus on two distinct cases which are both common in practice: i) ns overlapping samples between discovery GWAS and Target testing data for ϕαη estimation; and ii) ns overlapping samples between two discovery GWAS for ϕαβ estimation.

Case i)

We add ns overlapping samples into Discovery GWAS-I and Target testing GWAS, resulting in the following two new datasets:

• Dataset IV: (X, S, y ), with X ∈ n1×p, S ∈ ns×p, and yT = (yT , yT ) ∈ α R R α αX αS (n +n )×1 R 1 s .

15 • Dataset V: (W , S, y ), with W ∈ n3×p, S ∈ ns×p, and yT = (yT , yT ) ∈ η R R η ηW ηS (n +n )×1 R 3 s . 2 Mimicking h , we define hαη ∈ (0, 1] as the proportion of phenotypic correlation that can be explained by the correlation of their genetic components

mαησαη hαη = . mαησαη + σαη

On the overlapping samples, we allow nonzero correlation between random errors to capture the non-genetic contribution to phenotypic correlation. We introduce an ad- ditional condition on random errors.

Condition 3. On ns overlapping samples, αj and ηj are independent random vari- ables satisfying

! " ! 2 !# αj 0 σα σαη ∼ F , 2 ηj 0 σαη ση

for j = 1, ..., ns, where σαη = ραη · σα ση .

Theorem 4. Under polygenic model (4) and Conditions 1 - 3, suppose min(mαη, mα, a mη) → ∞ as min{(n1 + ns), (n3 + ns), p} → ∞, and let p = c ·{(n1 + ns)(n3 + ns)} for some constants c > 0 and a ∈ (0, ∞]. If a ∈ (0, 1), then GSαη can be written as     1 + nsp/{(n1 + ns)(n3 + ns) · hαη} · hη · ϕαη ·{1 + op(1)} .  2 2 2 2 1/2 1 + p/{(n1 + ns) · hα} + 2nsp/{(n1 + ns)(n3 + ns)} + nsp /{(n1 + ns) (n3 + ns) · hα}

If a ∈ [1, ∞], then we have GSαη = op(1).

Remark 4. Theorem 4 shows the effect of ns overlapping samples on the estimation of ϕαη. Both sample sizes (n1 +ns) and (n3 +ns) are involved in the bias. A consistent A estimator GSαη can be derived given that hαη is estimable. An interesting special case is when the two GWAS are fully overlapped, then we have

ns + p/hαη GSαη = · hη · ϕαη ·{1 + op(1)}.  2 2 1/2 ns + 2nsp + p(p + ns)/hα

2 2 In the optimal situation where hα = hη = hαη = 1, we have

 1 −1/2 GSαη = 1 + · ϕαη ·{1 + op(1)}. p/ns + ns/p + 2

Therefore, GSαη is asymptoticly biased unless either p = o(ns) or ns = o(p) holds, neither of which is the case in modern GWAS. As ns and p are more comparable, the asymptotic bias in GSαη increases and the largest bias occurs as p = ns → ∞. Note that it is not recommended to estimate the genetic correlation between two traits with (fully) overlapping samples due to concerns such as confounding and overfitting [Dudbridge,

16 A B

1.0 1.0 ●

0.8 0.8 ● ●

● 0.6 ● 0.6 ●

0.4 0.4 ● ● ● ● ● Raw PRS Estimates Raw 0.2 ● 0.2 Corrected PRS Estimates ● ● 0.0 0.0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 True Genetic Correlation True Genetic Correlation

C D

1.0 1.0 ●

0.8 0.8 ● ● ● 0.6 0.6 ●

0.4 0.4 ●

● ● Raw PRS Estimates Raw 0.2 0.2 ● ● ● Corrected PRS Estimates ● ● 0.0 0.0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 True Genetic Correlation True Genetic Correlation

Fig. 4: Raw genetic correlations estimated by cross-trait PRS with all SNPs (left panels, A: GSαη, C: GSαβ) and bias-corrected genetic correlation A A 2 2 2 estimates (right panels, B: GSαη, D: GSαβ). We set hα = hβ = hη = 1, n1 = ns = n2 = n3 = 5000 (half samples overlap), p = 10, 000, and m = 2000.

2013, Pasaniuc and Price, 2017]. In our analysis, such concern is quantified by the value of hαη. That is, when non-genetic correlation exists in error terms, we have hαη < 1, and the estimation of genetic correlation is inflated. However, on the other hand, our 2 2 results show that even in an optimal overlapping setting with hα = hη = hαη = 1, the cross-trait PRS estimator based on GWAS summary statistics is biased towards zero.

Case ii)

In this case, we add ns overlapping samples into Discovery GWAS-I and II, resulting in the following two new datasets:

• Dataset IV: (X, S, y ), with X ∈ n1×p, S ∈ ns×p, and yT = (yT , yT ) ∈ α R R α αX αS (n +n )×1 R 1 s . • Dataset VI: (Z, S, y ), with Z ∈ n2×p, S ∈ Rns×p, and yT = (yT , yT ) ∈ β R β βZ βS (n +n )×1 R 2 s .

Then we define hαβ ∈ (0, 1] as

mαβσαβ hαβ = , mαβσαβ + σαβ which quantifies the contribution of genetic correlation to the phenotypic correlation. We introduce the following additional condition on random errors.

17 Condition 4. On ns overlapping samples, αj and βj are independent random vari- ables satisfying ! " ! !#  0 σ2 σ αj ∼ F , α αβ σ σ2 βj 0 αβ β

for j = 1, ..., ns, where σαβ = ραβ · σα σβ .

Theorem 5. Under polygenic model (4) and Conditions 1, 2 and 4, suppose min(mαβ, mα, mβ) → ∞ as min{(n1 + ns), (n2 + ns), n3, p} → ∞, and let p = c ·{(n1 + ns)(n2 + a ns)n3} for some constants c > 0 and a ∈ (0, ∞]. If a ∈ (0, 1), then GSαβ is given by

1/2 1/2 1/2 1/2 (n1 + ns) (n2 + ns) + nsp/{(n1 + ns) (n2 + ns) · hαβ} · ϕαβ ·{1 + op(1)}.  2 2 1/2 (n1 + ns + p/hα) · (n2 + ns + p/hβ)

If a ∈ [1, ∞], then we have GSαβ = op(1).

Remark 5. Theorem 5 shows the effect of ns overlapping samples on the estimation of ϕαβ. Since n3 vanishes in the bias, when (n1 + ns) and (n2 + ns) are large, a consistent A estimator GSαβ can be derived given that hαβ is estimable. When the two discovery GWAS are fully overlapped, i.e., the two set of summary statistics are generated from the same GWAS, then we have

ns + p/hαβ GSαβ = · ϕαβ ·{1 + op(1)}.  2 2 1/2 (ns + p/hα) · (ns + p/hβ)

2 2 In the optimal situation with hα = hβ = hαβ = 1, we have GSαβ = ϕαβ ·{1 + op(1)}. Thus, GSαβ is a consistent estimator and we may have an unbiased estimator of genetic correlation. In summary, above analyses reveal that the bias in cross-trait PRS estimator may result from the following facts: i) summary statistics are generated from independent GWAS, where the induced bias is largely determined by the n/p ratio; ii) phenotypes are not fully heritable, i.e., heritability is less than one; and iii) non-genetic correlation exists in the random errors of overlapping samples. This may happen, for example, when confounding effects are not fully adjusted. The first two facts may bias the ge- netic correlation estimator towards zero, while the last fact may inflate the estimated genetic correlation. In the supplementary file, we further investigate several other spe- cific overlapping cases, which can be useful for quantifying potential bias and perform correction in real data.

18 5 Numerical experiments

5.1 GWAS of polygenic traits We first numerically evaluate the marginal effect size estimates in GWAS with p = 100, 000 and n = 10, 000 or 1000. Each entry of X is independently generated from N(0, 1). We vary the ratio m/p from 0.001 to 0.8 to reflect a wide range of sparsity.

The nonzero SNP effects in β(1) are independently generated from N(0, 1). Entries of  are independently generated from N(0, 1). A continuous phenotype y is then generated from model (1) and we apply model (2) to estimate the marginal effects. A total of 200 replicates was conducted. We calculated the sum of the MSE of regression coefficients βb, the area under curve (AUC) and power of test statistics Ti (i = 1, ··· , p), and enrichment, which is the proportion of true causal SNPs among the top (10%×p)- ranked SNPs. As expected, when sparsity m/p increasing, the MSE of βb is inflated, and both AUC and power of Tis decrease dramatically (Supplementary Figure 1). When m/p is larger than 0.5, AUC is close to 0.5 and power is near zero. Enrichment is high when m/p is small, but it drops dramatically as m/p increases. Finally, enrichment becomes similar to m/p, reflecting that marginal screening can well preserve the rank of variants only when signals are very sparse. These results indicate that causal and null SNPs may be highly mixed in the ranking list of SNP for polygenic traits.

5.2 Cross-trait PRS with all SNPs To illustrate the finite sample performance of our theoretical results, we simulate 10, 000 uncorrelated SNPs. The MAF of each SNP, f, is independently generated from Uni- form [0.05, 0.45] based on which the SNP genotypes are independently sampled from {0, 1, 2} with probabilities {(1 − f)2, 2f(1 − f), f 2}, respectively. We set the same 2000 causal SNPs on each trait and the nonzero genetic effects are generated from Normal distribution according to Condition 2 with σα = ση = σβ = 1. We set all heritability to one and vary ϕαη and ϕαβ from 0.1 to 0.9. Model (4) is used to generate contin- uous phenotypes. We generated 10, 000 samples in each dataset and a total of 200 replicates was conducted. Cross-trait PRS was built with all SNPs. We calculated the raw estimators Gαη and Gαβ studied in Theorems 1 - 2, and the corresponding A A bias-corrected estimators Gαη and Gαβ. The performance of Gαη and Gαβ is displayed in the left panels of Figure 2. It is clear that these raw estimates are biased towards zero. For example, when ϕαη = ϕαβ = 0.9, Gαη is around 0.6 while Gαβ is less than A A 0.45. The performance of Gαη and Gαβ is displayed in the right panels of Figure 2, which indicates that the two bias-corrected estimators perform well and are close to the true value of ϕαη and ϕαβ, respectively. To verify that our results are independent of the signal sparsity, we set mα = mβ = mη = p · aα and vary the sparsity aα = 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.6, 0.7 and 0.8 to generate sparse and dense signals. Next, we fix aα = 0.2 and set mβ = mη = k · mα to allow phenotypes to have different number of causal SNPs, where k = 0.3, 0.4, 0.5, 0.8, 1, 1.25, 2, 2.5 and 3.3. We set all heritability to one and let ϕαη = ϕαβ =

19 0.5. Sample size is set to either 2000 or 10, 000. The performance of Gαη is displayed in the upper panels of Supplementary Figure 2. The bias of Gαη is independent of the sparsity aα of a trait or the ratio of sparsity k between two traits, which verifies our results of Theorem 1. The bottom panels of Supplementary Figure 2 display the A A performance of Gαη. It is clear that Gαη is unbiased regardless of aα and k. The A 2 2 Supplementary Figure 3 shows a similar pattern in Gαη as heritability hα = hη = 0.5. A The performance of Gαβ and Gαβ is displayed in Supplementary Figure 4 and supports A our results in Theorem 2. Finally, we illustrate the performance of ϕbαβ and ϕbαβ in Supplementary Figure 5, verifying our results in Theorem 3 and the unbiasedness of A ϕbαβ.

5.3 SNP screening and overlapping samples Instead of using all the 10, 000 SNPs, we construct cross-trait PRS with the top-ranked SNPs whose GWAS p-values pass a pre-specified threshold. We consider a series of thresholds {1, 0.8, 0.5, 0.4, 0.3, 0.2, 0.1, 0.08, 0.05, 0.02, 0.01, 10−3, 10−4, 10−5, 10−6, −7 −8 10 , 10 } and generate a series of GT αη accordingly. We set heritability to one and ϕαη = 0.8. Four levels of sparsity mα/p = mη/p = 0.01, 0.1, 0.5 and 0.8 are examined. Figure 3 displays the performance of GT αη across a series of thresholds. As expected, the pattern of GT αη varies dramatically with the sparsity. When signals are sparse, SNP screening helps and GT αη performs better than Gαη. However, when signals are dense, the performance of GT αη drops as the threshold decreases. GT αη has the best performance as all SNPs are selected, i.e., the same as Gαη, which confirms our results of GT αη in Corollary 1. In addition, we examine our analyses of overlapping samples. For GSαη and GSαβ, half of the 10, 000 samples are set to be overlapping. Other A settings remain the same as those of Figure 2. The performance of GSαη, GSαβ, GSαη A and GSαβ is displayed in Figure 4, which fully support the results in Theorems 4 - 5.

6 UK Biobank data analysis

We apply our bias-corrected estimator on the United Kingdom (UK) Biobank data [Sudlow et al., 2015] to assess the genetic correlation between brain white matter (WM) tracts and several neuropsychiatric disorders. The structural changes of WM tracts are measured and quantified in diffusion tensor imaging (dMRI). We run the TBSS- ENIGMA pipeline [Thompson et al., 2014] to generate tract-based diffusion tensor imaging (DTI) parameters from dMRI of UK Biobank samples. Seven DTI parameters, FA, MD, MO, RD, L1, L2, and L3 (Supplementary Table 1) are derived in each of the 18 WM tracts (Supplementary Table 2, Supplementary Figure 6), thus there are 7 × 18 = 126 DTI parameters in total. We use the unimputed UK Biobank SNP data released in July 2017. Detailed genetic data collection/processing procedures and quality control prior to the release of data are documented at http://www.ukbiobank. ac.uk/scientists-3/genetic-data/. We take all autosomal SNPs and apply the standard quality control procedures using the Plink tool set [Purcell et al., 2007]:

20 excluding subjects with more than 10% missing genotypes, only including SNPs with MAF > 0.01, genotyping rate > 90%, and passing Hardy-Weinberg test (P > 1×10−7). The number of SNPs are 461, 488 after these steps. We further removed non-European subjects if any. To avoid including closely related relatives, we excluded one of any pair of individuals with estimated genetic relationship larger than 0.025. We then select subjects that have DTI data as well, which yields a final dataset consisting of 7979 UK Biobank samples with age range [47, 80] (mean=64.26 years, sd=7.44), and the proportion of female is 0.526. Cross-trait PRSs of three psychiatric disorders are constructed on these UK Biobank samples by using their published GWAS summary statistics, including attention-deficit /hyperactivity disorder (ADHD, sample size 55, 374), bipolar disorder (BD, 41, 653), and Schizophrenia (SCZ, 65, 967). The original GWAS [Demontis et al., 2017, Ruderfer et al., 2018] have no overlapping samples with the UK Biobank data used in this study. The GWAS summary data of these disorders are downloaded from the Psychiatric Genomics Consortium [Sullivan et al., 2017]. To obtain independent SNPs, we perform LD pruning with R2 = 0.2 and window size 50. There are 230, 072 SNPs remain after LD pruning and they are used in later steps as candidates for constructing PRS. We generate one PRS separately for each disorder by summarizing across all the pruned candidates SNPs, weighed by their GWAS effect sizes (log odds ratios). The number of overlapping SNPs is 204, 367 for SCZ, 215, 655 for BD, and 129, 052 for ADHD. Plink tool set [Purcell et al., 2007] is used to generate these scores. The association between each pair of PRS and DTI parameter is estimated and tested in linear regression, adjusting for age, sex and ten genetic principal components of the UK Biobank. There are 7 × 18 × 3 = 378 tests and we correct for multiple testing using the false discovery rate (FDR) method [Storey, 2002] at 0.05 level. We focus on the 20 significant associations after controlling for FDR: 17 for ADHD, 1 for BD, and 2 for SCZ (Supplementary Figure 7, Supplementary Table 3). On these significant DTI-Disorder pairs, the proportions of variation in DTI parameter that can be explained by PRS of disorder (partial R2) are all less than 0.2% (mean=0.125%, max=0.197%, left panel of Figure 5). Partial R2 is the square of the estimated genetic correlation between PRS and WM tract DTI parameters after adjusting for other covariates and is often interpreted as the genetic overlap or shared genetic etiology between the two traits. Such small R2s are widely reported in similar studies for highly heritable psychiatric disorders [Bogdan et al., 2018, Clarke et al., 2016, Guo et al., 2017a, Mistry et al., 2018, Power et al., 2015]. Next, we correct these estimates with our formula in Theorem 1. We applied the heritability estimates of psychiatric disorders reported in a recent large-scale study [Anttila et al., 2018]: 0.256 for SCZ, 0.205 for BD, and 0.100 for ADHD. We estimate heritability of the 126 DTI parameters with the individual-level UK Biobank data using the GCTA tool set [Yang et al., 2011]. These heritability estimates range from 0.224 to 0.733 with mean=0.532 and sd=0.087, and are reported in Zhao et al. [2018]. Plugging in these heritability estimates, sample sizes and number of SNPs, the updated partial R2s are much larger than previous ones (mean=5.260%, max=7.270%, right panel of

21 0.30 ● SCZ Significant after controlling FDR 10 ● SCZ Significant after controlling FDR ● BD * ● BD * ● ADHD ● ADHD 0.25 8

0.20

(X100%) * 2 6

R * (X100%) * * 2 * * * R 0.15 **** * * ** * ** * * * * *** * 4 * 0.10 * * * ** ** * * * * * *

Raw Partial Raw 2 0.05 Corrected Partial

0.00 0

EC SS EC SS IFO SLF IFO SLF ACR PTR SFO ACR PTR SFO ALIC BCC CGC CGH GCC PCR PLIC RLIC SCC SCR ALIC BCC CGC CGH GCC PCR PLIC RLIC SCC SCR FXST FXST

Fig. 5: Raw partial R2 of fitting psychiatric disorder PRS on 18 brain WM tracts (listed in x axis) in the UK Biobank data (left panel) and corrected ones based on our formulas (right panel). Each tract has seven DTI parameters (FA, MD, MO, RD, L1, L2, L3). Partial R2 measures the variance in DTI parameter that can be explained by the PRS, adjusting for age, sex and ten genetic principal components of the UK Biobank.

Figure 5). These corrected partial R2s are within 4% to 7.5% for ADHD, 2.5% to 3.5% for SCZ, and is 4.8% for BD. In conclusion, we detect the significant association between genetic risk scores of psychiatric disorders and brain WM microstructure changes in UK Biobank participants sampled from the general population. Compared to the originally estimated partial R2s, the corrected partial R2s may better reflect the degree of genetic similarity between the two set of traits and suggest the potential prediction power of brain imaging markers on these disorders.

7 Discussion

Understanding the genetic similarity among human complex traits is essential to model biological mechanisms, improve genetic risk prediction, and design personalized pre- vention/treatment. Cross-trait PRS [Power et al., 2015, Purcell et al., 2009] is one of the most popular methods for genetic correlation estimation with thousands of pub- lications. This paper empirically and theoretically studies the asymptotic properties of cross-trait PRS. Our analyses demystify the commonly observed small R2 in real data applications, and help avoid over- or under-interpreting of research findings. More importantly, the asymptotic bias is largely independent of the unknown genetic archi- tecture if we use all SNPs in cross-trait PRS, which enables bias correction. As the sample size of discovery GWAS becomes much larger in the last few years [Evangelou

22 et al., 2018, Lee et al., 2018] and may keep on increasing in the future, our bias-corrected estimators can be used to recover the underlying genetic correlation of many complex traits. We also discuss the popular SNP screening strategy and illustrate that this procedure may enlarge the bias for highly polygenic traits, and thus should be used with caution. Influence of overlapping samples is also quantified in several practical cases. The training-testing design employed by cross-trait PRS may help avoid the in- flation caused by non-genetic correlation, but results in systematic bias due to the restricted prediction power of GWAS summary statistics in testing data. The be- havior of cross-trait PRS studied in this paper is closely related to the properties of GWAS summary statistics, which have received little scrutiny in statistical genetics. Our research should bring attentions to the potential unexpected results when analyz- ing summary-level data of different GWAS for polygenic traits, and call to thoroughly (re)study the statistical properties of other popular GWAS summary statistics-based methods.

Acknowledgement

We would like to thank Haiyan Deng from North Carolina State University for help- ful discussion. This research was partially supported by U.S. NIH grants MH086633 and MH116527, and a grant from the Cancer Prevention Research Institute of Texas. We thank the individuals represented in the UK Biobank (http://www.ukbiobank. ac.uk/) for their participation and the research teams for their work in collecting, processing and disseminating these datasets for analysis. This research has been conducted using the UK Biobank resource (application number 22783), subject to a data transfer agreement. We thank the Psychiatric Genomics Consortium (PGC, http://www.med.unc.edu/pgc/) for providing GWAS summary-level data used in the real data analysis. The authors acknowledge the Texas Advanced Computing Center (TACC, http://www.tacc.utexas.edu/) at The University of Texas at Austin for providing HPC and storage resources that have contributed to the research results reported within this paper.

Appendix A: Proofs

In this appendix, we highlight the key steps and results to prove our main theorems. More proofs and technical details can be found in the supplementary file.

Proposition A1. Under polygenic model (4) and Conditions 1 and 2, if mαη, mα, and

23 mη → ∞ as min(n1, n3, p) → ∞, then we have

T  W(1)η(1) + η W(1)η(1) + η 2 2 = 1 + op(1), n3mη · ση + n3 · ση T T T  X(1)α(1) + α XW WX X(1)α(1) + α 2 2 = 1 + op(1). {n1n3mα(p − mα) + n1n3mα(mα + n1)}· σα + n1n3p · σα

Further if p/(n1n3) → 0, then we have

T T  W(1)η(1) + η WX X(1)α(1) + α = 1 + op(1). n1n3mαη · σαη

By continuous mapping theorem, we have

r n1 Gαη = 2 · hη · ϕαη ·{1 + op(1)}. n1 + p/hα

Then Theorem 1 holds for a ∈ (0, 1). When a ∈ [1, ∞], i.e., p/(n1n3) 6→ 0, we note

T T   1/2 1/2 1/2 W(1)η(1) + η WX X(1)α(1) + α = Op (n1 n3 mαηp + n1n3mαη) · σαη .

It follows that

 2 2 2 2 2 Op (n1n3m p + n n m ) · σ G2 = αη 1 3 αη αη αη  2   2  n3mη · ση ·{1 + o(1)} · n1n3mα(n1 + p) · σα ·{1 + o(1)}  2 2  n1n3p + n1n3 2 1 = Op 2 · ϕαη = Op( ). n3n1(n1 + p) n3 Thus, Theorem 1 is proved.

Proposition A2. Under polygenic model (4) and Conditions 1 and 2, if mαβ, mα, and mβ → ∞ as min(n1, n2, n3, p) → ∞, then we have

T T T  Z β + β ZW WZ Z β + β (1) (1) (1) (1) = 1 + o (1). {n n m (p − m ) + n n m (m + n )}· σ2 + n n p · σ2 p 2 3 β β 2 3 β β 2 β 2 3 β

2 Further if p /(n1n2n3) → 0, then we have

T T T  Z(1)β(1) + β ZW WX X(1)α(1) + α = 1 + op(1). n1n2n3mαβ · σαβ

It follows that Theorem 2 holds for a ∈ (0, 1). When a ∈ [1, ∞], we have

 2 2 2 2 2 2 Op (n1n2n3m p + n n n m ) · σαβ G2 = αβ 1 2 3 αβ αβ  2   2  n2n3mβ(n2 + p) · σβ ·{1 + o(1)} · n1n3mα(n1 + p) · σα ·{1 + o(1)}  2 2 2 2  2 n1n2n3p + n1n2n3 n p o = Op 2 = Op . n3n1n2(n1 + p)(n2 + p) n3(n1 + p)(n2 + p)

24 Thus, Theorem 2 is proved.

Proposition A3. Under polygenic model (4) and Conditions 1 and 2, if mαβ, mα, and mβ → ∞ as min(n1, n2, p) → ∞, then we have

T T  X(1)α(1) + α XX X(1)α(1) + α 2 2 = 1 + op(1), {n1mα(n1 + mα) + n1mα(p − mα)}· σα + n1p · σα T T  Z(1)β(1) + β ZZ Z(1)β(1) + β = 1 + op(1). n m (n + m ) + n m (p − m ) · σ2 + n p · σ2 2 β 2 β 2 β β β 2 β

Further if p/(n1n2) → 0, then we have

T T  X(1)α(1) + α XZ Z(1)β(1) + β = 1 + op(1). n1n2mαβ · σαβ

Therefore Theorem 3 holds for a ∈ (0, 1). When a ∈ [1, ∞], we have

 2 2 2 2 Op (n1n2m p + n n m ) · σαβ ϕ2 = αβ 1 2 αβ bαβ  2   2  n2mβ(n2 + p) · σβ ·{1 + o(1)} · n1mα(n1 + p) · σα ·{1 + o(1)}  2 2  n1n2p + n1n2 n p o = Op = Op . n1n2(n1 + p)(n2 + p) (n1 + p)(n2 + p)

Thus, Theorem 3 is proved. Corollaries 1 and 2 follow from the two propositions below. The proofs of overlapping samples can be found in the supplementary file.

Proposition A4. Under polygenic model (4) and Conditions 1 and 2, if min(mαη, mα, mη) → ∞, min(qαη, qα1, qα2) → ∞ when min(n1, n3, p) → ∞, then we have

VT α 2 2 = 1 + op(1). {n1n3mαqα2 + n1n3qα1(mα + n1)}· σα + n1n3qα · σα

2 2 Further if {mαη(qα1 + qα2)}/(qαηn1n3) → 0, then we have

CT αη = 1 + op(1). n1n3qαη · σαη

Proposition A5. Under polygenic model (4) and Conditions 1 and 2, if min(mαβ, mα, mβ) → ∞, min(qαβ, qα1, qα2, qβ1, qβ2) → ∞ when min(n1, n2, n3, p) → ∞, then we have

V T β = 1 + o (1). {n n m q + n n q (m + n )}· σ2 + n n q · σ2 p 2 3 β β2 2 3 β1 β 2 β 1 3 β β

2 2 Further if {mαβ(qα1 + qα2)(qβ1 + qβ2)}/(qαβn1n2n3) → 0, then we have

CT αβ = 1 + op(1). n1n2n3qαβ · σαβ

25 References

Anttila, V., Bulik-Sullivan, B., Finucane, H. K., Walters, R. K., Bras, J., Duncan, L., Escott-Price, V., Falcone, G. J., Gormley, P., Malik, R. et al. (2018) Analysis of shared heritability in common disorders of the brain. Science, 360, 1313 (eaap8757).

Bogdan, R., Baranger, D. A. and Agrawal, A. (2018) Polygenic risk scores in clini- cal psychology: bridging genomic risk to individual differences. Annual Review of Clinical Psychology, 14, 119–157.

Boyle, E. A., Li, Y. I. and Pritchard, J. K. (2017) An expanded view of complex traits: from polygenic to omnigenic. Cell, 169, 1177–1186.

Bulik-Sullivan, B., Finucane, H. K., Anttila, V., Gusev, A., Day, F. R., Loh, P.-R., Duncan, L., Perry, J. R., Patterson, N., Robinson, E. B. et al. (2015a) An atlas of genetic correlations across human diseases and traits. Nature Genetics, 47, 1236– 1241.

Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., Daly, M. J., Price, A. L., Neale, B. M., of the Psychiatric Genomics Consortium, S. W. G. et al. (2015b) Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 47, 291–295.

Chatterjee, N., Shi, J. and Garc´ıa-Closas,M. (2016) Developing and evaluating poly- genic risk prediction models for stratified disease prevention. Nature Reviews Ge- netics, 17, 392–406.

Chen, G.-B. (2014) Estimating heritability of complex traits from genome-wide asso- ciation studies using ibs-based haseman–elston regression. Frontiers in Genetics, 5, 107.

Clarke, T., Lupton, M., Fernandez-Pujals, A., Starr, J., Davies, G., Cox, S., Pattie, A., Liewald, D., Hall, L., MacIntyre, D. et al. (2016) Common polygenic risk for autism spectrum disorder (asd) is associated with cognitive ability in the general population. Molecular Psychiatry, 21, 419–425.

Daetwyler, H. D., Villanueva, B. and Woolliams, J. A. (2008) Accuracy of predicting the genetic risk of disease using a genome-wide approach. PloS One, 3, e3395.

Demontis, D., Walters, R. K., Martin, J., Mattheisen, M., Als, T. D., Agerbo, E., Belliveau, R., Bybjerg-Grauholm, J., Bækved-Hansen, M., Cerrato, F. et al. (2017) Discovery of the first genome-wide significant risk loci for adhd. BioRxiv, 145581.

Dudbridge, F. (2013) Power and predictive accuracy of polygenic risk scores. PLoS Genetics, 9, e1003348.

— (2016) Polygenic epidemiology. Genetic Epidemiology, 40, 268–272.

26 Evangelou, E., Warren, H. R., Mosen-Ansorena, D., Mifsud, B., Pazoki, R., Gao, H., Ntritsos, G., Dimou, N., Cabrera, C. P., Karaman, I. et al. (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nature Genetics, 50, 1412–1425.

Fan, J., Guo, S. and Hao, N. (2012) Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 37–65.

Fan, J. and Lv, J. (2008) Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911.

Fisher, R. A. (1919) Xv.the correlation between relatives on the supposition of mendelian inheritance. Earth and Environmental Science Transactions of the Royal Society of Edinburgh, 52, 399–433.

Ge, T., Chen, C.-Y., Neale, B. M., Sabuncu, M. R. and Smoller, J. W. (2017) Phenome- wide heritability analysis of the uk biobank. PLoS Genetics, 13, e1006711.

Golan, D., Lander, E. S. and Rosset, S. (2014) Measuring missing heritability: infer- ring the contribution of common variants. Proceedings of the National Academy of Sciences, 111, E5272–E5281.

Gottesman, I. and Shields, J. (1967) A polygenic theory of schizophrenia. Proceedings of the National Academy of Sciences, 58, 199–205.

Guo, W., Samuels, J., Wang, Y., Cao, H., Ritter, M., Nestadt, P., Krasnow, J., Green- berg, B., Fyer, A., McCracken, J. et al. (2017a) Polygenic risk score and heritability estimates reveals a genetic relationship between asd and ocd. European Neuropsy- chopharmacology, 27, 657–666.

Guo, Z., Wang, W., Cai, T. T. and Li, H. (2017b) Optimal estimation of genetic relatedness in high-dimensional linear models. Journal of the American Statistical Association, in press.

Hagenaars, S. P., Harris, S. E., Davies, G., Hill, W. D., Liewald, D. C., Ritchie, S. J., Marioni, R. E., Fawns-Ritchie, C., Cullen, B., Malik, R. et al. (2016) Shared genetic aetiology between cognitive functions and physical and mental health in uk biobank (n= 112 151) and 24 gwas consortia. Molecular Psychiatry, 21, 1624–1632.

Hill, W. G. (2010) Understanding and using quantitative genetic variation. Philo- sophical Transactions of the Royal Society of London B: Biological Sciences, 365, 73–85.

Jiang, J., Li, C., Paul, D., Yang, C., Zhao, H. et al. (2016) On high-dimensional misspecified mixed model analysis in genome-wide association study. The Annals of Statistics, 44, 2127–2160.

27 Lee, J. J., Wedow, R., Okbay, A., Kong, E., Maghzian, O., Zacher, M., Nguyen-Viet, T. A., Bowers, P., Sidorenko, J., Linn´er,R. K. et al. (2018) discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics, 50, 1112–1121.

Lee, S. H., Ripke, S., Neale, B. M., Faraone, S. V., Purcell, S. M., Perlis, R. H., Mowry, B. J., Thapar, A., Goddard, M. E., Witte, J. S. et al. (2013) Genetic relationship be- tween five psychiatric disorders estimated from genome-wide snps. Nature Genetics, 45, 984–994.

Lee, S. H. and Van der Werf, J. H. (2016) Mtg2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics, 32, 1420–1422.

Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. and Wray, N. R. (2012) Estima- tion of between complex diseases using single-nucleotide polymorphism- derived genomic relationships and restricted maximum likelihood. Bioinformatics, 28, 2540–2542.

Loh, P.-R., Bhatia, G., Gusev, A., Finucane, H. K., Bulik-Sullivan, B. K., Pollack, S. J., de Candia, T. R., Lee, S. H., Wray, N. R., Kendler, K. S. et al. (2015) Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nature Genetics, 47, 1385–1392.

Lu, Q., Li, B., Ou, D., Erlendsdottir, M., Powles, R. L., Jiang, T., Hu, Y., Chang, D., Jin, C., Dai, W. et al. (2017) A powerful approach to estimating annotation- stratified genetic covariance via gwas summary statistics. The American Journal of Human Genetics, 101, 939–964.

MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J. et al. (2016) The new nhgri-ebi catalog of published genome-wide association studies (gwas catalog). Nucleic Acids Research, 45, D896–D901.

Mistry, S., Harrison, J. R., Smith, D. J., Escott-Price, V. and Zammit, S. (2018) The use of polygenic risk scores to identify phenotypes associated with genetic risk of bipolar disorder and depression: A systematic review. Journal of Affective Disorders, 234, 148–155.

Nivard, M. G., Gage, S. H., Hottenga, J. J., van Beijsterveldt, C. E., Abdellaoui, A., Bartels, M., Baselmans, B. M., Ligthart, L., Pourcain, B. S., Boomsma, D. I. et al. (2017) Genetic overlap between schizophrenia and developmental psychopathology: longitudinal and multivariate polygenic risk prediction of common psychiatric traits during development. Schizophrenia Bulletin, 43, 1197–1207.

Orr, H. A. and Coyne, J. A. (1992) The genetics of adaptation: a reassessment. The American Naturalist, 140, 725–742.

28 Palla, L. and Dudbridge, F. (2015) A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. The American Journal of Human Genetics, 97, 250–259.

Pasaniuc, B. and Price, A. L. (2017) Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 18, 117–127.

Penrose, L. (1953) The genetical background of common diseases. Human Heredity, 4, 257–265.

Polderman, T. J., Benyamin, B., De Leeuw, C. A., Sullivan, P. F., Van Bochoven, A., Visscher, P. M. and Posthuma, D. (2015) Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nature Genetics, 47, 702–709.

Pouget, J. G., Han, B., Mignot, E., Ollila, H. M., Barker, J., Spain, S., Dand, N., Trembath, R., Martin, J., Mayes, M. D. et al. (2018) Cross-disorder analysis of schizophrenia and 19 immune diseases reveals genetic correlation. BioRxiv, 068684.

Power, R. A., Steinberg, S., Bjornsdottir, G., Rietveld, C. A., Abdellaoui, A., Nivard, M. M., Johannesson, M., Galesloot, T. E., Hottenga, J. J., Willemsen, G. et al. (2015) Polygenic risk scores for schizophrenia and bipolar disorder predict creativity. Nature Neuroscience, 18, 953–955.

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006) Principal components analysis corrects for stratification in genome- wide association studies. Nature Genetics, 38, 904–909.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I., Daly, M. J. et al. (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81, 559–575.

Purcell, S. M., Wray, R., Stone, L., Visscher, M., O’Donovan, C., Sullivan, F., Sklar, P., Ruderfer, M., McQuillin, A., Morris, W. et al. (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460, 748–752.

Ruderfer, D. M., Ripke, S., McQuillin, A., Boocock, J., Stahl, E. A., Pavlides, J. M. W., Mullins, N., Charney, A. W., Ori, A. P., Loohuis, L. M. O. et al. (2018) Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell, 173, 1705–1715.

Shi, H., Kichaev, G. and Pasaniuc, B. (2016) Contrasting the genetic architecture of 30 complex traits from summary association data. The American Journal of Human Genetics, 99, 139–153.

Shi, H., Mancuso, N., Spendlove, S. and Pasaniuc, B. (2017) Local genetic correlation gives insights into the shared genetic architecture of complex traits. The American Journal of Human Genetics, 101, 737–751.

29 Socrates, A., Bond, T., Karhunen, V., Auvinen, J., Rietveld, C., Veijola, J., Jarvelin, M.-R. and O’Reilly, P. (2017) Polygenic risk scores applied to a single cohort reveal pleiotropy among hundreds of human phenotypes. BioRxiv, 203257.

Storey, J. D. (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 479–498.

Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M. et al. (2015) Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12, e1001779.

Sullivan, P. F., Agrawal, A., Bulik, C. M., Andreassen, O. A., Børglum, A. D., Breen, G., Cichon, S., Edenberg, H. J., Faraone, S. V., Gelernter, J. et al. (2017) Psychiatric genomics: an update and an agenda. American Journal of Psychiatry, 175, 15–27.

Thompson, P. M., Stein, J. L., Medland, S. E., Hibar, D. P., Vasquez, A. A., Renteria, M. E., Toro, R., Jahanshad, N., Schumann, G., Franke, B. et al. (2014) The enigma consortium: large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging and Behavior, 8, 153–182.

Visscher, P. M., Hemani, G., Vinkhuyzen, A. A., Chen, G.-B., Lee, S. H., Wray, N. R., Goddard, M. E. and Yang, J. (2014) Statistical power to detect genetic (co) variance of complex traits using snp data in unrelated samples. PLoS Genetics, 10, e1004269.

Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A. and Yang, J. (2017) 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101, 5–22.

Weissbrod, O., Flint, J. and Rosset, S. (2018) Estimating snp-based heritability and genetic correlation in case-control studies directly and with summary statistics. The American Journal of Human Genetics, 103, 89–99.

Wray, N. R., Wijmenga, C., Sullivan, P. F., Yang, J. and Visscher, P. M. (2018) Common disease is more complex than implied by the core gene omnigenic model. Cell, 173, 1573–1580.

Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W. et al. (2010) Common snps explain a large proportion of the heritability for human height. Nature Genetics, 42, 565–569.

Yang, J., Lee, S. H., Goddard, M. E. and Visscher, P. M. (2011) Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88, 76–82.

30 Yang, J., Zeng, J., Goddard, M. E., Wray, N. R. and Visscher, P. M. (2017) Concepts, estimation and interpretation of snp-based heritability. Nature Genetics, 49, 1304– 1310.

Zhao, B., Zhang, J., Ibrahim, J., Santelli, R., Li, Y., Li, T., Shan, Y., Zhu, Z., Zhou, F., Liao, H. et al. (2018) Large-scale neuroimaging and genetic study reveals genetic architecture of brain white matter microstructure. BioRxiv, 288555.

Zheng, J., Erzurumluoglu, A. M., Elsworth, B. L., Kemp, J. P., Howe, L., Haycock, P. C., Hemani, G., Tansey, K., Laurin, C., Pourcain, B. S. et al. (2017) Ld hub: a centralized database and web interface to perform ld score regression that maximizes the potential of summary level gwas data for snp heritability and genetic correlation analysis. Bioinformatics, 33, 272–279.

Zhou, X. (2017) A unified framework for variance component estimation with summary statistics in genome-wide association studies. The Annals of Applied Statistics, 11, 2027–2051.

31 Supplementary Material 8 More proofs

Proposition S1. Under polygenic model (4) and Conditions 1 - 3, suppose mαη, mα, and mη → ∞ as (n1 + ns), (n3 + ns), p → ∞, then we have

T T yη yη SbSαSbSα 2 2 = 1 + op(1) and = 1 + op(1), (n3 + ns)mη · ση + (n3 + ns) · ση vSα where

T T  T  yη yη = W(1,η)η(1) + ηw W(1,η)η(1) + ηw + S(1,η)η(1) + ηs S(1,η)η(1) + ηs ,

T T T T  SbSαSbSα = X(1,α)α(1) + αx XW WX X(1,α)α(1) + αx + T T T  2 X(1,α)α(1) + αx XW WS S(1,α)α(1) + αs + T T T  S(1,α)α(1) + αs SW WS S(1,α)α(1) + αs + T T T  X(1,α)α(1) + αx XS SX X(1,α)α(1) + αx + T T T  2 X(1,α)α(1) + αx XS SS S(1,α)α(1) + αs + T T T  S(1,α)α(1) + αs SS SS S(1,α)α(1) + αs , and

2 2 2 vSα ={n1n3mα(p + n1) · σα + n1n3p · σα } + 2{n1n3nsmα · σα}+ 2 2 2 2 {nsn3mα(p + ns) · σα + nsn3p · σα } + {n1nsmα(p + n1) · σα + n1nsp · σα } 2 2 2 2 2 + 2{n1nsmα(ns + p) · σα} + {nsmα(ns + p + 3nsp) · σα + nsp(ns + p) · σα }.

Further if p/{(n1 + nS)(n3 + ns)} → 0, then we have

T yη SbSα = 1 + op(1), (n1 + ns)n3mαη · σαη + {nsmαη(p + ns) · σαη + nsp · σαη } + nsn1mαη · σαη

T where yη SbSα is given by

T T T  T  T T T  T  ηw + η(1)W(1,η) WX X(1,α)α(1) + αx + ηw + η(1)W(1,η) WS S(1,α)α(1) + αs T T T  T  T T T  T  + ηs + η(1)S(1,η) SS S(1,α)α(1) + αs + ηs + η(1)S(1,η) SX X(1,α)α(1) + αx .

Proposition S2. Under polygenic model (4) and Conditions 1, 2 and 4, suppose mαβ, mα, and mβ → ∞ as (n1 + ns), (n2 + ns), n3, p → ∞, then we have

T ST S SbSαSbSα bSβ bSβ = 1 + op(1) and = 1 + op(1), vSα vSβ

32 where

T T T T  SbSαSbSα = X(1,α)α(1) + αx XW WX X(1,α)α(1) + αx + T T T  2 X(1,α)α(1) + αx XW WS S(1,α)α(1) + αs + T T T  S(1,α)α(1) + αs SW WS S(1,α)α(1) + αs ,

T T T T  SbSβSbSβ = Z(1,β)β(1) + βz ZW WZ Z(1,β)β(1) + βz + T T T  2 Z(1,β)β(1) + βz ZW WS S(1,β)β(1) + βs + T T T  S(1,β)β(1) + βs SW WS S(1,β)β(1) + βs ,

2 2 2 vSα =n1n3mα(p + n1) · σα + n1n3p · σα + 2n1n3nsmα · σα+ 2 2 nsn3mα(p + ns) · σα + nsn3p · σα , and

v =n n m (p + n ) · σ2 + n n p · σ2 + 2n n n m · σ2+ Sβ 2 3 β 2 β 2 3 β 2 3 s β β n n m (p + n ) · σ2 + n n p · σ2 . s 3 β s β s 3 β

2 Further if p /{(n1 + ns)(n2 + ns)n3} → 0, then we have

T SbSαSbSβ = 1 + op(1), vSαβ where

T T T T  SbSαSbSβ = X(1,α)α(1) + αx XW WZ Z(1,β)β(1) + βz + T T T  X(1,α)α(1) + αx XW WS S(1,β)β(1) + βs + T T T  S(1,α)α(1) + αs SW WZ Z(1,β)β(1) + βz + T T T  S(1,α)α(1) + αs SW WS S(1,β)β(1) + βs , and

vSαβ =n1n2n3mαβ · σαβ + n1nsn3mαβ · σαβ + nsn2n3mαβ · σαβ+

{ns(p + ns)n3mαβ · σαβ + nsn3p · σαβ }.

Then Theorems 4 and 5 follow from continuous mapping theorem and similar ar- guments in the Appendix A.

33 9 More overlapping cases

This section provides more analyses on the overlapping samples. We consider several additional cases that might occur in real data applications.

Case iii)

When the two GWAS are fully overlapped, i.e., the two set of summary statistics αb and βb are generated from the same GWAS data

n1×p • Dataset VII: (X, yα, yβ), with X = [X(1,α), X(2,α)] = [X(1,β), X(2,β)] ∈ R , n1×mα n1×mβ n1×1 n1×1 X(1,α) ∈ R , X(1,β) ∈ R , yα ∈ R , and yβ ∈ R .

We assume that yα and yβ have polygenic architectures. We estimate ϕαβ directly by estimating the correlation of αb and βb

T αb βb ϕbXαβ = αb · βb T T  X α + α XX X β + β = (1,α) (1) (1,β) (1) .  T T  1/2 T T  1/2 X(1,α)α(1) + α XX X(1,α)α(1) + α X(1,β)β(1) + β XX X(1,β)β(1) + β

Proposition S3. Under polygenic models and Conditions 1, 2 and 4, if mαβ, mα, and mβ → ∞, as n1, p → ∞, then we have

T T  X(1,α)α(1) + α XX X(1,α)α(1) + α 2 2 = 1 + op(1), {n1mα(n1 + mα) + n1mα(p − mα)}· σα + n1p · σα T T  X β + β XX X β + β (1,β) (1) (1,β) (1) = 1 + o (1), {n m (n + m ) + n m (p − m )}· σ2 + n p · σ2 p 1 β 1 β 1 β β β 1 β and

T T  X(1,α)α(1) + α XX X(1,β)β(1) + β = 1 + op(1). n1mαβ(p + n1) · σαβ + n1p · σαβ

Thus, we have

n + p/h ϕ = 1 αβ · ϕ ·{1 + o (1)}. bXαβ 2 1/2 2 1/2 αβ p (n1 + p/hα) (n1 + p/hβ)

2 2 It follows that ϕbXαβ is asymptotically unbiased as hα = hβ = hαβ = 1. Otherwise, ϕbXαβ may be biased towards zero.

Case iv)

Again, the two set of summary statistics αb and βb are generated from the same GWAS dataset

34 n1×p • Dataset VII: (X, yα, yβ), with X = [X(1,α), X(2,α)] = [X(1,β), X(2,β)] ∈ R , n1×mα n1×mβ n1×1 n1×1 X(1,α) ∈ R , X(1,β) ∈ R , yα ∈ R , and yβ ∈ R .

And we construct two PRSs SbXα and SbXβ on X.

Proposition S4. Under polygenic models and Conditions 1, 2 and 4, if mαβ, mα, and mβ → ∞, as n1, p → ∞, then we have

T ST S SbXαSbXα bXβ bXβ = 1 + op(1) and = 1 + op(1), vXα vXβ where

T T T T SbXαSbXα = (X(1,α)α(1) + α) XX XX (X(1,α)α(1) + α), T T T T SbXβSbXβ = (X(1,β)β(1) + β) XX XX (X(1,β)β(1) + β), 2 2 2 vXα = n1mα{(n1 + p) + n1p}· σα + n1p(n1 + p) · σα , and v = n m {(n + p)2 + n p}· σ2 + n p(n + p) · σ2 . Xβ 1 β 1 1 β 1 1 β

Similarly, we have

T SbXβSbXα = 1 + op(1), vXαβ

2 where vXαβ = n1mαβ{(n1 + p) + n1p}· σαβ + n1p(n1 + p) · σαβ and

T T T T SbXβSbXα = (X(1,β)β(1) + β) XX XX (X(1,α)α(1) + α).

Thus, we have

2 n1 + 2n1p + p(n1 + p)/hαβ GXαβ = · ϕαβ ·{1 + op(1)}.  2 2 1/2 2 2 1/2 n1 + 2n1p + p(n1 + p)/hα n1 + 2n1p + p(n1 + p)/hβ

2 2 It follows that GXαβ is asymptotically unbiased if hα = hβ = hαβ = 1. Otherwise, GXαβ may be biased towards zero.

Case v)

The two set of GWAS summary statistics αb and βb are generated from the following two independent datasets:

n1×p n1×mα • Dataset VIII: (X, yα), with X = [X(1), X(2)] ∈ R , X(1) ∈ R , and n ×1 yα ∈ R 1 .

n2×p n2×mβ • Dataset IX: (Z, yβ), with Z = [Z(1), Z(2)] ∈ R , Z(1) ∈ R , and yβ ∈ n ×1 R 2 .

We construct two PRSs SbXα and SbXβ on X.

35 Proposition S5. Under polygenic models and Conditions 1, 2 and 4, if mαβ, mα, and mβ → ∞, as n1, n2, p → ∞, then we have

T ST S SbXαSbXα bXβ bXβ = 1 + op(1) and = 1 + op(1), vXα vXβ where

T T T T  SbXαSbXα = X(1,α)α(1) + α XX XX X(1,α)α(1) + α , T T T T  SbXβSbXβ = Z(1,β)β(1) + β ZX XZ Z(1,β)β(1) + β , 2 2 2 vXα = n1mα{(n1 + p) + n1p}· σα + n1p(n1 + p) · σα , and v = n n m (p + n ) · σ2 + n n p · σ2 . Xβ 1 2 β 2 β 1 2 β

Further if p/(n1n2) → 0, then we have

T SbXβSbXα = 1 + op(1), vXαβ where vXαβ = n1n2mαβ(n1 + p) · σαβ and

T T T T  SbXβSbXα = X(1,α)α(1) + α XX XZ Z(1,β)β(1) + β .

a Let p = c · (n1n2) for some constants c > 0 and a ∈ (0, ∞]. As n1 and n2 increase to ∞, if a ∈ (0, 1), then we have

1/2 (n1 + p) · n2 GXαβ = · ϕαβ ·{1 + op(1)}.  2 2 1/2 2 1/2 n1 + 2n1p + p(n1 + p)/hα n2 + p/hβ

36 10 Intermediate results

Cross-trait PRS with all SNPs

Proposition S6. Under polygenic model (4) and Conditions 1 and 2, if mαη, mα, and mη → ∞ as n1, n3, p → ∞, then we have

n T T o E W(1)η(1) + η WX X(1)α(1) + α = n1n3mαη · σαη,

n T T o 2 Var W(1)η(1) + η WX X(1)α(1) + α = {(n1n3mαηp+ 2 2 2 2 2 2 2 2 2n1n3mαη + 2n1n3mαη) · σαη + n1n3mαη · (a22 − σαη)}·{1 + o(1)},

n T o 2 2 E W(1)η(1) + η W(1)η(1) + η = n3mη · ση + n3 · ση , n T o 2 2 4 Var W(1)η(1) + η W(1)η(1) + η = o(n3mη · ση),

n T T T o E X(1)α(1) + α XW WX X(1)α(1) + α   2 2 = n1n3mα(n1 + mα) ·{1 + o(1)} + n1n3mα(p − mα) · σα + n1n3p · σα , n T T T o 2 2 2 2 4 Var X(1)α(1) + α XW WX X(1)α(1) + α = o{n1n3mα(n1 + p) · σα},

2 2 where a22 = E(α η ) < ∞.

Proposition S7. Under polygenic model (4) and Conditions 1 and 2, if mαβ, mα, and mβ → ∞ as n1, n2, n3, p → ∞, then we have

n T T T o E Z(1)β(1) + β ZW WX X(1)α(1) + α = n1n2n3mαβ · σαβ,

n T T T o 2 2 2 2 2 2 Var Z(1)β(1) + β ZW WX X(1)α(1) + α = O(n1n2n3mαβp ) + o(n1n2n3mαβ),

n T T T o E Z(1)β(1) + β ZW WZ Z(1)β(1) + β = n n m (n + m ) ·{1 + o(1)} + n n m (p − m ) · σ2 + n n p · σ2 , 2 3 β 2 β 2 3 β β β 2 3 β n T T T o 2 2 2 2 4 Var Z(1)β(1) + β ZW WZ Z(1)β(1) + β = o{n2n3mβ(n2 + p) · σβ}.

Proposition S8. Under polygenic model (4) and Conditions 1 and 2, if mαβ, mα, and

37 mβ → ∞ as n1, n2, p → ∞, then we have

n T T o E Z(1)β(1) + β ZX X(1)α(1) + α = n1n2mαβ · σαβ,

n T T o 2 2 2 2 Var Z(1)β(1) + β ZX X(1)α(1) + α = O(n1n2mαβp) + o(n1n2mαβ),

n T T o E X(1)α(1) + α XX X(1)α(1) + α   2 2 = n1mα(n1 + mα) ·{1 + o(1)} + n1mα(p − mα) · σα + n1p · σα , n T T o 2 2 2 4 Var X(1)α(1) + α XX X(1)α(1) + α = o{n1mα(n1 + p) · σα},

n T T o E Z(1)β(1) + β ZZ Z(1)β(1) + β = n m (n + m ) ·{1 + o(1)} + n m (p − m ) · σ2 + n p · σ2 , 2 β 2 β 2 β β β 2 β n T T o 2 2 2 4 Var Z(1)β(1) + β ZZ Z(1)β(1) + β = o{n2mβ(n2 + p) · σβ}.

Then Propositions A1 - A3 follow from Markov’s inequality.

Cross-trait PRS with selected SNPs

Proposition S9. Under polygenic model (4) and Conditions 1 and 2, suppose mαη, mαβ, mα, mη, and mβ → ∞, qαβ, qα1, qα2, qβ1, qβ2, and qαη → ∞ as n1, n2, n3, p → ∞, then we have

E(Cαη) = n1n3qαη · σαη, 2 2 2 2 Var(Cαη) = O(mαηn1n3qα) + o(n1n3qαη),

2 2 E(Vα) = {n1n3mαqα2 + n1n3qα1(mα + n1)}· σα ·{1 + o(1)} + n1n3qα · σα ,  2 Var(Vα) = o {n1n3mαqα2 + n1n3qα1(mα + n1)} ,

E(Cαβ) = n1n2n3qαβ · σαβ, 2 2 2 2 2 Var(Cαη) = O(mαβn1n2n3qαqβ) + o(n1n2n3qαβ),

2 2 E(Vβ) = {n2n3mβqβ2 + n2n3qβ1(mβ + n2)}· σβ ·{1 + o(1)} + n2n3qα · σα ,  2 Var(Vα) = o {n2n3mβqβ2 + n2n3qβ1(mβ + n2)} .

Then Propositions A4 and A5 follow from Markov’s inequality.

38 Overlapping samples

Proposition S10. Under polygenic model (4) and Conditions 1 - 3, suppose mαη, mα, and mη → ∞ as (n1 + ns), (n3 + ns), p → ∞, then we have

T  E yη SbSα = (n3 + ns)(n1 + ns)mαη · σαη + nsmαηp · σαη + nsp · σαη , T   2  2 T Var yη SbSα = O (n3 + ns)(n1 + ns)mαηp + o E (yη SbSα) ,

T  2 2 E yη yη = (n3 + ns)mη · ση + (n3 + ns) · ση , T   2 2 4 Var yη yη = o (n3 + ns) mη · ση ,

T   2 2  2 E SbSαSbSα = n1n3mα(p + n1) · σα + n1n3p · σα + 2 n1n3nsmα · σα +  2 2  2 2 nsn3mα(p + ns) · σα + nsn3p · σα + n1nsmα(p + n1) · σα + n1nsp · σα  2  2 2 2 2 + 2 n1nsmα(ns + p) · σα + nsmα(ns + p + 3nsp) · σα + nsp(ns + p) · σα , T   2 T Var SbSαSbSα = o E (SbSαSbSα) .

Proposition S11. Under polygenic model (4) and Conditions 1, 2 and 4, suppose mαβ, mα, and mβ → ∞ as (n1 + ns), (n2 + ns), n3, p → ∞, then we have

T  E SbSαSbSβ = (n1 + ns)(n2 + ns)n3mαβ · σαβ + nsn3mαβp · σαβ + nsn3p · σαβ , T   2 2  2 T Var SbSαSbSβ = O (n1 + ns)(n2 + ns)n3mαηp + o E (SbSαSbSβ) ,

T 2 2 2 E(SbSαSbSα) = n1n3mα(p + n1) · σα + n1n3p · σα + 2n1n3nsmα · σα+ 2 2 nsn3mα(p + ns) · σα + nsn3p · σα , T  2 T Var(SbSαSbSα) = o E (SbSαSbSα) ,

E(ST S ) = n n m (p + n ) · σ2 + n n p · σ2 + 2n n n m · σ2+ bSβ bSβ 2 3 β 2 β 2 3 β 2 3 s β β n n m (p + n ) · σ2 + n n p · σ2 , s 3 β s β s 3 β T  2 T Var(SbSβSbSβ) = o E (SbSβSbSβ) .

Then Propositions S1 and S2 follow from Markov’s inequality.

11 Additional technical details

The following technical details are useful in proving our theoretical results. Most of them involve in calculating the asymptotic expectation of the trace of the product of multiple large random matrices.

39 First moment of covariance term

n T T o E W(1)η(1) + η WX X(1)α(1) + α

n T T T o = E tr(η(1)W(1)WX X(1)α(1)) = n1n3mαη · σαη.

First moment of variance terms

n T o  T 2 T 2 2 E W(1)η(1) + η W(1)η(1) + η = E tr(W(1)W(1)) · ση + η η = n3mη · ση + n3 · ση .

n T T T o E X(1)α(1) + α XW WX X(1)α(1) + α

 T T T T   T T T T  = E α(1)X(1)X(1)W(1)W(1)X(1)X(1)α(1) + E α(1)X(1)X(2)W(2)W(2)X(2)X(1)α(1)  T T T T   T T T  + 2E α(1)X(1)X(1)W(1)W(2)X(2)X(1)α(1) + E α XW WX α   2 2 = n1n3mα(n1 + mα) ·{1 + o(1)} + n1n3mα(p − mα) · σα + n1n3p · σα .

Second moment of covariance term

 T T T T T T  E η(1)W(1)WX X(1)α(1)η(1)W(1)WX X(1)α(1)  T T T T T T  = E tr(η(1)W(1)W(1)X(1)X(1)α(1)η(1)W(1)W(1)X(1)X(1)α(1))  T T T T T T  + E tr(η(1)W(1)W(2)X(2)X(1)α(1)η(1)W(1)W(2)X(2)X(1)α(1))  T T T T T T  + 2E tr(η(1)W(1)W(1)X(1)X(1)α(1)η(1)W(1)W(2)X(2)X(1)α(1)) 2 2 2 2 2 2 2 2 2 2 2 2 = (n1n3mαη + n1n3mαηp + 2n1n3mαη + 2n1n3mαη) · σαη + n1n3mαη · (a22 − σαη).

It follows that

 T T T  2 2 2 2 2 2 Var η(1)W(1)WX X(1)α(1) = n1n3mαηp · σαη ·{1 + o(1)} + o(n1n3mαη · σαη).

Second moment of variance terms

 T T T T  E η(1)W(1)W(1)η(1)η(1)W(1)W(1)η(1) 2 2 4 4 4 4 2 2 4 = n3mη · ση + n3mη{2mηση + n3(b4 − ση) + c4b4 − 2ση − b4} = n3mη · ση ·{1 + o(1)}.

40 and

 T T T T T T T T  E α(1)X(1)XW WX X(1)α(1)α(1)X(1)XW WX X(1)α(1)  T T T T T T T T  = E α(1)X(1)X(1)W(1)W(1)X(1)X(1)α(1)α(1)X(1)X(1)W(1)W(1)X(1)X(1)α(1)  T T T T T T T T  + E α(1)X(1)X(2)W(2)W(2)X(2)X(1)α(1)α(1)X(1)X(2)W(2)W(2)X(2)X(1)α(1)  T T T T T T T T  + 4E α(1)X(1)X(1)W(2)W(2)X(1)X(1)α(1)α(1)X(1)X(1)W(1)W(2)X(2)X(1)α(1)  T T T T T T T T  + 2E α(1)X(1)X(1)W(1)W(1)X(1)X(1)α(1)α(1)X(1)X(2)W(2)W(2)X(2)X(1)α(1) 2 2 2 2 4 = n1n3mα(n1 + p) · σα ·{1 + o(1)}.

Similarly, we have

 T T T T T T T T  E β(1)Z(1)ZW WX Z(1)β(1)β(1)Z(1)ZW WZ Z(1)β(1) 2 2 2 2 4 = n2n3mβ(n2 + p) · σβ ·{1 + o(1)}.

Thus, we have

 T T  2 2 4 Var η(1)W(1)W(1)η(1) = o(n3mη · ση),  T T T T   2 2 2 2 4 Var α(1)X(1)XW WX X(1)α(1) = o n1n3mα(n1 + p) · σα ,  T T T T   2 2 2 2 4 Var β(1)Z(1)ZW WX Z(1)β(1) = o n2n3mβ(n2 + p) · σβ .

First moment of covariance term

n T T T o  T T T T  E Z(1)β(1) + β ZW WX X(1)α(1) + α = E β(1)Z(1)ZW WX X(1)α(1)  T T T T  = E tr(β(1)Z(1)Z(1)W(1)W(1)X(1)X(1)α(1)) = n1n2n3mαβ · σαβ.

Second moment of covariance term

 T T T T T T T T  E β(1)Z(1)ZW WX X(1)α(1)β(1)Z(1)ZW WX X(1)α(1)  T T T T T T T T  = E β(1)Z(1)Z(1)W(1)W(1)X(1)X(1)α(1)β(1)Z(1)Z(1)W(1)W(1)X(1)X(1)α(1)  T T T T T T T T  + E β(1)Z(1)Z(1)W(1)W(2)X(2)X(1)α(1)β(1)Z(1)Z(1)W(1)W(2)X(2)X(1)α(1) T T T T T T T T  + E(β(1)Z(1)Z(2)W(2)W(1)X(1)X(1)α(1)β(1)Z(1)Z(2)W(2)W(1)X(1)X(1)α(1)  T T T T T T T T  + E β(1)Z(1)Z(2)W(2)W(2)X(2)X(1)α(1)β(1)Z(1)Z(2)W(2)W(2)X(2)X(1)α(1) h 2 i = O n1n2n3mαβ{(p − mα)(p − mβ) + (p − mα)mβ + mα(p − mβ) + mαmα} 2 2 2 2 2 + n1n2n3mαβ · σαβ ·{1 + o(1)} 2 2 2 2 2 2 2 = O(n1n2n3mαβp ) + n1n2n3mαβ · σαβ ·{1 + o(1)}.

41 It follows that

 T T T T  2 2 2 2 2 2 2 Var β(1)Z(1)ZW WX X(1)α(1) = O(n1n2n3mαβp ) + o(n1n2n3mαβ · σαβ).

First moment of covariance term

n T T o E Z(1)β(1) + β ZX X(1)α(1) + α

 T T T  = E tr(β(1)Z(1)ZX X(1)α(1)) = n1n2mαβ · σαβ.

First moment of variance terms

n T T o E X(1)α(1) + α XX X(1)α(1) + α

 T T T   T T T   T T  = E α(1)X(1)X(1)X(1)X(1)α(1) + E α(1)X(1)X(2)X(2)X(1)α(1) + E α XX α   2 2 = n1mα(n1 + mα) ·{1 + o(1)} + n1mα(p − mα) · σα + n1p · σα .

Similarly, we have

n T T o E Z(1)β(1) + β ZZ Z(1)β(1) + β = n m (n + m ) ·{1 + o(1)} + n m (p − m ) · σ2 + n p · σ2 . 2 β 2 β 2 β β β 2 β

Second moment of covariance term

 T T T T T T  E β(1)Z(1)ZX X(1)α(1)β(1)Z(1)ZX X(1)α(1)  T T T T T T  = E β(1)Z(1)Z(1)X(1)X(1)α(1)β(1)Z(1)Z(1)X(1)X(1)α(1)  T T T T T T  + E β(1)Z(1)Z(1)X(1)X(1)α(1)β(1)Z(1)Z(2)X(2)X(1)α(1)  T T T T T T  + E β(1)Z(1)Z(2)X(2)X(1)α(1)β(1)Z(1)Z(1)X(1)X(1)α(1)  T T T T T T  + E β(1)Z(1)Z(2)X(2)X(1)α(1)β(1)Z(1)Z(2)X(2)X(1)α(1) 2 2 2 2 2 = O(n1n2mαβp) + n1n2mαβ · σαβ ·{1 + o(1)}.

It follows that

 T T T  2 2 2 2 2 Var β(1)Z(1)ZX X(1)α(1) = O(n1n2mαβp) + o(n1n2mαβ · σαβ).

42 Second moment of variance terms

 T T T T T T  E α(1)X(1)XX X(1)α(1)α(1)X(1)XX X(1)α(1)  T T T T T T  = E α(1)X(1)X(1)X(1)X(1)α(1)α(1)X(1)X(1)X(1)X(1)α(1)  T T T T T T  + E α(1)X(1)X(2)X(2)X(1)α(1)α(1)X(1)X(1)X(1)X(1)α(1)  T T T T T T  + E α(1)X(1)X(1)X(1)X(1)α(1)α(1)X(1)X(2)X(2)X(1)α(1)  T T T T T T  + E α(1)X(1)X(2)X(2)X(1)α(1)α(1)X(1)X(2)X(2)X(1)α(1) 2 2 2 4 = n1mα(p + n1) · σα ·{1 + o(1)}.

Similarly, we have

 T T T T T T  2 2 2 4 E β(1)Z(1)ZZ Z(1)β(1)β(1)Z(1)ZZ Z(1)β(1) = n2mβ(p + n2) · σβ ·{1 + o(1)}.

Thus, we have

 T T T   2 2 2 4 Var α(1)X(1)XX X(1)α(1) = o n1mα(p + n1) · σα ,  T T T   2 2 2 4 Var β(1)Z(1)ZZ Z(1)β(1) = o n2mβ(p + n2) · σβ .

43 AUC Power Enrichment MSE p=100000, n=10000 p=100000, n=10000 p=100000, n=10000 p=100000, n=10000

● 1.0 1.0 1.0 ● ● 10 ● ●

● ● ● ● ● ● ●

0.8 ● 0.8 0.8 8 ●

● ● ● ● ● ●

● ● ● ● 0.6 ● 0.6 0.6 6 ● ●

● ● ● ● AUC ● 0.4 Power 0.4 0.4 4 log(MSE) ● Enrichment

● ● 0.2 0.2 0.2 2 ●

● ● ● ● ● ● ●

0.0 0.0 0.0 0 ● 0.1 0.2 0.5 0.8 0.1 0.2 0.5 0.8 0.1 0.2 0.5 0.8 0.1 0.2 0.5 0.8 0.01 0.02 0.05 0.01 0.02 0.05 0.01 0.02 0.05 0.01 0.02 0.05 0.001 0.002 0.005 m/p 0.001 0.002 0.005 m/p 0.001 0.002 0.005 m/p 0.001 0.002 0.005 m/p

AUC Power Enrichment MSE p=100000, n=1000 p=100000, n=1000 p=100000, n=1000 p=100000, n=1000

1.0 1.0 1.0 ● 10

● ● ● ● ●

0.8 0.8 0.8 ● 8 ● ● ● ● ● ● ● ●

0.6 0.6 0.6 6 ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

AUC ●

Power ● 0.4 0.4 0.4 4 ● log(MSE) Enrichment

● ● ● 0.2 0.2 ● 0.2 2

● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.0 0.0 0 0.1 0.2 0.5 0.8 0.1 0.2 0.5 0.8 0.1 0.2 0.5 0.8 0.1 0.2 0.5 0.8 0.01 0.02 0.05 0.01 0.02 0.05 0.01 0.02 0.05 0.01 0.02 0.05 0.001 0.002 0.005 m/p 0.001 0.002 0.005 m/p 0.001 0.002 0.005 m/p 0.001 0.002 0.005 m/p

Supplementary Fig. 1: Trends of GWAS performance when varying sparsity m/p and sample size n: AUC of tests, power of tests, enrichment of top-ranked SNP, and MSE of estimated genetic effects.

44 n=2000, p=10000 n=2000, p=10000 n=10000, p=10000 n=10000, p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

● ● ● ● ● 0.4 0.4 0.4 ● 0.4 ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Raw PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 ● PRS Estimates Raw 0.2 ● ● ● ● ● ● ● ● ● ●

0.0 0.0 0.0 0.0 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

n=2000, p=10000 n=2000, p=10000 n=10000, p=10000 n=10000, p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.6 0.6 ● 0.6 ● ● ●

● ● ● ● ● ● 0.4 0.4 0.4 ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.2 0.2 0.2 Corrected PRS Estimates Corrected PRS Estimates Corrected PRS Estimates Corrected PRS Estimates

0.0 0.0 0.0 0.0 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

Supplementary Fig. 2: Raw genetic correlations estimated by cross- trait PRS with all SNPs (Gαη, upper panels) and corrected ones based A 2 2 on our formulas (Gαη, bottom panels). We set hα = hη = 1, ϕαη = 0.5, p = 10, 000, and vary mα, mη and n.

n=2000, p=10000 n=2000, p=10000 n=10000, p=10000 n=10000, p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

● ● ● ● ● ●

Raw PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● 0.0 0.0 0.0 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

n=2000, p=10000 n=2000, p=10000 n=10000, p=10000 n=10000, p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

● ● 0.8 0.8 ● ● 0.8 0.8

● ● ● ● ● ● 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● 0.2 ● 0.2 0.2 ● ● ● ● Corrected PRS Estimates Corrected PRS Estimates Corrected PRS Estimates Corrected PRS Estimates

0.0 0.0 0.0 0.0 ● 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

Supplementary Fig. 3: Raw genetic correlations estimated by cross- trait PRS with all SNPs (Gαη, upper panels) and corrected ones based on A 2 2 our formulas (Gαη, bottom panels). We set hα = hη = 0.5, ϕαη = 0.5, p = 10, 000, and vary mα, mη and n.

45 (n1,n2)=(2000,2000), p=10000 (n1,n2)=(2000,2000), p=10000 (n1,n2)=(10000,10000), p=10000 (n1,n2)=(10000,10000), p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 ● ● ● ● ● ●

Raw PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 ● ● ● PRS Estimates Raw 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● 0.0 ● ● ● 0.0 0.0 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

(n1,n2)=(2000,2000), p=10000 (n1,n2)=(2000,2000), p=10000 (n1,n2)=(10000,10000), p=10000 (n1,n2)=(10000,10000), p=10000 m1=2000 m1=2000 1.0 ● ● 1.0 ● 1.0 1.0 ● ● ● ● ● ● ● ● 0.8 0.8 0.8 0.8

● ● ● ● ● ● 0.6 0.6 0.6 0.6

0.4 0.4 0.4 ● 0.4 ● ● ● ● ● ● ● ● ● ● ● 0.2 0.2 0.2 0.2 ● ● Corrected PRS Estimates ● Corrected PRS Estimates ● ● Corrected PRS Estimates Corrected PRS Estimates ● ● ● ● ● ● ● ● ● 0.0 0.0 ● ● 0.0 0.0 ● ●

● 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

Supplementary Fig. 4: Raw genetic correlations estimated by cross- trait PRS with all SNPs (Gαβ, upper panels) and corrected ones based A 2 2 on our formulas (Gαβ, bottom panels). We set hα = hβ = 1, ϕαβ = 0.5, p = 10, 000, and vary mα, mβ and n.

(n1,n2)=(2000,2000), p=10000 (n1,n2)=(2000,2000), p=10000 (n1,n2)=(10000,10000), p=10000 (n1,n2)=(10000,10000), p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

● ● ● ● ● ● ● ● ● ● ● ● ● ● Raw PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 PRS Estimates Raw 0.2 ● PRS Estimates Raw 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.0 0.0 0.0 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

(n1,n2)=(2000,2000), p=10000 (n1,n2)=(2000,2000), p=10000 (n1,n2)=(10000,10000), p=10000 (n1,n2)=(10000,10000), p=10000 m1=2000 m1=2000 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.6 0.6 ● ● 0.6 ● ● ● ● ●

● ● ● ● ● ● ● ● 0.4 0.4 0.4 ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.2 0.2 0.2 Corrected PRS Estimates Corrected PRS Estimates Corrected PRS Estimates Corrected PRS Estimates

0.0 0.0 0.0 0.0 1 2 1 2 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.1 0.2 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.8 2.5 3.3 0.01 0.02 0.05 1.25 0.01 0.02 0.05 1.25 Sparsity: m/p Ratio of Sparsity: m2/m1 Sparsity: m/p Ratio of Sparsity: m2/m1

Supplementary Fig. 5: Raw genetic correlations estimated by cross- trait PRS directly with all SNPs (ϕbαβ, upper panels) and corrected ones A 2 2 based on our formulas (ϕbαβ, bottom panels). We set hα = hβ = 1, ϕαβ = 0.5, p = 10, 000, and vary mα, mβ and n.

46 A S S

L R R L A P

P I I (a) (c) (e) A S S

R L L R P A

P I I (b) (d) (f) ACR ALIC BCC CGC CGH CST EC FX FXST GCC IFO PCR PLIC PTR RLIC SCC SCR SFO SLF SS UNC

(a) Superior view (b) Inferior view (c) Anterior view (d) Posterior view (e) Left view (f) Right view

Supplementary Fig. 6: WM main tracts annotation. Originally pub- lished in Zhao et al. [2018]. We examine 18 WM tracts in our real data analysis, whose full names are listed in Supplementary Table 2.

47 5 ● SCZ Significant after controlling FDR ● BD * ● ADHD

4 ** 3 * * * * *** * * ** * *** ** * 2 −log10(p−value)

1

0

FXST RLIC ACR ALIC BCC CGC CGH GCC PCR PLIC PTR SCC SCR SFO SLF EC IFO SS

Supplementary Fig. 7: Associations between the PRS of four psychi- atric disorders created from the published GWAS summary statistics, and 18 brain WM tracts in the UK Biobank dataset. We control for age, sex, and 10 genetic principal components for population structure. FDR is con- trolled at 0.05 level. SCZ, Schizophrenia; BD, Bipolar disorder; ADHD: Attention-decit/hyperactivity disorder; ACR, Anterior corona radiata; ALIC, Anterior limb of internal cap- sule; BCC, Body of corpus callosum; CGC, Cingulum (cingulate gyrus); CGH, Cingulum (hippocampus); EC, External capsule; FXST, Fornix (cres)/Stria terminalis; GCC, Genu of corpus callosum; IFO, Inferior fronto-occipital fasciculus; PCR, Posterior corona radiata; PLIC, Posterior limb of internal capsule; PTR, Posterior thalamic radiation (include optic radiation); RLIC, Retrolenticular part of internal capsule; SCC, Splenium of corpus callosum; SCR, Superior corona radiata; SFO, Superior fronto- occipital fasciculus; SLF, Superior longitudinal fasciculus; SS, Sagittal stratum.

48 Supplementary Table 1: Full name and description of DTI parameters.

DTI parameter Full name Description FA fractional anisotropy a summary measure of WM integrity MD mean diffusivities magnitude of absolute directionality AD(L1) axial diffusivities eigenvalue of the principal diffusion direction RD radial diffusivities average of the eigenvalues of the two secondary directions MO mode of anisotropy third moment of the tensor L2, L3 two secondary diffusion direction eigenvalues

Supplementary Table 2: Full name of 18 WM tracts.

WM tract Full name ACR Anterior corona radiata ALIC Anterior limb of internal capsule BCC Body of corpus callosum CGC Cingulum (cingulate gyrus) CGH Cingulum (hippocampus) EC External capsule FXST Fornix (cres)/Stria terminalis GCC Genu of corpus callosum IFO Inferior fronto-occipital fasciculus PCR Posterior corona radiata PLIC Posterior limb of internal capsule PTR Posterior thalamic radiation (include optic radiation) RLIC Retrolenticular part of internal capsule SCC Splenium of corpus callosum SCR Superior corona radiata SFO Superior fronto-occipital fasciculus SLF Superior longitudinal fasciculus SS Sagittal stratum

49 Supplementary Table 3: The 20 significant associations between the PRS of four psychiatric disorders and brain WM tracts in the UK Biobank dataset (FDR controlled at 0.05 level). Raw R2 stands for the proportion of variance in the DTI parameter that can be explained by psychiatric disorder PRS. Corrected R2s are the ones after correction according our formulas. Seven DTI parameters are considered on each WM tract: FA, fractional anisotropy; MD, mean diffusivities; L1(AD): axial diffusivities, also the eigenvalue of the primary diffusion direction; RD, radial diffusivi- ties; MO, mode of anisotropy; L2 and L3, two secondary diffusion direction eigenvalues.

Disorder-Tract-DTI Estimate Std. Error P -value Raw R2 (×100%) Corrected R2 (×100%) ADHD-ACR-L2 -0.0317 0.0105 0.0026 0.1006 4.3005 ADHD-ALIC-MO 0.0311 0.0104 0.0026 0.0970 4.7888 ADHD-FXST-L1 -0.0373 0.0110 0.0007 0.1392 5.9635 ADHD-FXST-MO -0.0337 0.0104 0.0012 0.1133 4.8436 ADHD-PCR-L3 -0.0319 0.0108 0.0031 0.1017 4.9767 ADHD-PCR-RD -0.0325 0.0110 0.0030 0.1055 4.8518 ADHD-PLIC-FA 0.0388 0.0113 0.0006 0.1509 5.4437 ADHD-PLIC-L2 -0.0431 0.0113 0.0001 0.1862 6.6136 ADHD-PLIC-L3 -0.0361 0.0110 0.0010 0.1306 5.4399 ADHD-PLIC-MD -0.0330 0.0109 0.0024 0.1089 5.6350 ADHD-PLIC-RD -0.0444 0.0112 0.0001 0.1974 7.2696 ADHD-PTR-L1 -0.0351 0.0112 0.0018 0.1233 5.6933 ADHD-PTR-MD -0.0351 0.0109 0.0012 0.1233 6.1215 ADHD-RLIC-MD -0.0336 0.0109 0.0021 0.1127 5.6610 ADHD-SCC-FA 0.0331 0.0113 0.0033 0.1098 4.8173 ADHD-SCC-L2 -0.0367 0.0113 0.0011 0.1347 6.4249 ADHD-SS-L1 -0.0338 0.0111 0.0022 0.1142 5.2611 BD-BCC-L1 0.0331 0.0109 0.0023 0.1092 4.7963 SCZ-BCC-L1 0.0363 0.0109 0.0009 0.1315 2.8821 SCZ-IFO-L1 0.0341 0.0113 0.0025 0.1163 3.4237

50