Supplementary Notes to Accompany “Genome-Wide Association Study Results for Educational Attainment Aid in Identifying Genetic Heterogeneity of Schizophrenia”
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary notes to accompany “Genome-wide association study results for educational attainment aid in identifying genetic heterogeneity of schizophrenia” Table of contents 1 STATISTICAL CONSIDERATION 4 1.1 Genetic dependence under homogeneity 4 1.2 Heterogeneity 4 1.2.1 Genetic dependence of educational attainment and schizophrenia 5 1.2.2 Simulation study 6 1.3 Improved polygenic prediction of endophenotypes and symptoms 8 1.3.1 Improved polygenic prediction with GWAS results from genetically correlated traits 8 1.3.2 Detecting genetic heterogeneity among endophenotypes or symptoms 10 2 GWAS ON EDUCATIONAL ATTAINMENT 12 3 GWAS ON SCHIZOPHRENIA 13 4 QUALITY CONTROL 13 5 SELECTION OF EDUCATION-ASSOCIATED CANDIDATE SNPS 14 5.1 Setting the P value threshold for the proxy-based SNPs 14 6 LOOK-UP, SIGN CONCORDANCE AND ENRICHMENT 14 6.1 Look-up 14 6.2 Bayesian credibility of results 15 6.3 Sign concordance 17 6.4 Enrichment 17 6.4.1 Raw enrichment factor (not corrected for LD score of SNPs) 17 6.4.2 Raw enrichment P value (not corrected for LD score of SNPs) 18 6.5 Prediction of future genome-wide significant loci for schizophrenia 18 7 PLEIOTROPY BETWEEN EDUCATIONAL ATTAINMENT AND SCHIZOPHRENIA 19 8 BIOLOGICAL ANNOTATIONS 21 8.1 Prioritisation of genes, pathways, and tissues/cell types with DEPICT 21 1 8.2 GWAS catalog lookup 23 9 LD-AWARE ENRICHMENT TEST 24 10 THE GRAS DATA COLLECTION 26 10.1 Subjects 26 10.2 Genotyping 26 10.3 Imputation and estimation of genetic principal components 26 10.4 Phenotyping procedures 27 11 REPLICATION IN THE GRAS DATA COLLECTION 27 11.1 Polygenic score calculations 28 11.1.1 Schizophrenia scores 28 11.1.2 Educational attainment scores 29 11.1.3 Bipolar disorder score 29 11.1.4 Neuroticism score 29 11.2 Polygenic score correlations 30 11.3 Predicting case-control status using PGS 30 12 POLYGENIC PREDICTION OF SCHIZOPHRENIA SYMPTOMS 31 12.1 Phenotypic correlations 31 12.2 Based on the 132 EA lead SNPs 31 12.3 Co-dependent polygenic risk score prediction results 32 13 CONTROLLING FOR GENETIC OVERLAP BETWEEN SCHIZOPHRENIA AND BIPOLAR DISORDER 34 13.1 GWIS schizophrenia – bipolar disorder 34 13.2 Look-up in GWIS results schizophrenia – bipolar disorder 35 13.2.1 Sign concordance 36 13.2.1 Raw enrichment factor (not corrected for LD score of SNPs) 36 13.2.2 Raw enrichment P value (not corrected for LD score of SNPs) 36 13.3 GWIS bipolar disorder - schizophrenia 36 13.4 Genetic correlations of GWAS and GWIS results 36 14 SIMULATING ASSORTATIVE MATING 38 14.1 Simulations 38 14.1.1 Description 38 14.1.2 Power 38 2 14.2 Results 39 15 REFERENCES 40 16 ADDITIONAL ACKNOWLEDGMENTS 44 16.1 Individual author contributions: 44 17 SUPPLEMENTARY FIGURES 45 3 1 Statistical consideration 1.1 Genetic dependence under homogeneity The concept of genetic correlation is frequently used in the field of quantitative genetics and is typically defined as the correlation of the true effects of single nucleotide polymorphisms (SNPs) on two outcomes1,2. Although the true effects of SNPs are unknown, estimates of genetic correlation can be obtained from the noisy estimates of SNP effects obtained from GWAS1 or they can be inferred from the phenotypic similarity of individuals conditional on their (estimated) genetic similarity3. The broader concept of genetic dependence is, however, less well understood. In fact, for polygenic traits, a convincing definition of genetic dependence that resonates both with intuition and with the definition of dependence between random variables in probability theory is difficult to formulate. Here, we define genetic dependence in a manner that genetic dependence does not necessarily imply genetic correlation and, vice versa, that genetic correlation does not necessarily imply genetic dependence. More specifically, given two traits, A and B, with proportion πA (resp. πB) of independent variants associated with A (B), and πAB associated with both traits, we define genetic dependence as πAB > πAπB. Simply put, under this definition, two traits are considered genetically dependent in case they share more causal (or causal-loci tagging) variants than expected by chance. As a result, πAB = πAπB implies independence. Finally, πAB < πAπB is also a form of dependence, where an increase in probability of a variant being associated with A, decreases the chance of that variant also being also associated with B. This definition of genetic dependence, however, says very little about how the genetic value of A influences the distribution of the genetic value of B. Consider, for example, the case where A and B are both highly heritable and genetically dependent (i.e., πAB > πAπB) and where the effects of the shared loci follow a random-effects model, but in which the effects of these shared loci are not correlated across traits. In this scenario, the genetic correlation between these traits is zero. Nevertheless, an individual for whom A takes on a rather extreme value is also likely to have an extreme value for B, yet we do not know anything about the direction in which B will go given A. This example reveals that true independence, as it is defined in probability theory, may be a rare event in the context of complex trait genetics. Moreover, under our definition of independence (i.e., πAB = πAπB), traits A and B can still be genetically correlated, viz., when the effects of the shared loci are correlated across traits. Hence, under our definition, genetic dependence is not equal to genetic correlation; genetic correlation does not imply dependence, and genetic dependence does not imply correlation. 1.2 Heterogeneity Up until this point, we have implicitly considered genetically homogeneous traits, where a subset of variants is shared among traits, and the effects of these shared variants have a uniform correlation across traits. However, a trait can be genetically heterogeneous in the sense that it aggregates over several endophenotypes or symptoms that are not genetically identical (i.e. their genetic correlation is not 1). One can imagine a scenario where several endophenotypes are affecting traits A and B, where some of the endophenotypes affect both 4 traits positively or both negatively, some endophenotypes affect one trait positively and the other trait negatively, and other endophenotypes affect only either of the two traits. For example, consider the following structural model for A and B: 퐴 = 퐺1 + 퐺2 + 퐺3 + 퐸퐴, 퐵 = 퐺1 − 퐺2 + 퐺4 + 퐸퐵, where G1 and G2 are endophenotypes affecting both traits, with G1 affecting A and B sign- concordantly and G2 affecting A and B with a different sign, G3 and G4 are endophenotypes both affecting only one trait, and E denotes the environmental component of a given trait, denoted by its subscript. Obviously, when the variants involved in the four endophenotypes are not overlapping, the total effects of the variants shaping G1 on A and B are positively correlated at +1, whereas the effects of variants shaping G2 on A and B are negatively correlated at −1. Moreover, if Var(G1) = Var(G2) the genetic correlation of A and B will be zero, even when a disproportionate amount of variants shape G1 and G2 (i.e., disproportionate in the sense that πAB > πAπB). Hence, we have devised a simple model where two traits are highly genetically dependent (i.e., πAB > πAπB) and have strongly correlated effects for subsets of shared loci, whilst the total genetic correlation between the traits is zero. 1.2.1 Genetic dependence of educational attainment and schizophrenia In the particular case of schizophrenia (SZ) and educational attainment (EA), a small, 1,2 positive genetic correlation (휌퐸퐴,푆푍 ≈ 0.08) and genetic dependence between both traits have been previously reported2,4. In particular, given that a variant is robustly associated with either trait, it significantly increases the chances of that variant also being associated with the other trait. To illustrate this, Supplementary Fig. 1 shows a scatter plot of all Z-statistics from the GWASs on EA and SZ in non-overlapping samples that we describe in Supplementary Notes 2-4. As can be seen, there is inflation in the test statistics along both the horizontal and the vertical axes, highlighting variants that are likely to be associated with one of the two traits. However, there are also variants that reach marginal nominal genome-wide significance for both traits on both diagonal axes. In Supplementary Note 6, we formally show that SNPs associated with EA are more likely to be associated with SZ than expected by chance, given the overall distribution of Z-statistics for SZ. Yet, the sign of association with one trait is not very informative of the sign of association with the other trait. It seems, though, as if there are more SNPs with the same sign for both traits that reach marginal significance than there are SNPs that have discordant signs. This is consistent with the idea that both traits are genetically dependent (i.e., more shared loci than expected by chance) with a subtle positive genetic correlation. However, clearly there are also outliers which seem to defy expectations under a positive genetic correlation between both traits, i.e., we see SNPs that are robustly associated with both traits, but with opposing signs in the effect estimates. Nevertheless, if we assume for now that both traits do actually have a homogenous co- architecture, then the ‘effective genetic correlation’ (i.e., correlation as inferred from genome-wide data; ρG) is expected to be given by the product of πAB and the correlation in the 5 effects of shared loci (denoted by ρ), ρG = ρπAB . 5 However, even when ρ is large (e.g., ρ = 1) and when both πA and πB are relatively small (e.g., πA = πB = 10%), it holds that a relatively small πAB (e.g., πAB = 2%) may still be in excess of πAπB, hence, constituting genetic dependence, whilst the effective genetic correlation remains small (e.g., ρG = 0.02).