<<

REviEWs

Genetic correlations of polygenic disease traits: from theory to practice

Wouter van Rheenen1*, Wouter J. Peyrot2, Andrew J. Schork3, S. Hong Lee4 and Naomi R. Wray 5,6* Abstract | The genetic correlation describes the genetic relationship between two traits and can contribute to a better understanding of the shared biological pathways and/or the causality relationships between them. The rarity of large family cohorts with recorded instances of two traits, particularly disease traits, has made it difficult to estimate genetic correlations using traditional epidemiological approaches. However, advances in genomic methodologies, such as genome-wide association studies, and widespread sharing of data now allow genetic correlations to be estimated for virtually any trait pair. Here, we review the definition, estimation, interpretation and uses of genetic correlations, with a focus on applications to human disease.

Parameter The genetic correlation is a quantitative genetic parameter­ which in traditional epidemiology can only be sepa­ A numerical value that that describes the genetic relationship between two traits rated by collecting large data sets of families that include summarizes a characteristic of and has been expected to reflect pleiotropic action of different types of relatives (such as full siblings, cousins a population, such as the mean or correlation between causal loci in two traits. Studies­ of and parents–offspring) measured for both diseases. height of men, the lifetime risk Genome-wide association studies of schizophrenia or the genetically correlated traits improve our understanding (GWAS) have pro­ of a specific trait. of complex traits because they can reveal genetic variation vided a new paradigm for estimating genetic correlations that contributes to disease, improve genetic prediction among disease traits from data sets that have been inde­ and inform therapeutic interventions. pendently collected for two diseases. Individual-trait 1Department of Neurology, In humans, most lifestyle risk factors of disease, as well data sets have much larger sample sizes and, because Brain Center Rudolf Magnus, University Medical Center as the diseases themselves, are at least partially herit­ the data for the two traits are independently collected, the 1 Utrecht, Utrecht, Netherlands. able ; genetic correlation estimates help to describe their opportunity of confounding through shared common 2Department of Psychiatry, complex relationships, which, particularly in the context environmental factors is minimized. These new data, Amsterdam UMC, VU of disease traits, may be unrecognized. For example, combined with new statistical methods to estimate University Medical Center, although genetic correlations had been hypothesized genetic correlations from both individual-level genetic Amsterdam, Netherlands. among psychiatric diseases2, they long remained diffi­ data or GWAS summary statistics and widespread data 3Institute for Biological cult to measure using traditional genetic epidemiological sharing, have greatly increased the potential to describe Psychiatry, Mental Health approaches, which require data from many families with relationships between complex diseases and traits. Services Snct. Hans, Roskilde, Denmark. two or more blood relatives recorded for each trait. Given Here, we review the definition, estimation, inter­ 4Australian Centre for that most diseases defined as common have lifetime risks pretation and uses of genetic correlations for human Precision Health, University of 0.5–5%, collating data sets that are informative for two complex traits, with a focus on disease. We discuss how of South Australia Cancer diseases is difficult and subject to ascertainment biases. genetic correlations between measured traits can be Research Institute, University The Scandinavian registries3, which comprise data on used to improve the power of association tests for iden­ of South Australia, Adelaide, diagnosis codes from national hospital admissions and tifying genetic variants contributing to disease risk, can South Australia, Australia. discharges for up to several million individuals, have been improve genetic prediction and can aid inferences about 5Institute for Molecular Bioscience, University of useful for calculating estimates of increased risk of a given causality to inform intervention strategies. We describe Queensland, Brisbane, disease in relatives of those with a different specified dis­ the interpretation of genetic correlation estimates and Queensland, Australia. ease4, which is the key observation for estimating genetic consider scenarios that can lead to misinterpretation. 6Queensland Brain Institute, correlation between diseases from family data. However, Supporting theory, together with simulations and code, University of Queensland, these registries also have limitations for estimating genetic are provided in the Supplementary note. Brisbane, Queensland, correlation between diseases; the data set is restricted to Australia. the size of the national population, and recording began Defining genetic correlations *e-mail: w.vanrheenen-2@ 3 Genetic correlation measures . umcutrecht.nl; naomi.wray@ quite recently for many disease traits , resulting in incom­ Pleiotropy is uq.edu.au plete or censored observations for late-onset disease. present when a genetic locus affects more than one trait. https://doi.org/10.1038/ Moreover, it remains challenging to disentangle genetic The term pleiotropy was introduced before the mol­ s41576-019-0137-z sharing from sharing of a common family environment, ecular characterization of DNA5 and hence is attributed

Nature Reviews | Genetics Reviews

phenotypic Traits to a genetic locus, which can imply pleiotropy at the If the traits are standardized (that is, Measurements or level of either a DNA variant or a . In the context = 1) and the genetic values consider only the that are usually studied as the of a discussion about genetic correlation, our interest additive genetic effects, then the genetic are outcome of statistical analyses. is in DNA variants. Pleiotropy between two traits can narrow-sense and the numerator is the They can be quantitative reflect different modes of action5,6 (Fig. 1). The main (for example, height) or a Horizontal pleiotropy dichotomous (for example, distinction is whether the two phenotypes are part of a ρ ρ schizophrenia). causal cascade (vertical or mediated pleiotropy) or not x g y x g y (horizontal or biological pleiotropy). Understanding Estimates horizontal pleiotropy may lead to a better understand­ Approximations of a parameter based on a sample of observed ing of biological processes that are common between z data drawn from a population. traits. Vertical pleiotropy can inform on causality for intervention strategies for disease prevention. It is Ascertainment biases worth noting that in earlier studies7,8 that laid the found­ G G Types of bias that occur when ation for understanding pleiotropy, vertical pleiotropy the studied trait or disease affects how data were was termed ‘spurious pleiotropy’, a term that we now use for spurious genetic correlation estimates due to ascertained. For example, b Vertical pleiotropy patients with a family history bias, misclassification or linkage disequilibrium (Fig. 1). of diabetes may have more Methods to discriminate between vertical and hori­ y x frequent examinations for cardiovascular diseases. zontal pleiotropy indicate that often a combination of 9,10 Causality ρ ρ Reverse causality both contribute to the genetic correlation . Genetic g g βxy βyx Genome-wide association correlation describes the average effect of pleiotropy x y studies across all causal loci, but the underlying architecture Studies in which up to millions of correlations at individual loci can vary (Fig. 2). Local of mostly common single- nucleotide polymorphisms from genetic correlation can deviate from the genome-wide across the genome are each average and regions with strong positive or negative G G tested for association with a trait. genetic correlation have been described for multiple traits even in the absence of genome-wide genetic GWAS summary statistics 11 c Spurious pleiotropy The output of statistical tests of correlation . r r association of a trait with each x g y x g y single-nucleotide poly­morphism Defining genetic correlation mathematically. In a generated by a genome-wide general quantitative genetic model, in which, for each Misclassification bias association study (GWAS), individual, two traits (x and y) are each defined as the typically including the effect Linkage genetic value disequilibrium allele, signed effect estimate, sum of a (g) and a residual value (e, with , test statistic residual simply meaning the difference between the trait (for example, a z-score) and/or value and the genetic value): G M G G p-value. xg=+e Power x x (1) The probability that a study Disease Gene/variant Marker correctly rejects the null yg=+y ey, (2) hypothesis of no association Fig. 1 | Different mechanisms of pleiotropy between two or correlation, also described diseases. a | In horizontal pleiotropy, the genetic variant (G) as 1– type II error. the genetic correlation (ρg) of the traits is: contributes directly to risk of both diseases (x and y) or Bias indirectly through an intermediate (endo) (z). σ b | In vertical pleiotropy, there is a causal relationship Phenomenon where statistical ggxy, analyses produce estimates in ρ = between disease x and y where disease x itself leads to an g 22 (3) observed data that σσ increased risk of disease y (left); in some circumstances, ggxy system­ati­cally overestimate or reverse causation may be observed (here we assume that underestimate the population there is a natural order in the traits to expect causation from parameter. Bias can arise from where σgg, is the covariance of the genetic values and 22xy trait x to trait y, with the reverse causation from y to x less the ascertainment of the σ , σ genetic variance ggxy is the of the two traits in the popu­ expected) (right). In some circumstances, there may be observed data or the statistical lation. As a result, ρ ranges from –1 to 1. Following causality , reverse causality and genetic correlation. procedures used to generate g | the estimates. convention, the Greek letters emphasize parameters c Spurious pleiotropy at a locus can occur when a measured (ρg), which are replaced by Roman letters for esti­ genetic marker (M) ‘tags’, via linkage disequilibrium, two 2 distinct causal variants (left); however, this sort of spurious Linkage disequilibrium mates (rg), although we note that h is commonly used (LD). The non-random to represent both the parameter and the estimate of pleiotropy has to be consistent across loci to result in a segregation of alleles at two heritability. Because the definition of genetic correlation non-zero estimate of genome-wide genetic correlation. distinct loci. LD induces a Different sources of bias (such as disease misclassification) depends on invoking a latent model, it is important to correlation between two may also lead to the incorrect assumption of pleiotropy single-nucleotide polymorphism acknowledge that any estimates of genetic para­meters between disease x and y (right). In early work on pleiotropy, (SNP) genotypes in the may be biased if the assumed latent model is an the term ‘spurious pleiotropy’ was used for what we term population and is caused by the imperfect representation of nature. Lack of clarity in fact that alleles of neighbouring ‘vertical pleiotropy’. βxy is the expected change in trait y SNPs are transmitted together distinguishing the conceptual parameters from the caused by each unit increase in trait x;β yx is the expected until broken down by estimates made from the available data is a common change in trait x caused by each unit increase in trait y. ρg, 12 recombination events. problem . genetic correlation; rg, estimated genetic correlation.

www.nature.com/nrg Reviews

a Constant genetic correlation b Strong regional genetic correlation Genetic value (g). The sum of the total effects 0.6 0.6 of all genetic loci on the trait in an individual, that is g = Xß 0.5 where X is a vector of 0.4 0.4 genotypes for all loci and ß is a vector with additive allelic effects on the trait. It is also g g

ρ 0.2 ρ 0.2 called the genotypic value, 0.2 0.2 0.2 0.2 0.2 true polygenic (risk) score or breeding value. 0.0 0.0 0.05 0.05 Covariance

(σxy, ). The expected product of the deviation of two random −0.2 −0.2 variables from their mean

(σxy, =−EX[( μYxy)( −μ )]). Region 1 Region 2 Region 3 Total Region 1 Region 2 Region 3 Total Genomic region Genomic region Genetic variance 2 (σg). The expected squared deviation of genetic values c Opposite regional genetic correlation d Regional genetic correlation without genome-wide from the mean genetic value genetic correlation ( 22EG[( μ )]), and can also σg =−g 0.6 0.6 be considered the covariance 0.6 of a genetic value with itself.

Heritability 0.4 0.4 (h2). The proportion of phenotypic variance 2 (parameter σP , estimate VP) g g ρ 0.2 ρ 0.2 attributable to variance in 0.2 0.2 0.2 genetic factors. In the context of human traits, most often 0.0 0.0 0.0 0.0 only additive genetic factors are considered for the genetic 2 variance (parameter σA, −0.2 −0.2 −0.2 −0.2 estimate VA) and the ratio of variances is the narrow-sense heritability. Region 1 Region 2 Region 3 Total Region 1 Region 2 Region 3 Total Genomic region Genomic region Latent model A collection of formalized Fig. 2 | Genome-wide genetic correlation versus regional genetic correlation. The overall genetic correlation assumptions to describe (as estimated from pedigree or genome-wide association study data) between two traits has been fixed at 0.2, but the a data-generating process underlying regional architecture of the genetic correlation can vary widely11. In part a, the genetic correlation is constant through which observed across the genome. Alternative scenarios include strong regional genetic correlation (part b) and a combination of both variables (such as disease positive and negative regional correlations (part c); in both of these cases, the regional genetic correlation can far exceed occurrence) can be used to the overall genome-wide genetic correlation. In part d, both positive and negative regional correlations occur in the identify unobserved (latent) absence of an overall genetic correlation. Regions can be interpreted as physical genomic loci, as allele-frequency bins variables (for example, genetic parameters: heritability or as functionally annotated categories (such as coding versus non-coding, biological pathways or tissue-specific and genetic correlation). expression). ρg, genetic correlation.

Phenotypic variance 2 (σP). Variance of phenotypic covariance of the standardized traits or the coheritability 22 22 values (for example, height or ρh=+hρ 1−hh1− ρ , (5) p xyg ()xy()e disease liability) after (hxy). Equation 3 can therefore be rewritten as: accounting for the variance attributable to fixed effects 13 hxy Cheverud’s conjecture proposes that estimates (for example, sex). When ρg = (4) of ρ can be used to approximate estimates of ρ . The phenotypes are standardized, 22 p g hhxy these phenotypic values are conjecture relies on the assumption that most environ­ scaled such that µP = 0 mental effects act in the same direction and through 2 and σP = 1. As the genetic covariance is scaled relative to the the same pathways as genetic effects, which leads to a

two genetic standard deviations when computing ρg, a similarity between phenotypic and genetic correlations. Coheritability high genetic correlation is possible even if there is only Importantly, unlike estimation of genetic correlation, the (hxy). The genetic covariance of standardized traits. This is a a small genetic contribution to the two traits. Hence, phenotypic correlation can be estimated from cohorts 2 useful measure for comparisons reporting estimates for both rg and h can provide a refer­ of unrelated individuals that have been measured for of coheritabilities and ence for the importance of the shared genetic contribution both traits. Although it is relatively easy to collect data heritabilities on the same scale. to the trait phenotypes by enabling coheritability to be to estimate phenotypic correlations between quantita­ estimated and benchmarked against the heritabilities. tive traits, it remains challenging to estimate pheno­ Although the relationship between phenotypic corre­ typic correlations for disease traits because of limited

lation (ρp) and genetic correlation additionally includes data availability, potential ascertainment biases and 14 the correlation between residual factors (ρe): symptom overlap .

Nature Reviews | Genetics Reviews

Linear mixed model Independently-collected GWAS data sets for a range depends on the coefficient of kinship, heritability for (LMM). A linear model that of important traits are now widely available and offer both traits and disease prevalence (Fig. 3). Very similar includes both fixed and 2 an alternative to family studies for estimating genetic estimates of h and rg are obtained for schizophrenia random effects to describe correlations attributable to common genetic variants. and bipolar disorder using a LMM approach on data phenotypic values and 2 that allows a correlation However, the expectation that genetic correlation esti­ from the large Swedish registry (h of 0.64 (95% CI structure between the random mates from family phenotypic records are the same as 0.62–0.68) and 0.59 (95% CI 0.56–0.62), respectively, 4 effect levels. those from GWAS data assumes that ρg is homogeneous and rg = 0.60 (no 95% CI reported)) or meta-analysis across the allelic frequency spectrum of risk loci15. of the estimates derived from the simple liability Restricted maximum threshold equations (h2 of 0.64 (95% CI 0.61–0.67) likelihood (REML). A method for Methods to estimate genetic correlations and 0.56 (95% CI 0.54–0.58), respectively, and rg = 0.47 19 maximum likelihood estimation Methods to estimate genetic correlations depend on the (95% CI 0.32–0.62)) described in equations 6 and 7, of variance–covariance data sets available, such as large cohorts of related indi­ demonstrating that the simple approach is a good components of the parameters viduals or GWAS. Here, we describe the methods that approximation of the complex analysis. The LMM in linear mixed models. laid the foundation for studies of genetic correlation, approach also estimated a contribution for environ­ 2 Liability threshold model including methods to study the distribution of genetic ment shared between relatives (c ), which was 0.045 A model that describes a correlation across the genome (genome partitioning) (95% CI 0.044–0.074) for schizophrenia and 0.034 dichotomous trait (disease) and methods to study genetic correlations of the same (95% CI 0.023–0.062) for bipolar disorder. as a threshold partitioning of trait in different environments. Table 1 presents a list of Alternatively, the tetrachoric correlation (r ) of ‘liability’, which is a latent tc 20 1 variable assumed to follow available methods and software. Pearson can be used to estimate heritability and a standard normal distribution genetic correlation from the 2 × 2 table of observations in the population. The liability Genetic epidemiological data for related individuals. of disease1,21,22 in related individuals: threshold (T) defines lifetime A bivariate linear mixed model (LMM) can be used to esti­ risk (K) of disease as the mate heritabilities and genetic correlations from large 2 1 proportion of individuals hxx= rtc, (8) exceeding this threshold. cohorts of families measured for two traits. In a bivar­ aR iate LMM, each phenotype is modelled as a function Risk ratio of the latent genetic values of individuals, and they are Ratio between the risk of rtc,,xy assumed to be drawn from a bivariate normal (that is, rg = disease in a specific group 22 (9) (for example, relatives of polygenic) distribution, where in this case the variance– ahR xyh affected individuals) and the covariance structure of the genetic relationships is risk of disease in the general based on pedigree data. Best estimates of genetic and Here, the main assumption is that health and disease population. phenotypic variances and covariances can be obtained are a result of dichotomizing an underlying bivariate restricted maximum likelihood 16 Tetrachoric correlation using (REML) . For dis­ normal distribution, which is consistent with the liabil­ The correlation between two ease traits, a simpler approach is to estimate genetic ity threshold model but requires that the proportion of latent normally distributed correlation estimates using population disease risk cases in the study equals the population risk. Therefore, liability phenotypes assumed and risks in pairs of related individuals (such as full both methods yield similar estimates when applied to to underlie dichotomous siblings or parent–offspring), assuming, for example, such data (Supplementary note). population data and estimated liability threshold model from an observed 2 × 2 a . Under this model, the herita­ frequency table. bilities and genetic correlations can be estimated using Individual-level GWAS data for unrelated indi- normal distribution theory17,18. Although the equations viduals. Estimation of the genetic correlation from Genomic relationship look complex, they depend on only five measures: life­ individual-level GWAS data involves a bivariate exten­ matrix 23 (GRM). A matrix whose time risk of disease x and y (Kx and Ky), lifetime risk of sion of the univariate genome-based REML (GREML) off-diagonal elements disease y or x in relatives of those with disease x or y that uses a genomic relationship matrix (GRM) to estimate represent a coefficient of (KRy,x, KRx,x, KRy,y) and the coefficient of relationship (αR), single-nucleotide polymorphism-based heritability genetic sharing between which is 0.5 for full siblings or parent–offspring: (SNP-based heritability)24,25. As for traditional epidemio­ individuals to describe the logy data, this approach uses an LMM, in which the variance–covariance structure between their genetic values phenotype is modelled as a function of the genetic values TT−1−1−−Ti∕ TT2 2 calculated from observed xxR,xx()xx()R,xx of individuals, but the variance–covariance structure of 2 (6) single-nucleotide hx = ai +−iTT 2  genetic values is described by genetic relationships in polymorphism (SNP) data. RR xx()xx,x the GRM constructed from observed genome-wide SNP GRM coefficients can be calculated based on different data rather than from pedigree data. SNP-based herita­ bility is expected to be lower than heritability estimated assumptions of the expected 2 2 distribution of per-SNP TTyy−1R,xx−1()−−Ti∕ xy()TTR,yx from epidemiological family records because it aims heritability. rg = (7) to capture only causal variation that is in linkage dise­ ai +−iTTh2  h RR xx()xy,xx y quilibrium (LD) with the measured SNPs. Therefore, it provides insight into the relative importance of common

Here, Tx and TRy,x are the normal distribution thres­ SNP variation, which can differ among traits. Relatives

holds that reflect the proportions Kx and KRy,x; and ix is closer than second or third cousins are excluded from the mean phenotypic liability of those with disease x, the analysis to ensure that short haplotype segments are

which is calculated as zx/Kx, where zx is the height of the tracked by the shared genetic relationships between pairs

standard normal curve at Tx. The relationship between of individuals. Compared to close relatives, distant rela­ 26 the increased risks to relatives (KRy,x/Ky; for exam­ tives share negligible non-additive genetic variation ple, cross-disorder risk ratio) and genetic correlation and are expected to have lower phenotypic correlation

www.nature.com/nrg Reviews

Table 1 | summary of methods and software packages name Description input source Refs Estimate genetic correlation polycor Estimate tetrachoric correlation 2 × 2 contingency tables from https://cran.r-project.org/web/packages/ 22 through MLE pedigrees polycor/ GCTA (—reml-bivar) Bivariate GREML; includes Individual-level genotypes http://cnsgenomics.com/software/gcta/ 23,46 options for different model assumptions MTG2 Computationally efficient Individual-level genotypes https://sites.google.com/site/ 114 bivariate GREML honglee0707/mtg2 BOLT-REML Computationally efficient Individual-level genotypes https://data.broadinstitute.org/ 115 approximate bivariate GREML alkesgroup/BOLT-LMM/ with GRM with fully overlapping individuals for both traits LDSC (—rg) Weighted regression of product GWAS summary statistics https://github.com/bulik/ldsc 28,29 of GWAS summary statistics on LD scores LDAK Calculate weighted kinship Individual-level genotypes http://dougspeed.com/ldak 41,43 matrix to model distribution of causal variants across LD and/or MAF spectrum SumHer (—sum-cors) Analogue to LDSC, but adopts GWAS summary statistics http://dougspeed.com/sumher/ 44 ‘LDAK model’ GCTA (—HEreg-bivar) Bivariate Haseman–Elston Individual-level genotypes http://cnsgenomics.com/software/gcta/ 46,51 regression S-PCGC Phenotype correlation genotype GWAS summary statistics • https://github.com/omerwe/S-PCGC 49,50,116 correlation regression. Extension • https://data.broadinstitute.org/ of Haseman–Elston regression, alkesgroup/PCGC/ robust to ascertainment and covariates Popcorn Bayesian estimate of transethnic GWAS summary statistics https://github.com/brielin/Popcorn 37 genetic correlation LD Hub Server to estimate genetic GWAS summary statistics http://ldsc.broadinstitute.org/ldhub/ 30 correlation from published GWAS using LDSC GNOVA Annotation-partitioned genetic GWAS summary statistics https://github.com/xtonyjiang/GNOVA 35 correlation ρHESS Local genetic correlation GWAS summary statistics https://github.com/huwenboshi/hess 11 Power calculation genetic correlation estimation GCTA power calculator Calculate the power of bivariate User-defined parameters https://cnsgenomics.shinyapps.io/ 34 GREML analysis in GCTA gctaPower/ Multitrait association analysis MultiMeta Inverse variance-weighted SNP effects and standard errors https://CRAN.R-project.org/ 61 meta-analysis package=MultiMeta ASSET Subset meta-analysis which can SNP effects and standard errors https://bioconductor.org/packages/ 62 include correlated traits release/bioc/html/ASSET.html HIPO Heritability-based weighting of SNP effects and standard errors https://github.com/gqi/hipo 63 correlated traits MTAG Pleiotropy-informed SNP SNP effects and standard errors https://github.com/omeed-maghzian/mtag 64 association analysis for single trait CPMA Test SNP pleiotropy through SNP p-values http://coruscant.itmat.upenn.edu/ 117 distribution of p-values software.html LEP Heterogeneous sharing of risk SNP effects and standard errors https://github.com/daviddaigithub/LEP 118 variants metaUSAT Score-based association test SNP effects and standard errors https://github.com/RayDebashree/ 65 metaUSAT TATES Multivariate analysis of single SNPs SNP p-values https://ctg.cncr.nl/software/tates 73 metaCCA Multivariate analysis of multiple SNP effects and standard errors https://bioconductor.org/packages/ 74 SNPs using canonical correlation release/bioc/html/metaCCA.html analysis

Nature Reviews | Genetics Reviews

Table 1 (cont.) | summary of methods and software packages name Description input source Refs Multitrait association analysis (cont.) genomic SEM Identify SNPs associated with SNP effects and standard errors https://github.com/MichelNivard/ 72 general dimensions of cross-trait GenomicSEM liability through SEM cFDR Bayesian conditional analysis SNP p-values https://github.com/jamesliley/ 75,76 cFDR-common-controls CPBayes Bayesian conditional analysis SNP effects and standard errors https://CRAN.R-project.org/ 77 allowing >2 traits package=CPBayes GPA , GPA-MDS Bayesian association analysis SNP p-values https://github.com/dongjunchung/GPA 78,79 incorporating SNP annotation EPS Bayesian association analysis SNP p-values https://github.com/gordonliu810822/EPS 119 incorporating SNP annotation Multitrait prediction MTAG Predictor based on β coefficients GWAS summary https://github.com/omeed-maghzian/ 64 from pleiotropy-informed SNP statistics + individual-level mtag associations genotypes SMTpred Combine SNP weights GWAS summary https://github.com/uqrmaie1/smtpred 90 (β coefficient or BLUP) from statistics + individual-level multiple single-trait predictors genotypes PleioPred Bayesian framework for GWAS summary https://github.com/yiminghu/PleioPred 91 multitrait prediction potentially statistics + individual-level modelling genome annotation genotypes (PleioPred-anno) Inferences on causality MR-Egger MR using Egger regression GWAS summary statistics https://cran.r-project.org/web/packages/ 94 MendelianRandomization/index.html MRBase Server for MR analysis using GWAS summary statistics http://www.mrbase.org/ 95 published GWAS MR Steiger Detect directionality in causal GWAS summary statistics https://github.com/explodecomputer/ 97 relationships causal-directions GSMR Generalized summary GWAS summary statistics http://cnsgenomics.com/software/gsmr/ 9 data-based MR , including HEIDI test to exclude SNPs with evidence for horizontal pleiotropy MR-PRESSO MR after detecting and GWAS summary statistics https://github.com/rondolab/ 10 correcting for horizontal MR-PRESSO pleiotropy LCV Latent causal variable model GWAS summary statistics https://github.com/lukejoconnor/LCV 103 to infer causality , less biased by horizontal pleiotropy Conditional analysis Multi-trait conditional Multitrait conditional analysis of GWAS summary statistics https://github.com/yangq001/conditional 104 GWAS analysis GWAS in same individuals GCTA-mtCOJO Multitrait conditional analysis of GWAS summary statistics http://cnsgenomics.com/software/gcta/ 9 independent GWAS GWIS Approximate conditioned GWAS GWAS summary statistics https://sites.google.com/site/mgnivard/ 105 summary statistics gwis Fine-mapping causal variants RiVIERA Bayesian framework to combine GWAS summary statistics https://github.com/yueli-compbio/ 106 multitrait SNP associations and RiVIERA-beta annotation for fine mapping fastPAINTOR Bayesian framework to combine GWAS summary statistics https://github.com/gkichaev/ 107 multitrait SNP associations and PAINTOR_V3.0 annotation for fine mapping BLUP, best linear unbiased predictor; GCTA, genome-wide complex trait analysis; GCTA-mtCOJO, genome-wide complex trait analysis–multitrait conditional and joint analysis; GREML, genetic restricted maximum likelihood; GRM, genomic relationship matrix; GWAS, genome-wide association study; HEIDI, heterogeneity in dependent instruments; LD, linkage disequilibrium; LDAK, linkage disequilibrium adjusted kinship; LDSC, linkage disequilibrium score regression; MAF, minor allele frequency; MLE, maximum likelihood estimation; MR, Mendelian randomization; SNP, single-nucleotide polymorphism; SEM, structural equation modelling.

www.nature.com/nrg Reviews

2 a CDRR by coefficient of kinship (aR) b CDRR by heritability (h ) c CDRR by disease prevalence (K) 4 4 4

2

aR =0.125 (second cousin) hx =0.1 Kx =0.001 2

aR =0.25 (full cousin) hx =0.5 Kx =0.01 2

aR =0.5 (full sibling) hx =1.0 Kx =0.1 3 3 3

RR for disease x 2 RR for disease x 2 RR for disease x 2

1 1 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

ρg ρg ρg Fig. 3 | Relation between cross-disorder relative risk (cDRR) and genetic correlation. The graphs show the

relationship between genetic correlation (ρg) and the risk ratio (RR) with varying coefficient of relationship (aR) (part a), 2 2 heritability of disease x (hx) (part b) and lifetime risk of disease x (Kx) (part c). Reference parameters are hy = 0.4, Ky = 0.15 2 (typical of major depression), hx= 0.65, Kx = 0.01 (typical of schizophrenia) and aR = 0.5 (that is, full sibling). In part a, the 2 sibling (aR = 0.5) of someone with major depressive disorder (hy = 0.4, Ky = 0.15) has a 1.56-fold increased chance of having 2 schizophrenia hx = 0.65, Kx = 0.01, ρg = 0.47) compared to the general population. The risk ratio is lower for more distant relatives (1.26-fold and 1.13-fold increase for aR = 0.25 and 0.125, respectively; dotted black lines). Using similar parameters, part b shows that the relative risk for the sibling increases with increasing heritability of disease x (1.21, 1.49 2 and 1.72 for hx= 0.1, 0.5 and 1, respectively; dotted black lines). In part c, the relative risk for the sibling increases with decreasing lifetime risk of disease x (1.35, 1.57 and 1.75 for Kx = 0.1, 0.01 and 0.001, respectively; dotted black lines). The Supplementary note provides the theoretical background and code for this figure.

24,25,27 associated with the shared environment , so esti­ for the two traits (Nx, Ny) and the total number of SNPs

mates are unlikely to be biased by these factors. Bivariate (M). The intercept term estimates sample overlap (Ns) GREML analysis simultaneously estimates the genetic and hence reflects the proportion of shared individuals Ns variances of the two traits and the genetic covariance () and their phenotypic correlation (ρp): NN between them that best fit the data given the model. xy It can be applied to data sets that have been collected ρNp s NNxyhxy independently because pairs of individuals across data Ez[]xj zlyj∣ j =+ lj (10) sets are distantly related. Hence, bivariate models include NNxy M a GRM that has two block diagonal matrices represent­ ing the two univariate GRMs, with the off-diagonal LDSC SNP-based heritabilities, which are needed block representing the genetic relationships between to estimate the genetic correlation, are calculated sim­ pairs of individuals represented in the two data sets. ilarly with x = y (and with an additional intercept term to account for residual confounding, such as population GWAS summary statistics for unrelated individuals. stratification, within a data set). As LDSC is computa­ Linkage disequilibrium score regression (LDSC)28 was tionally very efficient, summary statistics from GWAS the first method to propose estimation of genetic cor­ are widely shared and there is no bias introduced by relation from GWAS summary statistics. It is based on sample overlap, genetic correlations between hundreds the observation, expected under polygenicity, that the of traits can be studied29–31. LD Hub30 provides a pub­ greater the total amount of LD a SNP has with other licly available server that hosts LDSC calculations and a genetic variants, the greater its chance of being corre­ library of published GWAS summary statistics. lated with causal variants, and the higher its expected association test statistic. Exploiting this relationship Genome partitioning. A question of key interest is SNP-based heritability allows estimation of SNP-based heritability when whether causal variants for a trait are found randomly An estimate of the proportion using association test statistics for a single trait or esti­ across the genome or are enriched based on genomic of the total phenotypic mation of SNP-based coheritability when combining annotation. The univariate GREML approach can model variance attributable to the association test statistics from two traits. Specifically, multiple random effects and hence estimate multiple additive effects of the class 29 of variants (that is, common bivariate LDSC uses a weighted regression frame­ genetic variances using multiple GRMs, each built with 32 single-nucleotide work to estimate the coheritability (hxy) from GWAS SNPs selected on different annotations . The REML polymorphisms (SNPs)) that association statistics for SNP j of both traits (zxj and zyj) approach optimizes the partitioning of the variance are typically genotyped and and the SNP LD scores (lj). The LD score of SNP j is the to these annotations. The computational efficiency of imputed in pursuit of a sum of r2 LD of SNP j with other SNPs obtained from LDSC in estimating the enrichment of SNP-based genome-wide association study. It is often shortened sequencing data and can thus be regarded as a measure heritability in sets of variants with particular genomic to SNP heritability, but this of the genetic variation that is ‘tagged’ by SNP j. The annotations has enabled study of genomic partitioning should be avoided. regression relationship also depends on the sample size of genetic variance using a stratified LDSC33 approach.

Nature Reviews | Genetics Reviews

Extension of these methods to investigate differences in effects and/or population-specific allele frequencies. genetic correlations between traits based on genomic With this in mind, Popcorn estimates the correlation annotations is appealing, but they would generate esti­ of SNP effect sizes (the genetic-effect correlation) and mates with high standard errors (which depend on the the correlation of per SNP heritability (the genetic- number of SNPs contributing to the estimates as well as impact correlation); the genetic-impact correlation is sample sizes34). Heritability Estimation from Summary dependent on differences between populations in terms Statistics (ρHESS), which was developed to partition of both effect sizes and allele frequencies. Both in simu­ genetic correlations based on genomic regions, addresses lations and in application to real data, the estimates were this issue by reducing the noise in the LD matrix through found to be similar37. principal component-based regularization (that is, block diagonalization)11. By contrast, the GeNetic cOVariance Interpretation of SNP-based estimates Analyzer (GNOVA), which partitions genetic correla­ SNP-based genetic correlation estimates are robust to tions based on functional annotations, uses the method most model assumptions. There is an ongoing debate of moments as the underlying framework instead of about model assumptions of GREML and LDSC and the weighted regression in LDSC35. Both methods have their impact on SNP-based heritability estimates41–45. shown that the genetic correlation is not constant By contrast, estimates of genetic correlations are gen­ across the genome for different trait pairs. For exam­ erally shown to be robust to these assumptions29,44. To ple, 11 regions of statistically significant local genetic support this conclusion, we summarize the current dis­ correlation (four positive, seven negative) were found cussions for SNP-based heritability estimation and then between LDL and HDL cholesterol in the absence of justify why the issues have little impact on estimates of genome-wide genetic correlation11. genetic correlations. Briefly, the basic GREML model, one of the models implemented in the genome-wide Same trait measured in different environments. complex trait analysis (GCTA) software46, assumes Bivariate methods can be used to analyse data for the same an infinitesimal model and that causal effects are drawn trait that have been intentionally recorded in two differ­ from a normal distribution. In simulations41, estimates ent environments (or populations); data from the two of SNP-based heritability were found to be robust environments are treated as different traits in the analy­ to three key underlying assumptions relating to the sis. The resulting genetic correlation estimates can of the trait: extent of polygenicity, reflect the sensitivity of the genetic effects to the cho­ normality of genetic effects and the inverse relationship sen environments, and estimates less than one may be between minor allele frequency (MAF) and effect size. indicative of genotype by environment (G × E) interaction. However, the estimates were found to be sensitive to the An important caveat, especially for GWAS-derived esti­ assumption that causal variants are found randomly mates, is that these analyses should always be bench­ with respect to the LD patterns of the genome. This led marked against estimates from different cohorts of the to the introduction of the LD adjusted kinship (LDAK) same trait recorded in the same environment. In this REML methods41,43, in which contributions from SNPs case, the true genetic correlation parameter is one, but, in low LD regions are assigned higher weights when con­ in practice, small sample sizes, differences in participant structing the GRM. Instead of using a weighted GRM, ascertainment or unrecognized differences in pheno­ LD and MAF stratified GREML (GREML-LDMS)42 Genotype by environment type definition can induce sample heterogeneity, which introduced multiple-component GREML models that (G × E) interaction results in lower estimates15,36. Estimates of genetic corre­ estimate multiple genetic variances stratified by LD and Differences in size and/or lation between samples of different ethnicities are addi­ MAF which sum to the SNP-based heritability. In turn, direction of the effect of genotype on disease risk tionally affected by differences in allele frequency and a multivariance-component LDAK model with LDAK (ref.37) 43 in two different environments. LD structure that may lead to rg estimates <1 . In GRM stratified by MAF was also introduced . A com­ LMM methods, the bivariate GRM can be constructed prehensive comparison of analyses that use single or Sample heterogeneity using allele frequencies estimated from the two different multiple-component GRM constructed with different Differences in the effects of genotype on disease risk in samples, which accounts for both allele frequency and underlying assumptions on LD and/or MAF-dependent 38 two different cohorts. Potential LD differences between the populations . For example, architectures, and based on simulations from genome- causes include differences the estimated genetic correlation between European sequence data, showed that multicomponent models in phenotype criteria, cohorts and East Asian cohorts was 0.76 (s.e. 0.04) for performed better than single-component models, but ascertainment methods and Crohn’s disease and 0.79 (s.e. 0.04) for ulcerative colitis39. biases were observed in all methods depending on the unknown environmental 47 differences with genotype By contrast, the genetic correlation between these ethni­ underlying architecture . It is therefore difficult to by environment interaction. cities for attention-deficit hyperactivity disorder foresee which biases may occur in real data, as the true (ADHD) was only 0.39 (s.e. 0.15)40; however, the genetic genetic architecture is unknown. In these comparisons, Infinitesimal model correlation estimated between two European ancestry LDSC estimates were biased downwards by 5–10% when This model assumes that a trait 15 is shaped by a very large ADHD cohorts was only 0.71 (s.e. 0.17) , which indi­ causal variants were common, and this bias increased number of variants with small cates sample heterogeneity in these ADHD GWAS. The as causal variants became less common. Recently, the (infinitesimal) effects resulting Popcorn method37 extends LDSC to allow estimation summary statistics-based method SumHer44 was intro­ in a normally distributed of genetic correlation between two traits from GWAS duced, which includes the assumption that low LD SNPs phenotype. A polygenic conducted in populations of different ethnicity using LD should have higher effect sizes. architecture of >~10 causal variants is approximated reference panels from both populations. In the absence Discussion about model assumptions has focused well by normal distribution of sample heterogeneity, an interesting question is mostly on the estimates of SNP-based genetic variance infinitesimal model theory. whether rg < 1 is because of population-specific allelic and heritabilities, but the same concerns apply to the

www.nature.com/nrg Reviews

Haseman–Elston regression estimates of SNP-based genetic covariances and coherit­ statistics-based methods is that larger sample sizes can 44 Regression of the product of abilities between two traits . The key point of discussion be achieved and they require only a small fraction the standardized phenotypes is whether the genetic architecture assumed in a model of the computational expertise and resources required for of pairs of individuals on their of analysis matches the true, but unknown, genetic archi­ the methods that use individual-level data. coefficient of genetic sharing as defined in the genomic tecture of the trait under analysis. A parti­cular concern is relationship matrix. the LD properties of causal variants, but for causal vari­ Genetic correlation estimates are robust to scale ants shared by two traits the same LD properties apply as transformations. For disease traits, SNP-based herita­ they are a property of the genome not of the trait. Hence, bility estimates are made relative to the phenotypic vari­ differences in estimates of (co)variances are expected to ance in the sample, which is a function of the proportion approximately cancel out through their impact on both of cases in the sample. The raw estimates are transformed the numerator and the denominator. Therefore, esti­ based on normal distribution theory to account for sam­ mates of genetic correlations are observed to be much ple ascertainment, so that they are interpretable and can more robust to underlying assumptions in simulations be compared across different samples21,47. Likewise, raw conducted across different methods29,44,48,49. coheritability estimates reflect the proportions of cases in the two samples. As the transformations apply to the Precision and required sample size for genetic cor- numerator and the denominator, the estimated genetic relation estimates. The standard error of GREML correlation is scale independent29. SNP-based heritability estimates depend only on sam­ ple size (N) and SNP density (accurately approximated Genetic correlation estimates are robust to ascertain- as 316/N for GWAS data34), but the standard errors of ment and strong environmental factors. Ascertainment genetic correlation estimates are several-fold larger of cases (resulting in oversampling of individuals with because they reflect errors of three estimated parame­ both high genetic and environmental values) or the ters, that is, the heritabilities of the two traits and the presence of environmental factors with strong effects genetic correlation parameter itself (Fig. 4a). Because can violate GREML (and hence also LDAK) assump­ standard errors can be estimated with good accuracy, tions that environmental values are normally distributed power calculations can be undertaken before conduct­ and lead to a downward bias of SNP-based heritability ing a study34. Empirical standard errors of summary estimates49,50. Haseman–Elston regression is robust to this statistics-based methods such as LDSC are ~2× higher assumption but gives ~1.5× higher standard errors than than those of GREML. Therefore, LDSC is less powerful GREML-based approaches51. Phenotype correlation to detect genetic correlations that are significantly differ­ genotype correlation (PCGC) regression is an extension ent from zero or one, for a given sample size, compared of Haseman–Elston regression that accounts for ascer­ to GREML28,48 (Fig. 4b). The major advantage of summary tainment and covariates in estimation of SNP-based

a Precision of genetic correlation estimates b Power to detect non-null genetic correlations 1.00 h2 0.3 2 2

rg: ρg =0.2, hx =0.8, hy =0.8 2 2

rg: ρg =0.2, hx =0.8, hy =0.2 2 2 0.75

rg: ρg =0.2, hx =0.2, hy =0.2 0.2 0.50 Power (1− β ) Power

Standard error Standard 0.1 GREML LDSC

0.25 ρg =0.05

ρg =0.2

ρg =0.5 0.0 0.00 0 25,000 50,000 75,000 100,000 0 25,000 50,000 75,000 100,000 Per−study sample size Per−study sample size

Fig. 4 | Precision of genetic correlation estimates compared to heritability estimates and power for GReML and LDsc. a | Standard errors for heritability and genetic correlation estimates obtained from genome-based restricted maximum likelihood (GREML). We use equal sample sizes for the two traits and show that the standard errors for genetic correlation estimates are substantially larger than for heritability estimates, meaning that larger sample sizes are needed to obtain equally accurate genetic correlation estimates compared to heritability estimates. b | Power calculations for GREML34 and linkage disequilibrium score regression (LDSC). We assume that standard errors are approximately twice as large for LDSC regression compared to GREML based on observations from simulations and real data28,48, in which the LDSC intercept was not constrained. For equal sample sizes, GREML is more powerful to detect genetic correlations than LDSC, but often the individual-level data sets needed for GREML are smaller than those contributing to the genome-wide association study summary statistics used in LDSC. Unless stated otherwise, the following parameters were fixed for both traits: heritability (h2) = 0.2, lifetime risk (K) = 0.01, proportion of the study sample that are cases (P) = 0.5 and significance threshold (α) = 0.05. Power is described as 1 – type II error (that is, 1 –β ). The Supplementary note

provides the theoretical background and code for this figure. ρg, genetic correlation; rg, estimated genetic correlation.

Nature Reviews | Genetics Reviews

a b Misclassification 1 1

0.75 0.75

g_ eq 0.5 0.5

ρ estimate g r

ρm =0.1 MTx =0.05

0.25 ρm =0.3 0.25 MTx =0.1

ρm =0.5 MTx =0.2

ρm =0.8 MTx =0.3 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

ρg ρg

c Double-screening control cohorts Bipolar disease and schizophrenia Asthma and hayfever 1 1 rg estimate

ρg parameter 2 hy estimate 0.75 0.75 2 hy parameter 2 hx estimate 2 hx parameter 0.5 0.5 rg estimate Estimates Estimates ρg parameter 2 hy estimate 0.25 2 0.25 hy parameter 2 hx estimate 2 hx parameter 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

ρg ρg

d Collider bias Uncorrelated disease traits Correlated disease traits 0.0 1

0.8

0.6 −0.1 0.4

estimate estimate 0.2 g g r r

−0.2 Kincl =0.05 ρincl =0 0 Kincl =0.2 ρincl =0.25

Kincl =0.4 ρincl =0.5 −0.2

Kincl =0.8 ρincl =0.75

−0.3 −0.4 –0.8 –0.4 0 0.4 0.8 0 0.25 0.5 0.75 1

ρincl ρg

heritability. However, extensive simulations suggest confounding is more likely to occur with a binary trait24. Confounding bias A type of bias that emerges that genetic correlation estimates obtained by GREML, Hence, stringent SNP quality control is needed when when a covariate, a LDSC and PCGC are not biased by strong ascertainment applying GREML to disease data, as well as inclusion ‘confounder’, causally or strong environmental factors49. of ancestry-informative principal components as fixed- influences the predictor variable effect covariates. LDSC and SumHer attempt to model and outcome variable. When Impact of population stratification on genetic cor- inflation of association statistics due to any residual­ the confounder is not relation estimates. Confounding bias accounted for, the relationship due to popula­ population stratification when estimating SNP-based 52 between predictor and outcome tion stratification can inflate (co)heritability estimates heritability, but bias may still remain . The impact of may be biased (confounded). from GREML. Moreover, population and technical population stratification on coheritability estimates and

www.nature.com/nrg Reviews

◀ Fig. 5 | Bias in estimated genetic parameters. a | Positive assortative mating increases Hence, the likelihood of over­estimation of genetic cor­ genetic correlation estimates at equilibrium (ρg_eq). Here, ρm is the phenotypic correlation relation should be guided by the context of realistic between mate pairs for trait x, heritability of trait x ( 2) 0.65 and heritability of trait hx = misclassification rates for each disease pair. y (h2) 0.4. b | Misclassification inflates genetic correlation estimates (r ). Here, y = g Individual-level genotype data can help detect MTx represents the misclassification rate of trait x as trait y. There is no misclassification 2 2 potential misclassification. The Breaking Up Hetero­ of trait y, hx= 0.65 and hy = 0.4. c | Extending the methodology that considered disease misclassification56, double-screening controls can yield inflated estimates for heritability geneous Mixture Based On cross(X)-locus correlations and genetic correlation. This bias is modest when at least one trait is relatively rare, for (BUHMBOX) algorithm leverages correlation patterns 2 58 example (left) for major depressive disorder (hx = 0.4, Kx = 0.15) and schizophrenia of disease risk loci to detect subgroup heterogeneity . 2 (hy = 0.65, Ky = 0.01), but can be substantial for two common traits such as hayfever There have been few applications to real data but one 2 2 113 (hx = 0.15, Kx = 0.15) and asthma (hy = 0.18, Ky = 0.25) (right). Dashed lines reflect true example used rheumatoid arthritis data and indicated parameters. d | In the panel on the left, collider bias results in a downward bias of the that the genetic correlation between seropositive and genetic correlation estimates when both traits are associated with the probability sero­negative types was (partly) due to subgroup hetero­ of being included in the study, here modelled on the liability scale ( as ρincl,x = ρincl,y geneity possibly introduced by false-negative serum presented on x axis). This bias is most pronounced when a smaller proportion of 58 2 2 rheumatoid blood factor tests (misclassification) . samples is included in the study (Kincl). For this panel, hx = hy = 0.4 and ρg = 0. In the panel on the right, trait parameters are chosen to reflect major depressive disorder Notably, subgroup heterogeneity­ can theoretically also 2 2 result from (molecular) subtypes or vertical pleio­ (hx= 0.4, Kx = 0.15) and schizophrenia (h = 0.65, Ky = 0.01). Again, ρincl,x = ρincl,y and we y tropy, which can, in contrast to misclassification, be of set ρg = ρe as presented on the x axis. The bias in genetic correlation estimates due to collider bias is most pronounced when traits are uncorrelated. The Supplementary note great interest.

provides the theoretical background and code for all panels of this figure. ρg, genetic correlation. Double-screening control cohorts can inflate genetic correlation estimates. In addition to screening controls to exclude the case trait, it is also common to exclude genetic correlation estimates for these methods has not control subjects with potentially related diseases in what been thoroughly investigated; however, for LDSC, theory we term here as double-screening of control cohorts. predicts that it affects the bivariate LDSC intercept but Although this ascertainment bias may increase power not coheritability estimates53. for detection in GWAS, it can induce biased estimates of genetic parameters (Fig. 5c). When the true genetic corre­ Positive assortative mating increases genetic lation is zero, a non-zero estimate reflects the increased correlation.­ Compared to random mating, positive prevalence of risk alleles for the secondary trait in the assortative mating on a trait (x) increases its genetic cases relative to the doubly-screened controls. This bias 2 (Fig. 5c) variance at equilibrium (hx_eq) and the genetic variance increases with increased population risk and of any correlated trait (y)54. Their genetic correlation at increased differences in heritability between the primary

equilibrium, ρg_eq, compared to the genetic correlation and secondary trait. 54 under random mating (ρg) can be expressed as : Collider bias can affect genetic correlation estimates. Collider bias can introduce spurious correlations when 1 ρρ= two traits both influence a third ‘collider’ variable and gg_eq 22 (11) their association is conditioned on this third variable. 1−ρhm x_eq 1−ρg () A special form of collider bias arises through selection bias in which both traits influence the probability that (Fig. 5d) where ρm is the phenotypic correlation between mates an individual is included in the study . Collider

for the assortatively-mated trait x. For disease traits, ρm bias has been acknowledged as a potential pitfall in the would be the correlation between liability to disease. As use of large-scale biobanks59, in which there may be a a result, assortative mating increases genetic correlation high degree of self-ascertainment. For example, in the estimates at equilibrium (Fig. 5a). For assortative mating UK Biobank study, only 5% of invitations to participate 55 60 typical of humans (ρm < 0.3) , ρg_eq/ρg has a ­maximum were accepted , with participants having higher educa­ of ~1.2. tional status and lower prevalence of smoking59. Hence, the genetic correlation estimated between educational Assortative mating Misclassification inflates genetic correlation esti- status and smoking obtained from UK Biobank data may Mating selection on a trait mates. Misclassification, or misdiagnosis, may occur be biased. where the phenotypes of mates are positively when two traits share phenotypic characteristics, such correlated. Examples of as Crohn’s disease and ulcerative colitis, and can lead to Uses of genetic correlations assortative mating in humans spurious estimates of genetic correlation56. The impact If non-zero genetic correlations are estimated between include height or educational of misclassification on the estimated genetic correlation two traits, analyses can be constructed that could attainment. can be quantified from theory56, and the bias is greatest improve power to detect new disease-associated vari­ Collider bias when two diseases are genetically unrelated (Fig. 5b). ants, imp­rove genetic risk prediction, make inferences A type of bias that emerges However, if two diseases are truly genetically corre­ on causality, perform conditional analyses and describe when estimates are lated then perhaps phenotypic overlap in clinical­ pres­ the bio­logical aetiology of complex traits. Methods and conditioned on a covariate, entation leading to diagnostic ambiguity is expected. software packages (Table 1) that can be used to achieve a ‘collider,’ that is causally 57 influenced by both the For example, in schizophrenia and bipolar disorder , these aims are discussed below, with a focus on those predictor variable and changing diagnoses over time is likely a real reflection that use GWAS summary statistics because these are outcome variable. of the longitudinal symptom profile in some people. broadly applicable.

Nature Reviews | Genetics Reviews

Identification of new trait-associated variants. The and early intervention (prevention) strategies82–84. combined association analyses of correlated traits can The accuracy of genetic risk prediction is dictated by increase power to detect new SNP–trait associations. disease characteristics (such as heritability and life­ A wide variety of methods is available that combine GWAS time risk)82, but also study design (including reference summary statistics to identify trait-associated SNPs sample size, population and trait)85. Leveraging SNP (Table 1). In general, methods encompass extensions of effect estimates from GWAS of genetically correlated (inverse-variance weighted) meta-analysis61–65, the score- traits is equivalent to increasing the effective discovery based association test65, linear combinations of GWAS sample size of the focal trait86 (Fig. 6a). Predictors for test statistics66–71, structural equation modelling72, multi­ case–control traits with relatively low heritability can variate models73,74 and Bayesian methods where the particularly benefit from this multitrait approach87. The prior is informed by SNP associations in correlated increase in predictive accuracy, measured as the area traits35,75–79. A detailed description of these methods is under the receiver operator curve, can be predicted beyond the scope of this review and is provided else­ from theory85,88 (Fig. 6b). An increase in prediction where80,81. The increase in power of the combined accuracy has indeed been observed when combining analyses compared to single-trait analyses can be inter­ individual-level genotypes for psychiatric traits using preted as equivalent to the increase in sample size for ridge regression89 or for inflammatory bowel diseases the single-trait association analysis. For example, the using best linear unbiased predictors of SNP effects, 86 multitrait analysis of the genetically correlated (ρg ≈ 0.7) such as MT-GBLUP . Methods for multitrait prediction traits depressive symptoms (N = 354,311), neuroticism have also been extended to work with single-trait GWAS (N = 168,105) and subjective well-being (N = 388,542) summary statistics (such as SMTPred90 and Multi-Trait with a large proportion of overlapping individuals led to Analysis of GWAS (MTAG)64) and can include genome an increase in power compared to the single-trait analy­ annotation in the prediction model (for example, sis, which is equivalent to a 27%, 55% and 55% increase PleioPred-anno)91. in sample size, respectively64. Inferences on causality. When traits x and y are found Improved genetic risk prediction. Genetic risk pre­ to be genetically correlated, it may be of interest to diction is of great interest for common complex traits understand whether the correlation has been induced because it can inform diagnostic decision-making by a causal relationship, that is, trait x causes trait y

a Increase in prediction accuracy using genetic b Increase in prediction accuracy using genetic correlation by sample size equivalence of focal correlation by genetic correlation and sample trait and genetic correlation size of correlated trait 0.75

0.65 0.70

0.65 0.60 AUC AUC 0.60

Single trait N2 =25,000 0.55

ρg =0.2 N2 =50,000 0.55 ρg =0.4 N2 =100,000

ρg =0.6 N2 =150,000 0.50 0.50 0 10,000 25,000 50,000 75,000 100,000 0.0 0.2 0.4 0.6 0.8 1.0

N1 ρg

Fig. 6 | Prediction accuracy increases when correlated traits are combined. Based on the theory in quantitative trait genetics described in ref.85, the prediction accuracy of a genetic predictor (parameterized as the area under receiver operating characteristic curve (AUC)) in the target population sample increases when two genetically correlated traits are combined to calculate single-nucleotide polymorphism (SNP) weights for 1,000,000 SNPs (M). The target trait in this example is major depressive disorder (heritability (h2) = 0.40, lifetime risk (K) = 0.15, proportion of cases (P) = 0.5). The genetically correlated trait is schizophrenia (h2 = 0.65, K = 0.01, P = 0.5). The effective number of

chromosome segments is chosen to reflect an unrelated sample of the European population (Meff = 50,000). Part a illustrates the increase in prediction accuracy that is achieved by including information from the correlated

trait equivalent to the increase in the sample size of the first trait (N1) in a single-trait analysis. Compared to the

single-trait analysis in 10,000 individuals, adding a secondary trait with estimated genetic correlation (rg) = 0.2, 0.4

or 0.6 and sample size N2 = 50,000 results in an increase in prediction accuracy equivalent to an increase in sample

size of focal trait genome-wide association studies (GWAS) (N1) of 40%, 260% and 510%, respectively (dotted black lines). In part b, the prediction accuracy increases with increasing genetic correlation and increasing sample size of

the discovery GWAS of the correlated trait (N2), here N1 = 20,000. The Supplementary note provides the theoretical

background and code for this figure. ρg, genetic correlation.

www.nature.com/nrg Reviews

(or vice versa). Identifying causality is particularly when the true sharing of genetic risk is variable across important if the putative causal trait is potentially the genome. modifiable, such as smoking or LDL cholesterol levels. However, only formal tests for causality justify these Improved interpretation of GWAS results. The devel­ claims. The gold standard to prove causality is a rando­ opment of frameworks to include multiple correlated mized clinical trial, which can be costly and, in some traits to translate GWAS results to functional biology instances, unethical or impossible to conduct. Genetic is still in the early stages but is likely to become an area correlations can be leveraged to aid inferences on cau­ of active research. Similar to single-trait heritability sality through Mendelian randomization (MR, reviewed enrichment analyses33, the GNOVA framework aims to elsewhere92,93) and is a cost-effective way to explore cau­ elucidate the biological processes underpinning genetic sality. Briefly, if trait x (for example, diabetes) increases correlations by identifying functionally annotated sets the risk of trait y (for example, cardiovascular disease), of SNPs that contribute most to genetic correlations35. then all risk factors (the instrumental variables), includ­ Furthermore, genetic correlations can be leveraged to ing genetic risk factors, for trait x will, to some consist­ help fine-map causal variants in GWAS loci, under an ent proportional extent, also increase the risk of trait y. assumption of shared causal variants between traits, A strong assumption in MR analyses is the absence of as illustrated by the Risk Variant Inference using horizontal pleiotropy, that is, the instrumental variable Epigenomic Reference Annotation (RiVIERA)106 does not affect trait y directly. Numerous methods for method; this Bayesian framework approach identified performing MR analyses9,94,95 and detecting horizontal more causal variants that regulated gene expression in pleiotropy9,10,96 are available (Table 1). MR Steiger97 can disease-relevant tissue when multiple correlated traits help to detect directionality of causal relationships and to were combined compared to single-trait analyses. disentangle causal relationships between multiple related Similarly, the Probabilistic Annotation INtegraTOR risk factors, and multivariate98,99 or conditional9 MR (fastPAINTOR)107 algorithm combines multiple corre­ analyses can be applied. An illustrative example of con­ lated traits and genome annotation to prioritize causal ditional MR identified a causal effect of LDL cholesterol variants in GWAS loci, explicitly modelling multiple levels, but not HDL cholesterol levels, on cardiovascular causal variants within a locus. disease9, which is reflected in the results of randomized trials100–102. The latent causal variable (LCV) method103 Conclusions and future perspectives has been proposed to better differentiate causality and The past decade has seen great advances in our under­ partial causality from horizontal pleiotropy, but a lim­ standing of complex traits and common diseases and itation is that it has not been extended to multitrait disorders. It is now well recognized that pleiotropy is conditional analyses. ubiquitous29,31,108 and that the number of uncorrelated traits is constrained109,110. The availability of large inde­ Conditional analysis. When it is known that two pendently collected data sets for multiple traits and the traits are genetically correlated and potentially causally sharing of GWAS summary statistics enable genetic related, it can be equally interesting to focus on which correlations to be estimated and used on an unprece­ SNPs induce phenotypic heterogeneity and cause dis­ dented scale. Although ongoing discussion focuses on ease x to be different from disease y. Multitrait condi­ model assumptions and how they can bias estimates of tional analyses that condition a SNP–trait association genetic variances and covariance, simulation studies on a second disease can provide this insight. When only consistently conclude that genetic correlation estimates summary statistics are available, conditional analyses are robust to these assumptions. Large-scale genotype– can be performed for GWAS with fully overlapping indi­ phenotype resources show that a genetic contribution viduals104 and for completely independent GWAS (for can be attributed to the vast majority of measured traits example, using genome-wide complex trait analy­sis– and lifestyle factors, further increasing the potential of multitrait conditional and joint analysis (GCTA- studies on genetic correlation to describe disease bio­ mtCOJO)9). If the genetic correlation between the logy111. In the short term, genetic correlation estimates traits reflects a purely causal relationship, then the SNP will help to find disease-associated genetic variation and effects for trait y conditional on trait x are expected to improve polygenic risk prediction as it makes its way 83,112 be uncorrelated with SNP effects of trait x (rgy xx, =0). into clinical practice . They may also contribute to However, if the genetic correlation between the traits improved nosology and diagnosis, risk stratification includes horizontal pleiotropy where the genetic shar­ to inform clinical trials and lifestyle interventions. Here,

ing may not be the same across the genome, then rgy xx, our focus is common, polygenic, disease traits, coded as may differ from zero. The Genome-Wide Inferred a simple dichotomy of health and disease. However, the Statistics (GWIS)105 method conditions trait y on trait x, disease process and its estimated genetic contribution

forcing rgy xx, =0. This approach was used to disentan­ can be conceptualized as the complex product of multi­ gle the genetic correlation between schizophrenia and ple intermediate phenotypes at the level of gene expres­ educational attainment, attributing the observed genetic sion, cumulative over time and cell types. The coming correlation to only those SNPs that are shared between decades may generate the data that, through studies schizophrenia and bipolar disorder105. Like GCTA- of genetically correlated traits, enable deconstruction of mtCOJO, GWIS assumes that the same adjustment disease risk at the molecular level. applies across the genome. Therefore, both approaches Published online xx xx xxxx may generate results that are difficult to interpret

Nature Reviews | Genetics Reviews

1. Polderman, T. J. C. et al. Meta-analysis of the 24. Yang, J. et al. Common SNPs explain a large genetic architecture of complex traits. Nat. Genet. 50, heritability of human traits based on fifty years proportion of the heritability for human height. 737–745 (2018). of twin studies. Nat. Genet. 47, 702–709 (2015). Nat. Genet. 42, 565–569 (2010). 48. Ni, G., Moser, G., Schizophrenia Working Group of 2. Craddock, N. & Owen, M. J. The beginning of the end 25. Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. the Psychiatric Genomics Consortium, Wray, N. R. for the Kraepelinian dichotomy. Br. J. Psychiatry 186, Estimating missing heritability for disease from & Lee, S. H. Estimation of genetic correlation via 364–366 (2005). genome-wide association studies. Am. J. Hum. Genet. linkage disequilibrium score regression and genomic 3. Maret-Ouda, J., Tao, W., Wahlin, K. & Lagergren, J. 88, 294–305 (2011). restricted maximum likelihood. Am. J. Hum. Genet. Nordic registry-based cohort studies: possibilities and 26. Falconer, D. S. & Mackay, T. F. C. Introduction to 102, 1185–1194 (2018). pitfalls when combining Nordic registry data. Scand. J. 4th edn (Pearson, 1996). 49. Weissbrod, O., Flint, J. & Rosset, S. Estimating Public Health 45 (Suppl. 17), 14–19 (2017). 27. Zaitlen, N. et al. Using extended genealogy to SNP-based heritability and genetic correlation in 4. Lichtenstein, P. et al. Common genetic determinants estimate components of heritability for 23 case–control studies directly and with summary of schizophrenia and bipolar disorder in Swedish quantitative and dichotomous traits. PLOS Genet. 9, statistics. Am. J. Hum. Genet. 103, 89–99 (2018). families: a population-based study. Lancet 373, e1003520 (2013). 50. Golan, D., Lander, E. S. & Rosset, S. Measuring 234–239 (2009). 28. Bulik-Sullivan, B. K. et al. LD score regression missing heritability: inferring the contribution of This work reports a population-scale data set for distinguishes confounding from polygenicity in common variants. Proc. Natl Acad. Sci. USA 111, estimation of genetic correlation between diseases genome-wide association studies. Nat. Genet. 47, E5272–E5281 (2014). based on family data. 291–295 (2015). 51. Yang, J., Zeng, J., Goddard, M. E., Wray, N. R. 5. Stearns, F. W. One hundred years of pleiotropy: 29. Bulik-Sullivan, B. et al. An atlas of genetic correlations & Visscher, P. M. Concepts, estimation and a retrospective. Genetics 186, 767–773 (2010). across human diseases and traits. Nat. Genet. 47, interpretation of SNP-based heritability. Nat. Genet. 6. Paaby, A. B. & Rockman, M. V. The many faces of 1236–1241 (2015). 49, 1304–1310 (2017). pleiotropy. Trends Genet. 29, 66–73 (2013). This study introduces the LDSC method to estimate 52. Holmes, J. B., Speed, D. & Balding, D. J. Summary 7. Grüneberg, H. An analysis of the ‘pleiotropic’ effects genetic correlation from GWAS summary data. statistic analyses do not correct confounding bias. of a new lethal mutation in the rat (Mus norvegicus). 30. Zheng, J. et al. LD Hub: a centralized database and Preprint at bioRxiv https://doi.org/10.1101/532069 Proc. R. Soc. Lond. B 125, 123–144 (1938). web interface to perform LD score regression that (2019). 8. Wagner, G. P. & Zhang, J. The pleiotropic structure maximizes the potential of summary level GWAS data 53. Yengo, L., Yang, J. & Visscher, P. M. Expectation of of the genotype–phenotype map: the evolvability of for SNP heritability and genetic correlation analysis. the intercept from bivariate LD score regression in the complex organisms. Nat. Rev. Genet. 12, 204 Bioinformatics 33, 272–279 (2017). presence of population stratification. Preprint at (2011). This work introduces LD Hub, a server that hosts bioRxiv https://doi.org/10.1101/310565 (2018). 9. Zhu, Z. et al. Causal associations between risk factors GWAS summary statistics and LDSC analyses to 54. Gianola, D. Assortative mating and the genetic and common diseases inferred from GWAS summary estimate genetic correlations. correlation. Theor. Appl. Genet. 62, 225–231 (1982). data. Nat. Commun. 9, 224 (2018). 31. Brainstorm Consortium et al. Analysis of shared 55. Peyrot, W. J., Robinson, M. R., Penninx, B. W. J. H. 10. Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. heritability in common disorders of the brain. Science & Wray, N. R. Exploring boundaries for the genetic Detection of widespread horizontal pleiotropy in 360, eaap8757 (2018). consequences of assortative mating for psychiatric causal relationships inferred from Mendelian 32. Yang, J. et al. Genome partitioning of genetic variation traits. JAMA Psychiatry 73, 1189–1195 (2016). randomization between complex traits and diseases. for complex traits using common SNPs. Nat. Genet. 56. Wray, N. R., Lee, S. H. & Kendler, K. S. Impact of Nat. Genet. 50, 693–698 (2018). 43, 519–525 (2011). diagnostic misclassification on estimation of genetic 11. Shi, H., Mancuso, N., Spendlove, S. & Pasaniuc, B. 33. Finucane, H. K. et al. Partitioning heritability by correlations using genome-wide genotypes. Eur. J. Local genetic correlation gives insights into the shared functional annotation using genome-wide association Hum. Genet. 20, 668–674 (2012). genetic architecture of complex traits. Am. J. Hum. summary statistics. Nat. Genet. 47, 1228–1235 57. Bromet, E. J. et al. Diagnostic shifts during the Genet. 101, 737–751 (2017). (2015). decade following first admission for psychosis. 12. Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. 34. Visscher, P. M. et al. Statistical power to detect Am. J. Psychiatry 168, 1186–1194 (2011). The mystery of missing heritability: genetic genetic (co)variance of complex traits using SNP data 58. Han, B. et al. A method to decipher pleiotropy by interactions create phantom heritability. Proc. Natl in unrelated samples. PLOS Genet. 10, e1004269 detecting underlying heterogeneity driven by hidden Acad. Sci. USA 109, 1193–1198 (2012). (2014). subgroups applied to autoimmune and neuropsychiatric 13. Cheverud, J. M. A comparison of genetic and phenotypic 35. Lu, Q. et al. A powerful approach to estimating diseases. Nat. Genet. 48, 803–810 (2016). correlations. Evolution 42, 958–968 (1988). annotation-stratified genetic covariance via GWAS This work describes a method that tries to This study describes phenotypic correlations summary statistics. Am. J. Hum. Genet. 101, distinguish between genetic correlation driven by as estimates of genetic correlations based on 939–964 (2017). sample heterogeneity and that driven by trait observation data. 36. Wray, N. R. et al. Genome-wide association analyses pleiotropy. 14. Rzhetsky, A., Wajngurt, D., Park, N. & Zheng, T. identify 44 risk variants and refine the genetic 59. Munafò, M. R., Tilling, K., Taylor, A. E., Evans, D. M. Probing genetic overlap among complex human architecture of major depression. Nat. Genet. 50, & Davey Smith, G. Collider scope: when selection phenotypes. Proc. Natl Acad. Sci. USA 104, 668–681 (2018). bias can substantially influence observed associations. 11694–11699 (2007). 37. Brown, B. C., Asian Genetic Epidemiology Network Int. J. Epidemiol. 47, 226–235 (2018). 15. Cross-Disorder Group of the Psychiatric Genomics Type 2 Diabetes Consortium, Ye, C. J., Price, A. L. 60. Allen, N. et al. UK Biobank: current status and what Consortium et al. Genetic relationship between five & Zaitlen, N. Transethnic genetic-correlation estimates it means for epidemiology. Health Policy Technol. 1, psychiatric disorders estimated from genome-wide from summary statistics. Am. J. Hum. Genet. 99, 123–126 (2012). SNPs. Nat. Genet. 45, 984–994 (2013). 76–88 (2016). 61. Vuckovic, D., Gasparini, P., Soranzo, N. & Iotchkova, V. This study is among the first to estimate genetic 38. de Candia, T. R. et al. Additive genetic variation in MultiMeta: an R package for meta-analyzing correlation between diseases using independently schizophrenia risk is shared by populations of African multi-phenotype genome-wide association studies. collected GWAS samples. and European descent. Am. J. Hum. Genet. 93, Bioinformatics 31, 2754–2756 (2015). 16. Tenesa, A. & Haley, C. S. The heritability of human 463–470 (2013). 62. Bhattacharjee, S. et al. A subset-based approach disease: estimation, uses and abuses. Nat. Rev. Genet. 39. Liu, J. Z. et al. Association analyses identify 38 improves power and interpretation for the combined 14, 139–149 (2013). susceptibility loci for inflammatory bowel disease analysis of genetic association studies of 17. Falconer, D. S. The inheritance of liability to certain and highlight shared genetic risk across populations. heterogeneous traits. Am. J. Hum. Genet. 90, diseases, estimated from the incidence among Nat. Genet. 47, 979–986 (2015). 821–835 (2012). relatives. Ann. Hum. Genet. 29, 51–76 (1965). 40. Yang, L. et al. Polygenic transmission and complex 63. Qi, G. & Chatterjee, N. Heritability informed power 18. Reich, T., James, J. W. & Morris, C. A. The use neurodevelopmental network for attention deficit optimization (HIPO) leads to enhanced detection of multiple thresholds in determining the mode of hyperactivity disorder: genome-wide association study of genetic associations across multiple traits. transmission of semi-continuous traits. Ann. Hum. of both common and rare variants. Am. J. Med. Genet. PLOS Genet. 14, e1007549 (2018). Genet. 36, 163–184 (1972). B Neuropsychiatr. Genet. 162B, 419–430 (2013). 64. Turley, P. et al. Multi-trait analysis of genome-wide 19. Wray, N. R. & Gottesman, I. I. Using summary 41. Speed, D., Hemani, G., Johnson, M. R. & Balding, D. J. association summary statistics using MTAG. data from the Danish national registers to estimate Improved heritability estimation from genome-wide Nat. Genet. 50, 229–237 (2018). heritabilities for schizophrenia, bipolar disorder, SNPs. Am. J. Hum. Genet. 91, 1011–1021 (2012). 65. Ray, D. & Boehnke, M. Methods for meta-analysis and major depressive disorder. Front. Genet. 3, 118 42. Yang, J. et al. Genetic variance estimation with of multiple traits using GWAS summary statistics. (2012). imputed variants finds negligible missing heritability Genet. Epidemiol. 42, 134–145 (2018). 20. Pearson, K. I. Mathematical contributions to the for human height and body mass index. Nat. Genet. 66. O’Brien, P. C. Procedures for comparing samples with theory of evolution. — VII. On the correlation of 47, 1114–1120 (2015). multiple endpoints. Biometrics 40, 1079–1087 characters not quantitatively measurable. Philos. 43. Speed, D. et al. Reevaluation of SNP heritability in (1984). Trans. A Math. Phys. Eng. Sci. 195, 1–405 (1900). complex human traits. Nat. Genet. 49, 986–992 67. Xu, X., Tian, L. & Wei, L. J. Combining dependent tests 21. Sham, P. Statistics in Human Genetics (Wiley, 1998). (2017). for linkage or association across multiple phenotypic 22. Olsson, U. Maximum likelihood estimation of the 44. Speed, D. & Balding, D. J. SumHer better estimates traits. Biostatistics 4, 223–229 (2003). polychoric correlation coefficient. Psychometrika 44, the SNP heritability of complex traits from summary 68. Yang, Q., Wu, H., Guo, C.-Y. & Fox, C. S. Analyze 443–460 (1979). statistics. Nat. Genet. 51, 277–284 (2019). multivariate phenotypes in genetic association 23. Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. & 45. Gazal, S., Marquez-Luna, C., Finucane, H. K. & studies by combining univariate association tests. Wray, N. R. Estimation of pleiotropy between complex Price, A. L. Reconciling S-LDSC and LDAK models Genet. Epidemiol. 34, 444–454 (2010). diseases using single-nucleotide polymorphism- and functional enrichment estimates. Preprint at 69. Bolormaa, S. et al. A multi-trait, meta-analysis for derived genomic relationships and restricted bioRxiv https://doi.org/10.1101/256412 (2018). detecting pleiotropic polymorphisms for stature, maximum likelihood. Bioinformatics 28, 2540–2542 46. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. fatness and reproduction in beef cattle. PLOS Genet. (2012). GCTA: a tool for genome-wide complex trait analysis. 10, e1004198 (2014). This study introduces the bivariate GREML method Am. J. Hum. Genet. 88, 76–82 (2011). 70. Zhu, X. et al. Meta-analysis of correlated traits via to estimate genetic correlation from genome-wide 47. Evans, L. M. et al. Comparison of methods that use summary statistics from GWASs with an application in SNP data. whole genome data to estimate the heritability and hypertension. Am. J. Hum. Genet. 96, 21–36 (2015).

www.nature.com/nrg Reviews

71. He, L. et al. Pleiotropic meta-analyses of longitudinal epidemiological studies. Hum. Mol. Genet. 23, based on genomic information. Bioinformatics 32, studies discover novel genetic variants associated with R89–R98 (2014). 1420–1422 (2016). age-related diseases. Front. Genet. 7, 179 (2016). 94. Bowden, J., Davey Smith, G. & Burgess, S. Mendelian 115. Loh, P.-R. et al. Efficient Bayesian mixed-model 72. Grotzinger, A. D. et al. Genomic structural equation randomization with invalid instruments: effect analysis increases association power in large cohorts. modelling provides insights into the multivariate estimation and bias detection through Egger Nat. Genet. 47, 284–290 (2015). genetic architecture of complex traits. Nat. Hum. regression. Int. J. Epidemiol. 44, 512–525 (2015). 116. Loh, P.-R. et al. Contrasting genetic architectures of Behav. 3, 513–525 (2019). 95. Hemani, G. et al. The MR-Base platform supports schizophrenia and other complex diseases using fast 73. van der Sluis, S., Posthuma, D. & Dolan, C. V. TATES: systematic causal inference across the human variance-components analysis. Nat. Genet. 47, efficient multivariate genotype-phenotype analysis for phenome. eLife 7, e34408 (2018). 1385–1392 (2015). genome-wide association studies. PLOS Genet. 9, 96. Zhu, Z. et al. Integration of summary data from GWAS 117. Cotsapas, C. et al. Pervasive sharing of genetic effects in e1003235 (2013). and eQTL studies predicts complex trait gene targets. autoimmune disease. PLOS Genet. 7, e1002254 (2011). 74. Cichonska, A. et al. metaCCA: summary statistics- Nat. Genet. 48, 481–487 (2016). 118. Dai, M. et al. Joint analysis of individual-level and based multivariate meta-analysis of genome-wide 97. Hemani, G., Tilling, K. & Davey Smith, G. Orienting summary-level GWAS data by leveraging pleiotropy. association studies using canonical correlation the causal relationship between imprecisely measured Bioinformatics 35, 1729–1736 (2018). analysis. Bioinformatics 32, 1981–1989 (2016). traits using GWAS summary data. PLOS Genet. 13, 119. Liu, J., Wan, X., Ma, S. & Yang, C. EPS: an empirical 75. Andreassen, O. A. et al. Improved detection of common e1007081 (2017). Bayes approach to integrating pleiotropy and variants associated with schizophrenia and bipolar 98. Burgess, S. & Thompson, S. G. Multivariable tissue-specific information for prioritizing risk genes. disorder using pleiotropy-informed conditional false Mendelian randomization: the use of pleiotropic Bioinformatics 32, 1856–1864 (2016). discovery rate. PLOS Genet. 9, e1003455 (2013). genetic variants to estimate causal effects. Am. J. 76. Liley, J. & Wallace, C. A pleiotropy-informed Bayesian Epidemiol. 181, 251–260 (2015). Acknowledgements false discovery rate adapted to a shared control design This study introduces MR, a method to determine W.v.R. was funded by the ALS Foundation Netherlands. W.J.P. finds new disease associations from GWAS summary whether genetic correlation results from a causal was funded by an NWO Veni grant (91619152). S.H.L. is an statistics. PLOS Genet. 11, e1004926 (2015). relationship. ARC Future Fellow (FT160100229). N.R.W. acknowledges 77. Majumdar, A., Haldar, T., Bhattacharya, S. & Witte, J. S. 99. Do, R. et al. Common variants associated with plasma funding from the Australian National Health and Medical An efficient Bayesian meta-analysis approach for triglycerides and risk for coronary artery disease. Research Council (1078901, 1087889 and 1113400). W.v.R. studying cross-phenotype genetic associations. PLOS Nat. Genet. 45, 1345–1352 (2013). and N.R.W. acknowledge funding from the EU Joint Genet. 14, e1007139 (2018). 100. Baigent, C. et al. Efficacy and safety of cholesterol- Programme – Neurodegenerative Disease Research (JPND) 78. Chung, D., Yang, C., Li, C., Gelernter, J. & Zhao, H. lowering treatment: prospective meta-analysis of data project (Australia, NHMRC 1151854; The Netherlands, GPA: a statistical approach to prioritizing GWAS from 90,056 participants in 14 randomised trials of ZonMW project number 733051071). The authors thank K. results by integrating pleiotropy and annotation. statins. Lancet 366, 1267–1278 (2005). Tilling, G. Davey Smith and the members of the University of PLOS Genet. 10, e1004787 (2014). 101. Nissen, S. E. et al. Effect of torcetrapib on the Queensland Program in Complex Trait Genomics for their 79. Wei, W. et al. GPA-MDS: a visualization approach to progression of coronary atherosclerosis. N. Engl. insightful discussions. investigate genetic architecture among phenotypes J. Med. 356, 1304–1316 (2007). using GWAS results. Int. J. Genomics 2016, 6589843 102. Barter, P. J. et al. Effects of torcetrapib in patients at Author contributions (2016). high risk for coronary events. N. Engl. J. Med. 357, All authors researched data for the article, made substantial 80. Solovieff, N., Cotsapas, C., Lee, P. H., Purcell, S. M. & 2109–2122 (2007). contributions to discussions of the content and reviewed Smoller, J. W. Pleiotropy in complex traits: challenges 103. O’Connor, L. J. & Price, A. L. Distinguishing genetic and/or edited the manuscript before submission. W.v.R. and and strategies. Nat. Rev. Genet. 14, 483–495 (2013). correlation from causation across 52 diseases and N.R.W. wrote the article. 81. Shriner, D. Moving toward system genetics through complex traits. Nat. Genet. 50, 1728–1734 (2018). multiple trait analysis in genome-wide association 104. Deng, Y. & Pan, W. Conditional analysis of multiple Competing interests studies. Front. Genet. 3, 1 (2012). quantitative traits based on marginal GWAS summary The authors declare no competing interests. 82. Wray, N. R. et al. Pitfalls of predicting complex traits statistics. Genet. Epidemiol. 41, 427–436 (2017). from SNPs. Nat. Rev. Genet. 14, 507–515 (2013). 105. Nieuwboer, H. A., Pool, R., Dolan, C. V., Boomsma, D. I. Publisher’s note 83. Khera, A. V. et al. Genome-wide polygenic scores for & Nivard, M. G. GWIS: genome-wide inferred statistics Springer Nature remains neutral with regard to jurisdictional common diseases identify individuals with risk for functions of multiple phenotypes. Am. J. Hum. claims in published maps and institutional affiliations. equivalent to monogenic mutations. Nat. Genet. 50, Genet. 99, 917–927 (2016). 1219–1224 (2018). 106. Li, Y. & Kellis, M. Joint Bayesian inference of risk Reviewer information 84. Torkamani, A., Wineinger, N. E. & Topol, E. J. variants and tissue-specific epigenomic enrichments Nature Reviews Genetics thanks D. Balding, B. Pasaniuc and The personal and clinical utility of polygenic risk across multiple complex human diseases. Nucleic the other, anonymous, reviewer(s) for their contribution to the scores. Nat. Rev. Genet. 19, 581–590 (2018). Acids Res. 44, e144 (2016). peer review of this work. 85. Lee, S. H., Clark, S. & van der Werf, J. H. J. Estimation 107. Kichaev, G. et al. Improved methods for multi-trait fine of genomic prediction accuracy from reference mapping of pleiotropic risk loci. Bioinformatics 33, Supplementary information populations with varying degrees of relationship. 248–255 (2017). Supplementary information is available for this paper at PLOS ONE 12, e0189775 (2017). 108. Pickrell, J. K. et al. Detection and interpretation https://doi.org/10.1038/s41576-019-0137-z. 86. Maier, R. et al. Joint analysis of psychiatric disorders of shared genetic influences on 42 human traits. increases accuracy of risk prediction for schizophrenia, Nat. Genet. 48, 709–717 (2016). bipolar disorder, and major depressive disorder. 109. Barton, N. H. Pleiotropic models of quantitative Related links Am. J. Hum. Genet. 96, 283–294 (2015). variation. Genetics 124, 773–782 (1990). BUHMBOX: http://software.broadinstitute.org/mpg/buhmbox/ 87. Guo, G. et al. Comparison of single-trait and 110. Walsh, B. & Blows, M. W. Abundant genetic fastPAiNTOR: https://github.com/gkichaev/PAINTOR_V3.0 multiple-trait genomic prediction models. BMC Genet. variation + strong selection = multivariate genetic GCTA: http://cnsgenomics.com/software/gcta/ 15, 30 (2014). constraints: a geometric view of adaptation. Annu. GNOvA: https://github.com/xtonyjiang/GNOVA 88. Dudbridge, F. Power and predictive accuracy of polygenic Rev. Ecol. Evol. Syst. 40, 41–59 (2009). Gwis: https://sites.google.com/site/mgnivard/gwis risk scores. PLOS Genet. 9, e1003348 (2013). This work puts forward arguments for multivariate JPND: www.jpnd.eu 89. Li, C., Yang, C., Gelernter, J. & Zhao, H. Improving genetic constraints and strong limits on the number LCv: https://github.com/lukejoconnor/LCV genetic risk prediction by leveraging pleiotropy. of independent traits. LDAK: http://dougspeed.com/ldak Hum. Genet. 133, 639–650 (2014). 111. Visscher, P. M. et al. 10 years of GWAS discovery: LD Hub: http://ldsc.broadinstitute.org/ldhub/ 90. Maier, R. M. et al. Improving genetic prediction by biology, function, and translation. Am. J. Hum. Genet. LDsC: https://github.com/bulik/ldsc leveraging genetic correlations among human diseases 101, 5–22 (2017). MR steiger: https://github.com/explodecomputer/ and traits. Nat. Commun. 9, 989 (2018). 112. Inouye, M. et al. Genomic risk prediction of coronary causal-directions 91. Hu, Y. et al. Joint modeling of genetically correlated artery disease in 480,000 adults: implications for MTAG: https://github.com/omeed-maghzian/mtag diseases and functional annotations increases primary prevention. J. Am. Coll. Cardiol. 72, PCGC: https://data.broadinstitute.org/alkesgroup/PCGC/ accuracy of polygenic risk prediction. PLOS Genet. 13, 1883–1893 (2018). ρHess: https://github.com/huwenboshi/hess e1006836 (2017). 113. Ferreira, M. A. et al. Shared genetic origin of PleioPred: https://github.com/yiminghu/PleioPred 92. Pingault, J.-B. et al. Using genetic data to strengthen asthma, hay fever and eczema elucidates allergic Popcorn: https://github.com/brielin/Popcorn causal inference in observational research. Nat. Rev. disease biology. Nat. Genet. 49, 1752–1757 RivieRA: https://github.com/yueli-compbio/RiVIERA-beta Genet. 19, 566–580 (2018). (2017). sMTpred: https://github.com/uqrmaie1/smtpred 93. Smith, G. D., Davey Smith, G. & Hemani, G. Mendelian 114. Lee, S. H. & van der Werf, J. H. J. MTG2: an efficient sumHer: http://dougspeed.com/sumher/ randomization: genetic anchors for causal inference in algorithm for multivariate linear mixed model analysis

Nature Reviews | Genetics